This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570LC4357: Assembler Execution Time Measurements at 300 MHz core speed

Part Number: TMS570LC4357

Dear support team,

During the RTI driver implementation we’ve found out something that is quite unexpected on our side.

While testing the RTI time period we’ve put the impulse generating code into the call-back of the RTI’s compare match ISR.

Below is the code snippet and its disassembly. As you can see on the second screenshot it took 0.372 microseconds to toggle the output pin from 0 to 1 and 1 to 0, so about 186 nanoseconds to execute five assembly instructions given on the screenshot.

What is unexpected is that the core is running on 300MHz clock, and we would expect to have the ~3.3 nanoseconds per single cycle. We are aware that the LDR, EOR and STR instructions take multiple cycles to execute, but even if we account for that like described here:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0460d/Cfadhhhc.html the worst case scenario would be that LDR and STR takes max 5 cycles, for EOR we couldn’t find the required values but we would assume it does not take more that a few cycles so we would look at maximum of around 25 cycles to execute the line of code that sets the pin to 1. That would be around 82.5 nanoseconds, so much less than we observe. This behaviour was observed on two evaluation boards and measured independently using different tools with the same result.

Caches for the cores were disabled. When enabled total time for execute the was reduced to around 212 nanoseconds from initial 372 nanoseconds.

 

So the question is, is this behaviour expected? We speculate that the delays may be coming from some wait states on the interconnect bus or from program flash but we’re unsure.

Figure 1

  • Hello,

    This is expected. The Cortex R4/5 architecture is designed to execute from cache / tightly-coupled memories as fast as possible (up to 1.66 DMIPS/MHz). Accesses to peripherals takes significantly longer, which is what you are observing. Most of the delay is in the interconnect, getting the command / response to and from the slave being accessed. Essentially any sort of bit-banging routine will not be very efficient on these processors.

    Also, you perform a read-modify-write operation each time you want to toggle an output. I would strongly recommend using the *SET and *CLR registers to toggle these outputs. That would at least avoid cycles "lost" in reading the output state.