This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

cortex R4 execution cycles

Other Parts Discussed in Thread: TMS570LS0432

Hello Team TI

I use TMS570LS0432 Hercules Launchpad. I see that there are 2 cortex R4 processors in this controller. Is there any point to be considered when calculating the number of clock cycles because of this architecture. For example if I have 10 single cycle instructions independent of previous instruction execution result, I'll multiply 10*1=10cycles and to get execution time, I'll multiply by clock base tick value. Is there any change in this calculation when I consider this dual processor architecture?

Thanks in advance !

  • Sindhu,

    From a programmer's model you treat these as a single core (pretty much).

    The two physical CPUs are in lockstep mode, meaning that there is a single program.
    The same program is executed by both processors independently and the results are compared
    on a cycle-by-cycle basis. This is a big part of the safety concept for the product.

    But from a programmer's model the advantage is you don't need to think of it as 2 independent processors and you don't need to write code that compares the results of these 2 processors - as that task is moved into hardware compare logic for you.

    So in the context of your question - no change in the calculation due to the lockstep CPU.

    On the other hand with the 0432 it's hard to do the simple: "I'll multiply 10*1=10cycles and to get execution time, " type calculation.
    First reason is that if you turn on the flash pipeline -- which you need to do in order to run at the max frequency, there will be wait states inserted for some cycles. The pipeline is like having a very very very small cache.... you'll get 'misses' and these will cost extra cycles.

    Second, the pipeline can get pretty complicated, the same instruction may take different # of cycles to execute depending on what is around it.

    It might be best for you to use the PMU for actual cycle measurements. Doing the calculation by hand for critical sections of code is good but I'd do it as a sanity check of your expectations against what the PMU tells you is actually happening, and then if you find & understand differences this may give you ideas on performance optimization.
  • Another point to consider is that the R4/R4F has limited dual issue capabilities as well as branch prediction.  As a result of these enhancements, 10 single cycle instructions could execute in less than 10 clock cycles (ignoring memory effects).  Overall this is a more complex pipeline and memory system than most MCUs and I echo Anthony's recommendation to prototype and measure using the PMU.

    Regards,

    Karl

  • Thank you Anthony and Karl... I understand the complexity. I'll better use PMU :) Is there any material available in net that can best describe this pipeline ? I referred Cortex R4 TRM... Anything else other than this ?
  • The TRM has the best detail on the pipeline which ARM is willing to publicly release.

    Regards,
    Karl