This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570LC4357 and slow program execution

Other Parts Discussed in Thread: TMS570LC4357, TMS570LS3137, HALCOGEN

Hi all,

we have problem with slow program execution on TMS570LC4357 compare to TMS570LS3137. Difference is too significant and we can't ignore it. At this moment I hope, that we have some bug in our code, but our ideas whera are depleted.

Here is maximally simplified code to measure it:

speedTest:
  mrc    p15, #0, r1, c9, c13, #0  // Read PMCCNTR Register
  nop    // 1
  nop    // 2
  nop    // 3
  nop    // 4
  nop    // 5
  nop    // 6
  nop    // 7
  nop    // 8
  nop    // 9
  nop    // 10
  nop    // 11
  nop    // 12
  nop    // 13
  nop    // 14
  nop    // 15
  nop    // 16
  nop    // 17
  nop    // 18
  nop    // 19
  nop    // 20
  nop    // 21
  nop    // 22
  nop    // 23
  nop    // 24
  nop    // 25
  nop    // 26
  nop    // 27
  nop    // 28
  nop    // 29
  nop    // 30
  nop    // 31
  nop    // 32
  nop    // 33
  nop    // 34
  nop    // 35
  nop    // 36
  nop    // 37
  nop    // 38
  nop    // 39
  nop    // 40
  nop    // 41
  nop    // 42
  nop    // 43
  nop    // 44
  nop    // 45
  nop    // 46
  nop    // 47
  nop    // 48
  nop    // 49
  mrc    p15, #0, r0, c9, c13, #0  // Read PMCCNTR Register
  sub    r0, r0, r1
  bx lr

Result on TMS570LS3137 is 6clock for MRC + 49*1clock for NOP. Function returns 55 ticks as expected.

But on TMS570LC4357 it is much slower. Expected result is same, but returned value is 81ticks.

And bad news. It is bigger difference on real code. For ex. one real function at TMS50LS3137 take 600 ticks (3us@180MHz). But on TMS570LC4357 same function need 1700 ticks (5us@300MHz)!

Where can be problem? GCLK = 300MHz, HCLK = 150MHz, flash data waitstates = 3, flash prefetch is enabled, cache is enabled. Boot code (flash & cache init) come from HalCoGen.

  • Hello Jiri,

      Is the cycle difference measured the first this test function is called? Can you call this function at least once and obtain the cycle difference based on the 2nd or the subsequent calls? I'm suspecting that these instructions are not yet filled in the cache the first time you call the function.

      In addtion, you can use the event counters to count certain types of events to better understand what is going on? For example, you will program the PMU event counter to count events such as instruction cache miss so we know if there are any cache misses executing these NOPs.  

  • I did some more measurement and it looks strange. It looks, that problem is NOT in instruction cache, but processor wait for instructions. Why?
    I also made test to measure it 4 times in one (function speedTest2) and it mesure 4 times same value 81 cycles.

    Here is PMU results in same order as in ARM DDI 0363E. All values are including measure overhead, see to measure code bellow.

    Event #  result  note
    0x01          0   Instruction cache miss.
    0x03          0   Data cache miss.
    0x04          0   Data cache access.
    0x06          1   Data Read architecturally executed.
    0x07          2   Data Write architecturally executed.
    0x08         70   Instruction architecturally executed.
    0x5e          1   Dual-issued pair of instructions architecturally executed.
    0x09          0   Exception taken.
    0x0A          0   Exception return architecturally executed.
    0x0B          0   Change to Context ID executed.
    0x0C          4   Software change of PC, except by an exception, architecturally executed.
    0x0D          2   B immediate, BL immediate or BLX immediate instruction architecturally executed
    0x0E          2   Procedure return architecturally executed, other than exception returns, for example, BX Rm; LDM PC.
    0x0F          0   Unaligned access architecturally executed.
    0x10          0   Branch mispredicted or not predicted.
    0x11        238   Cycle count.
    0x12          2   Branches or other change in program flow that could have been predicted by the branch prediction resources of the processor.
    0x40        142   Stall because instruction buffer cannot deliver an instruction.
    0x41          2   Stall because of a data dependency between instructions.
    0x42          0   Data cache write-back.
    0x43          1   External memory request.
    0x44          0   Stall because of LSU being busy.
    0x45          0   Store buffer was forced to drain completely.
    0x46          0   The number of cycles FIQ interrupts are disabled.
    0x47         19   The number of cycles IRQ interrupts are disabled.
    0x48-0x5d     0
    0x5f-0x7e     0
    0x7f          238  ?       

    And here is measure code:

            U32 event;
            pmuSetCountEvent(2, eventTest);
            event = pmuGetEventCount(2);
            speedTestResult = speedTest();
            eventResult1 = pmuGetEventCount(2) - event;
    

    And here is complete asm code:

    8407.speedTest.S

    PS: all ASM functions are aligned to 64bit. If it is not aligned, code execution is slower (from 81 to 85 or 91 cycles). Is it impact of flash prefetch unit and 64 bit internal bus ?

  • SCTLR(System Control Register) = 0x8BE71878
    It looks, like instructions cache was correctly enabled in boot code (come from HalCoGen)
  • Event counter 0x14(Level 1 instruction cache access) = 0. Why?
  • Hello Jiri,
    Could you tell me the MPU configuration for the flash space? Please let me know the MPU Region Access Control Register setting for the flash region especially the TEX, S, C and B bits.
    The PMU is indicating there is an external memory access from 0x43 event. In 0x40 it is also indicating the instruction buffer is stalled. An external memory access will take many cycles to complete from the external memories.
  • Hello Charles,
    thanks for idea direction. We has wrong MPU region setting for flash (active shareable bit)
    Stupid mistake :-(
    After correct settings of DRACR I measure 55 clocks for this test as expected.

    Many thanks, and enjoy following weekend!

    Jiri
  • Hi Jiri,
    Glad the problem is resovled! When a region is declared as shareable it becomes non-cacheable. This is the reason that it is not caching the instructions.