This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Am437x L2 cache performance issue running QNX

Hi,

 

AM437x L2 cache performance running QNX is much lower than Linux, please find the memperf utility test, running on Linux and QNX below:

 

 

Am437x - Linux

AM437x – QNX 650/660

Speed is MB/s

   size cpy C->C   set C   read C  

   ---- --------  -------- -------- 

     16    83.31   275.15   121.46

     32   475.01   495.56   213.95

     64   838.76   896.47   226.31

    128  1256.69  1490.99   277.24

    256  1816.52  2477.88   296.45

    512  1384.02  1583.52   306.32

   1024  1139.99  1283.98   311.68

   2048   855.11   994.45   314.79

   4096   814.10   881.34   316.17

   8192   773.33   799.84   316.47

  16384   705.63   717.44   315.73

  32768   635.63   660.97   308.94

  65536   646.98   652.32   295.30

 131072   637.49   648.60   290.84

 262144   534.84   642.05   284.35

 524288   339.25   640.94   261.00

1048576   307.94   640.81   251.47

2097152   306.26   640.76   250.08

4194304   305.51   640.07   249.42

 

Speed is MB/s  

size  cpy C->C     set C      read C

----  --------    --------    --------

16     39.2        44.74     14.63

32     73.62       82.87     19.78

64     147.2       132.53    20.17

128     290.09      331.9     21.74

256     412.56      500.17    22.41

512     507.6       1028.03   22.76

1024    517.47      1610.82   22.94

2048    579.79      2263.21   23.03

4096    616.7       2839.19   23.08

8192    636.9       3246.89   23.1

16384   647.1       3480.68   23.11

32768   651.1       3642.84   23.12

65536   644.05      3716.08   23.12

131072  517.1       3738.43   23.1

262144  202.44      2421.82   22.43

524288  189.58      802.77    20.04

1048576 188.46      701.79    19.49

2097152 187.69      668.28    19.44

4194304 187.57      649.68    19.46

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Based

Based on the values above do you have any suggestions on where to look?  

 

Few Questions:
-> Are there any security settings that would prevent L1 or L2 cached reads from working. Note that L1/L2 cached writes work OK.

-> Are there any clock settings that would prevent L1 or L2 cached reads from working? Again L1/L2 cached writes work OK.

 

 

Notes:
- Same board is used to test QNX and Linux. 
- QNX 650 and 660 test values are the same. 
-The A9 clock is set to 1GHz on both tests.
-The cache mappings flags are the same in the systems (Linux and QNX) values below:
- reg0_cache_id (0x48242000):

410000C9

 

- reg0_cache_type (0x48242004)

1E140140


- Control register (base + 0x100 = 0x48242100):

Read at address  0x48242100 (0xb6f05100): 0x00000001

 

- Auxiliary Control Register(0x48242104):

Read at address  0x48242104 (0xb6f05104): 0x7E030000

 

- reg_data_ram_control (0x4824210C):

Read at address  0x4824210C (0xb6fcd10c): 0x00000110

 

- eg12_addr_filtering_start (0x48242C00):

Read at address  0x48242C00 (0xb6f01c00): 0x00000001

 

- reg12_addr_filtering_end (0x48242C04):

Read at address  0x48242C04 (0xb6f7bc04): 0x80000000

 

- reg15_prefetch_ctrl (0x48242F60):

Read at address  0x48242F60 (0xb6f66f60): 0x30000000   

  

Thanks. 

  • Hi,

    This forum only supports Linux. I will forward this post to the factory experts in case they are able to offer some tips what might be the reason for this behavior.

  • Hi Datis
    Can you do a more comprehensive compare of the A9 and PL310 registers? Perhaps you are still missing some A9/cache configuration beyond the registers you have listed?

    Regards
    Mukul
  • Hi Mukul, 

    I have looked at all the registers listed on the Table 3-2 (offset 000 - 0xF80) of the "AMBA Level 2 Cache Controller (L2C-310) Revision: r3p1" TRM. The other registers were not set. 

    Please let me know if you have a particular set I should check. 

    Thanks,
    Datis

  • Datis, you may want to check if you cache policies in your MMU table are setup the same way between both OSes. this can have a big impact on how the cache operates

    regards,
    James
  • JJD said:
    Datis, you may want to check if you cache policies in your MMU table are setup the same way between both OSes. this can have a big impact on how the cache operates

    regards,
    James

    Thanks James, we checked the startup code and for am437x we use our standard Cortex-A9 cache policy. A typical cacheable memory mapping would be cacheable, bufferable, and the TEX bits would 0x1, i.e. write-back/write-allocate + shared. That said I can add some debug in our kernel code to dump the pte bits to be 100% sure.

    Actually the ARM & cache initialization for am437x is almost identical to the init for similar SOCs that also use Cortex-A9 MPCore + PL310 cache. The only difference I can see is that for am437x (and other TI SOCs) we never write to the CP15 Auxiliary Control register so we can't enable certain errata fixes like disabling prefetching.

    Do you guys support a smc instruction to allow writes to the CP15 Auxiliary Control register? If so could you share some details or a Linux example?

    Also is there anything else you can think of that would make am437x unique w.r.t ARM & cache init?

    Thanks!

    Mark

    * EDIT * - we also have some custom L2 cache init for am437x which uses the smc instruction but as Datis mentioned earlier we read the L2 cache registers at runtime on QNX and Linux and all the values were the same.

  • Yes, there are smc instructions. Take a look at the initialization chapter. There is a section on Services for HLOS Support.

    Steve K.

  • Hi Steve,

    Thanks for the info.

    We found that the CP15 ACTLR register coherency mode (bit [6] -- SMP) (Section 4.3.10 Auxiliary Control Register on the Cortex™-A9 Revision: r4p1 TRM) has to be set.

    However, looking at TI code we are not sure how this is being set, could you please confirm if this is done on the boot ROM? Or point out the code section that enables this bit?

    Thanks,
    Datis  

  • One more question: 

    Could you please confirm how the A9 erratas are applied? Would this has to be done thorough the secure call (smc) as well? 

    As an example in the linux code I can see: 

    (processor-sdk-linux/arch/arm/mm/proc-v7.S):

    #ifdef CONFIG_ARM_ERRATA_742230

            cmp     r6, #0x22                       @ only present up to r2p2

            mrcle   p15, 0, r10, c15, c0, 1         @ read diagnostic register

            orrle   r10, r10, #1 << 4               @ set bit #4

            mcrle   p15, 0, r10, c15, c0, 1         @ write diagnostic register

    But applying the same would freeze the board.  

    Thanks & Regards,
    Datis

  • Hi Steve, 

    Any updates? 

    I have one more question ;) 

    Would you be able to please confirm the following errata apply to the AM437x (particularly, rev. MPCore r2p10):

    From ARM® Cortex™ -A9 processors r2 releases Software Developers Errata Notice doc:

     

    ID: 845369 – Under very rare timing circumstances, transitioning into streaming mode might create a data corruption (configurations affected: One processor if the ACP is present)

    *Note: Checking the ARM437x TRM section: 3.3.1.7 Configuration Options: Accelerator Coherency Port (ACP) is included)

     

    The problem we have is that we cannot access the Diagnostic register to set the errata workaround (due to no dedicated diagnostic control register).

    TRM section 5.2.11 Services for HLOS Support – API (pg. 236) states: 


    “This Cortex core restricts accesses to few ARM coprocessor registers to the secure mode only.

    The services include:

    • L2 cache set debug register

    • L2 cache clean and invalidate range of physical address

    • L2 cache set control register

    • L2 cache set auxiliary control

    • L2 cache get control

    • L2 cache set latency”

     

    The other erratas already implemented in our code don’t seem to affect the one processor configuration board we have: 

     

    ID: 794072 – A short loop including a DMB instruction might cause a denial of service on another
    processor which executes a CP15 broadcast operation (Configurations affected: MPCore processors with two or more processors.
    )
    ID: 751472 - An interrupted ICIALLUIS operation might prevent the completion of a following broadcast operation (Configurations affected: configuration with two or more processors.)

    ID: 743622 - Faulty logic in the Store Buffer might cause data corruption (Configurations affected: All r2 revisions)
    *Note: Already implemented work around (setting bit 6 in boot ROM).

     

    Thanks & Regards,
    Datis

  • Datis, you should be able to search the kernel for each errata ID number to determine if a workaround was implemented and/or if each workaround is being used. If the ID cannot be found, then the errata is not addressed in the kernel.

    Regards,
    James