This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Getting maximum performance out of C6A816x

Hello,

 

We are currently trying to execute some benchmark code on the C6A816x (TI C6A816x/AM389x EVM). We have some results from running the tests under Embedded Linux which was provided with the board. When executing the same code as CCS4 project, we get a much slower execution, resulting in quite poor performance.

 

We already checked some of the register settings and had to see that some performance relevant features were simply turned off, like caches, branch predichtion, MMU, etc. Our question is - how can we configure the Cortex A8 in a way to deliver maxumum performance under CCS4 with regard to the complex hardware of the target (setting appropriate compiler flags is clear). In fact, when purchasing this evaluation board, we expected some kind of appropriate boot-code or startup file from TI? What we are already using is a GEL file from Spectrum Digital:

 

http://support.spectrumdigital.com/boards/netradimm/revc/files/evm816x.gel

 

But using only this file doesn't seem to be enough. Code is executed some 15 times slower than under Linux... By the way, the debugger (XDS100v2) seems to be very slow, too. Some simple printf-statements take quite a long time - maybe thats a different issue.

 

Our Development-System:

 

Windows XP SP3

CCS 4.2.0

XDS100v2 Debugger

TI C6A816x/AM389x EVM

Target Configuration for TI816x, with adapted file "ti816x_no_stm_no_M3.xml" from this post:

 

http://e2e.ti.com/support/development_tools/code_composer_studio/f/81/p/86690/302160.aspx

 

King regards,

 

Thomas

 

 

  • Thomas Stolze said:
    By the way, the debugger (XDS100v2) seems to be very slow, too. Some simple printf-statements take quite a long time - maybe thats a different issue.

    You are using the XDS100v2 emulator which one of the slowest emulator types out there. To improve debug access, you should try one of faster emulators. But they can get expensive:

    http://processors.wiki.ti.com/index.php/XDS100#Q:_I_would_like_to_purchase_a_faster_emulator.2C_which_one_is_recommended.3F

    If you want to stick with the XDS100v2, there are some tips here to deal with the slow speed:

    http://processors.wiki.ti.com/index.php/XDS100#Q:_How_to_maximize_performance_of_XDS100_under_CCS.3F

     

    As to the slow performance when running under CCSv4... if you have a lot of CIO (like printf), that would slow things down because printf is very intrusive and combined with the slow speed of the XDS100, that would slow things down quite a bit. Basically any kind of CCS<->target communication during execution can impact performance.

    ki

  • Thanks for your help.

     

    We'll try out your hints regarding the debugger.

     

    Coming to the slow performance of the C6A816x, it does not seem as easy. We only noticed that slow output speed when our results came up in the console - therefore the question about debugging speed. But when printf is called, the benchmark is already over, so no time is spent doing the printf. We simply start execution and wait till printf comes up with the results. No further action from our side with the debugger, and no outputs are made.

    Additionally, we wrote a function that uses the performance counters to measure the cpu clocks that are needed to execute the benchmark code. Then, we calculate the time spent on the code and are also able to verify it (simply using a stop watch - to prove the time is quite correct). That's why we believe the startup code is not well fitted, it just takes too long to execute.

     

    Do you have any hints?

     

    Regards,

     

    Thomas

  • Thomas Stolze said:

    Hello,
    ...
    We already checked some of the register settings and had to see that some performance relevant features were simply turned off, like caches, branch predichtion, MMU, etc. Our question is - how can we configure the Cortex A8 in a way to deliver maxumum performance under CCS4 with regard to the complex hardware of the target (setting appropriate compiler flags is clear). In fact, when purchasing this evaluation board, we expected some kind of appropriate boot-code or startup file from TI? What we are already using is a GEL file from Spectrum Digital:
    ...

    Thomas,

    I recently had the same issue, setting up the Cortex A8 core for best performance. What I'd recommend is using SYS/BIOS as operating system. If you are using it, it will configure L1 and L2 caches as well as the MMU. This happens already prior main. You don't have to use any of the SYS/BIOS features in this case. The BIOS scheduler won't run until you call BIOS_start() and you can benchmark your application just like no OS would be there. SYS/BIOS is simply used to make the right cache and MMU setup for you.

    What you need is at least version 6.31.00.18 of SYS/BIOS. To get started, create a new CCS project. In the step when you are asked for "Projetct Template" select "SYS/BIOS --> Generic Examples --> Hello Example". And on the next page "ti.platforms.evmDM8168" as "Platform" . That's it. You now have a project where the cache is configured before entering main.

    Best regards,
      Robert Finger

  • Hi Robert,

     

    Thanks for your suggestion. We tried your hints, installed the version 6.31.00.18 of SYS/BIOS and set up a new project as described. We had to use the Spectrum Digital Gel-File in order to upload our code to the target - without this file it did not work. The project was set up as "Hello Example" for a C6A816x, the EVM-DM8168 didn't work, too. However, we did not get better results running SYS/BIOS. The MMU was turned on as shown by CCS, but the execution times were as slow as before.

     

    Do you have any other hints we may check or try, or is there any information I may give you regarding to our project settings?

     

    Best regards and thank you for your help,

     

    Thomas

  • Thomas,

    In addition to BIOS for cache setup, please try to call the following function in main() in order to enable branch prediction:

    static inline void set_cr()
    {
        // Read Control register.
        asm("   MRC p15, #0, r0, c1, c0, #0");
        asm("   nop");
       
        asm("   ORR r0, r0, #0x0800");

        // Set Control register.
        asm("   MCR p15, #0, r0, c1, c0, #0");
        asm("   nop");

        // Read Control register again to confirm.
        asm("   MRC p15, #0, r0, c1, c0, #0");
        asm("   nop");
    }

    Maybe this helps to getter better performance.

    Thanks,
      Robert