This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Poor Performance

Other Parts Discussed in Thread: SYSBIOS, TMS320C6657

We're doing some benchmarking of filter functions and getting performance much worse than expected.  Using the DSPF_sp_biquad() function we expect a single biquad to use about 2e-6 percent of the processor's instruction bandwidth.  IOW about 20 instructions running at 1 GHz.  However we are measuring orders of magnitude worse than this.

Could it be because the cache is not enabled?  There's a function, Cache_enable() that takes an argument of "Bits 16 type" but there is no description anywhere of what this argument should be.  I tried enabling the cache in the SYS/BIOS but that didn't help.  Or is this DSP just a lot slower than advertised?

Searched the forum and the wiki to no avail.

  • Hi,

    Could you share which device is this? Also which version of the SDK?

    Best Regards,
    Yordan
  • usually if the number of cycles is a magnitude different from what you expect, it is memory issue.  So cache enable is necessary.

    Where did you find the cache_enable() function? (what release of what package?)

    What is IOW?

    How do you measure the cycles?

    Ran

  • C6657. Latest SDK.
  • How do we enable the cache? The Cache_enable() function is in the documentation. Go to the Help menu in CCS and search for "cache". There is a Cache component in the SYS/BIOS but enabling it made no difference. Everything we are using is the absolute latest. I did a fresh install of all the IDE components a few days ago.

    "IOW" is a common internet forum acronym for "in other words".

    We measure the cycles using the TSCx registers. We know this works properly because pausing the task for a known amount of time generates the expected CPU usage. Specifically what we are doing is running 256 biquads in cascade and measuring the total time to execute. Each biquad process a block of data. Since the block size is fixed and the sample rate is fixed calculating the percentage of CPU bandwidth is an easy calculation: CpuLoad = (float)cycles * 100.0 * FS / (BLOCKSIZE * 1.0e9); where cycles is the number of CPU cycles elapsed during the biquad processing.

    The biquads and the data are in L2 memory.  DDR is not being used for this test.

    On a TigerSHARC this would take about 0.25%. We are measuring around 17% on the C6657 which is well below expected performance. Even a regular SHARC is at least an order of magnitude better.

  • OK, here are couple of suggestions:

    enable the keep assembly switch (properties->Compiler->advanced Options-> assembly options) and look at the number of cycles in the loop. It will help you access how many cycles the code should consume and compare it to what you measure.
    If the actual number is way more than it, it is memory issue.

    There are multiple ways to enable the cache. I prefer to use csl code. (You can use BIOS as well, but do not mixed the two). So find the file csl_CacheAux.h in the csl directory and look for the functions that set the size of the caches. To enable cache you for a certain region (I prefer to call it memory segment) look for the MAR registers in the C66 User guide and turn the cache on for these segment. Note that the first 16 registers are read only so L2 and MSMC memory is always L1D cached

    I would agree with you that the biquad algorithm is not easy to parallelize, so this particular algorithm cannot take full advantage of the multiple functional units architecture of the C66. Other algorithms (and of course FIR) have better utilization.

    Ran
  • Are there any examples? The documentation recommends AGAINST using csl with the latest tools. How would we do it using the BIOS? There is only a single switch and it doesn't seem to do anything?

    We did use the keep assembly switch and estimated that the biquad function should use about 20 instructions per loop so we are expecting about 2e-6% CPU usage per biquad. We are measuring about 60e-3% so we are much, much worse than predicted.

    And, frankly, telling a customer to hunt for header files is unacceptable. There should be clear, concise documentation on how to perform this or detailed examples. Note that we are a "customer". This implies that we pay money for the product. In exchange we expect a reasonable level of support.
  • Frankly, I could tell you exactly where the include file resides, but I think that it is important that you are familiar with the locations of the include files are, so next time when you look for something you know where to look.

    Case in point - BIOS functions to manage the cache. You know that it is part of BIOS, you know more that this is part of the sysbios and you expect it to depend of the processor family. So if you look at XXX\bios_6_45_01_29\packages\ti\sysbios\family\c66 where XXX is the installation directory (and you may have a different revision of BIOS) you will find Caches.h and Caches.c. They have functions that manage the cache from BIOS.

    As I said, I personally like csl. This is a forum so other people may have other opinion.

    By the way, can you give the number of cycles as a numeric value and not as percent of the expected, and then compare it to the number of cycles that you expect the code?

    Ran
  • This is an absolutely ridiculous response. I should not have to come to a forum and play 20 questions. I should be able to go into the CCS help menu and find detailed information on how to enable the cache along with example code. Trying to figure out how to use a function by looking at prototypes in a header file that is buried in one of thousands of directories is insane. My guess is that it's somewhere in ti/packages/ti/package/ti/packages/drv/package/ti/packages/csl/ti/packages/package/src.
  • About optimization of application including memory optimization there are multiple optimization presentations on e2e and in TI training locations

    I have nothing to add
  • This does not appear to be the issue. According to the C66x CorePac User's Guide (there is NO C66x User's Guide):
    4.3.7.2 Special MAR Registers
    MAR0 through MAR15 represent reserved address ranges in the C66x CorePac, and therefore are treated as follows:
    1. MAR0 is implemented as a read-only register. The PC of the MAR0 is always read as 1.

    On page 199 MAR0 is defined as:
    0184 8000h MAR0 Memory Attribute Register 0 Local L2 RAM (fixed)

    So L2 RAM is ALWAYS cached and this cannot be turned off. Our arrays are in L2 SRAM which is configured as all SRAM.

    Furthermore trying to use the SYS/BIOS to configure the cache results in:
    Cache is not supported for the specified device (TMS320C6657).
    Cache is only supported for the following devices on the C66 target:
    TMS320TI816X
    TMS320DA830
    TMS320C3430
    etc.

    Is there anyone there who is an expert on the C665x or C667x? We really need some help.
  • Hello lemmiwinks ,

    Do you have the CCS project that shows the poor performance issue? If yes, can you please attach it?

    Do you have GEL file in your CCS target configurations such that when you connect CCS, PLL, DDR, etc will be initialized properly? i.e., the platform needs to be ready before you load and run the DSPLIB projects for benchmarking.

    best regards,
    David Zhou