This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

IDMA bandwidth

Hi,

I'm using a C6678 device. I tested the IDMA transfer rate for L2 to L1D transfers (using IDMA1). From my understanding the internal bus of the L2 and L1D memories is 256bit wide and it works on the EMC clock which is half of the DSP clock. This gives a theoretical bandwidth of 16GB/sec (for a 1GHz device).

I have done several tests which transfer data from the L2 to L1D using several block sizes (128byte to 2Kbyte), I kept adding transfers to keep a working and pending transfer at all times. I have timed the transfers and measured only 3GB/sec transfer rate. All the transfers where made using the highest configurable priority (priority 0). For my understanding no other memory transactions where made (no active master peripherals, little cache traffic, etc.).

Does this rate make sense? Do you have other figures?

Thanks,

Yishay

  • Yishay,

    I'm assuming you've configured L1D as partial RAM and Partial Cache for this (same for L2) and are checking the contents at the end to verify they landed correctly.

    A couple questions.

    1.) How are you capturing the timestamps?

    2.) What timer are you using for timer?

    3.) How are you calculating the throughput?

    4.) Can you provide some raw numbers from what you've observed?

    Best Regards,

    Chad

  • 1,2. I'm saving the CNTLO register of Timer0. I found out that the system initializes this timer to use a clock 6 times slower than the CPU rate.

    3. I'm dividing the total cycle count by the total byte count.

    4. My sample transfers 9 1K blocks using the following code:

     

    TIMER_TIC;

    for (size = 1024, count = 0; count < BLOCK_COUNT; count++)

    {

    hIdma->IDMA1_SOURCE = (uint32_t) L2Buff + size * count;

    hIdma->IDMA1_DEST = (uint32_t) L1DBuff + size * count;

    hIdma->IDMA1_COUNT = size;


    while (hIdma->IDMA1_STAT & 0x00000002)

    ;

    }

    while (hIdma->IDMA1_STAT)

    ;

    TIMER_TIC;

     

    The TIMER_TIC is defined as:

    #define TIMER_TIC TimerVal[timerValIndex++] = BaseAddress[0].regs->CNTLO; \

    if (timerValIndex == TIME_VAL_LEN) timerValIndex = 0;

    I measured 2946 cycles for 9K bytes which results in 0.319661 cycles/byte

     

     

  • What are you compiler options?  Are you using 'debug', if so you want to remove this option as you'll have the least efficient code.

    Where's the TimerVal[] data block?

    What exactly are the cache/SRAM settings

    Where are the L1DBuff and L2Buff located.

    I'd suggest using the TSC (core timer - available in the CSL) for your timestamps, I'm not sure how many cycles are being spent on this with your code.  Also, grab the first timestamp just prior to writing to the IDMA1 Count Register - Do another after exiting the for loop, and then another after the while (hIdma->IDMA1_STAT); loop. 

    Basically we need to figure out what's consuming the time, because the IDMA has been shown to achieve full theoretical performance on internal testing.

    Here's a quick example of the TSC timer usage - it's a direct register read.

            CSL_Uint64        counterVal;
           
            ...
           
            CSL_tscStart();
            counterVal = CSL_tscRead();

  • Chad,

    I'm using Release code with -o3 optimization.

    The TimeVal[] data block is located in the L1D.

    My Cache setting are: L2 Cache -128K, L1D Cache - 16K, L1P Cache - 16K

    L1DBuff is in the L1D, L2Buff is in the L2. I also ran the whole routine in the L1P so there will be no cache issues.

    Using the TSC did not change much. I measured using both the TSC and the TIMER0 and got the same results. When using the TSC with the CSL calling CSL_tscRead actually makes things a bit worst, since the code actually branches to the routine instead of inlining the code.

    The timing for the transfer is (using TSC): The whole transfer took 2895, the setup time (writing source and destination registers) took 50 cycles (some may be the TSC function call) the completion (from the end of the loop to the end of the transfer (after issuing the last transfer to pend the IDMA), took 298 cycles.

    Yishay

     

     

     

  • Can you zip up and post your test code so I can take a look at it.

    Best Regards,

    Chad

  • Chad,

    Attached is the source code. 

    The memory sections mentioned in the code are configured as follows in the RTSC cfg file:

    Program.sectMap[".L2_test"] = {loadSegment: "L2SRAM", loadAlign:8}; /* L2 Edma test*/
    Program.sectMap[".L1D_test"] = {loadSegment: "L1DSRAM", loadAlign:8}; /* L1D Edma test */
    Program.sectMap[".L1P_test"] = {loadSegment: "L1PSRAM", loadAlign:8}; /* L1P test */

    L2SRAM, L1DSRAM and L1PSRAM are configured automatically when changing the cache sizes in the platform configuration. I used 128k for L2 Cache, and 16k for L1D and L1P caches.
    Please tell me if you need more information.
    Thanks,
    Yishay
  • Yishay,

    I took same problems with C6678

    ~1230 cycles for 4K data transmitting L2->L1 with IDMA1

    Have you found the solution on the problem?

    Ivan

  • Yishay and Ivan,

    We identify the issue with the IDMA1 transaction performance on silicon revision 1.0 of the C66x and this issue has been fixed in silicon revision 2.0 already.

    Basically the transaction from L2 to L1D is about 3.41 bytes/cycle in revision 1.0 but it is improved to about 7.67 bytes/cycle in revision 2.0 (which meets the theoretical number 8 bytes/cycle).

    And transaction from L2 to L2 is about 3.75 bytes/cycle in revision 1.0 but it is improved to about 7.76 bytes/cycle in revision 2.0 with theoretical number 8 bytes/cycle.

    One more example is L2 fill, which is about 7.49 bytes/cycle in revision 1.0 and it is improved to about 15.44 bytes/cycle in 2.0 with theoretical number 16 bytes/cycle.

    One Usage Note will be added to the next release of Errata documents for this issue. Please plan to use the revision 2.0 of C66x devices if the IDMA1 transaction is critical in your design. Thanks.

    Sincerely,

    Steven

  • Steven, thanks!

  • Is there any way to determine the silicon revision of a C6678 chip soldered on a emv6678 board?

    I've just bought an emv6678 (board isrevision 3.0) development board, which I will to use for my diploma thesis, and if the revision has bandwidth issues I don't intend to use IDMA.

    Thank you in advance, Clemens

  • Clemens,

    Please take a look at the JTAGID register mentioned in the C6678 data manual and errata documents.

    The VARIANT field (bit 31-28) should show the silicon revision of the device you are using.

    As what mentioned in the errata document, VARIANT=0 means silicon revision 1.0, VARIANT=1 means silicon revision 2.0.

    Hope it helps.

  • Clemens,

    Even if you have silicon 1.0, which is what is on my rev 1.0 EVM, the IDMA performance is very fast. It can be a significant help for some applications by relieving the DSP from having to move data between L1 and L2.

    IDMA is an underutilized feature of the C64x+, C674x, and C66x DSP cores. As far as I know, none of my customers use it, but instead just rely on the DSP's cache and EDMA to do their data movement.

    But there are certain applications that can fit data in multiple buffers in L1D SRAM, and can save larger buffers in L2 SRAM. Those can get the performance lift from running an IDMA1 to copy the old results from L1D to L2 and then the next input from L2 to L1D, all while the DSP is executing on the current input in L1D and writing the current output to L1D. In many cases, the output part can be placed in L2 without any performance loss, but that is very algorithm dependent.

    Basically, I am suggesting that you not give up on IDMA1 just because you "only" get 3GB/s. That is still pretty good.

    Regards,
    RandyP

  • Hi Randy,

    I verified my emv6678 has v2 silicon - so I'll give IDMA a try, hoping to get 5-10% more throughput for memory bound algorithms due to hidden latencies of L2SRAM.

    Thanks, Clemens