This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6472 EDMA transfer performance

I am seriously concerned about our recent observation regarding EDMA3 transfer performance from L2 to DDR2 in c6472.

 

Here is the task I am trying to achieve --  

The task is reverse of what example 3.3 in SPRU727D.

The input (in L2) is arranged in Interleaved (as in the destination in the example 3.3) format and needs to be stored in DDR de-Interleaved (as in source in the example 3.3).

The data streams continuously into the L2 at a high data rate (~100 MBytes per second).

Each sample is 2 Bytes.

We setup the PaRam as follows – ACNT=2; BCNT = 1024, CCNT=4; SRCBIDX = 8; DSTBIDX =2; use AB_SYNC and use Intermediate chaining.

 

We found the transfer BW to be about 120MBytes/sec. (These benchmarks are done on c6472 EVM which has 32-bit DDR bus). This is a VERY LOW BW number for our application.

 

In order to further investigate & simplify the situation, we then modified the case a bit.

We let it do straight transfer (no interleaving) from L2 to DDR2. If we set ACNT=2 & BCNT=1024 and CCNT = 4; SRCBIDX=DSTBIDX=2; use AB_SYNC -- we get similar VERY LOW transfer BW. Let’s call this as case 1 for later reference in my description.

 

Then we changed it to use ACNT = 2 * 1024 and BCNT=4 and CCNT=1;

Now we get transfer BW that is almost 14 times compared to case1. Let us call this as case 2 for later reference.

 

We then looked into document SPRAAY0A. This document provides some figures describing various EDMA3 transfer BW scenarios. The difference in transfer BW is somewhat consistent with Figure 20. But to me Figure 20 is in direct contrast to Figure 19 in the same doc unless I misunderstand something. According to Figure 19 as long as ACNT x BCNT is large for AB_SYNC, the transfer BW should be large. But Figure 20 and our observation are different from Figure 19.

 

I don’t understand why the performance would suffer in AB_SYNC case since each TC has a FIFO and should be able to burst as long as ACNT x BCNT is large (as suggested by Figure 19 in SPRAAY0A).

 

This Super Low transfer BW is a BIG PROBLEM for us since our input BW is large and may mean the whole DSP performance could be brought down by this slow transfer BW – this Edma transfer could take up close to 100% of DDR access thereby leaving no room for other cores to access the DDR.

 

 My questions are --

1. What causes this performance to be so LOW when ACNT is small (case 1) even when using AB_SYNC?

 

2. For the case 1 where the performance is really low, am I loosing the DDR BW completely? In other words can other cores (c64x+), other Edma transfers which may go on in parallel be able to utilize the DDR BW to give me cumulative good performance.

 

I need to understand ASAP since I may have to go to the HW team to undertake the de-Interlaving task instead of the DSP.  I would really appreciate a quick response.

 

NOTE - I could reproduce same performance issue in c6455 too.

 

  • Louis Leung said:
    1. What causes this performance to be so LOW when ACNT is small (case 1) even when using AB_SYNC?



    SPRU727D Table 2-19 can give you a little more information on optimizing the EDMA3 transfers, but I am not certain how complete it is. But the main point from this table that could help you, again to a small amount, would be to decrease BCNT below 1024. Depending on your CIDX values, if it is possible to change to BCNT=512 and CCNT=8, then your destination side could be optimized a small amount.

    My biggest concern is for the ACNT=2 part. The EDMA3 can definitely meet the functionality that you need, and you have found this to be true. There are two places where this may be leading to the inefficiencies that you have experienced: 1) the L2 access architecture takes as much time to read a large burst of sequential data as it does to read a single element of data (large bus width, bursting access architecture), and 2) the EDMA3 read FIFOs may be optimized for word storage and not multiples of bytes other than 4*N.

    The test I would ask you to try is to change ACNT=4 and measure the performance. This will help you and me to understand the effect of  the half-word access width in both the L2 read and the FIFO storage.

    Louis Leung said:
    2. For the case 1 where the performance is really low, am I loosing the DDR BW completely?


    Since your writes to DDR are sequential, it is unlikely that you are losing DDR BW overall. If the read FIFOs support packing 16-bit half-words into 32-bit words within the FIFO, then you will not be losing any appreciable DDR BW. But if not, then you will have less efficiency on the DDR bus, but there should still be a lot of BW left for the other tasks. Measurements within your environment are the best way to find out the real impact.

    I recommend looking at the DDR bus to see whether it is bursting your writes (good use of BW) or if each half-word requires a separate CAS (not as good). Of course, the CAS for each write could be close together, so the losses could be minimized.

    The best way to improve this, based on my speculations here, would be to change how the data is stored in L2 to make the transfer out more efficient. That is a very system-specific recommendation, so it might not be at all practical for your case. But solutions could range from using extra buffers in L2 to store interleaved and non-interleaved copies of the data, to double-wide buffers so all samples are stored as 32-bit data, to running an additional de-interleaving DMA step from one L2 buffer to another and then copying from the de-interleaved L2 buffer to DDR. You will probably come up with other solutions tailored to your application.

    Regards,
    RandyP

  • Thanks Randy for your reply.

    I have verified by increasing my ACNT settings and the performance did improve.  For ACNT = 4, I got roughly doubled performance in throughput, and for ACNT = 8, I got roughly quadruple compared to the case of ACNT = 2.  That verifies the way I read the L2 memory makes the difference but not on the write to the DDR2.

    To meet our requirement, instead of making use of the EDMA, I will have to do the de-interleaving algorithm differently.

    Thanks again for your help.

    -- Louis

  • Louis,

    The EDMA transfers do not use DSP MIPS, so that can be an advantage for your overall system throughput, even with the impact you are seeing now. If you can change the algorithm to de-interleave the data for free (in terms of DSP MIPS), then you will have a perfect solution. But if it requires DSP MIPS overhead, then you might end up slower than with EDMA. There are so many system possibilities that I can only speculate rather than offer intelligent guesses.

    One alternative could be using multiple EDMA transfers, where one does the de-interleaving into buffers in L2 and then transfers from that de-interleaved L2 to DDR with a second transfer. Just a thought.

    Regards,
    RandyP

  • I understand the benefit of using the EDMA to transfer and/or de-interleave data so that it frees up the DSP to do some other tasks and gain system performance.  In our case, since the performance of using the EDMA to de-interleave is not as good as I expected (it can barely keep up with the data acquisition rate), I will have to de-interleave the data with the DSP in L2 and then store the result to DDR2 with EDMA (the ACNT can be large in this case and therefore a very fast transfer).

    In fact, I have tried using EDMA to de-interleave the data from L2 to L2, then transfer to DDR2.  However, as I mentioned, because of ACNT being 2 bytes, reading it from L2 with EDMA is still slow although the writing part is fast.  The performance still can't meet our requirement.

    Thanks again for your sugguestion.

    -- Louis