I am seriously concerned about our recent observation regarding EDMA3 transfer performance from L2 to DDR2 in c6472.
Here is the task I am trying to achieve --
The task is reverse of what example 3.3 in SPRU727D.
The input (in L2) is arranged in Interleaved (as in the destination in the example 3.3) format and needs to be stored in DDR de-Interleaved (as in source in the example 3.3).
The data streams continuously into the L2 at a high data rate (~100 MBytes per second).
Each sample is 2 Bytes.
We setup the PaRam as follows – ACNT=2; BCNT = 1024, CCNT=4; SRCBIDX = 8; DSTBIDX =2; use AB_SYNC and use Intermediate chaining.
We found the transfer BW to be about 120MBytes/sec. (These benchmarks are done on c6472 EVM which has 32-bit DDR bus). This is a VERY LOW BW number for our application.
In order to further investigate & simplify the situation, we then modified the case a bit.
We let it do straight transfer (no interleaving) from L2 to DDR2. If we set ACNT=2 & BCNT=1024 and CCNT = 4; SRCBIDX=DSTBIDX=2; use AB_SYNC -- we get similar VERY LOW transfer BW. Let’s call this as case 1 for later reference in my description.
Then we changed it to use ACNT = 2 * 1024 and BCNT=4 and CCNT=1;
Now we get transfer BW that is almost 14 times compared to case1. Let us call this as case 2 for later reference.
We then looked into document SPRAAY0A. This document provides some figures describing various EDMA3 transfer BW scenarios. The difference in transfer BW is somewhat consistent with Figure 20. But to me Figure 20 is in direct contrast to Figure 19 in the same doc unless I misunderstand something. According to Figure 19 as long as ACNT x BCNT is large for AB_SYNC, the transfer BW should be large. But Figure 20 and our observation are different from Figure 19.
I don’t understand why the performance would suffer in AB_SYNC case since each TC has a FIFO and should be able to burst as long as ACNT x BCNT is large (as suggested by Figure 19 in SPRAAY0A).
This Super Low transfer BW is a BIG PROBLEM for us since our input BW is large and may mean the whole DSP performance could be brought down by this slow transfer BW – this Edma transfer could take up close to 100% of DDR access thereby leaving no room for other cores to access the DDR.
My questions are --
1. What causes this performance to be so LOW when ACNT is small (case 1) even when using AB_SYNC?
2. For the case 1 where the performance is really low, am I loosing the DDR BW completely? In other words can other cores (c64x+), other Edma transfers which may go on in parallel be able to utilize the DDR BW to give me cumulative good performance.
I need to understand ASAP since I may have to go to the HW team to undertake the de-Interlaving task instead of the DSP. I would really appreciate a quick response.
NOTE - I could reproduce same performance issue in c6455 too.