This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

I want to use EDMA, but SRCCDX in our application is 65536, which is larger than its range [-32768 32768].



Hello everyone

In our application, we need to do the data corner turn. We have 24*65536 Bytes data matrix. We want to rearrange it in 65536*24. Which is the same idea as the Data Sorting Example shown in the Section 3.3 in EDMA User Manual (showing below). So, ACNT = 4, BCNT = 65536, CCNT = 24. As shown in the example SRCCDX is ACNT * BCNT, which is larger than the SRCCDX range[-32768 32768].

So, how can I deal with problem? 

Thanks.

Xining

  • Xining Yu said:
    In our application, we need to do the data corner turn. We have 24*65536 Bytes data matrix. We want to rearrange it in 65536*24. Which is the same idea as the Data Sorting Example shown in the Section 3.3 in EDMA User Manual (showing below). So, ACNT = 4, BCNT = 65536, CCNT = 24.

    Do you mean 24*65536 elements?  Your numbers don't match up otherwise.

    Xining Yu said:
    As shown in the example SRCCDX is ACNT * BCNT, which is larger than the SRCCDX range[-32768 32768].

    This kind of "rotation" operation causes a lot of problems for DDR devices, i.e. it is just about the most non-optimal way to use the bandwidth.  All those huge strides through memory cause you to open a page of memory on every write which is a very expensive/slow operation.  (This is purely a DRAM limitation, not anything to do with the processor.)  The way to get around this problem is to buffer a few rows of the transfer internally in memory.

    If you're going to work around this issue, you would need to do something drastic.  For example, one thought that comes to mind (that also helps work around the memory bandwidth problem) would be to utilize TWENTY FIVE CHANNELS to do the transfer.  (Not sure if you have that many available...)  Basically the thought would be to have 24 channels work on single lines (1x65536).  Depending on how much internal memory you have available will determine how many pieces you break each line into.  For example, let's say for starters that you break it down as ACNT=4, BCNT=1024, CCNT = 64.  You would chain all 24 channels such that you end up transferring 1k elements from each of the 24 rows into internal SRAM.  You then chain to a 25th channel which does the actual rotation.  The beauty here is that you're now operating on a 1024*24 matrix in internal SRAM.  You would want to be sure to do all the strides in SRAM such that you are bursting to the DRAM interface.  This will give you better bandwidth on the DRAM interface and greatly reduce the penalty of this kind of operation.

    FYI, this will be extremely complicated to setup...  It might possibly be the most complicated EDMA setup I've ever devised! Before you invest this much energy, you might simply want to try doing it with the CPU...  If you're lucky perhaps you'll still meet your real-time deadlines.

  • Thanks for your replying.

    Based on your suggestion, I plan to fetch the data from DDR to MSMC first, perform the corner turn on MSMC, and then send the data back to DDR. I implementation our application in this method.

    1. fetch the data from DDR to MSMC. The matrix size is 8192 by 24. For each element, it has 8 bytes. In other words, the matrix is 65536 Bytes by 24. Ideally, I want to set ACount =  8, BCount = 24, and CCount =8192; SRCBIDX=ACount, DSTBIDX=ACount*CCount; SRCCIDX=ACount*BCount, DSTCIDX=ACount.

    2. As the range of DSTBIDX is  [-32767 32767], I cannot implementation data corner turn directly. So I divided it into 4. Thus, for each corner turn is: ACount = 8, BCount = 24, and CCount = 2048.

    3. As the DSTBIDX is limited to 32767, and the supporsed DSTBIDX should be ACount*BCount=65536, so for each of four small corner turn, the data cannot align with the supposed memory address.

    4. In order to solve the problem in step 3, I set up 24 individual EDMA transmissions of 4 times to set the data into correct place on MSMC. (Right now I am using for loops).

    5. Sending the data back to DDR

    The cycles cost of this corner turn is 4,989,737. It will cost 44e6 cycles to complete the same corner turn by using C command like, dst++=src++;

    Do you have some suggestion that I can make?

    Thanks

    Xining

  • Xining Yu said:
    4. In order to solve the problem in step 3, I set up 24 individual EDMA transmissions of 4 times to set the data into correct place on MSMC. (Right now I am using for loops).

    If you can get everything setup in advance you should be able to use chaining to get everything to happen in the proper sequence without the need for "for" loops.

    Xining Yu said:
    The cycles cost of this corner turn is 4,989,737. It will cost 44e6 cycles to complete the same corner turn by using C command like, dst++=src++;

    I don't think 44e6 cycles is remotely possible using C commands.  Have you benchmarked it?

    In any case, if you can get rid of the "for" loop then you will have achieved the primary goal of freeing up CPU cycles.  And if you're moving the data in the right order (i.e. striding through SRAM and bursting to DRAM) then you should also maximize your DDR efficiency.