C6474 Poor DDR Copy Performance

Estian Malan

Hi.

RandyP: If you are out there, I desperately need your help with this one. Anyone else also welcome to assist... :

I am trying to copy a frame of (32 x 16384) unsigned 16-bit values from one location in external DDR to another using the 64x+ core of the C6474 DSP. However, naturally I am getting poor copying speed performance due to a lot of cache misses involved during access to L3 RAM. I have compared the results with that of the EDMA3, and currently the EDMA3 is performing 20 TIMES faster!! However, I am running into source and destination BIDX overflow issues with the EDMA3 when the size of the matrix is increased any further.

Default L1D cache size for the C6474 is 32kB. I have tried to disable the cache before the copy by setting the L1D cache size to 0 using the CSL CACHE_setL1dSize(CACHE_L1_0KCACHE), and restoring it afterwards back to 32k. I was hoping this would force the core to read-write directly to the DDR, bypassing the cache and alleviating the penalties of cache misses, and consequently performing faster.

However, I am always getting the same copying speed performance, regardless of the cache size setting, and pretty poor compared to the EDMA3.

I have 2x questions that I would like to get answers for:

Am I following the right approach in trying to avoid cache misses, or is there some other way to achieve this?
Will the EDMA3 ALWAYS have better performance to the core in this regard, and why?

Thank you for the help!

Estian.

over 14 years ago

0 RandyP over 14 years ago

TI__Guru* 84110 points

Estian,

It is very surprising that you get the same copying speed performance with cache on or off. It should be much slower with cache off.

Estian Malan said:
1. Am I following the right approach in trying to avoid cache misses, or is there some other way to achieve this?

Cache misses are not much slower than non-cache accesses. You can set the MAR bit to 0 for your DDR memory range, and that will disable caching for that range. But using cache should always be best for sequential accesses. If you were doing random accesses with only one word read from each cache line, then cache misses could be a concern and disabling cache might help you. But for this case, this is not the way to go, the EDMA3 is the way to do this.

Estian Malan said:
Will the EDMA3 ALWAYS have better performance to the core in this regard, and why?

10 years ago, our rule-of-thumb was that the CPU could copy up to 5 sequential words of data from external memory to internal memory faster than setting up and executing a QDMA to do the same thing. There have been some changes in the QDMA mechanism since then (EDMA2, now EDMA3), but there have also been changes to the pipelines inside the DSP device and also in the DDR architecture (compared to SDRAM). But the number of words where the CPU is faster is still a very low number. It would be interesting to figure out what the right number is for the C6474, but it is definitely less than the data you want to copy.

The C64x+ CPU uses a register-centric RISC architecture. This means all operations are on registers. To copy data from DDR to DDR, the CPU will read the source location into a register in the CPU and then write that register to the destination. This can be slow or medium speed depending on how the code is written and optimized.

The DDR architecture is optimized for bursting operations. A single CPU instruction will read one word, and that has DDR-bus overhead. Any intermingled writes can add to that overhead. The EDMA3 architecture is designed to support bursting, so it is naturally better at DDR bursts.

So, yes, EDMA3 will always be faster for medium to large transfer sizes.

Estian Malan said:
I am running into source and destination BIDX overflow issues with the EDMA3 when the size of the matrix is increased any further.

If you are copying a bunch of contiguous blocks to a bunch of contiguous blocks, it should be possible. But there is not enough detail to know what you are trying to do and how you are trying to do it.

Regards,
RandyP

0 Estian Malan over 14 years ago in reply to RandyP

Intellectual 635 points

Hi RandyP.

Thank you for the prompt reply.

Although I expected the result to be opposite of what you explained, I am also baffled by the fact that there is no difference in transfer speed with cache on or off.

Lets look at the EDMA3. I have attached a picture that illustrates what I am trying to accomplish:

I have a matrix of 16384 x 32 unsigned 16-bit values stored in DDR (green frame). I want to copy subframes (blue and red sources) of this frame to destinations outside of the boundaries of this frame (blue and red destinations).

The Basic Param setup values for the EDMA3 for a single subframe transfer (eg. blue subframe), is as follows:

Transfer Type: AB-Sync
SRC/DST Address: Top Left Corner Address of Each Subframe
ACNT: 2 x a bytes
BCNT: d x rows
CCNT 1
SRC/DST BIDX: (16384+a+b) x 16 bit values = 2 x (16384+a+b) bytes > 32768 bytes
SRC/DST CIDX: 1
LINK: 0xFFFF
Other OPTIONS: TCCEN, NORMAL, STATIC

Since the SRC/DST BIDX values are signed 16-bit (i.e. -32786 to 32786), you can see that the above values for SRC/DST BIDX exceed this limit and will overflow, thus not performing the transfer(s) as wished.

What to do? Please help.

Regards.

Estian.

0 RandyP over 14 years ago in reply to Estian Malan

TI__Guru* 84110 points

Estian,

Estian Malan said:
Although I expected the result to be opposite of what you explained, I am also baffled by the fact that there is no difference in transfer speed with cache on or off.

Do you have the MAR bit(s) = 1 for your ranges of DDR? If the MAR bit = 0, then you will not use caching even if caching is enabled.

Your description and picture above are very helpful. And this description is very different than the simpler statements in your first post. If you wanted to copy the whole frame, you could do that. But copying a subframe when the row span > 32767 will not work in a straight-forward manner.

What are the valid ranges for a, b, c, and d?

When you say "d x rows" and "32 x rows", you really mean "d rows" and "32 rows", right? Or is row a more complex entity?

QDMA Method 1:

Setup a list of PARAMs, one for each subrow.
Each PARAM LINK field links to the next in the order you want.
Write the first row setup to the real QDMA PARAM, which links to the second and which starts the transfers.
For best results, build a table in L2 and use IDMA0 to copy to the QDMA PARAMs, taking care they are in the right order.

QDMA Method 2:

Set QDMA trigger on DST and use EARLY TCC mode.
Write SRC CNT, DST to trigger first Transfer Request (TR).
Poll IPR for TR submitted.
Clear IPR and write SRC and DST for next TR
Repeat 3 and 4 so the number of rows are all transferred.
For the final TR, you may want to use a different TCC to generate a real interrupt, and you may want to use NORMAL TCC mode for that last one.

DMA Method 1 would be the same as QDMA Method 1 except that chaining to the same channel would be combined with the linking process. This still requires one PARAM for each row.

DMA Method 2:

Create a table in L2 (or DDR?) memory of SRC, CNT, DST values, one for each row.
Use a chaining sequence between two or three DMA channels.
DMA1 copies from the L2 table into DMA3's PARAM.
DMA2 writes to ESR to trigger DMA3.
DMA3 transfers one row then chains to DMA1 using ITCCHEN only.
DMA3 signals completion with an interrupt or IPR bit using TCINTEN.
DMA3 starts with CCNT = c (or d)

All of these will require some work on your part to understand the advanced logic and mechanisms. You already have a very good understanding of the EDMA3 architecture, so it will just be a matter of working with it and exercising the debugger.

My response times will be slow this week, so I have attached Edma_long_chain_6455.zip, a project that I built for the C6455 for CCS 3.3. It will require rework to get it to work with the C6474, but I will leave that as a learning exercise for you if you choose one of those paths. Just reading the comments and code may help you understand some of the techniques, and you may decide to do something completely different.

Regards,
RandyP

Edma_long_chain_6455.zip

0 Estian Malan over 14 years ago in reply to RandyP

Intellectual 635 points

Hi RandyP.

Thank you for the wonderfully comprehensive feedback! Glad you like my illustration!

RandyP said:
When you say "d x rows" and "32 x rows", you really mean "d rows" and "32 rows", right? Or is row a more complex entity?

Nothing weird here. It simply means 32 rows and d rows..

I like all your proposed methods, however, I am not too keen on using a param set for each row of the subframe. Although I only illustrated 2x subframes, I am actually interested in a transfer of 8x subframes (all around the inner frame), and although there are 255 Param sets, I dont think they will be enough.

I will certainly be trying QDMA Method 2. I have actually used a similar method to perform a lightning fast matrix transpose operation using the EDMA3 on a large set of data in the DDR before, and it worked like a charm (MUCH faster than the core).

Thanx again!

Estian.

0 RandyP over 14 years ago in reply to Estian Malan

TI__Guru* 84110 points

Estian,

Once you have this working, please post your method here. The Community will benefit from your experience.

Regards,
RandyP

Processors

Processors forum

C6474 Poor DDR Copy Performance