memory access performance

Jogging Song

Intellectual 685 points

Other Parts Discussed in Thread: TMS320DM6437

Hi,

I am reading the spru862b of TMS320C64x+ DSP Cache User's Guide.

L1D miss penalty is 12.5 cycles. I am a little confused why the penalty is so

large though L2 operates in the same frequency of L1D.

For the case L2 is used as SRAM instead cache or L2 read misses, how about

L1D miss penalty? I know DDR access time depends on many factors, but hope

to learn the estimated access time.

I am using the DM6437, and the frequency of DDR is DSP/3 from the document.

Best Regards

Jogging

over 16 years ago

0 Bernie Thompson TI over 16 years ago

TI__Mastermind 41665 points

Jogging Song said:
I am a little confused why the penalty is so
large though L2 operates in the same frequency of L1D.

What makes you say that L2 operates at the same frequency as L1? In general L2 will be slower than L1, typically half the speed, otherwise there would be little reason to have it as a seperate layer. In addition to the pure speed difference between L1 and L2 there is also some overhead for the cache controller leading up to this number.

Jogging Song said:
For the case L2 is used as SRAM instead cache or L2 read misses, how about
L1D miss penalty? I know DDR access time depends on many factors, but hope

to learn the estimated access time.

I don't believe L2 as SRAM will effect the L1D miss penalty, but when you get to an L2 miss than the penalty increases dramatically. As you mention DDR access time depends on too many factors to truly model accurately but I can say that typically it will be a huge delay relative to the CPU cycle time to the point that it is not necessarily practical to measure in CPU cycles since external memory is independent of the CPU clock (the slower the CPU the less the impact, the faster the CPU the more dramatic internal memory usage becomes). I have seen code that runs many orders of magnitude (~100x) faster from internal memory versus being totally external, this may not be the most typical case but should give some idea of the delays you could be looking at.

0 Jogging Song over 16 years ago in reply to Bernie Thompson TI

Intellectual 685 points

Hi, Bernie

Thanks for timely reply.

On the Page 157 of sprs345d of TMS320DM6437, it shows that there are three clock rate on DM6437.

I　think L2 operates on the DSP clock rate and DDR2 operates on the DSP/3 clock rate from the figure.

But the GEL file of DM6437 set the DDR frequency about DSP/4. I don't know why.

12.5 cycle may be the time for filling the cache line. On the DM6437, the cache line is 64 bytes. If I access

8 bytes from L2 and access is sequential, it seems that for the first time access is miss and for other time

the access is hit, so the average miss penalty is 12.5/8. Can I say that?

On the DM6437EVM, I do a test using DMA. There are two images in the DDR2. For one case I turn

the cache on, then each pixel in one image is added by one and then put into another image.

For the other case I turn the cache off, I transfer four rows of pixel into L1 SRAM using one DMA channel,

then process the pixels and put the result into another buffer of L1. I transfer the resultant four rows

into DDR2 using another DMA channel. The performance improvement is about eight times.

From my colleage I learn that DDR2 access time can be half cycle at the best time.

Because DDR2 operates at DSP/3, so it means that I can get one data at the freqency of DSP*2/3 clock rate.

Access DDR2 via cache make the DDR2 operate on burst mode. From my analysis the performance improvement is

a bitter large.

For the 100x case you mention, I am not clear you put the execute code in the internal memory or you

put the data which you are going to process into the internal memory. I think it it the latter case.

Best Regards

Jogging

0 MattLipsey over 16 years ago in reply to Jogging Song

Genius 3575 points

I think a little insight into the nature of DDR2 might help explain the difficulty in generalizing about access times. To simplify some things away, think of the ddr2 memory as a 2-d array of rows and columns. In order to read from a random memory location, the ddr2 controller must first issue a precharge command to open the desired row of ddr2 corresponding to the requested address. This is a relatively slow operation. It then issues a read command, which can take 4-6 ddr2 clocks to complete. The great thing about ddr2 is that if you are accessing memory sequentially in that same row, all the read requests can be pipelined and you get a read on every half clock cycle like you mentioned. Then with the fact that you have a 4 byte bus, you can get some (theoretically) serious throughput. As a random side note, the ddr2 controller has to periodically refresh the memory, which slows down throughput as well.

The problem is, for memory access patterns that are not sequential, nothing pipelines. You have to precharge a row, wait for the read latency, then close the row and repeat the process all to get 1 byte. So if you had an image in ddr2 and processed it sequentially (say inner looped across rows) you would have a vastly better throughput than if you accessed that same image by looping across columns (nonsequential accesses).

0 Jogging Song over 16 years ago in reply to MattLipsey

Intellectual 685 points

Thanks, MattLipsey

You mention that sequential access of an image in DDR2 can give a better throughput.

I wonder whether I need to use DMA to transfer data under this situation.

Best Regards

Jogging

0 Bernie Thompson TI over 16 years ago in reply to Jogging Song

TI__Mastermind 41665 points

Jogging Song said:
I wonder whether I need to use DMA to transfer data under this situation.

If you are moving large contiguous blocks of data from DDR than using the DMA is just about the most efficient way you can do it since that gives you the maximum bursting and most efficient use of DDR as MattLipsey explained. Using the CPU would be more efficient if the amount of data is small or not contiguous to the point that it would take less time to just fetch the data with the CPU than it would to configure the DMA.

0 Jogging Song over 16 years ago in reply to Bernie Thompson TI

Intellectual 685 points

Thanks, Bernie
       I have a few questions. Can you clarify them for me?
       1. What frequency does L2 operate on?
            In your reply, you tell me that L2 is half of the speed of L1. In the document sprs345d,
it seems L2 is in the DSP clock rate domain.

2. Is 12.5 cycle the time for filling the cache line. Does DSP obtain the data after filling the cache line?
On the DM6437, the cache line is 64 bytes. If I access 8 bytes from L2 and access is sequential, it seems
that for the first time access is miss and for other time the access is hit, so the average miss penalty is 12.5/8.
3. From the previous post, I learn there exists a DM6437 simulator if you buy a full featured CCS.
Can the DM6437 simulator simulate the DDR2?
4. Is there any document about memory access performance evaluation?
The document about DMA gives example use, but doesn't discuss the performance improvement.

Best regards
Jogging

0 Bernie Thompson TI over 16 years ago in reply to Jogging Song

TI__Mastermind 41665 points

Jogging Song said:
1. What frequency does L2 operate on?
In your reply, you tell me that L2 is half of the speed of L1. In the document sprs345d,
it seems L2 is in the DSP clock rate domain.

In the C64x+ architecture L2 is operated at CPU/2, the figure in the datasheet is not granular enough to note this, I believe the divide down happens within the megamodule itself so it may not be reflected in all block diagrams. This is not as obviously mentioned in the documentation as it should be, however look at the second sentence in the last paragraph on page 48 of SPRU862b for at least one place this is noted.

Jogging Song said:
2. Is 12.5 cycle the time for filling the cache line. Does DSP obtain the data after filling the cache line?
On the DM6437, the cache line is 64 bytes. If I access 8 bytes from L2 and access is sequential, it seems
that for the first time access is miss and for other time the access is hit, so the average miss penalty is 12.5/8.

The figures are for the time to fill a cache line when a stall happens, so after 12.5 cycles you would then have 64 bytes in the L1D cache. This being said I agree with your assertion that if you were doing individual byte accesses sequentially that you would end up with 12.5/8 average delay per byte as subsequent byte accesses would no longer be misses.

Jogging Song said:
3. From the previous post, I learn there exists a DM6437 simulator if you buy a full featured CCS.
Can the DM6437 simulator simulate the DDR2?

The simulators only really manage internal memories, so it could tell you about L1 and L2 misses but it does not model the external memory, for accurate results of DDR induced latency you would have to use hardware.

Jogging Song said:
4. Is there any document about memory access performance evaluation?
The document about DMA gives example use, but doesn't discuss the performance improvement.

I do not know of anything specific to the DM6437 that goes into details on DDR performance relative to DMA. If you scaled down the performance relative to clock frequency you may be able to get some ideas from SPRAAG8 which is based on another C64x+ device that uses DDR2 and EDMA3, though the clock differences and EDMA capabilities may differ somewhat, so to be sure on the DM6437 you would have to do some measurements on hardware.

Processors

Processors forum

memory access performance