This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM335: slow data reading

We have a custom board with DM335-216MHz and DDR2-171MHz x 16 bits. Theoretical bandwidth of the DDR2 is 342*2 = 682MB/s.

 

We have made some tests using a JTAG debugger for program loading in DDR2. So there is not OS or loader. Code and data caches are switched on.

 

Test EDMA data move DDR2-DDR2 gives value about 250MB/s = 73% of max.

Test CPU data write to DDR2 gives value about 450MB/s = 66% of max.

Test CPU data read from DDR2 gives value about 120MB/s = 18% of max.

 

Test CPU data write look like:

STMIA   r1!, {r5-r12}.

 

Test CPU data read look like:

LDMIA   r1!, {r5-r12}.

 

It is the fastest way data moving to/from CPU as I now.

 

May concern is very slow data read speed. I used an oscilloscope for check of the DDR2 data lines and saw burst with 8 pulses = 47ns and then pause about 200ns. So this is not bandwidth limitation.

But what is it?

 

  • This is an interesting problem, my first guess would be some strange cache interaction, though as long as you continually move the r1 register value through the DDR memory the cache interaction should average out.

    Looking at the actual instruction I think there may be a problem though I will start out by saying I am not an ARM assembly expert, most ARM programming I have done has been in C. Looking at the ARM documentation at http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0231b/BABEFCIB.html it seems that LDMIA is a THUMB instruction, which means it cannot access registers above r7 directly, so LDMIA r1!, {r5-r12} would be erroneous, it would need to be LDMIA r1!, {r5-r7} to be valid.

  •  

    I am not programming expert too, but I am sure - LDMIA instruction good works in old, not THUMB, ARM CPU.

     

    Well we test and single word reading too with no data cache. In this case we can see data burst with 4 pulses (4*2*2 = 16bytes). As I understand it is a minimal burst size for DDR2 memory. But pause between bursts remains the same size.

      

    We made test with no pointer register increment and data cache switched on and got result 770MB/s. It is look like as 216MHz * 4bytes bus width = 864MB/s. Therefore CPU speed is ok and caches are working.


    We have checked up all combinations of caches states and data length and in the first message is the best result.

     

  • Maybe someone has results of tests of throughput the CPU-DDR2 for DM335 or DM355 evaluation module or custom board?

  • I am not aware of any DDR2 benchmarks for the DM355 or DM335 for accesses from the ARM, I will see if I can come up with any.

    I am somewhat curious what drove you to perform these bandwidth tests in the first place, was your actual application running into DDR2 read delay problems?

  • Unfortunately our real application is not yet ready, but it is very sensitive to throughput.

     

    We had two assumptions of 200ns pause:

    1. It is latency between CPU command and DDR2 memory controller’s command to memory.
    2. It is minimum pause between two successive requests from CPU to DDR2 memory controller.

    The second variant will have much smaller influence on the real application.

     

    We have made such test for check of it:

    1. Set GPIO pin to 1.
    2. Set GPIO pin to 0.
    3. Set GPIO pin to 1.
    4. CPU read one word from DDR2.
    5. Set GPIO pin to 0.
    6. Pause 2 microseconds.
    7. goto 1.

     

    Test result:

    Pause between 1 and 2 – about 40-50ns.

    Pause between 3 and 5 – about 250-270ns.

     

    So this is variant 1  :-(

  • Thank you for the additional test data, that seems like a bit of a long latency to me for a read command, though it is possible this is expected. In any case we are looking into this a bit deeper internally, currently I am trying to come up with any DDR bandwidth benchmarks that may have been done during validation which would show us either this level of performance being expected, or some alternate way to get higher bandwidth. On the other hand I have no guarantee that such tests existed, so I am not sure if/when we will have a better answer, but I will keep you posted.