This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM816x DDR3 memory performance, lmbench

We seem to be chasing a memory bottleneck (read or write memory bandwidth) on our Ti816x-based platform.

We are using DDR3 memory. I cross-compiled and executed the lmbench benchmark tools and here are the results.

iCVR-VS-005354# ./bw_mem 100M wr

100.00 2256.52

iCVR-VS-005354# ./bw_mem 100M rd

100.00 512.43

We are surprised that the read and write performance are not similar.  

Note that the write performance is approximately 4 times the read performance.   Are we interpreting the results properly?

What is the expected performance for this benchmark on this platform?  Are there any other benchmarks that we should run to feel confident about our memory performance?

We observed similar performance when running these tests directly on the DVR-RDK hardware itself. 

Also, the DDR3 timing parameters in the RDK do not appear to be set for maximum performance.   I’d like to understand where the source for the DDR3 timing parameters in the RDK to see if there is a possibility of optimizing these numbers.

Regards, 

-- B

  • We have not tried out lmbench on DVRRDK .We will try it out and update you but we expect to get same performance. The performance probes on 816x allow measurement of DDR b.w per initiator using XDS 560v2 and CCS.We can confirm the mb/s printed by benchmark is correct using the h/w counters.

    Can you pls provide the following info:

    - Which DVR RDK release are you using.

    - What was the original issue you were seeing. Was it high A8 load ? What was the operation being performed. Was it high b/w network I/o or SATA I/o ?

    - What is the application ? Is it a DVR /Hybrid DVR / NVR ? What is the A8's data read / write b/w in your app ?

    - Is your intention to measure the A8's b/w or DDR b/w ? If your intention is to measure DDR b.w then you should use linked EDMA xfers to measure max throughput.

    It is expected writes give better throughput than read since writes can be posted whereas read requires data to actually be fetched from DDR which can be upto 100 cycles for a cache miss.

    Regarding DDR timings can you provide info on why you mention DDR timings are not optimal.

    The ddr timings are present in uboot: uboot\u-boot-dvr-rdk\arch\arm\include\asm\arch-ti81xx\ddr_defs_ti816x.h

     

     

     

  • 4034.DM8167_DDR3_Timing_Registers.xlsx

     

    Hello:

    We are trying to transfer multiple 1080P streams from the M3 Decoder(s) through the scaler to multiple M3 encoder(s) and DSP cores.    We are not sure if it is a memory bottleneck or a problem with the data paths.    How many transcodes of 1080P (with resizing) can we expect to be able to handle?

    What benchmarking tool do you recommend?  We found lmbench because it was referenced by TI here:

    http://processors.wiki.ti.com/index.php/Lmbench

    Attached is a comparison that shows a comparison between the DVR-RDK's memory SDRAM timing registers and our calculations.

    We are using DVR-RDK_03.00.01.03, but the timing register values didn't seem to change for a while. 

    Regards,

    --B

     

  • We are trying to transfer multiple 1080P streams from the M3 Decoder(s) through the scaler to multiple M3 encoder(s) and DSP cores.    We are not sure if it is a memory bottleneck or a problem with the data paths.    How many transcodes of 1080P (with resizing) can we expect to be able to handle?

    Trying lmbench which tests cortexA8 cpu data throughput will not help for this issue. Pls share the following info:

    1. Logs of Vsys_printDetailedStatistis() when your usecase is running.

    2. Your usecase file where the links are connected.

    We should be able to identify the source of the bottleneck from the logs.Pls confirm you are clocking DDR @ 796 Mhz.

     

    For transcode, below is approximate number:

    1 HDVICP can do 1 1080P60 H264 encode or decode.

    i.e. I HDVICP can do 1 1080P30 transcode.

    8167 has 3 HDVICP so you should be able to do 3 1080P30 transcode.

  • FYI RDK 4.0 release has a 4 channel 1080P30 encode -> decode usecase which is realtime. Can you pls share the Vsys_printDetailedStatistics logs in your usecase for analysis.