Measuring data load/store time from DDR and MSMC

Cheng Wang66110

Other Parts Discussed in Thread: 66AK2H12

Hi,

I am going to measure the time of loading a data from DDR, MSMC, and L2 on KS II, respectively. I think it may not be a good idea to write the benchmark from scratch. Could you please give me some examples on how to do that?

Cheng

over 11 years ago

0 lding over 11 years ago

TI__Guru* 95265 points

Cheng,

Can you clarify the data movement is inside DSP? From ARM to DSP (but CMEM can be used as shared memory between them)? Between different chips? So we can see if any bench-marking code there.

Regards, Eric

0 Cheng Wang66110 over 11 years ago in reply to lding

Intellectual 910 points

Hi Eric,

It is inside DSP, for a single-core performance on DSP.

Thanks

Cheng

0 lding over 11 years ago in reply to Cheng Wang66110

TI__Guru* 95265 points

Cheng,

Do you use CPU or EDMA to move data? If CPU, it just a few lines:

extern cregister volatile unsigned int TSCL;

time1 = TSCL; ====> DSP 32-bit timestmap

*(unsigned int*)address = value; =====> this could be in a for loop

time2 = TSCL; ====> DSP 32-bit timestmap

diff = time2 - time1;

Do you look for this?

Regards, Eric

0 Cheng Wang66110 over 11 years ago in reply to lding

Intellectual 910 points

Hi Eric,

Thanks for your reply!

Actually what I am looking for is how to control allocate an array on DDR, MSMC, and L2/ L1 respectively?

And also for the measurement purpose, how to turn the cache off?

Thanks

Cheng

0 Gaurav over 11 years ago

TI__Prodigy 240 points

Hi Cheng,

Please find attached an example we wrote to measure access times between different memory segments (DDR,MSMC,L2) using EDMA Manager initiated from a DSP core.

Note: Each DSP core (0-7) is assigned an EDMA channel controller. Core 6 and 7 use controller 4 which gives higher bandwidth as compared to controllers 1-3. This is evident from measurements taken (on a Rev 3.0 K2H EVM with a 1333 Mhz DDR3 module) in this example:

root@keystone-evm1:/usr/share/ti/examples/openmpacc/edmabw# ./edmabw
Single channel EDMA bandwidth measured in GB/s
==============
From DSP Core:  0       1       2       3       4       5       6       7
==============
ddr  => ddr  :  3.5     3.4     3.5     3.5     3.4     3.4     3.5     3.5
ddr  => msmc :  6.3     6.3     6.3     6.3     6.2     6.2     12.0    12.1
ddr  => l2   :  6.3     6.3     6.4     6.4     6.2     6.2     6.4     6.4
msmc => ddr  :  5.4     5.4     5.4     5.4     5.1     5.1     7.4     7.4
msmc => msmc :  6.5     6.5     6.5     6.5     6.5     6.5     12.9    12.8
msmc => l2   :  6.5     6.5     6.5     6.5     6.5     6.5     6.5     6.5
l2   => ddr  :  5.4     5.4     5.4     5.4     5.1     5.1     6.2     6.2
l2   => msmc :  6.5     6.5     6.5     6.5     6.5     6.5     6.5     6.5
l2   => l2   :  6.1     6.1     6.1     6.1     6.5     6.5     6.1     6.1

These measurements represent a single EDMA channel transaction each. However, if you initiate EDMA transactions simultaneously from all DSP cores, you will max out the memory bus and achieve the maximum bandwidth possible.

Please feel free to change/modify the example as necessary for further benchmarking.

Regards,

Gaurav

4405.edma_bandwidth.tar.gz

0 Gaurav over 11 years ago in reply to Cheng Wang66110

TI__Prodigy 240 points

Hi Cheng,

In order to create L1 SRAM, you can currently use the chip support library (CSL). The attached file shows an example of how to use CSL to create 16KB SRAM and 16KB Cache on L1 D.

In the next release of MCSDK-HPC, you will be able to use the following built-in functions in OpenCL and OpenMP Accelerator Model to perform these cache operations.

void     __cache_l1d_none  (void);
void     __cache_l1d_all   (void);
void     __cache_l1d_4k    (void);
void     __cache_l1d_8k    (void);
void     __cache_l1d_16k   (void);
void     __cache_l1d_flush (void);

void     __cache_l2_none   (void);
void     __cache_l2_128k   (void);
void     __cache_l2_256k   (void);
void     __cache_l2_512k   (void);
void     __cache_l2_flush  (void);

Regards,

Gaurav

2705.dsp_cache_ops.c

0 Cheng Wang66110 over 11 years ago in reply to Gaurav

Intellectual 910 points

Hi Gaurav,

I just looked into the code of measuring the memory bandwidth using EDMA. I could understand most of the parts except for the dsp_speed() function:

31 float dsp_speed()
32 {
33 const unsigned DSP_PLL = 122880000;
34 char *BOOTCFG_BASE_ADDR = (char*)0x02620000;
35 char *CLOCK_BASE_ADDR = (char*)0x02310000;
36 int MAINPLLCTL0 = (*(int*)(BOOTCFG_BASE_ADDR + 0x350));
37 int MULT = (*(int*)(CLOCK_BASE_ADDR + 0x110));
38 int OUTDIV = (*(int*)(CLOCK_BASE_ADDR + 0x108));
39
40 unsigned mult = 1 + ((MULT & 0x3F) | ((MAINPLLCTL0 & 0x7F000) >> 6));
41 unsigned prediv = 1 + (MAINPLLCTL0 & 0x3F);
42 unsigned output_div = 1 + ((OUTDIV >> 19) & 0xF);
43 unsigned speed = DSP_PLL * mult / prediv / output_div;
44 return speed / 1e6;
45 }

Could you please inspire me more details inside of the function?

2. In this example, you use EDMA to load data. If I use CPU to do this by using memcpy, would you expect the results be the same, or largely different? Do you have the theoretical peak bandwidth for each bus?

3. Regarding to how to disable the cache, do you think simply declare the variable as "volatile" would avoid the data to be cached?

Thanks for your help!

Cheng

0 Gaurav over 11 years ago in reply to Cheng Wang66110

TI__Prodigy 240 points

Hi Cheng,

1. The dsp_speed function reads PLL clock registers and calculates the actual frequency the DSP is running at.

2. I would not expect the results to be the same. EDMA should be faster in general. I do not have the theoretical peak for each bus. But you can look through http://www.ti.com/lit/ds/symlink/66ak2h12.pdf to find out.

3. Declaring a variable "volatile" will not make it non-cacheable. It will only instruct the compiler to not optimize out loads and stores for that variable.

Regards,

Gaurav

0 Cheng Wang66110 over 11 years ago in reply to Gaurav

Intellectual 910 points

Thanks, Gaurav!

I will let you know the results.

Cheng

0 Cheng Wang66110 over 11 years ago in reply to Gaurav

Intellectual 910 points

Hi Gaurav,

Just a follow-up question:

Regarding to core 6 and 7 which deliver higher bandwidth, you said that is because they are connected to EDMA controller 4. Could you explain a bit more detail why they decided to give core 6 and 7 a higher bandwidth on KS II? And what is the design considerations behind?

Thanks

Cheng

0 sunita chandrasekaran over 11 years ago in reply to Gaurav

Prodigy 50 points

Hi Gaurav

Thanks for the explanation. We are wondering why using controller 4 is giving higher bandwidth to core 6 and 7 compared to using other controllers? It is definitely evident from the results.

Is there documentation somewhere that we could read-up to understand this better?

Thanks for your help.

Sunita

0 Gaurav over 11 years ago in reply to sunita chandrasekaran

TI__Prodigy 240 points

Hi Sunita, Cheng,

I am unaware of the design decisions that led to EDMA controller 4 having higher bandwidth.

Perhaps more information is available from:

EDMA3 User Guide for Keystone: http://www.ti.com/lit/ug/sprugs5a/sprugs5a.pdf
66AK2H Technical Reference Manual: http://www.ti.com/lit/ds/symlink/66ak2h12.pdf

Hope that helps.

Regards,

Gaurav

0 sunita chandrasekaran over 11 years ago in reply to Gaurav

Prodigy 50 points

Gaurav,

Thanks for your reply, we will go through the PDFs and get back to you when we find more information.

Sunita

Processors

Processors forum

Measuring data load/store time from DDR and MSMC