This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Measuring data load/store time from DDR and MSMC

Other Parts Discussed in Thread: 66AK2H12

Hi,

I am going to measure the time of loading a data from DDR, MSMC, and L2 on KS II, respectively. I think it may not be a good idea to write the benchmark from scratch.  Could you please give me some examples on how to  do that?

Cheng

  • Cheng, 

    Can you clarify the data movement is inside DSP? From ARM to DSP (but CMEM can be used as shared memory between them)? Between different chips? So we can see if any bench-marking code there.

    Regards, Eric

  • Hi Eric,

    It is inside DSP, for a single-core performance on DSP.

    Thanks

    Cheng

  • Cheng,

    Do you use CPU or EDMA to move data? If CPU, it just a few lines:

    extern cregister volatile unsigned int TSCL;

    time1 = TSCL; ====> DSP 32-bit timestmap

    *(unsigned int*)address  = value; =====> this could be in a for loop

    time2 = TSCL;  ====> DSP 32-bit timestmap

    diff = time2 - time1;

    Do you look for this?

    Regards, Eric

  • Hi Eric,

    Thanks for your reply!

    Actually what I am looking for is how to control allocate an array on DDR, MSMC, and L2/ L1 respectively? 

    And also for the measurement purpose, how to turn the cache off?

    Thanks

    Cheng

  • Hi Cheng,

    Please find attached an example we wrote to measure access times between different memory segments (DDR,MSMC,L2) using EDMA Manager initiated from a DSP core.

    Note: Each DSP core (0-7) is assigned an EDMA channel controller. Core 6 and 7 use controller 4 which gives higher bandwidth as compared to controllers 1-3. This is evident from measurements taken (on a Rev 3.0 K2H EVM with a 1333 Mhz DDR3 module) in this example:

    root@keystone-evm1:/usr/share/ti/examples/openmpacc/edmabw# ./edmabw
    Single channel EDMA bandwidth measured in GB/s
    ==============
    From DSP Core:  0       1       2       3       4       5       6       7
    ==============
    ddr  => ddr  :  3.5     3.4     3.5     3.5     3.4     3.4     3.5     3.5
    ddr  => msmc :  6.3     6.3     6.3     6.3     6.2     6.2     12.0    12.1
    ddr  => l2   :  6.3     6.3     6.4     6.4     6.2     6.2     6.4     6.4
    msmc => ddr  :  5.4     5.4     5.4     5.4     5.1     5.1     7.4     7.4
    msmc => msmc :  6.5     6.5     6.5     6.5     6.5     6.5     12.9    12.8
    msmc => l2   :  6.5     6.5     6.5     6.5     6.5     6.5     6.5     6.5
    l2   => ddr  :  5.4     5.4     5.4     5.4     5.1     5.1     6.2     6.2
    l2   => msmc :  6.5     6.5     6.5     6.5     6.5     6.5     6.5     6.5
    l2   => l2   :  6.1     6.1     6.1     6.1     6.5     6.5     6.1     6.1
    

    These measurements represent a single EDMA channel transaction each. However, if you initiate EDMA transactions simultaneously from all DSP cores, you will max out the memory bus and achieve the maximum bandwidth possible.

    Please feel free to change/modify the example as necessary for further benchmarking.

    Regards,

    Gaurav

    4405.edma_bandwidth.tar.gz

  • Hi Cheng,

    In order to create L1 SRAM, you can currently use the chip support library (CSL). The attached file shows an example of how to use CSL to create 16KB SRAM and 16KB Cache on L1 D.

    In the next release of MCSDK-HPC, you will be able to use the following built-in functions in OpenCL and OpenMP Accelerator Model to perform these cache operations.

    void     __cache_l1d_none  (void);
    void     __cache_l1d_all   (void);
    void     __cache_l1d_4k    (void);
    void     __cache_l1d_8k    (void);
    void     __cache_l1d_16k   (void);
    void     __cache_l1d_flush (void);
    
    void     __cache_l2_none   (void);
    void     __cache_l2_128k   (void);
    void     __cache_l2_256k   (void);
    void     __cache_l2_512k   (void);
    void     __cache_l2_flush  (void);
    

    Regards,

    Gaurav

    2705.dsp_cache_ops.c

  • Hi Gaurav,

    I just looked into the code of measuring the memory bandwidth using EDMA. I could understand most of the parts except for the dsp_speed() function:

    31 float dsp_speed()
    32 {
    33 const unsigned DSP_PLL = 122880000;
    34 char *BOOTCFG_BASE_ADDR = (char*)0x02620000;
    35 char *CLOCK_BASE_ADDR = (char*)0x02310000;
    36 int MAINPLLCTL0 = (*(int*)(BOOTCFG_BASE_ADDR + 0x350));
    37 int MULT = (*(int*)(CLOCK_BASE_ADDR + 0x110));
    38 int OUTDIV = (*(int*)(CLOCK_BASE_ADDR + 0x108));
    39
    40 unsigned mult = 1 + ((MULT & 0x3F) | ((MAINPLLCTL0 & 0x7F000) >> 6));
    41 unsigned prediv = 1 + (MAINPLLCTL0 & 0x3F);
    42 unsigned output_div = 1 + ((OUTDIV >> 19) & 0xF);
    43 unsigned speed = DSP_PLL * mult / prediv / output_div;
    44 return speed / 1e6;
    45 }

    Could you please inspire me more details inside of the function? 

    2. In this example, you use EDMA to load data. If I use CPU to do this by using memcpy, would you expect the results be the same, or largely different? Do you have the theoretical peak bandwidth for each bus?

    3. Regarding to how to disable the cache, do you think simply declare the variable as "volatile" would avoid the data to be cached?

    Thanks for your help!

    Cheng 

  • Hi Cheng,

    1. The dsp_speed function reads PLL clock registers and calculates the actual frequency the DSP is running at. 

    2. I would not expect the results to be the same. EDMA should be faster in general. I do not have the theoretical peak for each bus. But you can look through http://www.ti.com/lit/ds/symlink/66ak2h12.pdf to find out. 

    3. Declaring a variable "volatile" will not make it non-cacheable. It will only instruct the compiler to not optimize out loads and stores for that variable. 

    Regards,

    Gaurav

  • Thanks, Gaurav! 

    I will let you know the results.

    Cheng

  • Hi Gaurav,

    Just a follow-up question:

    Regarding to core 6 and 7 which deliver higher bandwidth, you said that is because they are connected to EDMA controller 4. Could you explain a bit more detail why they decided to give core 6 and 7 a higher bandwidth on KS II? And what is the design considerations behind?

    Thanks

    Cheng

  • Hi Gaurav

    Thanks for the explanation. We are wondering why using controller 4 is giving higher bandwidth to core 6 and 7 compared to using other controllers? It is definitely evident from the results.

    Is there documentation somewhere that we could read-up to understand this better?

    Thanks for your help.

    Sunita

  • Hi Sunita, Cheng,

    I am unaware of the design decisions that led to EDMA controller 4 having higher bandwidth. 

    Perhaps more information is available from:

    1. EDMA3 User Guide for Keystone: http://www.ti.com/lit/ug/sprugs5a/sprugs5a.pdf
    2. 66AK2H Technical Reference Manual: http://www.ti.com/lit/ds/symlink/66ak2h12.pdf 

    Hope that helps. 

    Regards,

    Gaurav

  • Gaurav,

    Thanks for your reply, we will go through the PDFs and get back to you when we find more information.

    Sunita