This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

issue of each core performance. ( c6678)

Other Parts Discussed in Thread: TMS320C6678

I want to know if tolerance range of the test results exists and if exists, I want to know the scope of range. Currently, I am doing the test using 1GHz, and  one-thousandth of one percent of tolerence occurred I wonder the probability of tolerence is normal.

if the probability were not normal, could you tell me the kinds of factors or examples that occurring tolerance.

  • Can you please elaborate your problem or requirement ?
    What do you want to test ?
  • source code:

    TSCL= 0,TSCH=0;
    s_search = _itoll(TSCH, TSCL);

    for( i = 0; i < 90; i++ )
    {
    for( j = 0; j < 360; j++ )
    p = ke * ( cd->AA[i][0]*sin(el->el/180.*pi)*cos(azi->az/180.*pi) + cd->AA[i][1]*sin(el->el/180.*pi)*sin(azi->az/180.*pi) + cd->AA[i][2]*cos(el->el/180.*pi) );

    }

    e_search = _itoll(TSCH, TSCL);

    I measured time each the core 0 ~ 3.
    When I run the same code repeatedly, each time got a different time measurements.
    It results when i run 10 times :

    ( unit : nano second )
    core 0
    127,796,736
    127,796,602
    127,796,432
    127,796,662
    127,796,066
    127,797,098
    127,795,638
    127,795,898
    127,795,604
    127,796,258

    core 1
    127,757,638
    127,757,786
    127,757,320
    127,758,080
    127,758,348
    127,757,672
    127,758,014
    127,757,816
    127,757,490
    127,757,954

    core 2
    127,757,788
    127,757,494
    127,757,892
    127,757,542
    127,758,436
    127,757,868
    127,757,842
    127,758,324
    127,757,904
    127,758,346

    core 3
    127,758,100
    127,758,364
    127,757,708
    127,757,768
    127,758,534
    127,757,976
    127,757,686
    127,758,078
    127,757,560
    127,758,192

    like this. The results are different every time.
  • Can you verify the following statements -
    After core 0 initializes the device

    You run from CCS

    you run the exact same code on a single DSP while other DSPs are in idle mode

    And the data resides in the same memory location, right?

    By the way
    There is priority scheme in accessing slaves over the bus, but since you run a single DSP it should not matter
    And the CCS access different cores (for example when there is a printf) in a different order, but again, it do not think that this what matters in your case
    So verify the first two statements and we will continue from there

    Ran
  • It was run on a single core.
    Other core did not driven when the core 0 is driven.
    Similarly, the code was driven only in one core.

    I run the same code.
    and it is int the data reside same memory location.

    Can you give a test code to compare the performance?

    I would like to know the normal range of performance errors.
  • You should not see any difference in the time consumption in your model.   Something must be different between core 0 and core 1

    I attach a document and C code and assembly code.  If you build the project per the instructions in the document you will be able to verify the frequency of the core.  Run the code on different cores and see if you see a difference.  If there is no difference, then the problem is in the memory configuration or some other configuration.

    Ran/cfs-file/__key/communityserver-discussions-components-files/791/2350.delayRoutine.sa

    /cfs-file/__key/communityserver-discussions-components-files/791/8154.mainClock.c

    /cfs-file/__key/communityserver-discussions-components-files/791/4478.Verifying-the-clock-rate-and-the-clock-function-of-C66XX-cores.docx

    Please report what you see

    Ran

  • I was running your test code.

    I thought I'd be the same, the value of eight cores.

    However, it seems that all of the cores that do not behave the same.

    Here is a run results.

    [C66xx_0]   overhead time is   2
    [C66xx_1]   overhead time is   2
    [C66xx_2]   overhead time is   2
    [C66xx_3]   overhead time is   2
    [C66xx_4]   overhead time is   2
    [C66xx_5]   overhead time is   2
    [C66xx_6]   overhead time is   2
    [C66xx_7]   overhead time is   2
    [C66xx_0] t1  -2118105966  t1H  42  t2  -708040364 t2H 44   
    [C66xx_0] times passes (in milliseconds) 1.000000e+01
    [C66xx_2] t1  6153  t1H  0  t2  1410071763 t2H 2   
    [C66xx_2] times passes (in milliseconds) 1.000000e+01
    [C66xx_1] t1  2127169774  t1H  27  t2  -757731920 t2H 29   
    [C66xx_1] times passes (in milliseconds) 1.000000e+01
    [C66xx_3] t1  6153  t1H  0  t2  1410071763 t2H 2   
    [C66xx_3] times passes (in milliseconds) 1.000000e+01
    [C66xx_4] t1  6153  t1H  0  t2  1410071763 t2H 2   
    [C66xx_4] times passes (in milliseconds) 1.000000e+01
    [C66xx_5] t1  6153  t1H  0  t2  1410071763 t2H 2   
    [C66xx_5] times passes (in milliseconds) 1.000000e+01
    [C66xx_6] t1  6153  t1H  0  t2  1410071763 t2H 2   
    [C66xx_6] times passes (in milliseconds) 1.000000e+01
    [C66xx_7] t1  6153  t1H  0  t2  1410071763 t2H 2   
    [C66xx_7] times passes (in milliseconds) 1.000000e+01
    [C66xx_0]  t1  -708020784  t1H  44  t2  507733515 t2H 68   
    [C66xx_0] times passes (in milliseconds) 1.042950e+02
    [C66xx_0]  DONE !
    [C66xx_0]   DONE !
    [C66xx_0]   DONE !
    [C66xx_0]   DONE ! 
    [C66xx_0]  DONE ! 
    [C66xx_2]  t1  1410090721  t1H  2  t2  -1669122276 t2H 25   
    [C66xx_2] times passes (in milliseconds) 1.000000e+02
    [C66xx_2]  DONE !
    [C66xx_2]   DONE !
    [C66xx_2]   DONE !
    [C66xx_2]   DONE ! 
    [C66xx_2]  DONE ! 
    [C66xx_1]  t1  -757712370  t1H  29  t2  458041929 t2H 53   
    [C66xx_1] times passes (in milliseconds) 1.042950e+02
    [C66xx_1]  DONE !
    [C66xx_1]   DONE !
    [C66xx_1]   DONE !
    [C66xx_1]   DONE ! 
    [C66xx_1]  DONE ! 
    [C66xx_3]  t1  1410090721  t1H  2  t2  -1669122276 t2H 25   
    [C66xx_4]  t1  1410090721  t1H  2  t2  -1669122276 t2H 25   
    [C66xx_3] times passes (in milliseconds) 1.000000e+02
    [C66xx_4] times passes (in milliseconds) 1.000000e+02
    [C66xx_3]  DONE !
    [C66xx_4]  DONE !
    [C66xx_3]   DONE !
    [C66xx_4]   DONE !
    [C66xx_3]   DONE !
    [C66xx_4]   DONE !
    [C66xx_3]   DONE ! 
    [C66xx_4]   DONE ! 
    [C66xx_3]  DONE ! 
    [C66xx_4]  DONE ! 
    [C66xx_5]  t1  1410090721  t1H  2  t2  -1669122276 t2H 25   
    [C66xx_5] times passes (in milliseconds) 1.000000e+02
    [C66xx_5]  DONE !
    [C66xx_5]   DONE !
    [C66xx_5]   DONE !
    [C66xx_5]   DONE ! 
    [C66xx_5]  DONE ! 
    [C66xx_6]  t1  1410090721  t1H  2  t2  -1669122276 t2H 25   
    [C66xx_6] times passes (in milliseconds) 1.000000e+02
    [C66xx_6]  DONE !
    [C66xx_6]   DONE !
    [C66xx_6]   DONE !
    [C66xx_6]   DONE ! 
    [C66xx_6]  DONE ! 
    [C66xx_7]  t1  1410090721  t1H  2  t2  -1669122276 t2H 25   
    [C66xx_7] times passes (in milliseconds) 1.000000e+02
    [C66xx_7]  DONE !
    [C66xx_7]   DONE !
    [C66xx_7]   DONE !
    [C66xx_7]   DONE ! 
    [C66xx_7]  DONE ! 
    [C66xx_7] 

    Is this different from the core 0 and core 1?

    Development Environment

    Code Composer Studio 5.1

    TMS320C6678

    RTSC Project

    IPC 1.24.3.32

    MCSDK 2.1.2.5

    MCSDK PDK 1.1.2.5

    SYS/BIOS 6.33.6.50

    XDCtools 3.23.4.60

    Compiler version TI v8.0.3

     

    Can you give me suggest a different opinion?

     

  • I think that the differences between the cores may be because the way the DDR is working. When you run the code from L2, do you get the same cycles or not

    I will refer your question to our expert on the DDR, see if he can add

    Ran
  • I've understood your end goal is to get to measure performance variation of running a function on a C6678 core. The memory on C6678 is hierarchical, and controlling what is the memory map configuration (linker command file and RTOS configuration) and what gets placed where can have a major performance and performance variation effect. There are multiple levels to this. I've listed a few here to in an increasing rate of performance variation:

    1. Do not use caches at all, place all program and data in L1 level SRAM. Performance does not vary run to run, or core to core.

    2. Use L1P cache, place data in L1 level SRAM. Place code in local L2 SRAM. Performance does not vary run to run, or core to core. There can be some performance degradation because of L1P cache misses.

    3. Use L1P cache, place data in L1 level SRAM. Place code in shared L2 SRAM. Small performance variation is possible dependent on activity of other cores.

    ...

    N. The general purpose processor model, place everything in "flat" memory in DDR, turn all caches on. Performance will vary based on other cores and SoC activity, such as DDR refresh. Activity such as DDR refresh is asynchronous to the core activity and can create a performance bubble of up to 1-2 microseconds seen as stalls in the C66x core.

    Now to your specific example. I'm focusing on the fact that there is a variation, run to run, on just one core. I would focus on this first, before comparing runs on different cores. The computational loop you have included includes calls to multiple functions and direct accesses to data structures that are larger than the registers so placement of the the data and program can have an effect on the timing. Some suggestions on what could be done: 

    - in which memory are your data structures (cd, el, azi, ...), C stack, and code placed. Sharing this (e.g. the .map file), would help in narrowing this down.

    - If your end goal is finding the best performance variation per core in a multicore use case, place all code and data in L2 SRAM (or even L1 level). Or something like place code in MSMC SRAM (L2 shared) and data in local L2. For larger data sets use DMA paging to read and write data into L2 and or L1 level.

    - If your intention is to measure single core performance variation with data and/or code in DDR your results look reasonable

    - For a typical C6678 based system, the effects of multiple cores and parallel DMAs sharing memory (MSMC SRAM, DDR) is the unavavoidable source of performance variation. Placing the key signal processing kernels and other often called programs in MSMC SRAM, and placing the working memory of each task in local L2 and relying on the L1 caches and prefetcher is often a good approach.

      Pekka