issue of each core performance. ( c6678)

oh jong

Other Parts Discussed in Thread: TMS320C6678

I want to know if tolerance range of the test results exists and if exists, I want to know the scope of range. Currently, I am doing the test using 1GHz, and one-thousandth of one percent of tolerence occurred I wonder the probability of tolerence is normal.

if the probability were not normal, could you tell me the kinds of factors or examples that occurring tolerance.

over 9 years ago

0 Titusrathinaraj Stalin over 9 years ago

TI__Guru** 116100 points

Can you please elaborate your problem or requirement ?
What do you want to test ?

0 oh jong over 9 years ago in reply to Titusrathinaraj Stalin

Prodigy 140 points

source code:

TSCL= 0,TSCH=0;
s_search = _itoll(TSCH, TSCL);

for( i = 0; i < 90; i++ )
{
for( j = 0; j < 360; j++ )
p = ke * ( cd->AA[i][0]*sin(el->el/180.*pi)*cos(azi->az/180.*pi) + cd->AA[i][1]*sin(el->el/180.*pi)*sin(azi->az/180.*pi) + cd->AA[i][2]*cos(el->el/180.*pi) );

}

e_search = _itoll(TSCH, TSCL);

I measured time each the core 0 ~ 3.
When I run the same code repeatedly, each time got a different time measurements.
It results when i run 10 times :

( unit : nano second )
core 0
127,796,736
127,796,602
127,796,432
127,796,662
127,796,066
127,797,098
127,795,638
127,795,898
127,795,604
127,796,258

core 1
127,757,638
127,757,786
127,757,320
127,758,080
127,758,348
127,757,672
127,758,014
127,757,816
127,757,490
127,757,954

core 2
127,757,788
127,757,494
127,757,892
127,757,542
127,758,436
127,757,868
127,757,842
127,758,324
127,757,904
127,758,346

core 3
127,758,100
127,758,364
127,757,708
127,757,768
127,758,534
127,757,976
127,757,686
127,758,078
127,757,560
127,758,192

like this. The results are different every time.

0 ran35366 over 9 years ago in reply to oh jong

TI__Genius 12805 points

Can you verify the following statements -
After core 0 initializes the device

You run from CCS

you run the exact same code on a single DSP while other DSPs are in idle mode

And the data resides in the same memory location, right?

By the way
There is priority scheme in accessing slaves over the bus, but since you run a single DSP it should not matter
And the CCS access different cores (for example when there is a printf) in a different order, but again, it do not think that this what matters in your case
So verify the first two statements and we will continue from there

Ran

0 oh jong over 9 years ago in reply to ran35366

Prodigy 140 points

It was run on a single core.
Other core did not driven when the core 0 is driven.
Similarly, the code was driven only in one core.

I run the same code.
and it is int the data reside same memory location.

Can you give a test code to compare the performance?

I would like to know the normal range of performance errors.

0 ran35366 over 9 years ago in reply to oh jong

TI__Genius 12805 points

You should not see any difference in the time consumption in your model. Something must be different between core 0 and core 1

I attach a document and C code and assembly code. If you build the project per the instructions in the document you will be able to verify the frequency of the core. Run the code on different cores and see if you see a difference. If there is no difference, then the problem is in the memory configuration or some other configuration.

Ran/cfs-file/__key/communityserver-discussions-components-files/791/2350.delayRoutine.sa

/cfs-file/__key/communityserver-discussions-components-files/791/8154.mainClock.c

/cfs-file/__key/communityserver-discussions-components-files/791/4478.Verifying-the-clock-rate-and-the-clock-function-of-C66XX-cores.docx

Please report what you see

Ran

0 oh jong over 9 years ago in reply to ran35366

Prodigy 140 points

I was running your test code.

I thought I'd be the same, the value of eight cores.

However, it seems that all of the cores that do not behave the same.

Here is a run results.

[C66xx_0]   overhead time is   2
[C66xx_1]   overhead time is   2
[C66xx_2]   overhead time is   2
[C66xx_3]   overhead time is   2
[C66xx_4]   overhead time is   2
[C66xx_5]   overhead time is   2
[C66xx_6]   overhead time is   2
[C66xx_7]   overhead time is   2
[C66xx_0] t1 -2118105966 t1H 42 t2 -708040364 t2H 44
[C66xx_0] times passes (in milliseconds) 1.000000e+01
[C66xx_2] t1 6153 t1H 0 t2 1410071763 t2H 2
[C66xx_2] times passes (in milliseconds) 1.000000e+01
[C66xx_1] t1 2127169774 t1H 27 t2 -757731920 t2H 29
[C66xx_1] times passes (in milliseconds) 1.000000e+01
[C66xx_3] t1 6153 t1H 0 t2 1410071763 t2H 2
[C66xx_3] times passes (in milliseconds) 1.000000e+01
[C66xx_4] t1 6153 t1H 0 t2 1410071763 t2H 2
[C66xx_4] times passes (in milliseconds) 1.000000e+01
[C66xx_5] t1 6153 t1H 0 t2 1410071763 t2H 2
[C66xx_5] times passes (in milliseconds) 1.000000e+01
[C66xx_6] t1 6153 t1H 0 t2 1410071763 t2H 2
[C66xx_6] times passes (in milliseconds) 1.000000e+01
[C66xx_7] t1 6153 t1H 0 t2 1410071763 t2H 2
[C66xx_7] times passes (in milliseconds) 1.000000e+01
[C66xx_0] t1 -708020784 t1H 44 t2 507733515 t2H 68
[C66xx_0] times passes (in milliseconds) 1.042950e+02
[C66xx_0] DONE !
[C66xx_0]   DONE !
[C66xx_0]   DONE !
[C66xx_0]   DONE !
[C66xx_0] DONE !
[C66xx_2] t1 1410090721 t1H 2 t2 -1669122276 t2H 25
[C66xx_2] times passes (in milliseconds) 1.000000e+02
[C66xx_2] DONE !
[C66xx_2]   DONE !
[C66xx_2]   DONE !
[C66xx_2]   DONE !
[C66xx_2] DONE !
[C66xx_1] t1 -757712370 t1H 29 t2 458041929 t2H 53
[C66xx_1] times passes (in milliseconds) 1.042950e+02
[C66xx_1] DONE !
[C66xx_1]   DONE !
[C66xx_1]   DONE !
[C66xx_1]   DONE !
[C66xx_1] DONE !
[C66xx_3] t1 1410090721 t1H 2 t2 -1669122276 t2H 25
[C66xx_4] t1 1410090721 t1H 2 t2 -1669122276 t2H 25
[C66xx_3] times passes (in milliseconds) 1.000000e+02
[C66xx_4] times passes (in milliseconds) 1.000000e+02
[C66xx_3] DONE !
[C66xx_4] DONE !
[C66xx_3]   DONE !
[C66xx_4]   DONE !
[C66xx_3]   DONE !
[C66xx_4]   DONE !
[C66xx_3]   DONE !
[C66xx_4]   DONE !
[C66xx_3] DONE !
[C66xx_4] DONE !
[C66xx_5] t1 1410090721 t1H 2 t2 -1669122276 t2H 25
[C66xx_5] times passes (in milliseconds) 1.000000e+02
[C66xx_5] DONE !
[C66xx_5]   DONE !
[C66xx_5]   DONE !
[C66xx_5]   DONE !
[C66xx_5] DONE !
[C66xx_6] t1 1410090721 t1H 2 t2 -1669122276 t2H 25
[C66xx_6] times passes (in milliseconds) 1.000000e+02
[C66xx_6] DONE !
[C66xx_6]   DONE !
[C66xx_6]   DONE !
[C66xx_6]   DONE !
[C66xx_6] DONE !
[C66xx_7] t1 1410090721 t1H 2 t2 -1669122276 t2H 25
[C66xx_7] times passes (in milliseconds) 1.000000e+02
[C66xx_7] DONE !
[C66xx_7]   DONE !
[C66xx_7]   DONE !
[C66xx_7]   DONE !
[C66xx_7] DONE !
[C66xx_7]

Is this different from the core 0 and core 1?

Development Environment

Code Composer Studio 5.1

TMS320C6678

RTSC Project

IPC 1.24.3.32

MCSDK 2.1.2.5

MCSDK PDK 1.1.2.5

SYS/BIOS 6.33.6.50

XDCtools 3.23.4.60

Compiler version TI v8.0.3

Can you give me suggest a different opinion?

0 ran35366 over 9 years ago in reply to oh jong

TI__Genius 12805 points

I think that the differences between the cores may be because the way the DDR is working. When you run the code from L2, do you get the same cycles or not

I will refer your question to our expert on the DDR, see if he can add

Ran

0 Pekka Varis over 9 years ago in reply to ran35366

TI__Mastermind 27050 points

I've understood your end goal is to get to measure performance variation of running a function on a C6678 core. The memory on C6678 is hierarchical, and controlling what is the memory map configuration (linker command file and RTOS configuration) and what gets placed where can have a major performance and performance variation effect. There are multiple levels to this. I've listed a few here to in an increasing rate of performance variation:

1. Do not use caches at all, place all program and data in L1 level SRAM. Performance does not vary run to run, or core to core.

2. Use L1P cache, place data in L1 level SRAM. Place code in local L2 SRAM. Performance does not vary run to run, or core to core. There can be some performance degradation because of L1P cache misses.

3. Use L1P cache, place data in L1 level SRAM. Place code in shared L2 SRAM. Small performance variation is possible dependent on activity of other cores.

...

N. The general purpose processor model, place everything in "flat" memory in DDR, turn all caches on. Performance will vary based on other cores and SoC activity, such as DDR refresh. Activity such as DDR refresh is asynchronous to the core activity and can create a performance bubble of up to 1-2 microseconds seen as stalls in the C66x core.

Now to your specific example. I'm focusing on the fact that there is a variation, run to run, on just one core. I would focus on this first, before comparing runs on different cores. The computational loop you have included includes calls to multiple functions and direct accesses to data structures that are larger than the registers so placement of the the data and program can have an effect on the timing. Some suggestions on what could be done:

- in which memory are your data structures (cd, el, azi, ...), C stack, and code placed. Sharing this (e.g. the .map file), would help in narrowing this down.

- If your end goal is finding the best performance variation per core in a multicore use case, place all code and data in L2 SRAM (or even L1 level). Or something like place code in MSMC SRAM (L2 shared) and data in local L2. For larger data sets use DMA paging to read and write data into L2 and or L1 level.

- If your intention is to measure single core performance variation with data and/or code in DDR your results look reasonable

- For a typical C6678 based system, the effects of multiple cores and parallel DMAs sharing memory (MSMC SRAM, DDR) is the unavavoidable source of performance variation. Placing the key signal processing kernels and other often called programs in MSMC SRAM, and placing the working memory of each task in local L2 and relying on the L1 caches and prefetcher is often a good approach.

Pekka

Processors

Processors forum

issue of each core performance. ( c6678)