How to make 6455 run as fast as the result showed in Cycle Benchmarks of VILB

shiyan sun

Prodigy 40 points

Other Parts Discussed in Thread: TMS320C6455

DSP: TMS320C6455

Board:customized

Frequency:1GHz

VLIB: vlib_c64Px_3_2_1_0

CCS:5.5

Compiler:7.4.14 DEBUG

Command: -mv64+ --abi=coffabi -O3 --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.14/include" --include_path="C:/Users/Hardware/Desktop/TLD_C6455_20150809_1/inc" --define=c6455 --display_error_number --diag_warning=225 --diag_wrap=off

function:

int16_t *gradx = memalign(8,width*height*sizeof(int16_t));
assert(gradx != NULL);
int16_t *grady = memalign(8,width*height*sizeof(int16_t));
assert(grady != NULL);

uint8_t *previousImage = memalign(CACHE_L1P_LINESIZE,TLD_IMG_WIDTH * TLD_IMG_HEIGHT * sizeof(uint8_t));
assert(previousImage!=NULL);

int a = TSCL;

VLIB_xyGradients(previousImage, gradx + width + 1, grady + width + 1, width, height-1);

//width = 512 height = 384 pt = width * height = 196608

int b = TSCL - a;

Result: b = 2.755.536 cycles (14 cycles / pt) which is much slower than the result showed in Cycle benchmarks (avg:1 cycle / pt)

Is there anything wrong or missed ?

Best regards

over 8 years ago

0 shiyan sun over 8 years ago

Prodigy 40 points

and all data is located in L2SRAM

0 Titusrathinaraj Stalin over 8 years ago

TI__Guru** 116100 points

Hi,
Some of the things which code makes to run faster.
1) CPU frequency
2) When you the code on internal RAM.
3) Code optimization.

But you have been already running at higher speed and internal RAM, so you can try to optimize the code.

0 shiyan sun over 8 years ago in reply to Titusrathinaraj Stalin

Prodigy 40 points

HI!
Thanks for your advice.
But you can see that , there is just one function in my code and the function belongs to VLIB
so I don't think I can optimize my code any more .
besides, I have found that in most cases the function costs 2.700.000 cycles but sometimes it costs just 400.000 cycles
and the result is still right. Is there anything else can affect the funtion in VLIB such as interrupt, initialization of Periphral?

Best regards!

0 Jesse Villarreal over 8 years ago

TI__Expert 5485 points

The cycle performance numbers reported in the release are accomplished by running on the C64+ core simulator, which essentially assumes best (perhaps unrealistic) case where all memory accesses are L1RAM (no cache misses). This is done to give an idea of what the cycle performance may converge to as best case as different memory configuration optimizations are made on different chips.

This kernel in particular is load/store bound, so memory configuration optimizations can help you get closer to this point. You said that all data is located in L2SRAM. What about the code and stack? I suggest (if your application allows), the following:

1. Pull in the kernel code and stash into L2SRAM as well ( you can check your .map file to make sure this is true)
2. Ensure your L1 caches are enabled

Another thing to consider is the discrepancy between the core speed and the memory speed. For a kernel which is load/store bound, and the memory and interconnect speed stays the same, if you increase the core speed, the core cycles/pt will also increase because the core is essentially stalled for more cycles waiting the same amount of time for instructions or data to reach the core. This is one of the reasons to make sure all data/instructions are in RAM.

I think when everything is in L2 and L1 caches are enabled, a more realistic best case may be closer to 2 cycles/pt

Jesse

0 shiyan sun over 8 years ago in reply to Jesse Villarreal

Prodigy 40 points

Yes thank you very much!
It is because the cache size is too small.
It seems that the cache size is zero after reset even though L1DRAM and L1DRAM are set to 32KB in CMD file

L1PRAM: o = 0x00E00000 l = 0x00008000 /* 32kB L1 Program SRAM/CACHE */
L1DRAM: o = 0x00F00000 l = 0x00008000 /* 32kB L1 Data SRAM/CACHE */

Processors

Processors forum

How to make 6455 run as fast as the result showed in Cycle Benchmarks of VILB