This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How to make 6455 run as fast as the result showed in Cycle Benchmarks of VILB

Other Parts Discussed in Thread: TMS320C6455

DSP: TMS320C6455

Board:customized

Frequency:1GHz

VLIB: vlib_c64Px_3_2_1_0

CCS:5.5

Compiler:7.4.14  DEBUG

Command:  -mv64+ --abi=coffabi -O3 --include_path="C:/ti/ccsv5/tools/compiler/c6000_7.4.14/include" --include_path="C:/Users/Hardware/Desktop/TLD_C6455_20150809_1/inc" --define=c6455 --display_error_number --diag_warning=225 --diag_wrap=off

function:

int16_t *gradx = memalign(8,width*height*sizeof(int16_t));
assert(gradx != NULL);
int16_t *grady = memalign(8,width*height*sizeof(int16_t));
assert(grady != NULL);

uint8_t *previousImage = memalign(CACHE_L1P_LINESIZE,TLD_IMG_WIDTH * TLD_IMG_HEIGHT * sizeof(uint8_t));
assert(previousImage!=NULL);

int a = TSCL;

VLIB_xyGradients(previousImage, gradx + width + 1, grady + width + 1, width, height-1);

//width = 512  height = 384 pt = width * height = 196608

int b = TSCL - a;

Result: b = 2.755.536 cycles (14 cycles / pt) which is much slower than the result showed in Cycle benchmarks (avg:1 cycle / pt)

Is there anything wrong or missed ?

Best regards

  • and all data is located in L2SRAM
  • Hi,
    Some of the things which code makes to run faster.
    1) CPU frequency
    2) When you the code on internal RAM.
    3) Code optimization.

    But you have been already running at higher speed and internal RAM, so you can try to optimize the code.
  • HI!
    Thanks for your advice.
    But you can see that , there is just one function in my code and the function belongs to VLIB
    so I don't think I can optimize my code any more .
    besides, I have found that in most cases the function costs 2.700.000 cycles but sometimes it costs just 400.000 cycles
    and the result is still right. Is there anything else can affect the funtion in VLIB such as interrupt, initialization of Periphral?

    Best regards!
  • The cycle performance numbers reported in the release are accomplished by running on the C64+ core simulator, which essentially assumes best (perhaps unrealistic) case where all memory accesses are L1RAM (no cache misses). This is done to give an idea of what the cycle performance may converge to as best case as different memory configuration optimizations are made on different chips.

    This kernel in particular is load/store bound, so memory configuration optimizations can help you get closer to this point. You said that all data is located in L2SRAM. What about the code and stack? I suggest (if your application allows), the following:

    1. Pull in the kernel code and stash into L2SRAM as well ( you can check your .map file to make sure this is true)
    2. Ensure your L1 caches are enabled

    Another thing to consider is the discrepancy between the core speed and the memory speed. For a kernel which is load/store bound, and the memory and interconnect speed stays the same, if you increase the core speed, the core cycles/pt will also increase because the core is essentially stalled for more cycles waiting the same amount of time for instructions or data to reach the core. This is one of the reasons to make sure all data/instructions are in RAM.

    I think when everything is in L2 and L1 caches are enabled, a more realistic best case may be closer to 2 cycles/pt

    Jesse
  • Yes thank you very much!
    It is because the cache size is too small.
    It seems that the cache size is zero after reset even though L1DRAM and L1DRAM are set to 32KB in CMD file

    L1PRAM: o = 0x00E00000 l = 0x00008000 /* 32kB L1 Program SRAM/CACHE */
    L1DRAM: o = 0x00F00000 l = 0x00008000 /* 32kB L1 Data SRAM/CACHE */