This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

VLIB 2.2 (version c64x+ elf) performance in the DM8127 C674x



Hi dear forum members,

I am using the C674x DSP of DM8127 in the IPNC Camera of APPRO (EVM).

The module comes with vlib 1.0 and I moved to vlib 2.2 as in version 1.0 there were many issues (bugs...).

i would like to understand / query 3 main things:

1. Performance of dilate / erode and pyramid build functions (VLIB_imagePyramid8, VLIB_gauss5x5PyramidKernel_8, VLIB_erode_bin_square,...).

     When we run the functions in x86 ref we get some valid results (quality-wise). Then, we run them on the DSP and we get results like:

     - Dilate: 500 usec (micro) and sometimes for the same image size 64x124 (but in a different flow) 20 usec.

     - VLIB_gauss5x5PyramidKernel_8 on 640x480 image (following the x86 as is which performs 3x(img_height) calls): 160509 usec.

     - VLIB_imagePyramid8 on 640x480 image (single call like in the ref code examples): 18069 usec.

I ran the flow at least 5 times and I always get this different results. I'm allocating all the test memory via Utils_memAlloc (can't see in utils_mem.c any difference from Utils_memAlloc_cached) and I use Utils_getCurTimeInUsec to measure the execution time (I presumed that even if this function isn't atomic my intervals are wide enough so it won't affect my measurement validity). 

What can cause such a performance penalty? In the VLIB 2.2 papers I would expect, for example, for VLIB_gauss5x5PyramidKernel_8 "The compute-only performance with all buffers in L1 is 4.9 cycles per output value" i.e. for my 500 Mhz DSP and a 640x480 image something around (640x480x4.9/500e6)*1e6 =~ 3000 usec and not 18000 !!!

Can it be that I'm missing something with cache optimizations, shared memory and such... ?

2. Atomic / valid profiling - can I count on the result of Utils_getCurTimeInUsec (i.e. is it with locked ISR)?

I'm asking because I can't understand how come dilate give such a different results (consistently) in different flows on the same image dimensions.

3. Listing the disassembly of the vlib file (like with objdump).

I would like to know what is the command line (and what tool to use...) to examine the c64/67 IS of some object file (specifically *.lib in my case using DWARF/ELF format).

I tried to look into appro_dev/Source/ti_tools/cgt6x_7_3_5 and its docs but found non valid.

Thanks in advance for any help,

Regards,


Roei

  • Roei,

    The DM814x Forum is probably the best place for your question, and not this C67x Single Core DSP Forum. I will request a moderator to more this thread there for you to get the best answers.

    Have you received any references to TI Third Party Developers who have support for the IPNC? Or do you have any direct contacts at TI who will help with your questions.

    Regards,
    RandyP

  • Just to add to my previous request, I should also mention that when I run vlib 1.0 on the same functions, I get the exact same performance issues...

    If anyone could suggest what is wrong here - it would be great.

    Also, out of pure curiosity, can the linker parameters affect the performance of a closed lib like vlib (my hunch was absolutely NO as the library is already compiled for a release version) ???

    Regards,

    Roei

  • To help others that might be interested, the problem was mainly in CACHE usage and some error that reside in the MAR bits of IPNC. The core problem was that the data resided in the SR2 region which wasn't cached (0xB3...) while TILER is fully cached and unused in my flow - therefore, I moved to using the TILER pool directly and got a boost of around x8 to x20 (depends on the BW usage) in the code.

    Regards,

    Roei