This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DM6446 (C64K+ DSP) performance issue using DVSDK/Codec Engine

Hi,

I have an own developed XDAIS compliant IUNIVERSAL algorithm (MYALG) running on DM6446 SoC's DSP.
I would need real time processing.
Reaching that, I need run MYALG process less than 30 ms.

Processing time (algorithm's performance) calculation is performed by TSCL (there is no overrun, and I don't need to use TSCH).
TSC is the DSP's own timer, so it seems to good timer for measure execution cycles.
In addition, processing time is measured by MYALG itself, inside of its function ALG_process(), so it is measured only the DSP running time.

After some digging (due to some performance issues), I found some strange processing times.
Simulator timing, JTAG emulator timing and the real time running (in a DVSDK environment) are quite different.

I would understand the Simulator and others differences due to the quality of the simulation accuracy (although, I tried to set the best simulator's parameters).
But, I don't understand the JTAG emulator and real time running differences.
Processing times are 2-3 times slower in real time environment than using a JTAG emulator.
Once again: this processing times measured only DSP running in function ALG_process() !

So, I need some optimization work, but I don't know where would I start it.
I would appreciate if anybody would explain the reason of this performance decreasing.

Should I make some configuration steps, or would I need to accept this performance ratios?

Thanks,
Peter



Details:

A./ Algorithm development

    CCS v5.3.0.00090
    Texas Instruments Simulators: v5.3.3.0
    JTAG Emulator: Blackhawk LAN560
    Board's DSP core freq: 594 MHz
    Board's DDR2 freq: 162 MHz

The measured benchmarks are:

A.1./ When DSP cache is disabled:
     92.67 ms ( 55,043,713 sysclk) - Simulator: C64x+ CPU Cycle Accurate Simulator
    316.60 ms (188,061,064 sysclk) - Emulator: running on board using JTAG emulator

A.2./ When DSP cache is enabled (L1P=32K, L1D=32K, L2=64K):
     22.07 ms ( 13,109,186 sysclk) - Simulator: C64x+ CPU Cycle Accurate Simulator
     31.70 ms ( 18,829,160 sysclk) - Simulator: DM6446 Device Cycle Accurate (600 MHz core)
     27.46 ms ( 16,310,579 sysclk) - Emulator: running on board using JTAG emulator

A.3./ Simulator without cache simulation:
     13.07 ms (  7,765,643 sysclk) - C64x+ CPU Cycle Accurate Simulator (no cache simulation)


B./ Testing in final environment

    DVSDK: dvsdk_wince_01_11_00_03_patch_01
    DSP BIOS: 5_41_11_38
    Code Generation Tools: C6000_7_3_3
    XDCTOOLS: 3_23_00_32
    Codec Engine: 2_26_01_09
    Framework Components: 2_26_00_01
    DMAI: 1_27_00_06
    XDAIS: 6_26_01_03
    WinCE Utils: 1_01_00_01
    BIOS utils: 1_02_02
    DSPLink: 1_65_00_03
    CE utils: v1_07
    Opened algorithms in application (in codec server): DM6446-MPEG4 encode, JPEG encode, MYALG
    OS: WinCE 6 R3
    Board's DSP core freq: 594 MHz
    Board's DDR2 freq: 162 MHz

The measured benchmarks are:

B.1./ When DSP cache is disabled:
    575.10 ms (341,609,968 sysclk) - Free running: on board using within framework (Codec Engine, in part of a codec server)

B.2./ When DSP cache is enabled (L1P=32K, L1D=16K, L2=64K):
     75.97 ms ( 45,124,660 sysclk) - Free running: on board using within framework (Codec Engine, in part of a codec server)



  • Hi,

    Thanks for your post.

    Building your algorithm with -g (debug info) also causes problems in calculating the worst-case stack usage at present  and this is typically not a major problem since -g inhibits optimization thus most production libraries do not use -g.

    Please check if you are able to include fc headers as below:

    #your FC installation dir

    #Set to correct path as per your build environment

    FC_INSTALL_DIR=/home/user/my_workspace/Project/~/~/~/~/RDK2.0/ti_tools/framework_components/framework_components_3_22_02_08

    XDCPATH=$(XDAIS_INSTALL_DIR)/packages;$(FC_INSTALL_DIR)/packages

    In order to generate a XDAIS compliant project using XDM GenAlg version 7.21 successfully, kindly follow the below FAQ link:

    http://processors.wiki.ti.com/index.php/XDM_GenAlg_Wizard_FAQ#How_can_I_check_that_my_generated_algorithm_is_XDAIS_compliant.3F

    After building a generated algorithm package, it can be tested for XDAIS compliance using the QualiTI compliancy tool FAQ link below so that,  you could optimize the performance through validating "Rule 20: must declare worst-case stack requirements" in order to be XDAIS compliant, kindly validate all the XDAIS compliant tests shown in the below QualiTI link:

    http://processors.wiki.ti.com/index.php/QualiTI#Details_on_tests_performed

    As well, to understand insufficient stack issue, kindly check the below link:

    http://processors.wiki.ti.com/index.php/Stack_issues#Libraries_built_with_-g

    Thanks & regards,

    Sivaraj K

    -------------------------------------------------------------------------------------------------------

    Please click the Verify Answer button on this post if it answers your question.

    --------------------------------------------------------------------------------------------------------

  • Hi Sivaraj K,

    Thanks for your answer, but I have no problem about my algorithm neither XDAIS and FC paths neither QualiTI tools.
    (I know these utilities and options, and I am using them)

    I'll check the option -g, but I think, any stack problem can cause more fatal problems than just a performance decreasing...

    But, I'm confused by the SoC's working and/or CodecEngine's working itself.
    So, I don't understand (yet), how can the same algorithm run so with different processing times (and, the DSP costs measured only!) .

    When I use a JTAG emulator and the code runs without any breakpoint: processing time is 16,310,579 sysclk.
    But, when I run the same algorithm in the DVSDK/CodecEngine framework with a real environment (ARM working, VPBE active, etc.): processing time is 45,124,660 sysclk.

    Of course, I can accept just a few percent overhead running in a real environment compared to running only the alg within CCS/JTAG Emulator.

    So, how is possible wasting so much time reaching 2,77 times slower performance?
    (e.g. DVSDK/CodecEngine framework overhead costs, more frequent DDR2 usage, ARM, VPBE or other active submodule of SoC, etc.)

    Peter