This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VMXEVM: Queries related to custom kernel

Part Number: TDA4VMXEVM

Hi TI,

I have tested the custom kernel in the host emulation mode and c66x, however, the performance is quite poor on the target even though I have implemented a simple color conversion. At first, the target is to implement and running some more complex functions (i.e functions not available in TI supported libs) which are/will be written in C, using whatever the computational resources available on TDA4 to finish the pipeline and then jump into optimization of those functions. Keeping this background in mind, I have following questions:

1. What are the most powerful units that can implement C based functions and give similar performance as on host? 

2. Regarding the ARM, I am bit confused with the reference names used, If my Core.ARM what should be the targets? IPUs, R5Fs, VPAC_MSC1, A72, so on ... which are better to pick at first? Any document which explain these naming conventions will also be helpful.

With best regards,

H.M. Owais

  • Hafiz Muhammad Owais said:
    1. What are the most powerful units that can implement C based functions and give similar performance as on host? 

    When you say "host" I assume A72 ARM, right?  It depends on the type of c code you are writing.  The purpose of our heterogeneous architecture is to help provide different types of compute cores to service different types of algorithms and different power consumption.  The A72s are powerful but consume a lot of power as compared to other cores.  Having said that, if it is just C control or algorithm code with a lot of branching and conditions, then these are best done on the main ARMs.  If you want to do SIMD type operations (same operation across large buffer), then optimized c code on C66, C7x would be good.  If you are just doing a simple color conversion only from memory to memory, then this will be mostly data access bound as the DSP will be mostly idle waiting for memory.  In order to best take advantage of the power of these DSPs, then loops written should perform several operations merged together so as to minimize the intermediate round trips to DDR ... or else operate in SIMD loops across a whole block or line, putting intermediate data in L2 SRAM for quick cache locality and reduced delays for accessing DDR.

    You may want to consider if your operations you want are already supported by the Hardware accelerators for VPAC/DMPAC (please reference TIOVX user guide for more information on these operations.

    Hafiz Muhammad Owais said:
    2. Regarding the ARM, I am bit confused with the reference names used, If my Core.ARM what should be the targets? IPUs, R5Fs, VPAC_MSC1, A72, so on ... which are better to pick at first? Any document which explain these naming conventions will also be helpful.

    A72 is the Main "Big" ARMs that are running the HLOS, typically.  IPU is the R5F "Small" MCU type arms which are running RTOS and are mainly used for controlling the hardware accelerators with low latency interrupts, but can also be used for some compute.  The VPAC_/DMPAC_ are targets that represent tasks (threads) reserved for running the VPAC and DMPAC hardware accelerators.

    Here is the documentation for explanation of targets:

    http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/TIOVX_ADD_TARGET.html#TIOVX_TARGET_EXPLANATION

    Here is the documentation for Kernels and how they map to targets:

    http://software-dl.ti.com/jacinto7/esd/processor-sdk-rtos-jacinto7/latest/exports/docs/tiovx/docs/user_guide/SUPPORTED_KERNELS.html#autotoc_md10