TDA4VM: TDA4VM vs TDA4V MidEco: Why are the speeds of the two processors different for the same logic?

Part Number: TDA4VM
Other Parts Discussed in Thread: TDA4VL

Tool/software:

Hello,

I am porting the same logic to both processors: TDA4VM vs TDA4VMid Eco

My logic pipeline configuration looks like this:
1) image capture 2) ldc 3) preprocessing 4) tidl 5) postprocessing 6) tracking 7) draw bbox

In addition, TDA4VM is using version 8.2 and Mid Eco is using version 9.2 of the sdk.

Looking at the datasheets for both processors, the Mid Eco board appears to have better compute performance in DSP core (40GFLOPS vs 160GFLOPS)

But, when we ran real logic, we saw increased execution times on all nodes compared to the TDA4VM.

Can you give me an explanation of the likely causes of why this is happening?

  • Hi Dohee,

    To clarify your statement on GLOPS -

    TDA4VM has 80 GFLOPS per the datasheet and TDA4VMid-Eco has 160 GFLOPS due to there being 2 C7x cores - if you are only running your computation on 1, then you are utilizing the same 80 GFLOPS

    But, when we ran real logic, we saw increased execution times on all nodes compared to the TDA4VM.

    Can you clarify what you mean by nodes in this case?

    1) image capture 2) ldc 3) preprocessing 4) tidl 5) postprocessing 6) tracking 7) draw bbox

    Are you referring to these parts of the pipeline? If so, can you please give details on the performance differences that you are seeing?

    Best,

    Asha

  • Hi, Bhandarkar

    First of all, I would like to apologize for the lack of explanation of my issue.

    The nodes I was referring to are the ones you identified. Also, I put all the nodes on C66 cores in TDA4VM except tidl, which as far as I understood had a performance of 40 GFLOPS.

    As you know, the Mid Eco board does not have a C66 core, so I put all the nodes I put on the C66 core in TDA4VM on the 2nd C7x core, which does not have MMA on the Mid Eco board.
    Also, according to you, this had a performance of 80 GFLOPS.

    The performance difference I was referring to is execution time. For example, when we put the same postprocessing code on a C66 core in TDA4VM and a C7x core in TDA4VMid Eco, the execution time for that function in each situation was as follows

    TDA4VM(C66); PostProcNode: avg = 6194 usecs, min/max = 6078 / 25685 usecs,
    TDA4VMidEco (C7x); PostProcNode: avg = 11293 usecs, min/max = 11195 / 70855 usecs,

  • Hi Dohee,

    Thanks for those clarifications. 

    So to my understanding on TDA4VM, everything is running on C66 except TIDL node which runs on the C7x core. 

    On TDA4VMid-Eco, everything runs on C7x_2 except TIDL node which runs on C7x_1 with MMA. 

    Comparing in this case might be slightly more difficult. Are you saying that for all nodes you are seeing a differnce and not just PostProc?

    For PostProc are you reading from a similar point in memory? Are you using some VXLIB implementation for post-processing?

    Best,

    Asha

  • Hi, Bhandarkar

    One more thing, in TDA4VM, ldc and preproc were running on c66_1 and postproc, tracking, draw bbox on c66_2.

    Also, the slowdown I mentioned above was not just for postproc, but was similar on all nodes except TIDL
    (TIDL was about 5-10% slower on Mid Eco compared to TDA4VM and ldc is similar in both boards.)

    For PostProc are you reading from a similar point in memory? Are you using some VXLIB implementation for post-processing?

    I didn't understand what exactly this meant. What do you mean by reading similar points in memory? One thing is for sure, my code doesn't change between the two boards.

    Also, Is it correct to ask if the VXLIB implementation you are talking about is using the default example provided in the TI SDK? If so, we are using code that has been modified to suit our situation.

  • Hi Dohee,

    Also, the slowdown I mentioned above was not just for postproc, but was similar on all nodes except TIDL
    (TIDL was about 5-10% slower on Mid Eco compared to TDA4VM and ldc is similar in both boards.)

    Could you give a breakdown of the percentage slowdown? TIDL is 5-10% slower on MidEco, the other nodes are what percentage?

    I didn't understand what exactly this meant. What do you mean by reading similar points in memory? One thing is for sure, my code doesn't change between the two boards.

    You are doing some processing of data from C66 and C7x - where is this data stored in memory? If your code does not change between boards - I am guessing this means you have not optimized your code to the C7x architecture?

    Also, Is it correct to ask if the VXLIB implementation you are talking about is using the default example provided in the TI SDK? If so, we are using code that has been modified to suit our situation.

    Yes, I was wondering if you were using some default SDK implementation for some of your pre and post processing. So your answer here is no.

    Best,

    Asha

  • Hi Bhandarkar

    Could you give a breakdown of the percentage slowdown? TIDL is 5-10% slower on MidEco, the other nodes are what percentage?

    All nodes except LDC and TIDL were slowed down by about 2x, which can be seen by looking at the executation times of the PostProc node I sent you earlier.

    You are doing some processing of data from C66 and C7x - where is this data stored in memory? If your code does not change between boards - I am guessing this means you have not optimized your code to the C7x architecture?

    Yes. I didn't do any additional optimizations to port to C7x. Also, I'm not sure if you're referring to this, my data is stored in DDR memory and I'm using tivxMemShared2TargetPtr() to get the address. If this is not the answer you are looking for, please provide further clarification. 

    Also, if additional optimization techniques are needed to go from C66 to C7x, can you provide example code or something?

  • Hi Bhandarkar.

    I'm working on something and have an additional question, so I write in.

    Currently, I'm working on TDA4VMidEco, using C7x's intrinsic, etc. to significantly reduce the execution time of the LDC.

    This has almost halved the execution time compared to TDA4VM, but the CPU load remains the same. For example, for an LDC node with the same functionality, here is the situation

    TDA4VM(C66) ; Execution time: 40ms / CPU load : 31%.
    TDA4VMidEco (C7x) : execution time: 20ms / CPU load : 30%.

    My expectation was that the CPU load would decrease if the execution time decreased, but it didn't. What do you mean by CPU load and how can I reduce it? And, what factors determine the CPU Load? Please answer this along with my previous question.

    Thank you.

    Dohee Kang.

     

  • Dear. Bhandarkar

    When can I expect an answer to my question?

    Let me know if these were difficult questions to answer.

    Thanks.

    Dohee Kang.

  • Hi Dohee,

    Unlocking the thread to provide an update on the status of this thread.

    Asha is no longer part of the E2E Support team, so she won't be able to provide any update on this thread. We are trying to find a resource to look at this, but this will take time.

    In the meanwhile, I have reviewed this thread and have some comments.

    Your comparision is on three different vectors -> SoC (TDA4VM vs TDA4VL) , SDK (8.2 vs 9.2) and DSP cores (C66x vs C71x). This really makes it very difficult for comparision purposes, we are not at all comparing apples to apples anymore.

    The C66x and C71x DSP architectures are very different, and the frequencies at which they are running are also different. TDA4VL has two C71x, one with MMAv2 and one without MMA, while the TDA4VM's C71x has an MMAv1. The TIDL performance for sure has changed on recent SDKs, so that alone is a non-factor for your comparision purposes when using different SDKs.

    Any comparision here needs to be done on equivalent factors leaving SoC as the only variable. If using TDA4VM, please compare the performance first between the C66x and C71x cores on that SoC to establish a baseline. This needs to be done on the same SDK version that you are going to use on TDA4VL.

    For TDA4VL, use the C71x with MMA so that you are comparing this against the C71x on TDA4VM.

    Please provide an exact test scenario that can be used on TI EVMs for us to reproduce the data for further analysis.

    regards

    Suman