Tool/software:
Hello,
I am porting the same logic to both processors: TDA4VM vs TDA4VMid Eco
My logic pipeline configuration looks like this:
1) image capture 2) ldc 3) preprocessing 4) tidl 5) postprocessing 6) tracking 7) draw bbox
In addition, TDA4VM is using version 8.2 and Mid Eco is using version 9.2 of the sdk.
Looking at the datasheets for both processors, the Mid Eco board appears to have better compute performance in DSP core (40GFLOPS vs 160GFLOPS)
But, when we ran real logic, we saw increased execution times on all nodes compared to the TDA4VM.
Can you give me an explanation of the likely causes of why this is happening?
Hi Dohee,
To clarify your statement on GLOPS -
TDA4VM has 80 GFLOPS per the datasheet and TDA4VMid-Eco has 160 GFLOPS due to there being 2 C7x cores - if you are only running your computation on 1, then you are utilizing the same 80 GFLOPS
But, when we ran real logic, we saw increased execution times on all nodes compared to the TDA4VM.
Can you clarify what you mean by nodes in this case?
1) image capture 2) ldc 3) preprocessing 4) tidl 5) postprocessing 6) tracking 7) draw bbox
Are you referring to these parts of the pipeline? If so, can you please give details on the performance differences that you are seeing?
Best,
Asha
Hi, Bhandarkar
First of all, I would like to apologize for the lack of explanation of my issue.
The nodes I was referring to are the ones you identified. Also, I put all the nodes on C66 cores in TDA4VM except tidl, which as far as I understood had a performance of 40 GFLOPS.
As you know, the Mid Eco board does not have a C66 core, so I put all the nodes I put on the C66 core in TDA4VM on the 2nd C7x core, which does not have MMA on the Mid Eco board.
Also, according to you, this had a performance of 80 GFLOPS.
The performance difference I was referring to is execution time. For example, when we put the same postprocessing code on a C66 core in TDA4VM and a C7x core in TDA4VMid Eco, the execution time for that function in each situation was as follows
TDA4VM(C66); PostProcNode: avg = 6194 usecs, min/max = 6078 / 25685 usecs,
TDA4VMidEco (C7x); PostProcNode: avg = 11293 usecs, min/max = 11195 / 70855 usecs,
Hi Dohee,
Thanks for those clarifications.
So to my understanding on TDA4VM, everything is running on C66 except TIDL node which runs on the C7x core.
On TDA4VMid-Eco, everything runs on C7x_2 except TIDL node which runs on C7x_1 with MMA.
Comparing in this case might be slightly more difficult. Are you saying that for all nodes you are seeing a differnce and not just PostProc?
For PostProc are you reading from a similar point in memory? Are you using some VXLIB implementation for post-processing?
Best,
Asha
Hi, Bhandarkar
One more thing, in TDA4VM, ldc and preproc were running on c66_1 and postproc, tracking, draw bbox on c66_2.
Also, the slowdown I mentioned above was not just for postproc, but was similar on all nodes except TIDL
(TIDL was about 5-10% slower on Mid Eco compared to TDA4VM and ldc is similar in both boards.)
For PostProc are you reading from a similar point in memory? Are you using some VXLIB implementation for post-processing?
I didn't understand what exactly this meant. What do you mean by reading similar points in memory? One thing is for sure, my code doesn't change between the two boards.
Also, Is it correct to ask if the VXLIB implementation you are talking about is using the default example provided in the TI SDK? If so, we are using code that has been modified to suit our situation.
Hi Dohee,
Also, the slowdown I mentioned above was not just for postproc, but was similar on all nodes except TIDL
(TIDL was about 5-10% slower on Mid Eco compared to TDA4VM and ldc is similar in both boards.)
Could you give a breakdown of the percentage slowdown? TIDL is 5-10% slower on MidEco, the other nodes are what percentage?
I didn't understand what exactly this meant. What do you mean by reading similar points in memory? One thing is for sure, my code doesn't change between the two boards.
You are doing some processing of data from C66 and C7x - where is this data stored in memory? If your code does not change between boards - I am guessing this means you have not optimized your code to the C7x architecture?
Also, Is it correct to ask if the VXLIB implementation you are talking about is using the default example provided in the TI SDK? If so, we are using code that has been modified to suit our situation.
Yes, I was wondering if you were using some default SDK implementation for some of your pre and post processing. So your answer here is no.
Best,
Asha
Hi Bhandarkar
Could you give a breakdown of the percentage slowdown? TIDL is 5-10% slower on MidEco, the other nodes are what percentage?
All nodes except LDC and TIDL were slowed down by about 2x, which can be seen by looking at the executation times of the PostProc node I sent you earlier.
You are doing some processing of data from C66 and C7x - where is this data stored in memory? If your code does not change between boards - I am guessing this means you have not optimized your code to the C7x architecture?
Yes. I didn't do any additional optimizations to port to C7x. Also, I'm not sure if you're referring to this, my data is stored in DDR memory and I'm using tivxMemShared2TargetPtr() to get the address. If this is not the answer you are looking for, please provide further clarification.
Also, if additional optimization techniques are needed to go from C66 to C7x, can you provide example code or something?
Hi Bhandarkar.
I'm working on something and have an additional question, so I write in.
Currently, I'm working on TDA4VMidEco, using C7x's intrinsic, etc. to significantly reduce the execution time of the LDC.
This has almost halved the execution time compared to TDA4VM, but the CPU load remains the same. For example, for an LDC node with the same functionality, here is the situation
TDA4VM(C66) ; Execution time: 40ms / CPU load : 31%.
TDA4VMidEco (C7x) : execution time: 20ms / CPU load : 30%.
My expectation was that the CPU load would decrease if the execution time decreased, but it didn't. What do you mean by CPU load and how can I reduce it? And, what factors determine the CPU Load? Please answer this along with my previous question.
Thank you.
Dohee Kang.
Dear. Bhandarkar
When can I expect an answer to my question?
Let me know if these were difficult questions to answer.
Thanks.
Dohee Kang.
Hi Dohee,
Unlocking the thread to provide an update on the status of this thread.
Asha is no longer part of the E2E Support team, so she won't be able to provide any update on this thread. We are trying to find a resource to look at this, but this will take time.
In the meanwhile, I have reviewed this thread and have some comments.
Your comparision is on three different vectors -> SoC (TDA4VM vs TDA4VL) , SDK (8.2 vs 9.2) and DSP cores (C66x vs C71x). This really makes it very difficult for comparision purposes, we are not at all comparing apples to apples anymore.
The C66x and C71x DSP architectures are very different, and the frequencies at which they are running are also different. TDA4VL has two C71x, one with MMAv2 and one without MMA, while the TDA4VM's C71x has an MMAv1. The TIDL performance for sure has changed on recent SDKs, so that alone is a non-factor for your comparision purposes when using different SDKs.
Any comparision here needs to be done on equivalent factors leaving SoC as the only variable. If using TDA4VM, please compare the performance first between the C66x and C71x cores on that SoC to establish a baseline. This needs to be done on the same SDK version that you are going to use on TDA4VL.
For TDA4VL, use the C71x with MMA so that you are comparing this against the C71x on TDA4VM.
Please provide an exact test scenario that can be used on TI EVMs for us to reproduce the data for further analysis.
regards
Suman
Dear. Anna
Unfortunately, the TDA4VM board is not available at this time, only the TDA4VL is available.
Since the TDA4VL does not have a C66 core, we cannot try the method you suggested.
Can you recommend another method that uses only the TDA4VL?
Best.
Dohee Kang.
Hi Dohee,
Unfortunately, the TDA4VM board is not available at this time, only the TDA4VL is available.
The thread is a comparision effort between TDA4VM and TDA4VL started by you, so not much can be done if you have access to only TDA4VL board.
I do not any have any other recommendations.
regards
Suman
Hi Dohee
As requested, I am sharing performance difference of VxLIB among TDA4VM C66x, TDA4VM C7x and TDA4VE C7x.
• TDA4VM C66x VxLIB
software-dl.ti.com/.../VXLIB_c66x_TestReport.html
• TDA4VM C7x VxLIB
software-dl.ti.com/.../performance_summary.html
• TDA4VE C7x VxLIB
software-dl.ti.com/.../performance_summary.html
You will be able to find additional this kind of information such as DSPLIB, FFTLIB, MATHLIB, VXLIB from the SDK of each device as below.
Thank you.
Regards,
Johnny
Hi team, Suman,
This is the update of the current circumstance:
-Same algorithm's execution time gap between C66(VM) and C7x(VE) is almost twice(C7x is slower than C66x) even the specs are same in datasheet as below.
TDA4VM(C66); PostProcNode: avg = 6194 usecs, min/max = 6078 / 25685 usecs,
TDA4VMidEco (C7x); PostProcNode: avg = 11293 usecs, min/max = 11195 / 70855 usecs,
-This execution time is measured using VX time measure API, customer will also update the counter-based as well.
-With the optimization(e.g.bufferization), C7x became faster than C6x however the core share is almost same as below.
TDA4VM(C66) ; Execution time: 40ms / CPU load : 31%.
TDA4VMidEco (C7x) : execution time: 20ms / CPU load : 30%.
-C66x and C7x benchmarking resource is provided as above according to the customer's request.
-According to the datasheet same result also has to be come out with same optimization, however actually it is not.
Still don't have any exact reason and the solution is not found yet, as per talk could you please take a look and update the thread?
Thank you.
BR,
Lynn
-C66x and C7x benchmarking resource is provided as above according to the customer's request.
-According to the datasheet same result also has to be come out without optimization, however actually it is not.
Hi Lynn,
Can you please refer this thread. It seems the content is almost similar.
https://e2e.ti.com/support/processors-group/processors/f/processors-forum/951567/tda4vm-curious-difference-in-performance-between-c66x-and-c7x
Regards,
Sivadeep
Hi Sivadeep,
First of all let me give you the more detail for the issue, the link's contents is pretty different to this.
-According to the datasheet same result also has to be come out without optimization, however actually it is not.
I apologize having the typo in this comment, "same result also has to be with same optimizaion".
This execution time is measured with same optimization level, but the gap is almost twice even the core specs in datasheet is same.
TDA4VM(C66) ; Execution time: 40ms / CPU load : 31%.
TDA4VMidEco (C7x) : execution time: 20ms / CPU load : 30%.
With more optimization like bufferization on only C7x, then this outcome was came out.
However why the core load is almost same although the execution time has twice gap(in this case, now C7x is faster than C66x).
The point is :
1. Why the execution time has twice gap between C66x(TDA4VM) and C7x(TDA4VE), C66x is even faster than C7x? They have same core spec.
2. With more optimization like bufferization on only C7x, why the core load is almost same although the execution time has twice gap(in this case, now C7x is faster than C66x)?
2.1. Is there any way to reduce the core load and why the CPU load is still same even the execution time is reduced?
What do you mean by CPU load and how can I reduce it? And, what factors determine the CPU Load? Please answer this along with my previous question.
Could you please refer to above and investigate the reason?
Hello Dohee,
Please refer to the above and let me know if I have missed your points in here.
Could you please share the counter-based execution time measurement between C66x and C7x?
Thank you.
BR,
Lynn
Hi Lynn,
Sorry for the delay in response.
1. Why the execution time has twice gap between C66x(TDA4VM) and C7x(TDA4VE), C66x is even faster than C7x? They have same core spec.
2. With more optimization like bufferization on only C7x, why the core load is almost same although the execution time has twice gap(in this case, now C7x is faster than C66x)?
Regards,
Sivadeep
Hello Sivadeep,
Can you please try to run the optimized code on both the C66x and C7x with the compiler optimization level set
For this please refer to the below.
What I meant here was the codes are optimized already with same optimization level set, however the execution time differs as below.
I apologize having the typo in this comment, "same result also has to be with same optimizaion".
This execution time is measured with same optimization level, but the gap is almost twice even the core specs in datasheet is same.
As I know with same optimization level and same algorithm at least both of C7x C66x performance should be same, isn't it?
Could you please share how you calculated the CPU load? Additionally, could you provide the cycle count using TSC.
Customer has checked it with OpenVX API, TSC based-cycle count would be updated once customer has measured it.
So, is there any other reason for the same opti-level algorithm execution time gap between C66x and C7x other than the below?
- Right placement of buffers – This refers to where the data is placed in memory (whether the data is closer to the CPU or not).
- Right vectorization of the code – Ensuring that the code utilizes vector instructions effectively.
- Efficient use of loops – Performance also depends on the number and structure of loops in the code.
Thanks.
BR,
Lynn
Hi Lynn,
With more optimization like bufferization on only C7x, then this outcome was came out.
I have a doubt regarding this statement. From what I understand the bufferization optimization that you used was only on C7x. If that's the case can you run the optimized code on both c7x and c66 and get the cycle count.
As I know with same optimization level and same algorithm at least both of C7x C66x performance should be same, isn't it?
It can still depend on the factors I mentioned above. Which compiler optimization level are you using ?
Regards,
Sivadeep
Hi Sivadeep,
From what I understand the bufferization optimization that you used was only on C7x.
Let me explain the detailed flow.
1.Same optimization & algorithm test first => execution time gap has found as below.
2. With more optimization like bufferization only for the C7x then the execution time finally became faster as below however the core loading seems almost same.
TDA4VM(C66) ; Execution time: 40ms / CPU load : 31%.
TDA4VMidEco (C7x) : execution time: 20ms / CPU load : 30%.
So in general perspective the execution time should be like this or same even in case 1 but it is not, also in 2 the core share is pretty same even the code on C7x is more optimized than in C66x. Do you see any reasons in here?
Which compiler optimization level are you using ?
For this we need to check from the Dohee.
Hello Dohee, could you please check above in addition to the TSC?
Thanks.
BR,
Lynn
Hi Lynn Kim,
With more optimization like bufferization only for the C7x then the execution time finally became faster as below however the core loading seems almost same.
I'm not sure about the optimization only for C7x can you please use the same optimization in c66x and compare. As I mentioned earlier there can be instances where of execution time gap between the two.
Can you also please update the number of cycles using TSC and the compiler optimization level you are using.
Regards,
Sivadeep
Hi Sivadeep,
Can you also please update the number of cycles using TSC and the compiler optimization level you are using.
The optimization level is 3, both of on C7x and C66x.
For the cycle number measurement we still need to wait customer, so could you please give your opinion of above gap although with the same optimization levle?
Thank you.
BR,
Lynn
Hi Lynn,
As I mentioned earlier. So in your case when you are not optimizing the code (not compiler optimization) the results can vary.
Even though they have the same core specifications. For unoptimized code, it is not mandatory that the C7x will always outperform the C66x. The performance can vary depending on several factors. The C7x can only achieve better performance through proper optimizations.
It can be based on factors such as:
- Right placement of buffers – This refers to where the data is placed in memory (whether the data is closer to the CPU or not).
- Right vectorization of the code – Ensuring that the code utilizes vector instructions effectively.
- Efficient use of loops – Performance also depends on the number and structure of loops in the code.
Regards,
Sivadeep
Dohee,
can this thread be closed now? There has been no update for 1 month now
Dave C