TDA2: The same algorithm runs slower on the DSP(c66x) of TDA2X than on the DSP(c674x) of DM8148

liuke liuke

Part Number: TDA2
Other Parts Discussed in Thread: SYSBIOS

We have two self-designed boards, one using DM8148 (DSP is C674X) and the other using TDA2X (DSP is C66X). We run the same algorithm on the two board's DSP, and find C674X faster than C66X. Both of the two bord DDR frequency are 533 and the DSP frequency are 750. We use the sam compiler option -c -qq -pdsw225 -mv6600 --abi=elfabi -mo -eo.oe66 -ea.se66 -ms --embed_inline_assembly --symdebug:dwarf -O3 --keep_asm. What is the reason for this?

here is our testing code:

#define MAXV(a,b)((a>b)?(a):(b))

#define MINV(a,b)((a<b)?(a):(b))

typedef unsigned char u8;

int LneDet_MorphEroX(const u8 *SrcImg,u8 *DstImg,
const int Wid,
const int Hei,
const int MorphSizeV,
const int HeiStart,
const int HeiEnd,
const float k1,
const float k2)
{
int Delta,midSize;
int MV_i,MV_j,MV_k;
int gray_curr,gray_min,midL,midR;

for (MV_i=HeiStart;MV_i<HeiEnd;MV_i++)
{
midSize=MorphSizeV/k1 - (HeiEnd-MV_i)*k2/((HeiEnd-HeiStart))*MorphSizeV;

Delta=MAXV(1,midSize);

for (MV_j=0;MV_j<Wid;MV_j++)
{
gray_min=0xff;

midL=MAXV(0,(MV_j-Delta));

midR=MINV((Wid-1),(MV_j+Delta));

for(MV_k=midL;MV_k<=midR;MV_k+=2)
{
gray_curr = *(SrcImg+MV_i*Wid+MV_k);
gray_min=MINV(gray_min,gray_curr);
gray_curr = *(SrcImg+MV_i*Wid+MV_k+1);
gray_min=MINV(gray_min,gray_curr);

}

*(DstImg+MV_i*Wid+MV_j)=gray_min;
}
}

return 1;
}

highly appreciate your help, thanks. !!!

over 5 years ago

0 Piyali Goswami over 5 years ago

TI__Mastermind 30235 points

Hi Liuke,

There are couple aspects which can affect the execution behavior:

1. Location of where the code and data buffers are placed.
2. Cache settings between the two setups.
3. Are you running this multiple times and then comparing the average execution time? This will basically ensure the first time access (cache miss) penalty is considered and averaged out.

Can you please check these aspects?

Also how much difference do you see with respect to the excution time?

Thanks and Regards,
Piyali

0 liuke liuke over 5 years ago in reply to Piyali Goswami

Prodigy 170 points

Hi Piyali,
Thank you very much for your response
1.Our code and data are placed in DDR, we tried to put the code segmemt in L2, but it didn't work.
2.We didn't change the cache setting(we don't know how to change to cache setting in vision sdk...).
3.The test result is the average time, it shows that dm8148 consume 14ms and tda2x consume 18ms for the same code run on each dsp.

0 Piyali Goswami over 5 years ago in reply to liuke liuke

TI__Mastermind 30235 points

Hi Liuke,

The cache setting in the Vision SDK happens in vision_sdk/links_fw/src/rtos/bios_app_common/tda2xx/cfg/DSP_common.cfg

var Cache = xdc.useModule('ti.sysbios.family.c66.Cache');
Cache.initSize.l1pSize = Cache.L1Size_32K;
Cache.initSize.l1dSize = Cache.L1Size_32K;
Cache.initSize.l2Size = Cache.L2Size_32K;

Since you seem to be using BIOS, are you making sure you are measuring the time by keeping the piece of code as part of a critical section to disallow any interrupts/task switches to happen in between?

Thanks and Regards,

Piyali

0 liuke liuke over 5 years ago in reply to Piyali Goswami

Prodigy 170 points

Hi Piyali Goswami

We really appreciate your help,we want to ask one more question, how to disallow interrupts/task switches in vision sdk?

Thanks and Regards,
Liuke

0 Piyali Goswami over 5 years ago in reply to liuke liuke

TI__Mastermind 30235 points

Hi Liuke,

The code is as below:

#include <ti/sysbios/hal/Hwi.h>
cookie = Hwi_disable();

< Critical Section >

Hwi_restore(cookie);

Thanks and Regards,
Piyali

0 liuke liuke over 5 years ago in reply to Piyali Goswami

Prodigy 170 points

Hi Piyali Goswami
Thank your for your help,we disallow interrupts/task switches but the test result is still the same.We run the same algorithm on C6678,1GHZ -O3 optimize ,code in L2, data in L3 RAM and the algorithm consumes 12.6ms. But the same algorithm runs on dm8148's dsp which is 750MHZ it only consumes 14ms.It shows that the c674x has better proformance than c66x if running on the same frequence. Coud you give us some advise on how to get the reason for this? Is there something wrong with my config?

Below is our test record:
1 [C6678, 1000MHZ，-O3，code L2 RAM,data L3 RAM]: 12.6ms
2 [TDA2X-C66x，750MHZ，-O3，code L2 RAM,data DDR-533MHZ]: 17.5ms
3 [TDA2X-C66x，600MHZ，-O3，code L2 RAM,data DDR-533MHZ]: 21.8ms
4 [DM8148-C674X,750MHZ，-O3，code DDR-533MHZ,data DDR-533MHZ]:14ms

Thanks and Regards,
Liuke

0 Deepak Poddar over 5 years ago in reply to liuke liuke

TI__Expert 4305 points

Hi,

Both DSPs instruction set is slightly different. So it is not guaranteed to produce identical machine code by compiler for both DSP versions. As a first doubt can you please check/compare the software pipeline information (generated by compiler) of inner most loop of your test code? That will tell whether core loop is scheduled identically or differently.

Regards

Deepak Poddar

0 liuke liuke over 5 years ago in reply to Deepak Poddar

Prodigy 170 points

Hi Deepak Poddar:

Thanks for your help. We have compared the.se66 files and se674 files generated by the compiler. The pipeline information of the software is found to be the same. We found that the instructions were different, but we didn't know much about them. Because of the algorithm code provided by our customers, we can not optimize it by changing the code structure. Can you give us some suggestions to find out the reasons? In some specific code, the performance of C66X is worse than that of C674X?

0 Deepak Poddar over 5 years ago in reply to liuke liuke

TI__Expert 4305 points

Hi,

It is difficult to tell the exact reason unless we compile the code and run the code to analyze each portion carefully.

At this juncture what can be suggested is that there could be two reasons for this mismatch

1) compiler code generated for both platform itself are different.

2) cache, or other peripheral is making code to run on C66x based soc.

you need to first nail down the reason between #1, and #2.

For #1 -- > you can generate assembly file for both the platform, and calculate the estimated cycle to be consumed for both the DSP on paper. assembly file gives the scheduled cycle estimate for each software pipelined loop. Most of the cycle in general is consumed in loops. So roughly cycle can be calculated on paper by just looking generated assembly file. No need to run the code for this analysis. If you find that there is mismatch in calculated cycle for both DSP then we have to analyze the assembly code of each loop. In this case please share the generated assembly file if the estimated cycles are different for both DSP. we can do analysis of assembly file if you are not able to find the reason.

For #2 --> if you find that estimated cycles for both DSP is same by #1 experiment. Then you can run the code by keeping all the input and output data in L1D/L2D/DDR and analyze the consumed cycle for both the DSPs. By keeping the input and output in L1D, cache related penalties are skipped and you should see same cycle consumption for both the DSPs, and it should be close to theoretical estimate number as it was estimated in #1 exercise. Then you can try getting cycle consumed by keeping all data in L2D/DDR, and then if you see that cycle consumed are different then its cache/peripheral level setting which is making both of the DSP behave differently.

if you are nail down to some closer reason (#1 or #2) then we can investigate further from there.

Thanks

Deepak Poddar

0 liuke liuke over 5 years ago in reply to Deepak Poddar

Prodigy 170 points

Hi Deepak Poddar:

Thank you very much for your help. We are very sorry that we do not quite understand the assembly files.We try our best to calculate the number of cycles in the loops.We found that between L2 to L6 lable there is 47 cycles in se66 asm file, but 33 cycles in se674 asm file. Could you help us to find the problem, thanks a lot ! The attachment is the c source file and assemble file of c66x and c674x.df-algo.zip

0 Deepak Poddar over 5 years ago in reply to liuke liuke

TI__Expert 4305 points

Hi,

Looks like for both DSP, software scheduling is same. which is "Total cycles (est.) : 8 + trip_cnt * 4 ". Please check for "SOFTWARE PIPELINE INFORMATION" in attached file. So on flat memory loop is expected to take equal cycle for both the DSP. So this indicates that you have to do exp #2 as suggested in previous post, to nail down peripheral related difference.

You may go though various tutorial available for C6x DSP, such as www.ti.com/.../sprui04b.pdf

Now as a general observation, this loop can be further optimized for better cycle consumption. And those optimization is applicable for both DSPs.

Regarding current observation of higher cycle on C66x, can you please check the cycles consumed by keeping input and output in L1D.

Deepak

0 Piyali Goswami over 5 years ago in reply to Deepak Poddar

TI__Mastermind 30235 points

Hi Luike,

We have not heard back from you on this one. I assume you have been able to proceed. If you continue to face issue, kindly let us know.

Thanks and Regards,
Piyali

0 liuke liuke over 5 years ago in reply to Piyali Goswami

Prodigy 170 points

Hi Piyali
Sorry for the delays. Thank you very much for your help. Our leader decided to put the project on hold because we failed to complete the task within the deadline of the customer. Anyway, we would like to thank you for your help.

Processors

Processors forum

TDA2: The same algorithm runs slower on the DSP(c66x) of TDA2X than on the DSP(c674x) of DM8148