This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA2: The same algorithm runs slower on the DSP(c66x) of TDA2X than on the DSP(c674x) of DM8148

Part Number: TDA2
Other Parts Discussed in Thread: SYSBIOS

Hi

We have two self-designed boards, one using DM8148 (DSP is C674X) and the other using TDA2X (DSP is C66X). We run the same algorithm on the two board's DSP, and find C674X faster than C66X. Both of the two bord DDR frequency are 533 and the DSP frequency are 750. We use the sam compiler option -c -qq -pdsw225 -mv6600 --abi=elfabi -mo -eo.oe66 -ea.se66 -ms --embed_inline_assembly --symdebug:dwarf -O3 --keep_asm. What is the reason for this?

here is our testing code:

#define MAXV(a,b)((a>b)?(a):(b))

#define MINV(a,b)((a<b)?(a):(b))

typedef unsigned char u8;

int LneDet_MorphEroX(const u8 *SrcImg,u8 *DstImg,
const int Wid,
const int Hei,
const int MorphSizeV,
const int HeiStart,
const int HeiEnd,
const float k1,
const float k2)
{
int Delta,midSize;
int MV_i,MV_j,MV_k;
int gray_curr,gray_min,midL,midR;

for (MV_i=HeiStart;MV_i<HeiEnd;MV_i++)
{
midSize=MorphSizeV/k1 - (HeiEnd-MV_i)*k2/((HeiEnd-HeiStart))*MorphSizeV;

Delta=MAXV(1,midSize);

for (MV_j=0;MV_j<Wid;MV_j++)
{
gray_min=0xff;

midL=MAXV(0,(MV_j-Delta));

midR=MINV((Wid-1),(MV_j+Delta));

for(MV_k=midL;MV_k<=midR;MV_k+=2)
{
gray_curr = *(SrcImg+MV_i*Wid+MV_k);
gray_min=MINV(gray_min,gray_curr);
gray_curr = *(SrcImg+MV_i*Wid+MV_k+1);
gray_min=MINV(gray_min,gray_curr);

}

*(DstImg+MV_i*Wid+MV_j)=gray_min;
}
}

return 1;
}

highly appreciate your help, thanks. !!!

  • Hi Liuke,

    There are couple aspects which can affect the execution behavior:

    1. Location of where the code and data buffers are placed.
    2. Cache settings between the two setups.
    3. Are you running this multiple times and then comparing the average execution time? This will basically ensure the first time access (cache miss) penalty is considered and averaged out.

    Can you please check these aspects?

    Also how much difference do you see with respect to the excution time?

    Thanks and Regards,
    Piyali
  • Hi Piyali,
    Thank you very much for your response
    1.Our code and data are placed in DDR, we tried to put the code segmemt in L2, but it didn't work.
    2.We didn't change the cache setting(we don't know how to change to cache setting in vision sdk...).
    3.The test result is the average time, it shows that dm8148 consume 14ms and tda2x consume 18ms for the same code run on each dsp.
  • Hi Liuke,

    The cache setting in the Vision SDK happens in vision_sdk/links_fw/src/rtos/bios_app_common/tda2xx/cfg/DSP_common.cfg

    var Cache = xdc.useModule('ti.sysbios.family.c66.Cache');
    Cache.initSize.l1pSize = Cache.L1Size_32K;
    Cache.initSize.l1dSize = Cache.L1Size_32K;
    Cache.initSize.l2Size = Cache.L2Size_32K;

    Since you seem to be using BIOS, are you making sure you are measuring the time by keeping the piece of code as part of a critical section to disallow any interrupts/task switches to happen in between?

    Thanks and Regards,

    Piyali

  • Hi Piyali Goswami

    We really appreciate your help,we want to ask one more question, how to disallow interrupts/task switches in vision sdk?

    Thanks and Regards,
    Liuke
  • Hi Liuke,

    The code is as below:

    #include <ti/sysbios/hal/Hwi.h>
    cookie = Hwi_disable();

    < Critical Section >

    Hwi_restore(cookie);

    Thanks and Regards,
    Piyali
  • Hi Piyali Goswami
    Thank your for your help,we disallow interrupts/task switches but the test result is still the same.We run the same algorithm on C6678,1GHZ -O3 optimize ,code in L2, data in L3 RAM and the algorithm consumes 12.6ms. But the same algorithm runs on dm8148's dsp which is 750MHZ it only consumes 14ms.It shows that the c674x has better proformance than c66x if running on the same frequence. Coud you give us some advise on how to get the reason for this? Is there something wrong with my config?

    Below is our test record:
    1 [C6678, 1000MHZ,-O3,code L2 RAM,data L3 RAM]: 12.6ms
    2 [TDA2X-C66x,750MHZ,-O3,code L2 RAM,data DDR-533MHZ]: 17.5ms
    3 [TDA2X-C66x,600MHZ,-O3,code L2 RAM,data DDR-533MHZ]: 21.8ms
    4 [DM8148-C674X,750MHZ,-O3,code DDR-533MHZ,data DDR-533MHZ]:14ms

    Thanks and Regards,
    Liuke
  • Hi,

    Both DSPs instruction set is slightly different. So it is not guaranteed to produce identical machine code by compiler for both DSP versions. As a first doubt can you please check/compare the software pipeline information (generated by compiler) of inner most loop of your test code? That will tell whether core loop is scheduled identically or differently.  

    Regards

    Deepak Poddar

  • Hi Deepak Poddar:

    Thanks for your help. We have compared the.se66 files and se674 files generated by the compiler. The pipeline information of the software is found to be the same. We found that the instructions were different, but we didn't know much about them. Because of the algorithm code provided by our customers, we can not optimize it by changing the code structure. Can you give us some suggestions to find out the reasons? In some specific code, the performance of C66X is worse than that of C674X?

  • Hi,

    It is difficult to tell the exact reason unless we compile the code and run the code to analyze each portion carefully.

    At this juncture what can be suggested is that there could be two reasons for this mismatch

    1) compiler code generated for both platform itself are different.

    2) cache, or other peripheral is making code to run on C66x based soc.

    you need to first nail down the reason between #1, and #2.

    For #1 -- > you can generate assembly file for both the platform, and calculate the estimated cycle to be consumed for both the DSP on paper. assembly file gives the scheduled cycle estimate for each software pipelined loop. Most of the cycle in general is consumed in loops. So roughly cycle can be calculated on paper by just looking generated assembly file. No need to run the code for this analysis. If you find that there is mismatch in calculated cycle for both DSP then we have to analyze the assembly code of each loop. In this case please share the generated assembly file if the estimated cycles are different for both DSP. we can do analysis of assembly file if you are not able to find the reason.

    For #2 --> if you find that estimated cycles for both DSP is same by #1 experiment. Then you can run the code by keeping all the input and output data in L1D/L2D/DDR and analyze the consumed cycle for both the DSPs. By keeping the input and output in L1D, cache related penalties are skipped and you should see same cycle consumption for both the DSPs, and it should be close to theoretical estimate number as it was estimated in #1 exercise. Then you can try getting cycle consumed by keeping all data in L2D/DDR, and then if you see that cycle consumed are different then its cache/peripheral level setting which is making both of the DSP behave differently.

    if you are nail down to some closer reason (#1 or #2) then we can investigate further from there.

    Thanks

    Deepak Poddar

  • Hi Deepak Poddar:

    Thank you very much for your help. We are very sorry that we do not quite understand the assembly files.We try our best to calculate the number of cycles in the loops.We found that between L2 to L6 lable there is 47 cycles in se66 asm file, but 33 cycles in se674 asm file. Could you help us to find the problem, thanks a lot ! The attachment is the c source file and assemble file of c66x and c674x.df-algo.zip

  • Hi,

    Looks like for both DSP, software scheduling is same. which is "Total cycles (est.)         : 8 + trip_cnt * 4 ". Please check for "SOFTWARE PIPELINE INFORMATION" in attached file. So on flat memory loop is expected to take equal cycle for both the DSP. So this indicates that you have to do exp #2 as suggested in previous post, to nail down peripheral related difference.

    You may go though various tutorial available for C6x DSP, such as www.ti.com/.../sprui04b.pdf

     

    Now as a general observation, this loop can be further optimized for better cycle consumption. And those optimization is applicable for both DSPs.

    Regarding current observation of higher cycle on C66x, can you please check the cycles consumed by keeping input and output in L1D.

    Deepak

  • Hi Luike,

    We have not heard back from you on this one. I assume you have been able to proceed. If you continue to face issue, kindly let us know.

    Thanks and Regards,
    Piyali
  • Hi Piyali
    Sorry for the delays. Thank you very much for your help. Our leader decided to put the project on hold because we failed to complete the task within the deadline of the customer. Anyway, we would like to thank you for your help.