This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Running Time of DSPLIB fuctions is more

Hi,

      For my project, i am trying to use from DSPLIB functions on C6678. I have gone through the programmers reference guide and calculated the time taken for executing the functions.The actual performance results are almost 3 times  of theoretical results. Can someone kindly let me know where the problem is. The Optimization is set to level 3. I am running the code only on Core 0.

The following is the piece of code that is using DSPLIB:

All are floating point arrays of size 117912. tempa is a temporary register. CSL_tscRead function is used to sample the current clock time.

t1=CSL_tscRead();
DSPF_sp_vecmul(u,u,usq,117912);
DSPF_sp_vecmul(v,v,vsq,117912);
DSPF_sp_vecmul(w,w,wsq,117912);

DSPF_sp_w_vec(usq,ones,-1.5,tempa,117912);
DSPF_sp_w_vec(vsq,tempa,-1.5,tempa,117912);
DSPF_sp_w_vec(wsq,tempa,-1.5,vel_sq,117912);

DSPF_sp_w_vec(usq,vel_sq,4.5,usq,117912);
DSPF_sp_w_vec(vsq,vel_sq,4.5,vsq,117912);
DSPF_sp_w_vec(wsq,vel_sq,4.5,wsq,117912);

DSPF_sp_w_vec(v,u,1,uv,117912);
DSPF_sp_vecmul(uv,uv,uv_sq,117912);
DSPF_sp_w_vec(uv_sq,vel_sq,4.5,uv_sq,117912);

DSPF_sp_w_vec(v,u,-1,uv_1,117912);
DSPF_sp_vecmul(uv_1,uv_1,uv_1_sq,117912);
DSPF_sp_w_vec(uv_1_sq,vel_sq,4.5,uv_1_sq,117912);

DSPF_sp_w_vec(w,u,1,uw,117912);
DSPF_sp_vecmul(uw,uw,uw_sq,117912);
DSPF_sp_w_vec(uw_sq,vel_sq,4.5,uw_sq,117912);

DSPF_sp_w_vec(w,u,-1,uw_1,117912);
DSPF_sp_vecmul(uw_1,uw_1,uw_1_sq,117912);
DSPF_sp_w_vec(uw_1_sq,vel_sq,4.5,uw_1_sq,117912);

DSPF_sp_w_vec(w,v,1,vw,117912);
DSPF_sp_vecmul(vw,vw,vw_sq,117912);
DSPF_sp_w_vec(vw_sq,vel_sq,4.5,vw_sq,117912);

DSPF_sp_w_vec(w,v,-1,vw_1,117912);
DSPF_sp_vecmul(vw_1,vw_1,vw_1_sq,117912);
DSPF_sp_w_vec(vw_1_sq,vel_sq,4.5,vw_1_sq,117912);
t2=CSL_tscRead();
printf("The time Taken for This is %u\n",t2-t1);

 Theoretically, this piece of code must be executed in 3 milli seconds(@1 Ghz) but it is taking around 10 milli seconds. My application is a performance critical application. So, is there any  way to optimize on the code and achieve the theoretical minimum.

Thanks & Regards

Varun

  • Varun,

    It looks line you have several very large arrays in your code.  Any those data arrays are in DDR3.  It takes more cycles for CPU to read and write data directly from  and to DDR3.  You can try to place all the data in L2 SRAM to improve the performance.  Because your data arrays are huge.  You can break the data arrays into smaller blocks and use EDMA to bring the data into L2 and once the computations are done, you can use EDMA to send data back to DDR3.

    You can make a simple experiment to see the potential performance improvement by configuring the arrays, u, v, w, usq, vsq, and wsq into smaller sizes such as 512 and putting these arrrays in L2 SRAM and then run the following routines and measure the time to see if it's more closer to theoretical value.

    DSPF_sp_vecmul(u,u,usq,512);
    DSPF_sp_vecmul(v,v,vsq,512);
    DSPF_sp_vecmul(w,w,wsq,512);

    Regards,

    Xiaohui

  • Dear Xiaohui,

         Thanks for your wonderful advice. But, i have a doubt in implementing this:

    1. Instead of L2 SRAM,  i want to use L1D cache of core 0 and then expand it similarly to 8 cores i.e. using their L1 caches. Will there be any difference in these two approaches because as per my understanding most of the time the compiler does some kind of pre fetching in L1D cahce to optimize on code.  So, if i transfer to L1D cache, then it may effect the performance. Which approach would be better?

    Thanks & Regards

    Varun V

  • Varun,

    What do you mean to use L1D cache.  On C66, we have 32KB L1D.  You can partition the L1D and config part of the L1D to be cache and part of it to be L1D SRAM.  Do you mean to use part of L1D as cache and bring data into L1 SRAM?  This way, it would work for you.  The potential problem is that you have quite a few large data arrays, it may not be easy to bring all the data into L1 SRAM at once because its size is limited.  That's why I suggested to use L2 SRAM.

    Regarding L1D, the data will almost always go through L1D eventually regardless whether the data are in DDR or L2SRAM except for the case that data are in DDR section which is configured to be nochacable.

    Regards,

    Xiaohui

     

  • Thanks for your Reply.

    I mean to use it as L1 SRAM. But, as you said since the data is large, i will first try with the option of using L2 SRAM. If we transfer the datato L1 SRAM (say), then that data will also be cached to L1D right?? So, to what extent will the performance differ in both the cases i.e. using L1 SRAM and L2 SRAM.

    And, i have never tried transferring from DDR to L2 SRAM or L1 SRAM but i am assuming that the procedure remains same as that of DDR to DDR transfer except that the destination address will be that of  L2SRAM or L1 SRAM.

    Thanks & Regards,

    Varun

  • When you use L2 SRAM, the data will eventually go through L1D.  There maybe some cache penalties.  When the data are placed in L1D SRAM, the CPU will directly access the data in L1D SRAM.  The data won't go through L1 Cache.

    In your case, the data are accessed in strict linar pattern.  I don't think the degradation will be significant when data are placed in L2 SRAM.

    Yes, the procedure is the same as that of DDR when transferring data to L1D SRAM and L2D SRAM.  One thing to note is that when specifying EDMA source or destination addresses for data that are in either L1D or L2D, you should use their global address.  Please refer to user manual.

    -Xiaohui