This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Performance 64x vs 64x+

Other Parts Discussed in Thread: TMS320C6457

Hello,

Performance issue with TMS320C6457.

Compiler used is C6000 in CCS 5.2

Attached code has two examples using local variable and global variable.

The code with global variables is taking very large clock cycles as compared to local variables.

Thanks.

/**********************************************************************************************************************************/


//.... These Functions are getting called based on Interrupts..........//

/*.............Test Function With Local Variables....*/

void process_data(void)
{
    int start_time,end_time;
    TSCL=0;
    TSCH=0;
    start_time =_itoll(TSCH,TSCL);
    int k,s,uiaPixelStore1[2048],uiR1,uiR2,uiR3,uiR4;
    
    for(k=0;k<2048;k++)
    {
        uiaPixelStore1[k]=0;
    }
    for(s=0;s<2048;s++)
    {
        uiR1 = _extu(uiaPixelStore1[s], 24, 24);
        uiR2 = _extu(uiaPixelStore1[s], 16, 24);
        uiR3 = _extu(uiaPixelStore1[s], 8, 24);
        uiR4 = _extu(uiaPixelStore1[s], 0, 24);

    }
    end_time=_itoll(TSCH,TSCL);

    time_diffr= end_time-start_time;           ///Time Difference is 10 (time_diffr = 10 clk cycles)
}




/*.....................Test Fucntion With Global Variables .....*/
//    Global Declaration ..

    int start_time,end_time;
    int i,uiaPixelStore[2048],uiR1,uiR2,uiR3,uiR4;


void process_data(void)
{

    TSCL=0;
    TSCH=0;
    start_time =_itoll(TSCH,TSCL);

    for(i=0;i<2048;i++)
    {
        uiaPixelStore[i]=0;
    }
    for(i=0;i<2048;i++)
    {
        uiR1 = _extu(uiaPixelStore[i], 24, 24);
        uiR2 = _extu(uiaPixelStore[i], 16, 24);
        uiR3 = _extu(uiaPixelStore[i], 8, 24);
        uiR4 = _extu(uiaPixelStore[i], 0, 24);

    }

    end_time=_itoll(TSCH,TSCL);

    time_diffr= end_time-start_time;        //Time Difference is 1565 (time_diffr = 1565 clk cycles)
}


/***********************************************************************************************************************************************/

  • Hi Akshay,

    It is worth checking in your linker command file, how the memory sections are mapped into the memory regions.

    That is,in general,  if it is an uninitilaized global variables, the memory section would be ".bss" or ".far". Unintilaized local variables would be ".stack". Usually the .bss and .far would be placed in RAM memory.

    Please do check in your program to which memory the ".stack" and ".bss" sections are placed. Depending upon the memory regions the time taken to fetch might differ.

    Regards,

    Shankari

    -------------------------------------------------------------------------------------------------------

    Please click the Verify Answer button on this post if it answers your question.

    --------------------------------------------------------------------------------------------------------

  • Thanks for the reply.

    1. We are using only L2 Memory in our application as IRAM.

    Both .stack and .bss are part of L2 memory only.

    In-spite of the above performance is very poor.

    2. Is there is any specific sequence to configure L2 memory as Cache.

    We tried configuring L2 memory as cache using lower 3 bits of L2CFG register and MAR bits. Registers are updated but performance is same as above.

    Thanks.

  • In case with local variables compiler can note that results are not used anywhere and therefore don't have to be calculated at all. Even with global variables compiler can choose to zero uiaPixelStore and eliminate second loop assigning zeros to uiR[1-4]. In other words neither of the two example have to tell story about real-life performance.

  • Akshay,

    To see the validity of what Andy has said, try using different compiler switches for optimization. And also, consider the numbers that you see when comparing those numbers to what you are trying to do in the code.

    Compiler switches:

    You have not said which compiler switches you are using, but it is safe to say you are using some level of optimization. Re-run both cases with -g (Debug switch) turned on, and then re-run like that with all of the levels of optimization from -o0 to -o3 and compare the values.

    Consider the numbers:

    In both examples (local and global variables), you are trying to write to 2048 32-bit memory locations and then read from 2048 32-bit memory locations 4 times each. This will take much more than 10 cycles to do. So you can quickly see that something is not being correctly done or correctly measured in the first case in which you report 10 cycles. That is not possible.

    Using CCS, you can step through your code to see if it is executing correctly. You probably have a case where you are taking performance measurements from code that you have not yet debugged functionally at the level of optimization you are using.

    Regards,
    RandyP