This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/TMS320C6747: Need help on DSP performance optimization

Part Number: TMS320C6747


Tool/software: Code Composer Studio

Hi,

We are facing performance issue when all the audio effects are turned on, we tried optimizing the code to reduce the instructions count, and we do see the decrease of clock numbers in debug mode(from 500k+ clocks down to 360k+ clocks), but when the modified codes are built as release mode and downloaded to the board, seems the cycle period becomes longer. (We use a LED to observe the cycle period, it flickers slower when the cycle period gets longer.)

Attached the original codes and optimized codes, please help to analyze, thanks.

#pragma CODE_SECTION(iirLatticeLadderStereoProcess, ".criticalSectionInternal")
void iirLatticeLadderStereoProcess(float *restrict in, float *restrict out, iirLatticeLadder *filt, int len, int numFilters, Bool on)
{
	float *signalInL, *signalOutL, *signalInR, *signalOutR;
	float w1L, w4L, allpassL, allpoleL, sumL, inputL;
	float w1R, w4R, allpassR, allpoleR, sumR, inputR;
	int i, j;
	iirLatticeLadder *restrict filt1 = filt;

	signalInL = in;
	signalOutL = out;
	signalInR = in + len;
	signalOutR = out + len;

	// bypass filter
	if (on == FALSE)
	{
		for (j=0; j<len; j++)
		{
			*signalOutL++ = *signalInL++;
			*signalOutR++ = *signalInR++;
		}
	}
	else
	{
#pragma MUST_ITERATE(1, , )
		for (j=0; j<len; j++)
		{
			inputL = *signalInL++;
			inputR = *signalInR++;
#pragma MUST_ITERATE(1, , )
			for (i=0; i<numFilters; i++)
			{
				w1L = filt1[i].c2 * inputL - filt1[i].k2 * filt1[i].w5;	// L
				w1R = filt1[i].c2 * inputR - filt1[i].k2 * filt1[i].x5;	// R

				allpassL = filt1[i].k2 * inputL + filt1[i].c2 * filt1[i].w5;	// L
				allpassR = filt1[i].k2 * inputR + filt1[i].c2 * filt1[i].x5;	// R

				allpoleL = filt1[i].c1 * w1L - filt1[i].k1 * filt1[i].w3;	// L
				allpoleR = filt1[i].c1 * w1R - filt1[i].k1 * filt1[i].x3;	// R

				w4L = filt1[i].k1 * w1L + filt1[i].c1 * filt1[i].w3;		// L
				w4R = filt1[i].k1 * w1R + filt1[i].c1 * filt1[i].x3;		// R

				filt1[i].w5 = w4L;
				filt1[i].w3 = allpoleL;
				sumL = allpassL*filt1[i].v2+ w4L*filt1[i].v1 + allpoleL*filt1[i].v0;
				inputL = sumL;

				filt1[i].x5 = w4R;
				filt1[i].x3 = allpoleR;
				sumR = allpassR*filt1[i].v2+ w4R*filt1[i].v1 + allpoleR*filt1[i].v0;
				inputR = sumR;

			}
			*signalOutL++ = sumL;
			*signalOutR++ = sumR;
		}
	}
}
#pragma CODE_SECTION(iirDirectform1StereoProcess, ".criticalSectionInternal")
void iirDirectform1StereoProcess(float *restrict in, float *restrict out, float (*x)[6], float (*y)[6], iirdirectform1 *filt, int len, int numFilters, Bool on)
{
    float *signalInL, *signalOutL, *signalInR, *signalOutR;
    int i, j;
    iirdirectform1 *restrict filt1 = filt;
    signalInL = in;
    signalOutL = out;
    signalInR = in + len;
    signalOutR = out + len;
    // bypass filter
    if (on == FALSE)
    {
        for (j=0; j<len; j++)
        {
            *signalOutL++ = *signalInL++;
            *signalOutR++ = *signalInR++;
        }
    }
    else
    {
#pragma MUST_ITERATE(1, , )
        for (j=0; j<len; j++)
        {
            x[0][0] = *signalInL++;
            x[0][3] = *signalInR++;
#pragma MUST_ITERATE(1, , )
            for (i=0; i<numFilters; i++)
            {
                y[i][0] = filt1[i].b0 * x[i][0] + filt1[i].b1 * x[i][1] + filt1[i].b2 * x[i][2] - filt1[i].a1 * y[i][1] - filt1[i].a2 * y[i][2];
                y[i][2] = y[i][1];
                y[i][1] = y[i][0];
                x[i][2] = x[i][1];
                x[i][1] = x[i][0];

                y[i][3] = filt1[i].b0 * x[i][3] + filt1[i].b1 * x[i][4] + filt1[i].b2 * x[i][5] - filt1[i].a1 * y[i][4] - filt1[i].a2 * y[i][5];
                y[i][5] = y[i][4];
                y[i][4] = y[i][3];
                x[i][5] = x[i][4];
                x[i][4] = x[i][3];

                x[i+1][0] = y[i][0];
                x[i+1][3] = y[i][3];
            }
            *signalOutL++ = y[numFilters-1][0];
            *signalOutR++ = y[numFilters-1][3];
        }
    }

}

Thanks,

Zhanjun Li

  • Hi,

    What software are you using? Which Processor SDK RTOS version?

    Best Regards,
    Yordan

  • Can you please provide the build log from your debug mode and release mode ? In CCS, when changing the settings, the IDE allows you to update the settings in all build profiles or apply to specific profile. I suspect that the release build profile don`t have the compiler optimization settings that you have applied hence you see a difference in performance. If you add -k option in both the modes, it will save the generated assembly and you should be able to look at the number of instructions in the assembly for this function

    Also, clarify that the test setup for debug and release mode is the same For example are you using GEL file and CCS setup with both binaries or is the release moe binary loaded using ROM boot? Ensure that the PLL settings are exactly the same between the two setup. Any other information on the setup will also be helpful. Is the code run from internal memory or from DDR, is the code interrupt-able.

    Regards,

    Rahul

  • Hi Yordan,

    We are using CCS 5.5 and DSP/BIOS 5.42.

  • Hi Rahul,

    Attached the rebuilt log file for both debug and release modes for your reference.

    we are using the same GEL file in two modes, and the .cmd file for both modes are also similar, those file are also attached.

    I see the pragma directive CODE_SECTION(iirLatticeLadderStereoProcess, ".criticalSectionInternal") ahead of this function, and ".criticalSectionInternal >IRAM" in the cmd file, so this piece of code should be run from internal memory.

    6747.zip

  • From the build log we can see the -k option had already been added, and we do see the generated assembly file(.asm) for those codes, however the difference of assembly codes between those 2 functions are too much for us to analyze, i can post them up if necessary.

    Btw, the code is interrupt-able, but when we tested the code, there was not hardware interrupt triggered.

  • I noticed that you are linking into the fastMATH lib and in your link step the include or library search path is different. If the library doesn`t exist the fast math functions will not be found and the linker will use RTS library log, exponential and reciprocal functions and also not place them in IRAM which could be the reason for the performance drop.

    Linking differences in the include path pointing to c674xfastMath library.


    To confirm, you can look at the map file for the debug and release binaries and make sure that the log, exponential and reciprocal functions are being linked from the fastmathlib rather than rts6740.lib. Please fix the path to the fastmathlib and let us know if the issue is resolved.

  • Rahul,

    we modified the algorithm and the new code seems working well until now, thanks for the support.