CCS/TMS320C6747: Need help on DSP performance optimization

user5982956

Part Number: TMS320C6747

Tool/software: Code Composer Studio

Hi,

We are facing performance issue when all the audio effects are turned on, we tried optimizing the code to reduce the instructions count, and we do see the decrease of clock numbers in debug mode(from 500k+ clocks down to 360k+ clocks), but when the modified codes are built as release mode and downloaded to the board, seems the cycle period becomes longer. (We use a LED to observe the cycle period, it flickers slower when the cycle period gets longer.)

Attached the original codes and optimized codes, please help to analyze, thanks.

Fullscreen original codes.txt Download

#pragma CODE_SECTION(iirLatticeLadderStereoProcess, ".criticalSectionInternal")
void iirLatticeLadderStereoProcess(float *restrict in, float *restrict out, iirLatticeLadder *filt, int len, int numFilters, Bool on)
{
	float *signalInL, *signalOutL, *signalInR, *signalOutR;
	float w1L, w4L, allpassL, allpoleL, sumL, inputL;
	float w1R, w4R, allpassR, allpoleR, sumR, inputR;
	int i, j;
	iirLatticeLadder *restrict filt1 = filt;

	signalInL = in;
	signalOutL = out;
	signalInR = in + len;
	signalOutR = out + len;

	// bypass filter
	if (on == FALSE)
	{
		for (j=0; j<len; j++)
		{
			*signalOutL++ = *signalInL++;
			*signalOutR++ = *signalInR++;
		}
	}
	else
	{
#pragma MUST_ITERATE(1, , )
		for (j=0; j<len; j++)
		{
			inputL = *signalInL++;
			inputR = *signalInR++;
#pragma MUST_ITERATE(1, , )
			for (i=0; i<numFilters; i++)
			{
				w1L = filt1[i].c2 * inputL - filt1[i].k2 * filt1[i].w5;	// L
				w1R = filt1[i].c2 * inputR - filt1[i].k2 * filt1[i].x5;	// R

				allpassL = filt1[i].k2 * inputL + filt1[i].c2 * filt1[i].w5;	// L
				allpassR = filt1[i].k2 * inputR + filt1[i].c2 * filt1[i].x5;	// R

				allpoleL = filt1[i].c1 * w1L - filt1[i].k1 * filt1[i].w3;	// L
				allpoleR = filt1[i].c1 * w1R - filt1[i].k1 * filt1[i].x3;	// R

				w4L = filt1[i].k1 * w1L + filt1[i].c1 * filt1[i].w3;		// L
				w4R = filt1[i].k1 * w1R + filt1[i].c1 * filt1[i].x3;		// R

				filt1[i].w5 = w4L;
				filt1[i].w3 = allpoleL;
				sumL = allpassL*filt1[i].v2+ w4L*filt1[i].v1 + allpoleL*filt1[i].v0;
				inputL = sumL;

				filt1[i].x5 = w4R;
				filt1[i].x3 = allpoleR;
				sumR = allpassR*filt1[i].v2+ w4R*filt1[i].v1 + allpoleR*filt1[i].v0;
				inputR = sumR;

			}
			*signalOutL++ = sumL;
			*signalOutR++ = sumR;
		}
	}
}

Fullscreen optimized codes.txt Download

#pragma CODE_SECTION(iirDirectform1StereoProcess, ".criticalSectionInternal")
void iirDirectform1StereoProcess(float *restrict in, float *restrict out, float (*x)[6], float (*y)[6], iirdirectform1 *filt, int len, int numFilters, Bool on)
{
    float *signalInL, *signalOutL, *signalInR, *signalOutR;
    int i, j;
    iirdirectform1 *restrict filt1 = filt;
    signalInL = in;
    signalOutL = out;
    signalInR = in + len;
    signalOutR = out + len;
    // bypass filter
    if (on == FALSE)
    {
        for (j=0; j<len; j++)
        {
            *signalOutL++ = *signalInL++;
            *signalOutR++ = *signalInR++;
        }
    }
    else
    {
#pragma MUST_ITERATE(1, , )
        for (j=0; j<len; j++)
        {
            x[0][0] = *signalInL++;
            x[0][3] = *signalInR++;
#pragma MUST_ITERATE(1, , )
            for (i=0; i<numFilters; i++)
            {
                y[i][0] = filt1[i].b0 * x[i][0] + filt1[i].b1 * x[i][1] + filt1[i].b2 * x[i][2] - filt1[i].a1 * y[i][1] - filt1[i].a2 * y[i][2];
                y[i][2] = y[i][1];
                y[i][1] = y[i][0];
                x[i][2] = x[i][1];
                x[i][1] = x[i][0];

                y[i][3] = filt1[i].b0 * x[i][3] + filt1[i].b1 * x[i][4] + filt1[i].b2 * x[i][5] - filt1[i].a1 * y[i][4] - filt1[i].a2 * y[i][5];
                y[i][5] = y[i][4];
                y[i][4] = y[i][3];
                x[i][5] = x[i][4];
                x[i][4] = x[i][3];

                x[i+1][0] = y[i][0];
                x[i+1][3] = y[i][3];
            }
            *signalOutL++ = y[numFilters-1][0];
            *signalOutR++ = y[numFilters-1][3];
        }
    }

}

Thanks,

Zhanjun Li

over 6 years ago

0 Yordan Kovachev over 6 years ago

TI__Guru**** 161600 points

Hi,

What software are you using? Which Processor SDK RTOS version?

Best Regards,
Yordan

0 Rahul Prabhu over 6 years ago

TI__Guru** 116770 points

Can you please provide the build log from your debug mode and release mode ? In CCS, when changing the settings, the IDE allows you to update the settings in all build profiles or apply to specific profile. I suspect that the release build profile don`t have the compiler optimization settings that you have applied hence you see a difference in performance. If you add -k option in both the modes, it will save the generated assembly and you should be able to look at the number of instructions in the assembly for this function

Also, clarify that the test setup for debug and release mode is the same For example are you using GEL file and CCS setup with both binaries or is the release moe binary loaded using ROM boot? Ensure that the PLL settings are exactly the same between the two setup. Any other information on the setup will also be helpful. Is the code run from internal memory or from DDR, is the code interrupt-able.

Regards,

Rahul

0 user5982956 over 6 years ago in reply to Yordan Kovachev

Prodigy 100 points

Hi Yordan,

We are using CCS 5.5 and DSP/BIOS 5.42.

0 user5982956 over 6 years ago in reply to Rahul Prabhu

Prodigy 100 points

Hi Rahul,

Attached the rebuilt log file for both debug and release modes for your reference.

we are using the same GEL file in two modes, and the .cmd file for both modes are also similar, those file are also attached.

I see the pragma directive CODE_SECTION(iirLatticeLadderStereoProcess, ".criticalSectionInternal") ahead of this function, and ".criticalSectionInternal >IRAM" in the cmd file, so this piece of code should be run from internal memory.

6747.zip

0 user5982956 over 6 years ago in reply to Rahul Prabhu

Prodigy 100 points

From the build log we can see the -k option had already been added, and we do see the generated assembly file(.asm) for those codes, however the difference of assembly codes between those 2 functions are too much for us to analyze, i can post them up if necessary.

Btw, the code is interrupt-able, but when we tested the code, there was not hardware interrupt triggered.

0 Rahul Prabhu over 6 years ago in reply to user5982956

TI__Guru** 116770 points

I noticed that you are linking into the fastMATH lib and in your link step the include or library search path is different. If the library doesn`t exist the fast math functions will not be found and the linker will use RTS library log, exponential and reciprocal functions and also not place them in IRAM which could be the reason for the performance drop.

Linking differences in the include path pointing to c674xfastMath library.

To confirm, you can look at the map file for the debug and release binaries and make sure that the log, exponential and reciprocal functions are being linked from the fastmathlib rather than rts6740.lib. Please fix the path to the fastmathlib and let us know if the issue is resolved.

0 user5982956 over 6 years ago in reply to Rahul Prabhu

Prodigy 100 points

Rahul,

we modified the algorithm and the new code seems working well until now, thanks for the support.

Processors

Processors forum

CCS/TMS320C6747: Need help on DSP performance optimization