TMS320F28377D: Optimal RAM usage for CLA operations

Mike Twieg1

Part Number: TMS320F28377D
Other Parts Discussed in Thread: C2000WARE

In my application CLA1 is running an IIR filter. The input x_in is passed from Cpu1 via the Cpu1ToCla1 message RAM, and the output y_out is passed back via the Cla1ToCpu1 message RAM. Here is a simplified example of the IIR code, which is based off the cla_iir2p2z project in C2000ware (extended to 3p3z):

Fullscreen

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// global variables, declared elsewhere
// float x_in;  //input to IIR. In Cpu1ToCla1 message RAM
// float xn;        //copy of x_in. Accessed frequently by IIR code
// float D[6];      //shift register/accumulator for IIR
// float A[4];  //denominator coefficients of IIR
// float B[4];  //numerator coefficients of IIR
// float yn;        //output of IIR. Accessed frequently by IIR code
// float y_out; //copy of yn. In Cla1ToCpu1 message RAM
static inline void run3p3z_CLA(void)
{   //transposed direct form II
    //     Network Diagram  :
    //
    // xn------>(x)--->(+)--------------->yn
    //      |    ^      ^             |
    //      |    |      |D[5]         |
    //      |    B(0)  (z)            |
    //      |           ^             |
    //      |           |D[4]         |
    //      |-->(x)--->(+)<-----(x)---|
    //      |    ^      ^        ^    |
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

// global variables, declared elsewhere
// float x_in; 	//input to IIR. In Cpu1ToCla1 message RAM
// float xn;		//copy of x_in. Accessed frequently by IIR code
// float D[6];		//shift register/accumulator for IIR
// float A[4]; 	//denominator coefficients of IIR
// float B[4]; 	//numerator coefficients of IIR
// float yn;		//output of IIR. Accessed frequently by IIR code
// float y_out;	//copy of yn. In Cla1ToCpu1 message RAM

static inline void run3p3z_CLA(void)
{   //transposed direct form II
    //     Network Diagram  :
    //
    // xn------>(x)--->(+)--------------->yn
    //      |    ^      ^             |
    //      |    |      |D[5]         |
    //      |    B(0)  (z)            |
    //      |           ^             |
    //      |           |D[4]         |
    //      |-->(x)--->(+)<-----(x)---|
    //      |    ^      ^        ^    |
    //      |    |      |D[3]    |    |
    //      |    B(1)  (z)       A(1) |
    //      |           ^             |
    //      |           |D[2]         |
    //      --->(x)--->(+)<-----(x)----
    //      |    ^      ^        ^    |
    //      |    |      |D[1]    |    |
    //      |    B(2)  (z)       A(2) |
    //      |           ^             |
    //      |           |D[0]         |
    //      --->(x)--->(+)<-----(x)----
    //           ^               ^
    //           |               |
    //           B(3)            A(3)
    //

    xn=x_in; //copy from shared message ram to local ram to reduce conflicts between cla and cpu!!!!
    yn = xn*B[0] + D[5];

    D[4] = xn*B[1] + yn*A[1] + D[3];
    D[5] = D[4];

    D[2] = xn*B[2] + yn*A[2] + D[1];
    D[3] = D[2];

    D[0] = xn*B[3] + yn*A[3];
    D[1] = D[0];
}

My question is how I should place the different variables in RAM in order to maintain fastest execution and minimizing access to the message RAMs.

Currently I use local versions of x and y (xn and xy) which are in local shared RAM (LS0, for example). That way I only access each message RAM once per iteration.

I also see that my code produces assembly with paralleled instructions (MMOV32 || MADDF32), with each accessing a different RAM address (one is usually from D[], the other from A[] or B[]). If those two addresses are in the same RAM block (LS0), will that result in slower execution due to wait states? Should I therefore put D[] in a separate LSx block from A[] and B[]? The original example project did not explicitly place the shift registers, they're just declared in the .cla source file without a DATA_SECTION #pragma.

Regards,

Mike

over 4 years ago

0 Ashwini Athalye over 4 years ago

TI__Expert 7695 points

Hi Mike,

If code makes extensive use of two data buffers, putting each buffer in a different RAM block may improve performance. The goal is to reduce the pipeline stalls due to write and read occurring in the same cycle to different buffers.

Thanks,

Ashwini

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28377D: Optimal RAM usage for CLA operations