This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F280049C: Calculation Time of Arithmetic in the CLA

Part Number: TMS320F280049C

Hi experts,

We would like to reduce the calculation time of arithmetic in the CLA and we found the floating data type is faster than the integer from the data sheet. 

But we found an interesting thing which is shown below (we use the oscilloscope to measure the elapsed time):

It is reasonable that scenario 2 is faster than scenario 1 and scenario 4 is faster than scenario 3. But we would like to know why scenario 3 takes much more time than scenario 1?

If it is caused by the different variables, scenarios 4 and 5 should also take much more time than scenario 2.

kindly help clarify our doubts.

 

Scenario 1: Data types of a and b are unsigned integers, it takes 1.07us.

uint16_t a, b; // a is a random variable

GpioDataRegs.GPBSET.bit.GPIO34 = 1; // pull high
b = a * (a + 1) - (a * a - 1);
b += a * (a + 2) - (a * a - 2);
b += a * (a + 3) - (a * a - 3);
b += a * (a + 4) - (a * a - 4);
b += a * (a + 5) - (a * a - 5);
b += a * (a + 6) - (a * a - 6);

GpioDataRegs.GPBCLEAR.bit.GPIO34 = 1; // pull low

Scenario 2: Data types of a and b are floating, it takes 0.42us.

float a, b; // a is a random variable

GpioDataRegs.GPBSET.bit.GPIO34 = 1; // pull high
b = a * (a + (float)1) - (a * a - (float)1);
b += a * (a + (float)2) - (a * a - (float)2);
b += a * (a + (float)3) - (a * a - (float)3);
b += a * (a + (float)4) - (a * a - (float)4);
b += a * (a + (float)5) - (a * a - (float)5);
b += a * (a + (float)6) - (a * a - (float)6);

GpioDataRegs.GPBCLEAR.bit.GPIO34 = 1; // pull low

Scenario 3: Data types of (a, c, d, e, f, g) and b are unsigned integers, it takes 34us.

uint16_t a, b, c, d, e, f, g; // a, c, d, e, f, g is a random variable

GpioDataRegs.GPBSET.bit.GPIO34 = 1; // pull high
b = a * (g + 1) - (e * f - 1);
b += g * (a + 2) - (f * e - 2);
b += c * (d + 3) - (g * d - 3);
b += d * (e + 4) - (c * c - 4);
b += e * (f + 5) - (a * g - 5);
b += f * (a + 6) - (d * a - 6);

GpioDataRegs.GPBCLEAR.bit.GPIO34 = 1; // pull low

Scenario 4: Data types of  (a, c, d, e, f, g) and b are floating, it takes 0.6us.

float a, b, c, d, e, f, g; // a, c, d, e, f, g is a random variable

GpioDataRegs.GPBSET.bit.GPIO34 = 1; // pull high
b = a * (g + (float)1) - (e * f - (float)1);
b += g * (a + (float)2) - (f * e - (float)2);
b += c * (d + (float)3) - (g * d - (float)3);
b += d * (e + (float)4) - (c * c - (float)4);
b += e * (f + (float)5) - (a * g - (float)5);
b += f * (a + (float)6) - (d * a - (float)6);

GpioDataRegs.GPBCLEAR.bit.GPIO34 = 1; // pull low

Scenario 5: Data types of  (a, c, d, e, f, g) and b are unsigned integers and floating, respectively, it takes 0.86us.

uint16_t a, c, d, e, f, g; // a, c, d, e, f, g is a random variable

float b;

GpioDataRegs.GPBSET.bit.GPIO34 = 1; // pull high
b = (float)a * ((float)g + (float)1) - ((float)e * (float)f - (float)1);
b += (float)g * ((float)a + (float)2) - ((float)f * (float)e - (float)2);
b += (float)c * ((float)d + (float)3) - ((float)g * (float)d - (float)3);
b += (float)d * ((float)e + (float)4) - ((float)c * (float)c - (float)4);
b += (float)e * ((float)f + (float)5) - ((float)a * (float)g - (float)5);
b += (float)f * ((float)a + (float)6) - ((float)d * (float)a - (float)6);

GpioDataRegs.GPBCLEAR.bit.GPIO34 = 1; // pull low

Best Regards,

C.C, Liu

  • Hi,

    Have you looked at the generated assembly code? You can view that be enabling the --keep_am compile flag in the CCS project.

    Regards,

    Veena

  • Hi,

    Yes, we have checked the assembly code. We are not familiar with the mechanism of this compiler, so we want to ask experts directly.

    For example, why scenario 3 uses a lot of MNOP but others do not (screenshot of a fragment as below)?

    Hope TI experts can clarify our concerns.

    By the way, do you have the document of each assembly instruction?

    BR,

    C.C. Liu

  • Hi,

    Yes, the CLA chapter in the device Technical Reference Manual has the details of all CLA instructions.

    I will forward your query to compiler experts

    Regards,

    Veena

  • why scenario 3 uses a lot of MNOP

    The CPU pipeline of the CLA is not protected.  When an instruction is issued, the result of that instruction is sometimes not available for several cycles.  The compiler attempts to fill those cycles with other independent instructions.  When no other instructions are available, those cycles get filled with MNOP instructions.  For further details, please see this forum thread.

    Thanks and regards,

    -George

  • Hi George,

    Thanks for your reply.

    I have an extended question. The floating data type (scenarios 4 or 5) is almost 50 times faster than the unsigned data type (scenario 3), is this due to CLA being a fully-programmable independent 32-bit floating-point hardware accelerator (that's why there are no a lot of MNOP instructions?)? 

    BR,

    C.C. Liu

  • Hi Chen,

    Thanks for your question, I will route this thread back to Veena so they can help you with this CLA question

    Regards,

    Peter

  • Hi Chen,

    That is correct, the CLA instruction set is optimized for 32-bit floating point.

    Without knowing the generated assembly, one example where I can think that can make a difference is say to ADD a 16-bit constant - for floating point addition, there is an MADD32F MRa, #16F, MRb which takes on the immediate operand as part of the instruction. For unsigned int, the only instuction supported is MADD32 MRa, MRb, MRc that is there is no instruction that takes immediate operand. This implies for the code snippet you have, for integer types, constants will need to be loaded to a register first, this is not needed for floating point.

    Thanks,

    Ashwini