TMS320C6416T: TMS320C6416T

Waseem Ajmal

Intellectual 310 points

Part Number: TMS320C6416T

Hello,

I have a question regarding optimization problem.

Here is my code:

/*Loop for correlation computation.*/

for (n1=0; n1<1400; n1++)
{
fine_sync = 0.0+0.0*I;
for (m1=0; m1<200; m1++)
{
fine_sync += mpysp(input[n1+m1],conjf(CMP_MOD_knownsym[m1]));
}

fine_sync1_vector_abs[n1]=cabsf(fine_sync);
}

My this section of code is taking about 54,000,000 cycles (54 ms) ,measured from CCS profiling tool . That cannot be afforded for any real time application. I am running this code in C6416 Chip with 1 Ghz clock.

I need help regarding optimization of this loop. The outer loop is running for 1400 times whereas, inner loop is running for 200 times.

I know about loop unrolling but that doesn't work so much. I want to know some other method through which I can reduce execution cycles of this loop.

over 3 years ago

0 Victor Kazmirenko over 3 years ago

Guru 13042 points

Hello!

We need a bit more information about your situation to judge certainly, though few considerations are visible even now.

First, it looks like you are doing complex math in single precision on fixed point processor. To my knowledge, single precision math is implemented in software for C64x. It means every simple addition, every multiplication is implemented as function call. This not only adds overhead, but prevents loop pipelining. I've never worked with complex.h on C64x, but I suspect that mpysp, conj, cabsf - all are implemented as function calls. This again prevents loops optimization.

C6x DSP achieve their top performance when loops can be pipelined. We were making RF signal processing on C64x. However, all the time we used integer representation for our data. It should be Q-format data, but we survived even without diving there, just with proper scaling in advance. The key difference is that C64x core has hardware for making integer additions and multiplications, thus they don't need to be emulated in software. With that, compiler can do loop pipeling efficiently. You may want to see spra666.pdf "Hand-Tuning Loops and Control Code on the TMS320C6000", just google for it. This guide containg useful information how to write efficient code for C6x. The is also a much more modern tutorial SPRABG7 "Optimizing Loops on the C66x DSP", however, it is targeted to C66x cores, which can do floating point operations in hardware, but that does not apply to your case.

All the magic happens if -o3 compiler option is used. Be sure to do that. Also, mentioned tutorial tells how to read compiler feedback. That is valuable information to understand, where is a bottleneck and what can be done to improve.

Been in your situation we used integers for complex data sampled from RF. Specifically, they were 16-bit data, as that is matched with AD/DA resolution, and operations with shorts can be packed into SIMD instructions.

Hope this helps.

0 Rahul Prabhu over 3 years ago in reply to Victor Kazmirenko

TI__Guru** 114410 points

Agree with rrlagic, your post is missing vital information to help provide guidance on this issue. What is your compiler settings? Which memory is the code running from ? HAve you enabled DSP cache ? Can you use DSP TSCL and TSCH registers instead of CCS Profile clock for accuracy.

0 Waseem Ajmal over 3 years ago in reply to Rahul Prabhu

Intellectual 310 points

My compiler settings are:

I am running my code from IRAM.

I have not enabled DSP cache. Should I enable DSP cache?

Yes, I have tried using TSCL and TSCH registers.

I wrote like:

{

unsigned long long start;
unsigned long long end;

start = _itoll (TSCH, TSCL);

my_function();

end= _itoll (TSCH, TSCL);

}

But I got this error:

#20 identifier "TSCH" is undefined
#20 identifier "TSCL" is undefined

0 Waseem Ajmal over 3 years ago in reply to Victor Kazmirenko

Intellectual 310 points

Yes, I am using complex maths. I have added following header files:

#include <csl.h>
#include <complex.h>
#include <math.h>
#include <mathf.h>
#include <float.h>
#include <string.h>
#include "stdbool.h"
#include <c6x.h>

#include "fastrts62x64x.h"

If it is not appropriate to perform complex maths in C64x DSPs, then how to manipulate complex numbers (real and imaginary) parts?

Second, is there any way to compute angles using fixed point maths. For example, If I use cos, sin, tan function, the result is always in float.

To convert into fixed point, we need to multiply by 32767. But the answer of cos and sin is always a floating point number. In math.h function, these functions are in double or float data type.

From the manuals you recommended, I am trying to reduce cycles of section of my code given above. Thanks for suggestion. If you have further help regarding handling of complex numbers in integer format, please guide.me.

0 Rahul Prabhu over 3 years ago in reply to Waseem Ajmal

TI__Guru** 114410 points

There are several issues with the DSP setup and compiler settings due to which the code is not executing optimally.

https://www.ti.com/lit/ug/spru187u/spru187u.pdf?ts=1618342922401&ref_url=https%253A%252F%252Fwww.ti.com%252Ftool%252Fdownload%252FC6000-CGT-7-4

Please use optimization setting -o3, please enable DSP L1D/L1P cache. I strongly recommend that you refer to the C6000 optimization App note and apply those optimization techniques.

https://www.ti.com/lit/an/sprabf2/sprabf2.pdf?ts=1618343077564

For handling floating point math on fixed point processor we provide IQMATH library for C64x+ architecture. Please use this to compute angles, cos, sine values.

If you are still seeing issues, we can provide further guidance.

0 Victor Kazmirenko over 3 years ago in reply to Waseem Ajmal

Guru 13042 points

Hello!

You are asking right questions, and there is certain legacy to deal with that stuff. Here we just outline major avenues, where you have to learn to drive.

I believe it's pointless to do any significant floating point work on fixed point DSP, such as C64x. On the other hand, these DSPs are monster number crunchers in integers. Then one may employ so called Q-format, which in a way is fixed point number represented as dealt as integer. I think TI is providing a library to work with fixed point numbers in Q-format, though there is a learning curve.

Using complex numbers is common in signal processing. The only difference is that both real and imaginary parts are stored and processed as integers. Now think about your AD/DA, they do work in scaled integers and provide 14-16 bits at their best. So C's short is just enough to store valuable information and wider or floating point data types do not save any more information. There is expense during math operation, and that's what needs to be carefully planned.

Next, it is very common to pack Re/Im parts of complex number ans lower/upper half-words of 32-bit word. Then C64x core can perform number of complex operations with hardware. For instance, C64x core can add upper and lower halves of two packed 32-bit complex containers in just a cycle, and that is complex addition. It also can make dot products and dot products with negation - these operations happens often in complex math. Some of these operations are recognized by compiler, some require explicit call of intrinsic instructions.

As to angle and harmonic functions, sometimes one can avoid the, using conjugate multiplications, and for that the core has hardware and respective intrinsics. There are fast approximations of well known functions. And finally, one may opt for custom developed lookup tables to speed up the math.

Not to forget to mention evil operation of division. Used often, and specifically in the loops, it may degrade performance a lot, so sometimes we use Newton iterations to estimate approximate of division operation.

Its up to you do decide, how far you will go in these techniques, however, use of packed integers is a starting point I would recommend.

0 Waseem Ajmal over 3 years ago in reply to Rahul Prabhu

Intellectual 310 points

Thank you for suggestions.

I have started optimization with IQMATH library first to perform angle computations in fixed point.

I have downloaded and installed IQMATH library for C64x+. I am studying IQMATH library document "sprugg9". Side by side, i am trying to run example codes present in documents.

But I am facing some issue.

My first confusion is, IQMATH library for C64x+ architecture is compatible with C6416 DSP or not?

Second, among .lib files, which one is compatible to be added in file search path of project properties in CCS?

However, just to try, i have added IQmath_c64x+.lib in file search path and have included "include folder" in include options.

The code lines I wrote in CCS:

#include <IQmath.h> /* Header file for IQmath routine */

#define PI 3.14159F

_iq input, sin_out; /* Definition of variables using IQmath datatype */

void main(void )

{

input =_IQ29( 0.25*PI ); /* radians represented in Q29 format */

sin_out =_IQ29sin (input );

}

But when i run this example code, it gives me error:

unresolved symbol IQNsin, first referenced in C:/ti/IQMATH library ...........

I have also tried by adding in .cmd file:

.data:IQmathTables > L2RAM

.text:IQmath > L2RAM

But still, i got this error.

Kindly help me to resolve this issue.

0 Victor Kazmirenko over 3 years ago in reply to Waseem Ajmal

Guru 13042 points

Hello!

To my knowledge, C64+ architecture is superior to C64x and this way C64+ IQMATH library is not backward compatible with C64x devices.

You may want to see the thread e2e.ti.com/.../44931

0 Waseem Ajmal over 3 years ago in reply to Victor Kazmirenko

Intellectual 310 points

Okay, I have read this link.

I have downloaded "IQmath_c64x_v212.zip".

But still, I am facing an error:

__mpy32su C:\ti\lib\IQmath_c64x_v212.lib
__mpy32us C:\ti\lib\IQmath_c64x_v212.lib
__mpy32u C:\ti\lib\IQmath_c64x_v212.lib

In the above link, there is another link to download instrinsic, "http://tiexpressdsp.com/index.php/Run_Intrinsics_Code_Anywhere",

but that does'nt open.

0 Victor Kazmirenko over 3 years ago in reply to Waseem Ajmal

Guru 13042 points

Hello!

My point was that though TI people provided IQmath_c64x to the customer, it appears that version never was neither complete nor public release. Indeed, using that contribution people faced same problems as you do and the response was there is no cure for that. I could be wrong, but it seems to me that complete and functional release of IQMATH for C64x did not exist. Instead, at that times people were coding those algorithms on their own. When it comes to harmonic functions, there is a number of so called "fast implementations". They wrap around idea of periodicity and some method of approximation, ranging from direct Taylor series to CORDIC algorithm and lookup tables. We were using lookup tables due to simplicity of their implementation, but I can't claim that neither accurate, nor efficient.

At this point you might want tot step back and think whether sine computation is the only way to accomplish your goal. Perhaps your problem has other solutions as well.

Processors

Processors forum

TMS320C6416T: TMS320C6416T