C5515 Audio Buffer With a little math on the side

Douglas Lambert

First, I'd like to thank the community for their help in the past. Every time I've posted here I've walked away with that much more knowledge. Being a new DSP programmer and still a student its been quite the resource.

Anyway, I've been working off of an audio pass through example kindly provided to me by the TI staff. Which is working perfectly. However, I'm essentially trying to replicate their buff copy function (Included below) with additional data processing in between the two buffers. However, I'm having issues...

My conclusions thus far:

1. The math and the filter in between functions as desired. Its run perfectly using a pre-allocated sine wave as an input.

2. Leading me to believe that I'm doing something wrong while handing the Real Time data stream.

3. The DMA registers seem to be working fine ala the buff copy giving me crisp audio pass through.

4. I've investigated using the circ_update etc in the programmers hand book and intend to replace some of my logic with that. Though it looks remarkably similar to what I'm trying to do irregardless.

My concerns:

1. Are my signal modifications too slow to keep with my buffer? How can I check this?

2. I understand my code is less than optimal I guess that ties in with 1.

3. I might be entirely ignorant of the process... Thats a reality...

If theres more information that I can provide please let me know and I'll respond ASAP. Its a rather large project and I've left out a good majority of it as to not overwhelm. I guess I'm mainly concerned with how my addtional operations in between the buffers affects the data stream (aka why I'm getting utter junk at my output).

Thank you again and thank for taking your time to read a lengthy post!

The Buff Copy Function for Reference:

void buff_copy(Int16 *input, Int16 *output, Int16 size)

{

Int16 i;

for(i =0; i<size; i++)

{

*(output + i) = *(input +i);

}

My Modified Function:

void run_lms(Float32 *TapWeight,Int16 *Xmit,Int16 *Rcv_input,Int16 *Rcv_error)

{

//mu is now #defined in ref_bypassdata.h

//Vector_Size is #defined in ref_bypassdata.h

//Delay is #defined in ref_bypassdata.h

//Internal Variabls

Int16 i;

Float32 Output;

Int16 k;

Int16 j;

Float32 Error;

Float32 Input_Scaled;

Float32 Error_Scaled;

Float32 ErrorInput;

Float32 Input;

for(k=0;k<(XMIT_BUFF_SIZE);k++)

{

j=k-1;

if(j<0)

{

//Handles the case when k-1 is less than the last buffer, so it loops back to the past values in the 44 45 46 47 8 range

j=XMIT_BUFF_SIZE;

}

//Ensure that if you get a 0 error, the correct output is loaded to the register

Output=*(Xmit+j)/Scale;

//Output=0;

for(i=0;i<(Vector_Size);i++)

{

j=k-i;

if(j<0)

{

//Handles the case when k-i is less than the last buffer, so it loops back to the past values in the 44 45 46 47 8 range

j=XMIT_BUFF_SIZE-i+k+1;

}

// Variable j is used to maintain the propper indexing of the input vector

//Using Scaled Version of Input

Input=*(Rcv_input+j);

Input_Scaled=(Input)/Scale;

Output+= *(TapWeight+i) * Input_Scaled;

}

Input=*(Rcv_input+k);

Input_Scaled=Input/Scale;

Error=Output+Input_Scaled;

for(i=0;i<(Vector_Size);i++)

{

j=k-i; j=k-i;

if(j<0)

{

//Handles the case when k-i is less than the last buffer, so it loops back to the past values in the 44 45 46 47 8 range

j=XMIT_BUFF_SIZE-i+k+1;

}

Input=*(Rcv_input+j);

Input_Scaled=Input/Scale;

*(TapWeight+i)+=mu * Input_Scaled * Error;

}

//^^ Can be commented out once the delay is confirmed.

*(Xmit+k)=Output*Scale;

}

To Clarify some variables:

Rcv_XXXX: are pointers to my incoming data stream buffer.

Xmit : is a pointer to my output buffer.

over 14 years ago

0 Jim Noxon over 14 years ago

TI__Genius 14940 points

Since the 5515 is a fixed point processor, doing work in floating point formats will be quite slow comparatively to simply copying a buffer from input to output. Generally when doing real time filtering of this type the code is optimized by using fixed point numbers rather than floating point numbers. This allows the assembly to use integer math instructions while maintaining sub digit precision based on where the decimal point is implied in the representation of the number. For a generalized filter function like you have where you merely pass in an array of tap weights, fixed point doesn't work well as it is highly dependent upon the individual tap weight values.

Take a look at Q math representation which the 5515 has instructions to support.

http://en.wikipedia.org/wiki/Q_(number_format)

Jim Noxon

0 Douglas Lambert over 14 years ago in reply to Jim Noxon

Prodigy 80 points

Jim,

I'll check that out! Its been a lot of self teaching and sometimes just being made aware of an option is a mile ahead of where you were. I hate to ask, but is there a simple example of the Q math implementation? Otherwise I'll start digging through some of the TI documentation and see what I can find.

0 Jim Noxon over 14 years ago in reply to Douglas Lambert

TI__Genius 14940 points

You might want to check out the IQMath library and the users guide IQMath Users Guide which explains much of the math operations available.

Jim Noxon

0 Brian Willoughby over 14 years ago in reply to Jim Noxon

Genius 4630 points

Hmm, IQMath seems entirely designed for the C64x+ family, but we're focused on the C55x family here. Is there an equivalent of IQMath that is appropriate for C55x? The OP is using C5515 and I am using C5506.

0 Douglas Lambert over 14 years ago in reply to Brian Willoughby

Prodigy 80 points

Jim,

I dug up an older post that pointed out the absence of a C55XX implementation of the IQMath Library. Is this the case?

http://e2e.ti.com/support/dsp/tms320c5000_power-efficient_dsps/f/109/p/32071/111663.aspx#111663

I find it rather it rather surprising that TI doesn't have a library for their latest Low power line.

0 Jim Noxon over 14 years ago in reply to Douglas Lambert

TI__Genius 14940 points

Yes, it is a bit disheartening because the C55xx DSP's do support fractional math via a status register bit. Unfortunately the C compiler environment never uses it so you must write the routines that take advantage of it yourself. However I was trying to find the documentation which describes how to utilize a syntactic patern which will help the compiler recognize operations very useful to fixed point arithmetic but I couldn't find it as it is related to the C55xx devices. However I did find it as it is related to the C54xx devices. See section 6.7.3 of the TMS320C54x Optimizing C/C++ Compiler User's Guide.

Since both the C54x and C55x processors only have 17x17 bit mac units, performing actual 32x32 multiplies in C would require a subroutine call. However, generally speaking when doing fixed point arithmetic on 16 bit operands, what is usually wanted is to do a 16x16 multiply resulting in a 32 bit result from which only 16 bits are extracted for the result. I know that using the procedure described for the 54x also works for the 55x. Since the result of the multiply is held in a 40 bit accumulator, access to any specific 16 bits is only a single instruction using the barrel shifter within both parts. It puts some requirements on you to write the code but you can get vast improvements in speed if you take the time to do so.

As an example, lets assume you have an input ADC which is 12 bits wide thus you can receive a value from 0 to 4095. The range of the ADC is 0 to 3.3 volts and we would like to represent the input number of the ADC in volts thus we must effect the multiply of 3.3/4096*adc_input. Since the output result is a number from 0 to 3.3, we can represent it as a q2.14 number. This representation indicates there are 2 bits to the left of the decimal point and 14 bits to the right of it implying a 16 bit number. If we let bits 15 and 14 of a 16 bit value be the two bits to the left of the decimal point then we can represent a number from 0 to

2 + 1 + 1/2 + 1/4 + 1/8 + 1/16 + 1/32 + 1/64 + 1/128 + 1/256 + 1/512 + 1/1024 + 1/2048 + 1/4096 + 1/8192 + 1/16384 = 3.99993896484375

which is more than enough to represent our range of 0 to 3.3 volts. Notice also that our ADC input is in a q16.0 format where we know the upper 4 bits will always be zero (or in the case of signed numbers will always be sign extension bits). The key here is we want to calculate the voltage seen by the ADC with as much precision as possible. The gain value of 3.3/4096 = 0.0008056640625 can be represented as a q-10.26 number (Note the negative value). Thus the gain in volts per bit is represented as 54067 in a q-10.26 format.

Now assume our last ADC reading was 3145. We multiply it by 54067 to get 170040715. Because this was a q16.0 number multiplied by a q-10.26 number we get a 32 bit result in a q6.26. To get your voltage value in a q2.14 format we shift the 32 bit result right by 26-14=12 bits and and with 0xFFFF to get our q2.14 result. The line of code would look like this

result = ((unsigned long)adc_in * (unsigned long)(unsigned int)54067) >> 12; // and of 0xFFFF implied by store operation

This results in a value of 41523 which when interpreted from the q2.14 format is 2.53375244140625 volts. All without any floating point operations.

Although in the above code we used a shift of 12 rather than 16 as in the compiler guide, because the C54x and C55x processors have 40 bit ALU's which incorporate barrel shift registers along with the 17x17 multplier, the compiler produces pretty efficient code. Note the cast to unsigned int of the constant value, this is required to make sure the compiler recognizes how we want the value interpreted as it would be interpreted as a signed long by the compiler otherwise which would force a call to a 32x32 multiply function.

Jim Noxon

0 Douglas Lambert over 14 years ago in reply to Jim Noxon

Prodigy 80 points

Jim,

Thats wonderfully helpful! Thank you for taking the time to create such an informative post. I'm working with my colleagues and a professor here to implement the concept in our design. When we have it hashed out, I"ll post the solution on the boards!

0 Jim Noxon over 14 years ago in reply to Douglas Lambert

TI__Genius 14940 points

Excellent! Glad to have been able to help. Looking forward to seeing your successful solution.

Jim Noxon

0 Douglas Lambert over 14 years ago in reply to Jim Noxon

Prodigy 80 points

Aight, so its been quite a while but I finally have a working implementation. The Q math certainly led me down the right path. In the end, we did some clever manipulation of the steepest descent function to balance it out a power of 2 allowing us to do the multiplications and divisions with simple bit shifts. Similarly, we knocked the incoming integer down a bit to prevent overflow.

In its simplest form it worked out similarly to this:

for(i=0;i<(Vector_Size);i++)

{

Input=*(Rcv_input+j);

Input=Input >> 1;

Output+=(signed long)*(TapWeight+i)*(signed long)Input;

}

Output=Output >> 15;

Error=(Output << 1)+Input;

for(i=0;i<(Vector_Size);i++)

{

j=k-i;

if(j<0)

{

//Handles the case when k-i is less than the last buffer, so it loops back to the past values in the 44 45 46 47 8 range

j=XMIT_BUFF_SIZE-i+k;

}

Input=*(Rcv_input+j);

Input=Input >> 1;

Step_Direction=(signed long)Input*(signed long)Error;

Step=Step_Direction >> 16;

*(TapWeight+i)-=Step;

}

*(Xmit+k)=Output << 1;

Was quite the experience, I've never really handled integer based math and I think I picked up alot. Again, I'd like to thank everyone whom chipped in and gave a word (or an incredibly helpfull paragraph!) of advice!

0 Cypher Punks over 14 years ago in reply to Douglas Lambert

Intellectual 410 points

Just a brief side note from a C programmer:

Yes, you can write *(array+i), but the usual (and far more readable) way to write that is array[i].

The fact that they are equivalent is useful for a programmer to know, but writing Tapweight[i] or Rcv_input[j] is a lot easier on the eyes. And it lets you write much more compact loop bodies like Output += (long)TapWeight[i] * (long)(Rcv_input[j] >> 1);

0 Norman Wong over 14 years ago in reply to Cypher Punks

Guru 26430 points

While we are hijacking this thread, there is also the question what compiles into the fastest code. For example, let's take the case of copying one array to another.

The most common form seen on this forum.

for(i=0; i<n; i++)
*(dst+i) = *(src+i);

Good for processors with a move pointer+index instruction.

As suggested by Cyber Punks above, this form is more readable.

for(i=0; i<n; i++)
dst[i] = src[i];

Also good for processors with a move pointer+index instruction.

And the form that is most common with C programmers.

pdst=dst;
psrc=src;
for(i=0; i<n; i++)
*pdst++ = *psrc++;

Good for processors with a move pointer post-increment instruction. It seems a lot of processors don't have this instruction. On those processors, the pointer + indexed form is probably faster.

Sometimes using the library function is faster.

memcpy(dst, src, n);

Assumes that memcpy() has been hand coded in assembler or exploits some single instruction that does an array copy. I would guess that there is a specialized DSP instruction for that.

0 Jim Noxon over 14 years ago in reply to Douglas Lambert

TI__Genius 14940 points

Thank you very much for sharing your solution with the community. This is a great example of what the E2E community forums are all about!

Jim Noxon

0 Cypher Punks over 14 years ago in reply to Norman Wong

Intellectual 410 points

Regarding dest[i] = src[i] vs. *pdest++ = *psrc++, it's worth mentioning that these days, most compilers are good enough to convert between the two forms as necessary, so it doesn't really matter which one you use.

0 Norman Wong over 14 years ago in reply to Jim Noxon

Guru 26430 points

Jim's reply about the avoiding the 32x32 function call reminded me about this documentation:

TMS320C55x DSP Programmer’s Guide (spru376)
3.1.2 How to Write Multiplication Expressions Correctly in C Code
Writing multiplication expressions in C code so that they are both correct and efficient can be confusing, especially when technically illegal expressions can, in some circumstances, generate the code you wanted in the first place. This section will help you choose the correct expression for your algorithm. The correct expression for a 16x16−>32 multiplication on a C55x DSP is:
long res = (long)(int)src1 * (long)(int)src2;
According to the C arithmetic rules,this is actually a 32x32−>32 multiplication, but the compiler will notice that each operand fits in 16 bits, so it will issue an efficient single-instruction multiplication. A 16-bit multiplication with a 32-bit result is an operation which does not directly exist in the C language, but does exist on C55x hardware, and is vital for multiply-and-accumulate (MAC)-like algorithm performance. Example 3−1 shows two incorrect ways and a correct way to write such a multiplication in C code.
Example 3−1. Generating a 16x16−>32 Multiply
long mult(int a, int b)
{
    long result;

    /* incorrect */
    result = a * b;

    /* incorrect */
    result = (long)(a * b);

    /* correct */
    result = (long)a * b;

    return result;
}
Note that the same rules also apply for other C arithmetic operators. For example, if you want to add two 16-bit numbers and get a full 32 bit result, the correct syntax is:
(long) res = (long)(int)src1 + (long)(int)src;

0 Jim Noxon over 14 years ago in reply to Norman Wong

TI__Genius 14940 points

Hmmm... We are beginning to really digress here. I've been holding back on commenting as it is partly my responsibility to attempt to keep threads from diverging without starting a new one. Perhaps we should move the it but the discussion a present has less to do with the correct or best answer as much as it has to do with stylistic interpretation and I'm not sure which forum to move the discussion to. Thus, I'll chime in here as well.

Cypher Punks said:
Just a brief side note from a C programmer:

Cypher's first response fits nicely into the C parlance of a type. Here the basic type is "programmer" and the qualifier is "C", hence we have a "C Programmer". We could further expand on this with additional type modifiers such as "Real", "Old School", "New School", "Novice", etc. The question is what is the default modifier applied to the qualified type "C Programmer". Since the post Cypher replied to was already in C, doesn't it imply Douglas' post was also written by a "C Programmer", yet Cypher's statement here would indicate some form of differentiation from the "C Programmer" type Douglas is. Perhaps Douglas was considered a "Novice C Programmer" and Cypher feels like a "Real C Programmer". I would characterize Douglas more like an "Old School C Programmer" and Cypher more of a "New School C Programmer". This, of course, is merely my opinion due to the stylistic form (or dialect) each prefers in writing thier respective source code. =]

When the C language was invented (or initially evolved from B) there were few optimizers if any which were applied either to source or machine code. This is exactly why C has such a robust set of operators like pre and post increment and decrement, assignment operators such as +=, >>=, etc., and most importantly easy access to objects via pointers with * and & operators. The biggest reason for this rich set of operators was in direct response to the lack of optimizers. Thus a means to align the source code closely with the intended operations of the underlying machine was sought. This allows C code to be written in a manner whereby the compiler could infer the instruction type wanted by the coder. Thus, without the advent of optimizers, one might find code written as

#ifdef ARCH_A
for( ndx = 0; ndx < len; ndx++ )
dst[ndx] = src[ndx];
#elif ARCH_B
ndx = 0;
while( ndx++ < len )
*dst++ = *src++;
#else
#error "ERROR: Must choose an architecture to compile to"
#endif

In the above example, being "easy on the eyes"[Cypher] was less important than optimal code production as identified by Norman. Most would probably agree the conditional compilation construct makes the code far worse to read than either line alone would have. However, as optimizers came on line and increased in quality and capability, the above conditional compilation was no longer necessary as either form would compile to the optimal code regardless of what machine was targeted. Unfortunately, it wasn't good enough to merely choose one form or the other and remove all the conditional compilation. This is because the metric for the choice moved from "Fastest" or "Smallest" to "Esthetically Pleasing" or "Familiarity". The latter metrics are much more difficult to quantify and in the end has continued to be debated whereas the former decision metric had little variance of interpretation, it was either fastest or smallest or it wasn't. Thus the code from above now looks more like

#ifdef USER_PREFERS_ARRAY_INDICES
for( ndx = 0; ndx < len; ndx++ )
dst[ndx] = src[ndx];
#elif USER_PREFERS_POST_INCREMENT
ndx = 0;
while( ndx++ < len )
*dst++ = *src++;
#else
#error "ERROR: Must choose a stylistic approach to view code by"
#endif

Of course the above conditional code, if ever actually found in a real source file, would be scoffed at and immediately removed as the choice is now irrevelent from the standpoint of actually implementing code.

With respect to the last post by Cyhper regarding the need for the special syntax regarding other operators as well, it certainly doesn't hurt for those special cases where you have two 16 bit numbers which will add to get a 17 bit number but it has much more limited usefulness since sign extension is performed by the choice of opcode rather than choosing to circumvent additional yet unnecessary operations as in the use of the multiply operator. One could certainly argue this for the division case as well but even here, since there is no underlying instruction to accomplish a quick divide or modulo then use of the syntax is not much better than choosing which form of pointer indirection to use. It could certainly be argued that it would provide the code to be efficiently compiled if it ever were compiled on a machine which supported such quick divide operations but since this syntactic structure was built into the C5xxx code gen tools and parser it is unlikely that other targets would contain the same implication from such a syntax.

I have to admit it's a bit ironic Cypher is willing to obfuscate code here more than necessary but at the same time wanting to simplify its view in other aspects. Perhaps if there were a more stylistic approach or even multiple approaches to telling the compiler how we want the code implemented Cypher would choose the more esthetic one here as well.

Please don't take any of this personally Cypher as no malice is intended. I merely am pointing out that regardless of your point of view, others will have a different one. I seem to remember a saying...

Opinions are lke your backside, everyone has one and they all stink at one time or another (including mine).

Jim Noxon

0 Jim Noxon over 14 years ago in reply to Jim Noxon

TI__Genius 14940 points

I must correct my own correction here as in the last section of my previous post I meant to be talking about Norman rather than Cypher. Again, all in good humor.

Jim Noxon

0 Norman Wong over 14 years ago in reply to Jim Noxon

Guru 26430 points

Jim,

Actually, my post about the syntax to avoid the 32x32 multiply was nothing to do with the side-tracked discussion about pointers and arrays.

It was actually to your response on 28 Mar 2011 3:26 PM with this line:
result = ((unsigned long)adc_in * (unsigned long)(unsigned int)54067) >> 12;
That odd double cast is documented in spru376. I think only TI compilers recognize this pattern.

In the end, Douglas did not use that syntactic pattern of casting twice.

Indeed this digression has gotten a bit out of hand. Sorry about that.

0 Brian Willoughby over 14 years ago in reply to Cypher Punks

Genius 4630 points

Cypher Punks said:
Regarding dest[i] = src[i] vs. *pdest++ = *psrc++, it's worth mentioning that these days, most compilers are good enough to convert between the two forms as necessary, so it doesn't really matter which one you use.

You will find that the c55 compiler is not like most. Slight variations in the source can have significant effect on the code generated. If benchmarking shows that a particular section of source is taking a lot of cycles, then you might want to examine the mixed source and assembly output of the compiler to adjust your source until more efficient code is generated.

Processors

Processors forum

C5515 Audio Buffer With a little math on the side