decent 28xx compiler (cl2000) support for interfacing C/C++ and inline assembly

Jason R Sachs

When is TI's 28xx compiler going to support interfacing between C/C++ variables and symbolic quantities in inline assembly code? At present there is no such support. I can add a few lines of inline assembly, but there's no way to access any program variables unless they are static variables.

(This was one of the recommended approaches in a question I asked on stackoverflow, and someone commented this was a "MAJOR FAIL" for the compiler.)

Microchip's C compiler supports this (see their C30 compiler user's guide DS51284F, chapter 8: http://ww1.microchip.com/downloads/en/DeviceDoc/MPLAB%20C30%20UG_DS-51284f.pdf ). TI has an otherwise excellent C/C++ development environment but this seems to be a sorely missing feature, especially for small bits of assembly (5-10 instructions) that would speed up small calculations (e.g. minimum-of-3-16bit-numbers) using inline assembly w/o having to have the overhead of a function call.

over 14 years ago

Jason R Sachs over 14 years ago

Expert 1890 points

Alternatively please give us a mechanism to write our own intrinsics.

Lori Heustess over 14 years ago in reply to Jason R Sachs

TI__Guru* 89465 points

Hi Jason,

Thank you for the feedback and suggestions. I am going to move this post to the code composer studio forum. More of the folks that deal directly with compiler development watch that forum.

Regards

-Lori

George Mock over 14 years ago in reply to Lori Heustess

TI__Guru**** 232880 points

I think intrinsics are a much better solution than inline assembly. Intrinsics act just like function calls. Thus they avoid many problems introduced by a non-standard language extension like inline assembly. We try to provide intrinsics for all of the "unusual" instructions on the device. What intrinsics are you missing?

Thanks and regards,

-George

Jason R Sachs over 14 years ago in reply to George Mock

Expert 1890 points

Thanks but I think you're missing the point: intrinsics can be great but they can only cover a limited number of cases. There are a lot of times that I just have a 5-10 line assembly procedure which may take 8-12 instruction cycles, and I can't use inline assembly because I don't have any way of interfacing from C/C++ variables to the inline assembly. This is an impedance-matching problem, not a how-to-implement problem. Assembly-implemented C-callable functions are ok but then you incur the overhead of a function call and the push/pops needed to do it.

Inline assembly with interfacing parameters is a nice way of working with the C compiler rather than against it.

With respect to intrinsics, I'd love to use them if I had a way to write my own. The ones provided by TI (SPRU514C) are not well documented:

- there are no mention of side effects (which flags or registers are affected... it's unclear whether the instructions given are literally the ones used or whether there are other ones inserted by the compiler as necessary, and if so which ones those are)

- whether they operate on rvalues or lvalues: if __rpt_rol(src, count) takes an rvalue, how does the C compiler interface with __rpt_rol(mystruct.member[1].foo->something) ?

- many of them (__rol, __sat, __addcu, etc) put their result in the accumulator. I was just going to ask how is it possible to get the accumulator value back into C, and then I realized that the compiler translates that into a C return value of the intrinsic.

- it is not clear how to combine several intrinsics together in an efficient way (will the compiler insert lots of extra code to slosh back and forth values between C and assembly?)

- __abs16_sat takes 1 argument (src) but the other two columns mention a "dst" (2nd argument?)

When I get down to brass tacks and start optimizing, I usually do so with a building block that will get reused, and I want to know what the chip is doing, so I'd really like to just write what I need in assembly. Again, the hard part is usually less writing the assembly code and instead doing an efficient job interfacing the resulting assembly code with C in a way that doesn't ruin the speed advantages by requiring a function call.

As far as which intrinsics are missing, there are some core math ones like

short __qmpy16(short a16, short b16, int q) -> 16x16 multiply (a16*b16)

unsigned short __uqmpy16(unsigned short a16, unsigned short b16, int q) -> 16x16 multiply (a16*b16)

(+ similar ones to do (a16*b16+L32) >> N, (a16*b16+c16*d16) >> N)

The IQ math is great but in most of the cases I work with, 16x16 is fine and it puzzles me why TI seems to have such good support for 32x32 "IQMath" but nothing for 16x16.

George Mock over 14 years ago in reply to Jason R Sachs

TI__Guru**** 232880 points

First, more detail on why intrinsics are preferred over inline assembly. Then I'll answer specific parts of your post.

A compiler does not know what is really going on in a piece of inline assembly. So, it must be very conservative about it. It has to presume this code could modify the execution environment in any way whatsoever, and/or carefully document what is not allowed in inline assembly. It just isn't pretty. Intrinsics have none of these problems. The compiler has a model of how each instruction works. An intrinsic obviously maps to one or more instructions represented in this model. The compiler has as much knowledge about instructions generated for intrinsics as it does about instructions generated for ordinary C statements. Much better!

Jason R Sachs said:
- there are no mention of side effects (which flags or registers are affected... it's unclear whether the instructions given are literally the ones used or whether there are other ones inserted by the compiler as necessary, and if so which ones those are)

Generally, the instructions used are those shown in the documentation for the intrinsic. To see what is generated, compile with -s and inspect the resulting assembly (.asm) file. I'm not sure why you are concerned with side effects. Let the compiler worry about that for you.

Jason R Sachs said:
- whether they operate on rvalues or lvalues: if __rpt_rol(src, count) takes an rvalue, how does the C compiler interface with __rpt_rol(mystruct.member[1].foo->something) ?

An intrinsic acts just like a function call, even though it is implemented with a few (usually one) instructions. So ...

__rpt_rol(really_long_expr1, really_long_expr2);

acts just like ...

your_function_call(really_long_expr1, really_long_expr2);

In your terms, those long expressions are treated as rvalues. When you write ...

var = __rpt_rol(...)

var is treated as an lvalue. Again, just like a function call. Generally speaking, the return value of an intrinsic is the destination operand of the instruction.

Jason R Sachs said:
- it is not clear how to combine several intrinsics together in an efficient way (will the compiler insert lots of extra code to slosh back and forth values between C and assembly?)

Feel free to invoke multiple intrinsics one right after another. The return value of one intrinsic can be an input to the next. This is exactly how they are intended to be used. It ends up looking a bit like hand-coded assembly. But it has all the advantages of full integration with the C environment.

Jason R Sachs said:
- __abs16_sat takes 1 argument (src) but the other two columns mention a "dst" (2nd argument?)

The dst mentioned in the other two columns is the return value of the intrinsic.

Jason R Sachs said:
When I get down to brass tacks and start optimizing, I usually do so with a building block that will get reused, and I want to know what the chip is doing, so I'd really like to just write what I need in assembly. Again, the hard part is usually less writing the assembly code and instead doing an efficient job interfacing the resulting assembly code with C in a way that doesn't ruin the speed advantages by requiring a function call.

It should be straightforward to re-write a block of hand-coded assembly as a series of intrinsics. Then that series of intrinsics can become a C macro or function.

Jason R Sachs said:

As far as which intrinsics are missing, there are some core math ones like

short __qmpy16(short a16, short b16, int q) -> 16x16 multiply (a16*b16)

unsigned short __uqmpy16(unsigned short a16, unsigned short b16, int q) -> 16x16 multiply (a16*b16)

(+ similar ones to do (a16*b16+L32) >> N, (a16*b16+c16*d16) >> N)

The IQ math is great but in most of the cases I work with, 16x16 is fine and it puzzles me why TI seems to have such good support for 32x32 "IQMath" but nothing for 16x16.

The intrinsics provide access to the instructions on the device. It seems there are no instructions which perform these operations, thus no intrinsics. As I'm not a C2000 expert, I don't know why.

Thanks and regards,

-George

pf over 14 years ago in reply to Jason R Sachs

TI__Expert 4930 points

Jason R Sachs said:
short __qmpy16(short a16, short b16, int q) -> 16x16 multiply (a16*b16)

Maybe I'm missing something, but the only QMPY instructions I see in SPRU430c are 32x32 multiplies. Given that both short and int are 16 bits on C2000, I think a 16x16 multiply would be simply a*b.

StephanS over 14 years ago in reply to pf

Genius 4006 points

I would like to add my 10 ct to the question "what intrinsics are you missing":

I am missing intrinsics for 16 and 32 bit addition and subtraction with saturation.

Another nice thing would be IQmpy for 16 bit - the actual one only supports 32 bit.

Jason R Sachs over 14 years ago in reply to George Mock

Expert 1890 points

Georgem said:
First, more detail on why intrinsics are preferred over inline assembly. Then I'll answer specific parts of your post.

Thanks for the explanation and comments. TI should really really really consider producing either an appendix or an application note on proper use of intrinsics, containing the points you brought up, along with a dozen or two examples. What's in the compiler manual doesn't do justice to the topic.

Georgem said:
Generally, the instructions used are those shown in the documentation for the intrinsic. To see what is generated, compile with -s and inspect the resulting assembly (.asm) file. I'm not sure why you are concerned with side effects. Let the compiler worry about that for you.

So let's say that I'm trying to use a pair of intrinsics that do operations __X and __Y. Operation __X sets the carry bit (e.g. from a rotate or shift). Operation __Y takes the carry bit and combines it with a value of my choice.

__X(someval);

someVariable = __Y(somePointer->memberField.subMember);

How am I supposed to know whether the __Y operation preserves the carry or not? The "documentation" for the intrinsics tell me what core instructions it uses. It doesn't tell me what auxiliary instructions it uses to access data, and whether the carry is altered in the process.

Looking at what assembly instructions are generated by the compiler is insightful, but is no guarantee that it will always do the same thing (in future versions of the compiler, or even in all uses of the same compiler). I maintain there should be more documentation on what the intrinsic functions do and do not do.

Georgem said:
...it puzzles me why TI seems to have such good support for 32x32 "IQMath" but nothing for 16x16.
The intrinsics provide access to the instructions on the device. It seems there are no instructions which perform these operations, thus no intrinsics.

[/quote]

pf said:

Maybe I'm missing something, but the only QMPY instructions I see in SPRU430c are 32x32 multiplies. Given that both short and int are 16 bits on C2000, I think a 16x16 multiply would be simply a*b.

The point for having an intrinsic is not necessarily whether or not the processor has a single instruction to achieve a 16x16 bit multiply with shift, but rather that such an operation is a significant primitive computational block that can be done with a small number of machine instructions.

If I want to do a fixed-point math on the 28xx with multiplication of 16-bit numbers, the most "harmonious" way to do it with the processor's core instructions is to do a straight 16x16 multiply with 32-bit result in the accumulator, followed by the "MOVH loc16,ACC << 1..8" instruction. This pair of operations is equivalent to (x*y)>>N where N is between 8 and 15. I know how to do this in 2 or 3 instructions in assembly, therefore it is frustrating that there is no intrinsic or no other way to interface such a quick assembly sequence without making a C-callable assembly function. The other more common way of doing this is to bully the compiler into doing what I want, e.g. "((short)((long)x*(long)y)>>N)", which involves a magic incantation of casts that causes the compiler to properly optimize this into fast machine code. This "works" but is kind of fragile and there's no guarantee (other than looking at the resulting assembly code every time) that it does the best thing.

Jason R Sachs over 14 years ago in reply to Jason R Sachs

Expert 1890 points

OK, I offer the following challenge. What follows are some very simple computations. Either they can be implemented in intrinsics but I just can't figure out how (in which case there should be documentation), or there are no intrinsics (in which case there should be, or there should be a way for us users to write our own). If they can be implemented using intrinsics, I hope someone will post an answer. If they cannot, I hope someone from TI is reading this who has influence with the 28xx compiler group.

1) Maximum of 2, maximum of 3, minimum of 2, minimum of 3

The 28xx instruction set has these nice MAX / MAXL / MIN / MINL instructions for 16- and 32-bit maximum/minimum computations. If I could only use them :-(

Someone has gone to the trouble to make use of them in the __IQsat() intrinsic, but there's no __min16 / __min32 / __max16 / __max32 intrinsics.

2) unsigned divide with quotient and remainder

There are often times when I have a 32-bit number X and a 16-bit number D where I know (because of what ranges of numbers I'm using for X and D) that the results of dividing X by D will be a 16-bit quotient Q and a 16-bit remainder R. I can get them myself using SUBCU instructions in assembly, but I can't do this easily in C because of the integer promotion rules (to divide a 32-bit number by something, it will promote the something to 32-bits) that force the use of a 32-bit divide, and anyway in C it doesn't keep both quotient and remainder. Looks like the __rpt_subcu() intrinsic might help but I'm not sure how to use it (there's a funny "dst" argument...)

Here's what I get using C:

(compiler input)

              uint32_t X = serialEncoder.getU32(cp);
                uint16_t D = serialEncoder.getU16(cp);

                divQ = X / D;
                divR = X % D;

(compiler output for the divQ and divR calculations. Notice that it does 32-bit divides because of operand promotion, and it does this twice because it's not smart enough to get quotient and remainder from the same calculation.)

        MOVZ      AR7,AL                ; |178|
        MOVL      P,XAR6                ; |178|
        MOVL      XAR0,#1583            ; |178|
        MOVZ      AR4,AR7
        MOVB      ACC,#0
        RPT       #31
||     SUBCUL    ACC,XAR4              ; |178|
        MOV       *+XAR2[AR0],P         ; |178|
        MOVZ      AR7,AR7
        MOVL      XAR0,#1584            ; |444|
        MOVB      ACC,#0
        MOVL      P,XAR6                ; |444|
        RPT       #31
||     SUBCUL    ACC,XAR7              ; |444|
        MOV       *+XAR2[AR0],AL        ; |444|

3) linear feedback shift register = LFSR (see http://en.wikipedia.org/wiki/Linear_feedback_shift_register)

If I have a 16-bit or 32-bit number X, the following operations perform one iteration of a LFSR, useful for generating some classes of random numbers.

X >>= 1;

if (bit shifted out was a 1)

X ^= LFSR_MASK

This is possible in C and looks fairly simple, but the compiler doesn't do a great job on it.

X = (X >> 1) ^ (X & 1 ? LFSR_MASK : 0);

In hand-coded assembly it would be rather elegant; an LSR instruction handles the shift and sets/clears the carry according to the bit shifted out; a conditional move could be used to handle the ?: logic.

4) memory copying

Given pointers src and dest, and a number N, move N words of memory from source locations (src, src+1,...src+N-1) to destination locations (dst, dst+1, ... dst+N-1), assuming there are no overlapping locations between source and destination.

I think maybe __rpt_mov_imm() or __rpt_mov_imm_far() might do what I want (or is it a memory fill?), but the documentation is horrible: it sounds like someone is reading aloud assembly instructions instead of telling you how to use the intrinsic. From SPRU514C:

SPRU514C said:
Move the dst register to the result register. Move the dst register to a temp (ARx) register. Copy the immediate src to the temp register count + 1 times. The src must be a 16-bit immediate. The count can be an immediate from 0 to 255 or a variable.

Jason R Sachs over 14 years ago in reply to Jason R Sachs

Expert 1890 points

5) Swap halves of a 32-bit number or a 16-bit number.

The compiler produces a correct but inefficient implementation for the following C code (retval is a 32-bit unsigned integer):

retval = ((retval << 16) & 0xffff0000) | ((retval >> 16) & 0x0000ffff);

Jason R Sachs over 14 years ago in reply to Jason R Sachs

Expert 1890 points

Jason R Sachs said:

5) Swap halves of a 32-bit number or a 16-bit number.

The compiler produces a correct but inefficient implementation for the following C code (retval is a 32-bit unsigned integer):

retval = ((retval << 16) & 0xffff0000) | ((retval >> 16) & 0x0000ffff);

Hmm. I guess the 1st "&" part above is superfluous, not sure about the 2nd one. The compiler still goes through a kind of clumsy dance when it just sees this:

x = (x << 16) | (x >> 16);

which it translates to:

        MOVL      ACC,XAR6              ; |296|
        MOVU      ACC,AH                ; |296|
        MOVL      P,ACC                 ; |296|
        MOVL      ACC,XAR6              ; |296|
        LSL       ACC,16                ; |296|
        OR        AL,PL                 ; |296|
        OR        AH,PH                 ; |296|
        MOVL      XAR6,ACC              ; |296|

For the record, I'm using -O3, which is disappointing. :-( I guess maybe I expect too much from a compiler.

George Mock over 14 years ago in reply to Jason R Sachs

TI__Guru**** 232880 points

Jason R Sachs said:

__X(someval);

someVariable = __Y(somePointer->memberField.subMember);

How am I supposed to know whether the __Y operation preserves the carry or not?

The C environment and the CPU environment are related, yet nonetheless distinct. Intrinsics are not intended to somehow blur that distinction. Please give an example of __X and __Y operations you need to use that share state which is not in the C environment. I suspect we will find a way to define the intrinsics to make it work, while keeping those environments distinct.

Jason R Sachs said:
The other more common way of doing this is to bully the compiler into doing what I want, e.g. "((short)((long)x*(long)y)>>N)", which involves a magic incantation of casts that causes the compiler to properly optimize this into fast machine code. This "works" but is kind of fragile and there's no guarantee (other than looking at the resulting assembly code every time) that it does the best thing.

This is the preferred way to do a 16x16 to 32 multiply. See slide 16 of this presentation http://tiexpressdsp.com/index.php/CGT_Tips_%26_Tricks_for_Beginners .

Thanks and regards,

-George

George Mock over 14 years ago in reply to Jason R Sachs

TI__Guru**** 232880 points

Jason R Sachs said:

1) Maximum of 2, maximum of 3, minimum of 2, minimum of 3

The 28xx instruction set has these nice MAX / MAXL / MIN / MINL instructions for 16- and 32-bit maximum/minimum computations. If I could only use them :-(

Someone has gone to the trouble to make use of them in the __IQsat() intrinsic, but there's no __min16 / __min32 / __max16 / __max32 intrinsics.

I'll file a request to have these intrinsics added, and let you know the number of that request in our ClearQuest system.

Jason R Sachs said:

2) unsigned divide with quotient and remainder

There are often times when I have a 32-bit number X and a 16-bit number D where I know (because of what ranges of numbers I'm using for X and D) that the results of dividing X by D will be a 16-bit quotient Q and a 16-bit remainder R. I can get them myself using SUBCU instructions in assembly, but I can't do this easily in C because of the integer promotion rules (to divide a 32-bit number by something, it will promote the something to 32-bits) that force the use of a 32-bit divide, and anyway in C it doesn't keep both quotient and remainder. Looks like the __rpt_subcu() intrinsic might help but I'm not sure how to use it (there's a funny "dst" argument...)

This will work for unsigned types ...

   result32 = __rpt_subcu(num32, denom16, 15);
   quotient = result32 >> 16;
   remainder = result32 & 0xffff;

For signed types you have to do a fair amount more than that, and it is probably better to use the built-in C divide instead.

Jason R Sachs said:
3) linear feedback shift register = LFSR (see http://en.wikipedia.org/wiki/Linear_feedback_shift_register)

I'll file another request in ClearQuest for this. I'm not sure if it will result in an intrinsic, or the compiler recognizing this idiom, similar to how 16x16 to 32 multiply is handled. At any rate, once it is documented, you should be able to wrap it in a well-named macro.

Jason R Sachs said:

4) memory copying

Given pointers src and dest, and a number N, move N words of memory from source locations (src, src+1,...src+N-1) to destination locations (dst, dst+1, ... dst+N-1), assuming there are no overlapping locations between source and destination.

I think maybe __rpt_mov_imm() or __rpt_mov_imm_far() might do what I want (or is it a memory fill?),

Those intrinsics do not work like memcpy, but like memset. I suspect you are interested in knowing when the compiler inlines a call to memcpy. I'll have to get some help on that.

Thanks and regards,

-George

George Mock over 14 years ago in reply to George Mock

TI__Guru**** 232880 points

For the MAX and friends intrinsics, I entered SDSCM00034938 in our ClearQuest system. For the linear shift register idea, I entered SDSCM00034939. You are welcome to track these issues here https://cqweb.ext.ti.com/cqweb/main?command=GenerateMainFrame&service=CQ&schema=SDo-Web&contextid=SDOWP&username=readonly&password=readonly . Enter the SDSCM entries above in the "Find Record ID" box.

Thanks and regards,

-George

ThomasS over 14 years ago in reply to George Mock

TI__Expert 6550 points

Jason R Sachs said:

4) memory copying

Given pointers src and dest, and a number N, move N words of memory from source locations (src, src+1,...src+N-1) to destination locations (dst, dst+1, ... dst+N-1), assuming there are no overlapping locations between source and destination.

I think maybe __rpt_mov_imm() or __rpt_mov_imm_far() might do what I want (or is it a memory fill?),

The C2000 compiler can generate a RPT PREAD instruction for C/C++ code that does a memory copy. This requires optimization and the unified memory (-mt) switch. For example:

extern int *a, *c;

void foo()
{
    int i;

    for (i = 0;i < 10; i++)
        c[i] = a[i];
}

with cl2000 -v28 -ml -mt -o3 generates:

RPT #9
|| PREAD *XAR4++,*XAR7

This also works for memcpy calls:

#include <stdlib.h>

int *a, *b;

void foo()
{
memcpy(a, b, 50);
}

with the same options generates:

RPT #49
|| PREAD *XAR4++,*XAR7

Jason R Sachs over 14 years ago in reply to ThomasS

Expert 1890 points

George and Thomas--

I appreciate your responses, I haven't had time to digest them but wanted to say thanks.

--Jason

Jason R Sachs over 13 years ago in reply to Jason R Sachs

Expert 1890 points

...so here I am again; in some critically-fast code I need a one-sided unsigned saturation algorithm. __max() won't work because it's signed. And the __sathigh16() and __satlow16() intrinsics are bidirectional so I can't use them... (not to mention the documentation is unclear)

When is TI going to let us write our own intrinsics? I guarantee you that no matter what list of intrinsics are included with the compiler, there will always be other short segments of assembly instructions that are valuable to implement in an intrinsic-like manner, and neither the approaches of inline assembly (which is next-to-impossible to interface properly with C) or C-callable assembly functions (incurs at least an 8-cycle hit to enter/exit a function call) are feasible.

please give us a mechanism to let us write our own intrinsics, or interface C/C++ variables with inline assembly.

d200guy over 13 years ago in reply to Jason R Sachs

Intellectual 555 points

I'm looking for the best way to do byte swapping since the C28xx is little-endian and the remote terminal is big-endian. Should/could I be using intrinsics?

Jason R Sachs over 13 years ago in reply to Jason R Sachs

Expert 1890 points

...and here I am again. I need a 32-bit min and max that translates into MINL or MAXL, appropriately. The __min() and __max() are intrinsics that only work on 16-bit variables that translate into MIN and MAX... or at least I think they do. Because there doesn't seem to be an up-to-date set of documentation on the intrinsics; SPRU514C doesn't list __min and __max, and there is no SPRU514D.

I'll say it again: please give us a mechanism to let us write our own intrinsics, or interface C/C++ variables with inline assembly.

pf over 13 years ago in reply to Jason R Sachs

TI__Expert 4930 points

Try __lmin and __lmax (and __llmin and __llmax for long-long).

I do agree that the documentation is lacking. I suspect many of your requests to write your own intrinsics would be addressed by better descriptions of the ones that already exist.

Jason R Sachs over 13 years ago in reply to pf

Expert 1890 points

>Try __lmin and __lmax (and __llmin and __llmax for long-long).

Thanks, that helps.

Here's what I'm running into right now. I've got an array of 3 signed 16-bit integers, which I need to offset by the same amount but saturate between -32768 and +32767. (In other words, if array[2] = 32760 and my offset is +10, I should end up with +32767 and not 32760+10=32770=-32766.

I've tried using intrinsics but am pretty unsatisfied with how they perform. If I were just doing this once, I wouldn't try to optimize, but I use these sorts of operations all over my code and so there's a huge multiplier for speeding things up.

The most obvious method is the __satlow16 intrinsic, but the compiler stupid (I've tried -O1 and -O3) and takes 36 instructions with obvious inefficiencies:

        Iabc[0] = __satlow16(adc.getMeasurement(0) - (int32_t)Iabc_ref);
        Iabc[1] = __satlow16(adc.getMeasurement(1) - (int32_t)Iabc_ref);
        Iabc[2] = __satlow16(adc.getMeasurement(2) - (int32_t)Iabc_ref);

        MOVB      XAR0,#46              ; |13|
        ADDB      XAR1,#39              ; |13|
        MOV       ACC,*+XAR4[0]         ; |13|
        SUB       ACC,*+XAR3[AR0]       ; |13|
        SETC      OVM                   ; |13|
        CLRC      SXM
        ADD       ACC,#65535 << 15      ; |13|
        SUB       ACC,#65535 << 15      ; |13|
        SUB       ACC,#65535 << 15      ; |13|
        ADD       ACC,#65535 << 15      ; |13|
        CLRC      OVM                   ; |13|
        SETC      SXM
        MOV       *+XAR1[0],AL          ; |13|
        MOV       ACC,*+XAR4[1]         ; |13|
        SUB       ACC,*+XAR3[AR0]       ; |13|
        SETC      OVM                   ; |13|
        CLRC      SXM
        ADD       ACC,#65535 << 15      ; |13|
        SUB       ACC,#65535 << 15      ; |13|
        SUB       ACC,#65535 << 15      ; |13|
        ADD       ACC,#65535 << 15      ; |13|
        CLRC      OVM                   ; |13|
        SETC      SXM
        MOV       *+XAR1[1],AL          ; |13|
        MOV       ACC,*+XAR4[2]         ; |13|
        SUB       ACC,*+XAR3[AR0]       ; |13|
        SETC      OVM                   ; |13|
        CLRC      SXM
        ADD       ACC,#65535 << 15      ; |13|
        SUB       ACC,#65535 << 15      ; |13|
        SUB       ACC,#65535 << 15      ; |13|
        ADD       ACC,#65535 << 15      ; |13|
        CLRC      OVM                   ; |13|
        MOVL      XAR5,#32767           ; |13|
        SETC      SXM
        MOV       *+XAR1[2],AL          ; |13|

Then I tried __lmin and __lmax which take 24 instructions (actually 22, the 2nd and 3rd lines below are part of previous logic) but the compiler is acting weird; it does what I'd do for the positive limit, loading it into a register and using MINL, but it repeats loading the same constant into memory 3 times for XAR6 and wastes 4 instructions (should be 2 instructions to load XAR6, then one MAXL for each saturation step, vs. the 9 instructions that are present now).

                   register int32_t maxlim = 0x00007fffL;
        register int32_t minlim = 0xffff8000L;
        Iabc[0] = __lmax(minlim,__lmin(maxlim,adc.getMeasurement(0) - (int32_t)Iabc_ref));
        Iabc[1] = __lmax(minlim,__lmin(maxlim,adc.getMeasurement(1) - (int32_t)Iabc_ref));
        Iabc[2] = __lmax(minlim,__lmin(maxlim,adc.getMeasurement(2) - (int32_t)Iabc_ref));

        MOVL      XAR5,#32767           ; |13|
        SETC      SXM
        MOV       *+XAR1[2],AL          ; |13|
        MOV       ACC,*+XAR4[0]         ; |13|
        SUB       ACC,*+XAR3[AR0]       ; |13|
        MINL      ACC,XAR5              ; |13|
        MOVL      XAR6,ACC              ; |13|
        MOV       ACC,#-1 << 15
        MAXL      ACC,XAR6              ; |13|
        MOV       *+XAR1[0],AL          ; |13|
        MOV       ACC,*+XAR4[1]         ; |13|
        SUB       ACC,*+XAR3[AR0]       ; |13|
        MINL      ACC,XAR5              ; |13|
        MOVL      XAR6,ACC              ; |13|
        MOV       ACC,#-1 << 15
        MAXL      ACC,XAR6              ; |13|
        MOV       *+XAR1[1],AL          ; |13|
        MOV       ACC,*+XAR4[2]         ; |13|
        SUB       ACC,*+XAR3[AR0]       ; |13|
        MINL      ACC,XAR5              ; |13|
        MOVL      XAR6,ACC              ; |13|
        MOV       ACC,#-1 << 15
        MAXL      ACC,XAR6              ; |13|
        MOV       *+XAR1[2],AL          ; |13|

There *should* be a way to do something like:

[register setup omitted]
        MOV       ACC, #0x7fff
       MOVL       XAR5, ACC
       MOV       ACC, #0x8000
       MOVL       XAR6, ACC

        MOV       ACC,{Iabc[0]}
        SUB       ACC,{Iabc_ref}
        MINL      ACC,XAR5
        MAXL      ACC,XAR6
        MOV       {Iabc[0]},AL

        MOV       ACC,{Iabc[1]}
        SUB       ACC,{Iabc_ref}
        MINL      ACC,XAR5
        MAXL      ACC,XAR6
        MOV       {Iabc[1]},AL

        MOV       ACC,{Iabc[2]}
        SUB       ACC,{Iabc_ref}
        MINL      ACC,XAR5
        MAXL      ACC,XAR6
        MOV       {Iabc[2]},AL

which would take 15 instructions core math + 4 instructions setup + whatever other register setup is needed. I can do this in a C-callable assembly function, but then I get an 8-instruction hit just for the call and return, plus whatever instructions are necessary to load registers for the routine.

Again, I wouldn't optimize to this level of speed, except that I have pieces like this that show up several times in critical code I'm writing, and it really frustrates me that there's no way to get the compiler to do what seems to me like such a simple optimization exercise.

StephanS over 13 years ago in reply to pf

Genius 4006 points

Where can I find all those intrinsics (__lmin, __llmax,....)?

In the latest available documentation (spru514c.pdf), they are not available.

Are they listed in some FAQ, readme, Wiki, whatever?

Mitja Nemec over 13 years ago in reply to Jason R Sachs

Genius 5885 points

Hi Jason

I use a different approach and I think it works better. Before the code that need saturation i set OVM bit via inline assembly (asm(" SETC OVM")) and afterward I clear it. So far it worked like a charm although you might get in trouble with the compiler though.

Probably the proper way would be to save ST0 first (asm(" PUSH ST0")) and then set the OVM bit and afterward just restore ST0 via POP instruction.

Regards, Mitja

Mitja Nemec over 13 years ago in reply to Mitja Nemec

Genius 5885 points

Your problem has got me reading SPRU514C chapter 7.2.2 and there is an obvious solution to avoid problems with compiler. Wrap the code in a function where at the beginning you set OVM and at the end you clear OVM. This should comply with compiler regarding presumed values of status/mode bits.

I have a question for TI support. Does the compiler observe the same rules for status register field if the function is inlined? And please update SPRU514 already.

Regards, Mitja

pf over 13 years ago in reply to Jason R Sachs

TI__Expert 4930 points

Jason R Sachs said:

Here's what I'm running into right now. I've got an array of 3 signed 16-bit integers, which I need to offset by the same amount but saturate between -32768 and +32767. (In other words, if array[2] = 32760 and my offset is +10, I should end up with +32767 and not 32760+10=32770=-32766.

Do you have a small compilable example that I could play with? I'm not fluent in C2000 asm, but I see several details that make me curious and I'd like to explore them.

For example, saturating arithmetic is apparently done by setting a status bit and then using the normal instructions. (In contrast to the C6000, which has separate ADD and SADD instructions.) I think one of our other targets went to some effort to notice consecutive saturating operations and set the status bit across the whole group rather than over-and-over individually; why isn't that happening here?

There are repeating sections that I'd expect to have been optimised; why weren't they?

And there are intriguing bits in your cited code. Why the cast to int32_t, why initialise a signed constant with an overflowing value, that sort of thing.

I could derive an example from what you've reported, but I want to make sure I'm using the same context and options that you are.

Thanks.

Jason R Sachs over 12 years ago in reply to pf

Expert 1890 points

...and here I am yet again.

Is there an intrinsic for the 28xx FLIP instruction? (reverse bits in a 16-bit word) If not, could you please put in a request for one?

(I will make my continued plea to allow users to write their own intrinsics, so that small assembly routines can be utilized efficiently from C code.)

Jason R Sachs over 12 years ago in reply to pf

Expert 1890 points

pf said:

Here's what I'm running into right now. I've got an array of 3 signed 16-bit integers, which I need to offset by the same amount but saturate between -32768 and +32767. (In other words, if array[2] = 32760 and my offset is +10, I should end up with +32767 and not 32760+10=32770=-32766.

Do you have a small compilable example that I could play with? I'm not fluent in C2000 asm, but I see several details that make me curious and I'd like to explore them.

[/quote]

I'll make time to do this sometime soon. sorry, it's been a hellish 18 months or so since I had much breathing room at work.

Jason R Sachs over 12 years ago in reply to Jason R Sachs

Expert 1890 points

back again...

I'd like to request some intrinsics for the TSET and TCLR operations. use case:

if (__tclr(myvar, 3))

do_something(); // do_something() if bit in question was changed: this corresponds to TC bit being 1

if (__tset(myvar, 4))

do_something(); // do_something() if bit in question was changed: this corresponds to TC bit being 0
Cancel
Up 0 True Down

Cancel
George Mock over 12 years ago in reply to Jason R Sachs

TI__Guru**** 232880 points

I added the request for these intrinsics to SDSCM00034938 in the SDOWP system. That record had already been filed to ask for intrinsics for MAX and MIN instructions.

Thanks and regards,

-George
Cancel
Up 0 True Down

Cancel

Code Composer Studio™︎

Code Composer Studio forum

decent 28xx compiler (cl2000) support for interfacing C/C++ and inline assembly