C5515 Optimization considerations

Andreas Weishaupt

Other Parts Discussed in Thread: OMAP-L137

Hi all,

I'm working with the C5515 ezdsp kit and I run quite "heavy" algorithms with quite a lot of input data. After turning on -O3 and optimize for speed = 5 I still don't meet my target requirements. I think I might need your advice w.r.t. to optimization for data acquisition, transmission and algorithm.

First of all, here's what I do (roughly):
- Input data arrives in 8-bit parallel fashion on GPIO with about 3 MBytes/s and currently GPIO receive function is executed at a period of T_gpio = 10 ms
- Data is transmit to host by USB where the host requests approx 10 kbytes at an interval of 40 ms
- The data to process is of video type (64x64x16 bit matrices). My algorithm takes approximately T_algo = 8 ms currently.

My target requirements are:
- read 64x64x16 bit at an interval of approx T_gpio = 1 ms (very limit is T_gpio = 3 ms)
- Target-Host transfer is okay already
- Algorithm should run at approximately T_algo < 1 ms (very limit is T_algo = 4 ms)

If I don't manage to have T_algo < 1 ms (which is very probable) I still should read GPIO at T_gpio = 1 ms (or 2 or 3 ms) and then run a filter on the read input.

My questions:
- GPIO is all managed by CPU. Is there a better way to handle such kind of data acquisition? I saw that SPI is handled by CPU as well... If not, what's the best way to do data acquisition and input filter algorithm independently from the rest of the algorithm? I'm currently using Timer HWIs to execute GPIO receive code. It's okay. But when I put the filter algorithm inside the the HWI, USB requests are blocked which slows down my Target-Host transfer. How would you "organize" those HWIs?
- The algorithm is entirely coded in C. I'm not using DSPLIB. I might use DSPLIB, but if I can manage to optimize my code by hand, I'd rather leave it out. So here are some very specific optimization questions:

- if I do #pragma MUST_ITERATE .. (how) can I be sure that the code is pipelined or executed in parallel
- I have several "element-wise" matrix operations like subtract matrix A from B etc. What is better (see the following exemplary pseudo-code)?

for (i...N)
C[i] = A[i] - B[i]
end
for (i...N)
D[i] = k*C[i]
end
... and many other operations to follow...

for (i...N)
C[i] = A[i] - B[i]
D[i] = k*C[i]
... and many other operations to follow...
end

- w.r.t. to memory operations. I'm currently using memcpy for copy and memset for setting to zero. I imagine that I could get much better performance if I'd use DMA for memory copy. What about setting to zero? What's the best performing method? Doing it manually in a loop?
- and then there's the #pragma DATA_ALIGN ... what impact does it have on the C5515? First, I've seen that for example on the OMAP-L137 DATA_ALIGN (x,128) is preferable for caching. But why is it DATA_ALIGN (x, 8) on the C5515?

I'd very appreciate your advice and answers to (some of) my questions. Sorry for the huge list but I'm quite new in DSP coding. And even if I've looked at the C55x compiler user's guide and assembly tool's guide there are many points which are not very clear w.r.t. to implementation. One thing is for sure: I think I have to go down to assembly for large parts of my code.

Thank you very much.

Best regards,

Andreas

over 14 years ago

0 Jim Noxon over 14 years ago

TI__Genius 14940 points

Andreas,

So here are some suggestions to try and achieve your goals.

1) Make sure you have set the clock to run at its maximum rate.

2) If you have large functions you may be able to help the optimizer if you can create local blocks and declare variables closer to where they are used and limit their scope as much as possible.

3) Always look at the assembler code produced (using the mixed mode display or the dissassembly window) and ask yourself if it is reasonable or seems to contain redundant operations. Sometimes you will need to contort your code with some rather difficult to read constructs to get the most out of the optimizer.

4) Take advantage of the intrinsic functions especially if you are concerned about over/under flow as the DSP can perform saturation management for you. These also give you another way to insert assembly language instructions into your code without haveing to go completely to assembly. The good part about using the intrinsics is that the compiler understands them better than a simple asm statement which allows the optimizer to get involved for further improvements.

5) Layout your variables carefully (especially any arrays or matricies) to utilize the dual access RAM to its fullest extent. The architecture of the DSP has many busses for the movement of data in/out of the processor and properly managing these (via variable placement) can significantly speed up your code as the processor pipeline will stall less.

6) Minimize access to I/O space and off chip memory/components. These paths will cause the processor to stall more often as they are slower mechanisms. In general, make as many accesses to these spaces via DMA or by using shadow variables (especially in I/O space when configuring peripherals on the fly).

7) Use one channel of the DMA for setting up other DMA's. You do this by setting up the register configurations in structures located in memory somewhere. Then write a small function which takes a DMA reference and a pointer to one of these structures and inserts it into the DMA source address and then kick off the DMA. The DMA will load the configuration structure data into the appropraite DMA channel while the processor can do other preparations. Then all you need to do is start it when you are ready. Do this because accessing the I/O is a slow process and consumes multiple cycles which can cause pipe line stalls unnecessarily. Be careful though, if you expect to port this code to another 55xx processor, not all of them can access the DMA I/O space. This will work for the 5515 processor.

As for your direct questions...

- GPIO is all managed by CPU. Is there a better way to handle such kind of data acquisition? I saw that SPI is handled by CPU as well... If not, what's the best way to do data acquisition and input filter algorithm independently from the rest of the algorithm? I'm currently using Timer HWIs to execute GPIO receive code. It's okay. But when I put the filter algorithm inside the the HWI, USB requests are blocked which slows down my Target-Host transfer. How would you "organize" those HWIs?

This is certainly a place where you want to take a look at the assembly produced for your code. The context save and restore code for an ISR can vary but some things just force a complete context save which can be time consuming if you are poping in and out of an ISR quite frequently. In general, the optimizer will attempt to only save those registers that need saving on entrance to an ISR but if the ISR is large or calls functions in another object, where the optimizer cannot see the register allocations, then it simply will do a complete context save just in case. If you must use ISR code, your best bet is to do as little as possible in there like read the GPIO and store it in a buffer and return. Then use a semaphore to tell the baseline code to work on the data as a set instead of a piece at a time. Even better, you can set up a DMA to read the GPIO and put the data into the buffer. You will still need to operate on the data using your filter but doing it in the largest chunks possible usually saves time. A DMA for the SPI os also a good choice for similar reasons. The idea is to let the DMA do the mundane work while your code does the stuff the DMA cannot. Also, the GPIO (and all IO space in the processor) has a minimum 5 cycle access time so if your doing this with the processor you are spinning your wheels for a significant amount of time here.

- if I do #pragma MUST_ITERATE .. (how) can I be sure that the code is pipelined or executed in parallel

The MUST_ITERATE pragma does nothing to guarantee parallel opcodes will be generated. All it does is tell the compiler that the loop will iterate at least once. This provides the optimizer the information to remove a pretest (for the times if the loop is never executed) and also can clear the path for the optimizer to generate looping structures using DSP instructions which support looping thus making the code smaller and faster. You should look into the UNROLL pragma as this can further optimize speed by limiting the number of iterations a loop must iterate through which will reduce the looping overhead. For example, if your code is looping over 5 machine instructions and you have to do a comparison and branch to determine if the loop is complete, then 2 out of the 7 instructions are overhead. Using the UNROLL pragma you can tell the compiler it would be ok to unroll the loop once thus only having to iterate half as many times. Now 2 out of 12 instructions are overhead so you have faster execution at the expense of 5 additional instructions. Without these pragmas, the optimizer must make more pessimistic assumptions which generally result in more generated code and slower execution.

- I have several "element-wise" matrix operations like subtract matrix A from B etc. What is better (see the following exemplary pseudo-code)?

for (i...N)
C[i] = A[i] - B[i]
end
for (i...N)
D[i] = k*C[i]
end
... and many other operations to follow...

for (i...N)
C[i] = A[i] - B[i]
D[i] = k*C[i]
... and many other operations to follow...
end

Generally, the second case will generate faster tighter code but if you put too much inside the loops you can confuse the optimizer and lose some of the benifit. A happy medium is usully the case, it also depends on the dependency of operations. For example, if I have 4 arrays (a, b, c, d) and I need to perform the operations of a += b and c += d, then these operations are essentially independent of each other so they can use your second style for loop with a high probability that parallel instructions will be generated. If the operations were instead to be b += c - d and a += b, then the second assignment is dependent upon the first. In this case it is unlikely you will get parallelism if you implement these two operations in the for loop so they operate on the same index at the same time. Instead, if you perform the second operations first iteration before entering the loop and inside the loop perform the operations such that the second operation is always an index ahead of the first, then parallelism can apply. You just need to do the final iteration on the first operation after the loop.

The DSP is a pretty good device, if you are merely moving data and not performing any operations on it, the DMA may not be significantly faster than the code. However, the big advantage here is that you can do something else in code while the DMA is moving data around for you. So if you don't have to wait for the DMA to finish before you can go onto another task, using the DMA is always a plus. You can use the DMA to set a constant value too. Just setup the source address to not increment, only the destination address does and then point the source address to a variable where you have placed the initialization value. If you have a repeating pattern, you can set the DMA to place that pattern as well, effectively a nested for loop implementation. Also, since the processor operates fastest when accessing the internal memories, you can use the DMA to transfer data between external memory and internal memory saving all that time for your processor to be doing better things. So my answer is definitely yes, if you can use a DMA then use it!

- and then there's the #pragma DATA_ALIGN ... what impact does it have on the C5515? First, I've seen that for example on the OMAP-L137 DATA_ALIGN (x,128) is preferable for caching. But why is it DATA_ALIGN (x, 8) on the C5515?

The DATA_ALIGN pragma is never needed per se', however, there are several situations where it can be useful. If you use and extternal SDRAM, you can align your structures to minimize the number of page commands needed to send across the bus. Most external memory will be slower than theinternal memory and if the data is layed out so it crosses a 32 bit boundary it can cause additional access to memory thus slowing things down. The fact that code space is measured in bytes and data space is measrued in words can require an alignment of 2 for constants. I don't know off hand the reason for the alignment of 8 but I am also not familiar with the development board you are using so I cannot comment as to the validity of it.

You may not need to go all the way to assembly to complete what you want to do. You can intermix C and assembly quite easily. The Optimizing compiler guide can show you how. In this case, you should only need to code in assembly those functions which you cannot get the optimizer to do a decent job at. Some of the matrix operations may benifit from this as many of them can take advantage of the multiply-accumulate instructions of the DSP.

Hope this has helped,

Jim

0 Andreas Weishaupt over 14 years ago in reply to Jim Noxon

Expert 1270 points

Hi Jim

First of all: a very big thank you for your detailed answer. It is good to know that such a good support exists.

Your explanations are very clear and I think they'll help me a lot, in particular w.r.t. DMA and algorithm optimization. Concerning data input handling by GPIO I'm not sure if I can do this via DMA as I've read that GPIO does not have DMA in the C5515?
Further: is there any (good) example code on using semaphores to split up algorithm and data acquisition?

I have some other, more direct, questions:

- Should I prefer to prefix some of my locals with the "register" keyword or should I better let the compiler handle such things?

- I've tried to use intrinsics to implement the filter. And yes it's true it nearly reaches my final target execution time. BUT: it does not work properly. How can I handle unsigned values with the intrinstics? For example if I use _smac (acc, x0, x1), it is supposed that my acc is long, x0 and x1 are ints whereas for my application it should be Uint32, x0 and x1 should be Uint16. Is there a better/an easier way to have full unsigned resolution with the intrinsics or would I have to right shift x0 and x1 in order to fit in Int16?

- Similar question for the _min and _max intrinsics: do I really have to use the _lmin and _lmax instead if my inputs are Uint16 rather than Int16 if I want correct results?

- Are there (ways to use) intrinsics to enable parallel operations, for example parallel MAC? I've looked at the assembly output for the function where I've placed successively 2 independant _smac calls inside a for loop. But the compiler does not seem enable parallelism.

- When going into assembler:
1) how can I directly access arrays or variables declared in C?
2) In what registers are C function arguments placed?
For example, I would like to write the assembly routine for C declaration: void video_fir (Uint16* x, Uint16* y, const Uint16* n) and inside that routine I'd like to access arrays v[0][0], v[1][0], v[2][0] and v[3][0] with v defined as Uint16 v[4][64*64].

Again thank you very much for your help.

Best regards,

Andreas

0 Jim Noxon over 14 years ago in reply to Andreas Weishaupt

TI__Genius 14940 points

Andreas,

My mistake. You are correct in that the 5515 DMA's cannot access GPIO I/O space. If you must utilize the GPIO's, then this will have to be handled in code, most likely with an interrupt. There is another option however, depending on how your system is setup. If you could provide the data on the GPIO's to the memory data bus (even if a tristate buffer is required), then you could read it directly from the external memory bus which the DMA controller 3 can access. You'll need to assess what your system can support and then we can work from there.

As for a semaphore example, the difficulty with a "good" example is that its only good if it applies to what you are trying to do. The basic premise of a semaphore system is to pass simple messages back and forth between two processes. From there, things can diverge in a real hurry. Semaphores can be as simple as yes/no, on/off, true/false, etc. all the way to very complicated ones involving structures and dynamic code selection. Most data acquisition semaphore systems usually involve a buffer and a counter. The interrupt routine reads in the next data value and puts it in the buffer then increments the counter semaphore and exits the interrupt. In the baseline code, the counter is tested for a non-zero value and if true, the next data is retrieved from the buffer and the counter is decremented (with interrupts disabled). The test is usually enclosed within a while loop thus the data is operated on as long as there is data in the buffer. Alternatively, you could test for the buffer to contain some number of data samples and after this threshold is achieved, work on the entire block. This can be more efficient sometimes or simply necessary for some things like executing an FFT on 128 samples at a time. Careful examination of your algorithm will dictate much of the approach you choose.

With regard to using the register keyword, if you turn on any level of optimization, the compiler will ignore all register modifiers used. The only time the compiler considers honoring them is when optimization is completely off. Strangely enough, if you cannot get the optimizer to accomplish the goals you are trying to achieve, turning it off and utilizing the register keyword judiciously and carefully coding your function can sometimes actually do better than the optimizer. However, this is usually a last resort. The better option is to open up local blocks within a function allowing you better resolution on variable scoping. Alternatively, compiling your code with non-ANSI options or as C++ can achieve similar capabilities.

For managing the unsigned arithmetic operations, read section 3.1.2 of the 55x DSP Programmers Guide (http://focus.ti.com/lit/ug/spru376a/spru376a.pdf). Also note that most of the MAC and MPY instructions and their parallel forms use 17 bit arithmetic thus an unsigned 16 bit value can fit within it as a signed number. Finally you will want to understand sections 6.3, 6.4, and 6.5 of the Optimizing Compiler Guide for the 55x (http://focus.ti.com/lit/ug/spru281f/spru281f.pdf). These describe the configuration of the DSP which is expected by the compiler. The short answer is that section 3.1.2 of the 55x DSP Programmers Guide explains how the DSP can provide some shortcuts which, although not adhering to the letter of the C language, adheres to the "as if" rule because the result is indistinguishable in both cases. Thus since the compiler can recognize certain syntactic structures, it will use these shortcuts to significant advantage. Try compiling your code with what would be considered normal C code and then try it with these special constructs and see if the shortcut MPY instructions and MAC instructions are used. You may need to play with the syntax a bit but it will generally follow the rules described.

Also, take a look at the final assembly generated by the compiler for the intrinsics as well. You may find the opcodes used are valid for the intended operation with proper casting.

If you are not getting the intended parallel instruction results, make sure the optimization is turned on for the assembly tab as well. This is because some optimizations are accomplished in the assembly process after the C code has been translated to assembly. The optimizer options in the C tab should enable these but you may need to coax the compiler a bit more. It's been a while since I've used the 55x series so things may have changed with regard to my memory.

In accessing variables from assembler, sections 6.4 and 6.5 of the Optimizing Compiler Guide will explain accessing them from either side of the system. One thing to remember, as soon as you put inline assembly in a function, you may be invalidating some optimizations performed, i.e. the optimizer ignores the inline assembler code except to put it in exactly as defined. Generally this will cause the optimizer window to be truncated thus limiting the amount of optimization that can be performed. However, some optimizations merely assume the inline assembly adheres to the assumptions imposed on the surrounding code and thus may optimize anyway.

In general, any assembly cannot change the state of the system with respect to the optimizer or optimizations could fail without warning. Technically this is not a compiler bug since they tell you this can happen. Thus, intrinsics, NOP, enabling and disabling interrupts, and assembly directives are about the only assembly code you want to mix in a function. Anytime you really need to code part of your algorithm in assembly, you will save yourself significant headaches by writing the function entirely in assembly. This doesn't mean you should write large functions in assembly, but rather short small functions which are easy to manage like the inner workings of a for loop, etc.

If your not comfortable writing complete assembly files, one trick is to write a C function and use only inline assembly within it. You can access the variables as defined by the compiler guide and with optimization turned on high, all function overhead code is optimized away leaving only your assembly instructions. This can simplify writing the assembly quite a bit as the C compiler will manage most of the data/function declarations and debugging directives so it works nicely with the IDE.

For your example I would write the following code in a C file

Uint16 v[4][64*64]; // a global array

void video_fir( Uint16* x, Uint16* y, const Uint16* n )

{

// x is in AR0, y is in AR1, and n is in AR2 as per the compiler guide

// the only saved registers on entrance are T2, T3, AR5, AR6, and AR7

// any other registers must be pushed before being overwritten

// access the v array as _v in the assembly, this is described in the compiler guide too

asm(" MOV AR5, #(_v + 0 * 64 * 64)"); // AR5 = &v[0][0]

asm(" MOV AR6, #(_v + 3 * 64 * 64)"); // AR6 = &v[3][0]

asm(" MOV *AR6+, *AR5+ // *AR6++ = *AR5++

// ... remaining code of function

}

Please don't hold my assembly syntax against me. It may well produce errors but I think you get the idea. With optimization set on its highest for a given file, this should generate a function with no additional machine instructions added by the compiler. If you don't use a high enough level of optimization, you may need to update the variable on the stack as well. A little experimentation will show you pretty quickly what's going on.

Regards,

Jim

Processors

Processors forum

C5515 Optimization considerations