This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Using SPLOOP in C66x

Hello,

I want to implement a tight loop that doesn't need pipeline using the SPLOOP, the loop might look like :

inst1 .L1

inst2 .L1

inst3 .L1

inst4 .L1

The wanted execution flow is :

Cycle 1 : inst1

Cycle 2 : inst2

Cycle 3 : inst3

Cycle 4 : inst4

Cycle 5 : inst1

Cycle 6 : inst2

...

I first tried the code :

MVK .S2 10,B9

MVC .S2 B9,ILC

NOP 3

SPLOOP 4

inst1 .L1

inst2 .L1

inst3 .L1

SPKERNEL 0,0

|| inst4 .L1

But the compiler says that the dynamic length must be > than the ii

Is there someway to use SPLOOP to do it in the manner i want ? could u explain what does that SPKERNEL parameters really mean ? i just put them to 0,0 so as to not have any delays ..

Thanks ..

  • Hi,

    first of all I don't recommend programming hand optimized assembler. Our Compiler does a very good job at optimizing loops. In case you want to further optimize your C code I'd start looking into using pragma's and intrinsics.

    In case you really want to program in assembly I'd recommend to write serial assembly and let the compiler do the parallelization the code.

    If you want to understand the SPLOOP hardware you'll have to understand the software pipelining first. You can read Chapter 8 of the TMS320C66x DSP CPU and Instruction Set Reference Guide which explains it all.

    Kind regards,

    one and zero

     

  • Thanks for your answer one and zero, however, I still need hand optimized assembly .. I've read the SPLOOP BUFFER chapter in the user guide, so could you help me ?

    I've made the previous example getting to work by having a dynamic length=5 (a NOP added to the iteration) and ii=4 :

     

    MVK .S1 0,A9

    MVK .S2 10,B9

    MVC .S2 B9,ILC

    NOP 3

    SPLOOP 4

    ADD .S1 A9,1,A9 ; inst1

    ADD .S1 A9,1,A9 ; inst 2

    ADD .S1 A9,1,A9 ; inst3

    ADD .S1 A9,1,A9 ; inst4

    SPKERNEL 0,0

    SUB .S1 A9,1,A9 ; finst

     

    The simulator gives what is expected, however, i'm wandering if SPKERNEL params are correct .. so that the "finst" instruction executes immediately after the last "inst4" instruction contained in the epilog ..

    Thanks

  • I solved my probem, thanks .. what i've missing is the information that the SPKERNEL parameters indicate when the next instructions following the SPLOOP should begin relatively to the epilog .. then, the last code i wrote is correct, avoiding even the NOP latency included in each iteration ..

    By the way, my hand written assembly turns to be 6 times better than the optimized DSPLIB complex matrix multiply with intrinsics and pragmas, using -o3 and optimising for speed 5 as compiler options .. that's why i decided to optimise my functions in standard C66x assembly ..

    Best Regards . Mounir

  • Hi Mounir,

    first of all thanks a lot for your feedback!

    I just want to clarify on the 6 times better than the DSPLIB. Is your assembly code really functionally the same as the DSPLIB implementation or have you done further functional optimizations?

    Also I'd like to recommend to you to look into linear assembly in case you haven't done so already. You might be interested to compare your hand-optimized assembly performance with the linear assembly. The advantage of linear assembly is that it's easier to write and maintain. You'll find more details in the C Compiler User's Guide (SPRU187) in the Chapter "4.3 What You Need to Know to Write Linear Assembly"

     

    ... by the way you don't necessarily need to specify the functional unit when writing hand-optimized assembly, the assembler would do that for you if the functional unit is not specified ...

    Kind regards,

    one and zero

  • Thanks, i'll try to see the linear assembly and what benefits could provide .. my hand written assembly does just a complex matrix multiply, but I indeed slightly changed the algorithm to be tiled, computing 8 values at the end of each iteration .. the comparison i did is then unfair regarding the code optimization ..