This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler/TMS320C6678: code run much slower when replacing c function with asm code generated by CCS

Part Number: TMS320C6678


Tool/software: TI C/C++ Compiler

I run into a strange problem when using asm function. I have a program written in c, which consists of several functions. I want to rewrite one of the function in asm. As a first try, I copy the asm code generated by CCS, which is achieved by using the Disassembly window in CCS. But then this function run much lower. When the function is written in C, it uses about 23000 cycles. When I replace it with asm code, it uses about 150000 cycles. I wander why the asm code run much slower?

Here are some information about my program:

1) The whole program is written in C, and I run it in TMX320C6678 demo board. The program runs in single core mode.

2) I use _itoll(TSCH, TSCL); to measure the time consumed by this function.

3) In this function, it reads data from an input buffer, do some multiplication and addition, and then writes the data to an output buffer. So it does not call any other functions.

4) The only change I make to the asm code is that I change the name of the label in it.

5) The section .text is placed in the L2/SRAM, all other sections are placed in Multicore shared Memmory.

Thank you!

  • Hi,

    Just a suggestion: use compiler option "keep the generated assembly ..." --keep_asm, in Advanced Options -> Assembler options) to get the generated asm code (maybe suppressing the debug info that cannot be used in manually written asm files). You'll get a cleaner source file to start from.
  • Xu Wang said:
    When the function is written in C, it uses about 23000 cycles. When I replace it with asm code, it uses about 150000 cycles.

    That sounds like a cache problem of some kind, or maybe a memory bank conflict.  I'm not an expert on memory effects like that.  I'll ask someone else to take a look.

    One way to test whether this is a memory effect problem ... Make sure the assembly function is exactly the same size as the C function.  Add NOP instructions if needed.  After linking, make sure everything, code and data, is at the same address.  Then test.  If execution goes back to 23000 cycles, then there must be some memory effect problem.  If it doesn't, then question whether your measurement method is correct.

    And while the suggestion to use --keep_asm is a good one, an even better suggestion is to use --src_interlist.  This causes the assembly file to be kept, and it adds comments to the source that allow you to better understand the assembly code.

    Thanks and regards,

    -George

  • My focus when I joined TI was on DSP optimization, etc. so I wrote quite a bit of assembly for c6000 earlier in my career.  Given the VLIW architecture and unprotected pipeline, it is VERY difficult to write assembly for this device and is very hard to do better than the compiler.

    Before going down the path of pure assembly language, have you looked into using pragmas?  The MUST_ITERATE pragma is a really good one for informing the compiler of constraints regarding a loop (e.g. minimum, maximum, multiple).  The "restrict" keyword is useful for pointers if you can guarantee that different pointer inputs to your function are guaranteed not to point to overlapping buffers.  The use of those two simple things alone can help the compiler to generate substantially more efficient code.  I assume you've already increased the optimization level in the compiler.

    Beyond MUST_ITERATE and the restrict keyword, the next step of optimization is generally to use intrinsics.  I find intrinsics a much better option than assembly.  It enables you to get the compiler to use very specific instructions without having to delve into all the details of writing assembly.  If you're doing work with packed data (e.g. a bunch of 16-bit words, etc.) then you can use specialized 66x instructions to operate on a bunch of items simultaneously.

  • Thank you for your suggestions on the keyword "restrict". It works for me.

    In fact, I want to rewrite this function into asm code because I find that the speed of this function is not stable. When I optimize other functions and keep this function unchanged, the speed of this function is not stable. Sometime it uses only 23000 cycles. Sometimes it uses 150000 cycles. I don't know the reason. I guess it may be because the compiler generates different asm code.

    Now I add the keyword "restrict". This function always uses 23000 cycles. A very useful keyword!

    Thank you.

  • Thank you for your help. As suggested by Brad Griffis, I add the keyword "restrict" in the C function, and this works for me. So it is indeed a memory effect problem.