This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Assembly works on simulator, not hardware. Pipelined Assembly Viterbi Decoder.

Other Parts Discussed in Thread: CCSTUDIO

I have written a Viterbi decoder in assembly (not linear) that is 25% faster than the C64x version that we purchased from a third party. When running the code in DSS (CPU accurate simulator) through Matlab with no noise there are no errors in the output. When running on the target hardware with almost no noise(possibly due to tasks and interrupts) there are bit errors very rarely (every thousandth call to the decoder).

When the old C64x code is run in place of the new decoder it runs without errors. I do not believe this is due to any difference in performance of the decoders.

I have avoided using registers A10-A15 and B10-B15. I have saved and used the return address. I am not modifying any registers like AMR. I have tried disabling interrupts using DINT and RINT (no success). 

What else should I be doing and what can I try to find the issue?

Thanks
Chris 

  • Chris,

    This shows two impressive feats. First, that you are successfully writing a complex program in C6000 assembly. Second, that the performance beats out a commercial software house by 25%. This should be enough to convince your boss that you deserve a raise and a bonus, or at least it will once you get this discrepancy solved.

    Some questions to clarify some points, please:

    1. DSS is "Debug Server Scripting"?

    2. WHich version of CCS are you using?

    3. Exactly which simulator are you using?

    4. Do you mean C64x and not C64x+?

    5. What is the -mv switch set to in your build options?

    6. What memory are you running out of for program, and for data?

    c0l0jar0 said:
    through Matlab with no noise there are no errors in the output.

     on the target hardware with almost no noise there are bit errors

    This means the data input to the two versions is different. Therefore, we do not know if the program is executing differently (implication of the thread's title) or if the data leads to differences in the execution. You really need to capture a fixed set of input data from either the Matlab process or from the board process. You should be able to insert some capture code at the start of your function and save a bunch of data to an external memory buffer, then save that to a file to use repeatedly for testing. For best results, do that for both the Matlab and hardware methods.

    Then adapt the wrapper around your decoder so it takes input from that exact same data from a pre-loaded buffer in external memory.

    With the exact same data, you can have two things to compare A-to-B to help you narrow down to the exact cause of the problem.

    c0l0jar0 said:
    on the target hardware with almost no noise (possibly due to tasks and interrupts) there are bit errors

     

    Can you explain, please, why tasks and interrupts would or could cause noise on your input data?

    Why would you have different tasks and interrupts running on the hardware than you have on the simulator? Can they not run the exact same test program?

    c0l0jar0 said:
    What else should I be doing and what can I try to find the issue?

    Debugging assembly is difficult, maybe by definition of the situation. With C code, you can compile with the optimization turned off and debug -g turned on for functional testing. If possible, expand your tight loops in assembly to remove some the high-performing pipelining so you can watch the register activity with a lot less confusion.

    Hopefully, you started from C code to test your algorithm, and then moved to higher levels of optimization and then to the pure assembly. If so, try running those less optimized version to see how they perform with the exact same data.

    Regards,
    RandyP

  • First of all, thank you very much for the detailed reply. I really didn't expect someone to provide helpful feedback. None of the DSP engineers at my current company have this kind of experience and I am trying to self-learn this level of hand-optimized assembly because the Viterbi decoder is 75% of the MIPs usage on our SDR application. I feel like I am 95% there so I hope I can see this through to completion.

    BACKGROUND

    To clarify, I mentioned the gains to explain the motivation for my effort. Part of the reason for the gain is due to the extended instructions available on our 6740 (-mv 6740 btw). The original code was C64x (not +). The new code is not based on the other. Still, I am very pleased to have come this far and have code comparable to theirs.

    Yes DSS does mean Debug Server Scripting. I run all test-benches with DSS in Matlab (used as a javascript). I use a special minimal project to call only the code under test.
    I use the C674x CPU cycle accurate simulator.
    I use CCS 5.3. 

    I'm sorry I wrote a confusing sentence. On the radio SNR is very high (0 errors even without FEC) but I did not mean to somehow suggest tasks and interrupts had anything to do with noise. Those were two separate thoughts. I mentioned tasks and interrupts because I did not know if there was a step that needed to be taken to make the code "reenterable". 

    The simulator does not have tasks or interrupts and of course has a flat memory model. All it does is call the function from Matlab (kind of like is done with MEX).

    I have tested this on two radios, one build keeps the input and scratch buffer in L3. The other keeps one in L1 and the other in L2. Code is in L2.

    QUESTIONS

    Are there any conditions in which the loading of words from memory would be delayed or early or anything like that? I use registers after loading to them with the assumption the previous value will be available for 4 cycles. Without this assumption performance would not be as good as it is. Any potential problems to avoid? Registers that need to be set a certain way?

    MOVING FORWARD

    I did start from C-code and I have saved it off, I will try to step back a few steps.

    I like the idea of controlling the decoder input.  I will save the Matlab input to cheap memory and copy that in at initialization and never re-write the input buffer then see if it still has a problem. That way I can know if it is the input.

    Thank you for your help.

    Chris

  • Chris,

    Which DSP is on the two radios?

    Please be sure to read the description of the CPU CA simulator in the Target Configuration window. There should be no issue with bit accuracy, but timing accuracy is not what Cycle Accurate means. I do not care for that name, but within the perspective of that simulator, it is correct, but not within the perspective of you developing and benchmarking an application that uses cache and anything other than L1 SRAM.

    c0l0jar0 said:

    QUESTIONS

    1. Are there any conditions in which the loading of words from memory would be delayed or early or anything like that?
    2. I use registers after loading to them with the assumption the previous value will be available for 4 cycles.
    3. Without this assumption performance would not be as good as it is.
    4. Any potential problems to avoid?
    5. Registers that need to be set a certain way?

    1. I am not really sure what you are asking here. Open-ended questions on assembly and architecture issues go back to the CPU & Instruction Set Reference Manual, where all things about the assembly instructions are addressed. There is a lot in there and there is a lot fo have to consider. The C6000 Optimization Workshop is available on the TI Wiki Pages, and it is a great source of clarification on the basic points of the architecture.

    2. This is the classic definition of a non-interrupt-tolerant situation. You must have interrupts disabled when executing code like this. The registers will get loaded with the new values in 4 cycles even if those four cycles occur while branching to an ISR.

    3. #2 is also the classic definition of software pipelining. It is vital for the performance, as you said, but does require interrupts to be turned off.

    4. The biggest problem to avoid is assembly programming. We always recommend that you write in C, optimize in C, use pragmas to optimize more in C, re-write your C code to let it be optimized better, then do any assembly programming by starting with what the compiler generated. There are lots of potential problems to avoid. Everyone will be described in the CPU & Instruction Set Reference Guide. The C compiler knows all the idiosyncrasies of the architecture, like when a register is being used too much, which subtle conflicts will cause stalls and which ones will cause errors, and what all can be done to clear/add/move/shift register values.

    5. Not sure I understand what would fit this question, but the answer would probably be the same as #4.

    Regards,
    RandyP

  • RandyP said:

    QUESTIONS

    1. Are there any conditions in which the loading of words from memory would be delayed or early or anything like that?
    2. I use registers after loading to them with the assumption the previous value will be available for 4 cycles.
    3. Without this assumption performance would not be as good as it is.
    4. Any potential problems to avoid?
    5. Registers that need to be set a certain way?

    1. I am not really sure what you are asking here. Open-ended questions on assembly and architecture issues go back to the CPU & Instruction Set Reference Manual, where all things about the assembly instructions are addressed. There is a lot in there and there is a lot fo have to consider. The C6000 Optimization Workshop is available on the TI Wiki Pages, and it is a great source of clarification on the basic points of the architecture.

    [/quote]

    If data can't be delivered in 4 cycles, then whole pipeline stalls, and illusion the data is delivered in 4 cycles is maintained 100%. As for "early" appearance of data, see below.

    RandyP said:

    2. This is the classic definition of a non-interrupt-tolerant situation. You must have interrupts disabled when executing code like this. The registers will get loaded with the new values in 4 cycles even if those four cycles occur while branching to an ISR.

    Correct, but a bit misleading. "You must have interrupts disabled" implies that programmer adheres to DINT/RINT. But there is situation when processors handles interrupts in a manner that allows for situation in question. SPLOOPs. In situation in question and in SPLOOP one doesn't have to explicitly disable interrupts. This is because if SPLOOP is interrupted, all modulo-scheduled operations are completed in implied order (SPLOOP is drained) prior interrupt is handled.

    RandyP said:

    4. The biggest problem to avoid is assembly programming. We always recommend that you write in C, optimize in C, use pragmas to optimize more in C, re-write your C code to let it be optimized better, then do any assembly programming by starting with what the compiler generated. There are lots of potential problems to avoid. Everyone will be described in the CPU & Instruction Set Reference Guide. The C compiler knows all the idiosyncrasies of the architecture, like when a register is being used too much, which subtle conflicts will cause stalls and which ones will cause errors, and what all can be done to clear/add/move/shift register values.

    TI assembler is aware of the quirks as well, and emits warnings and errors pointing at problematic instructions. And not all of the quirks are described in Instruction Set Reference Guide, you have to look in errata documents as well. Yes, assembler should warn you about errata cases as well. Though this implies that assembler is up-to-date.

  • Andy and Randy,

    Thank you both for your help. I have proven that the problem is the interrupt intolerance and I should have thought of this from the beginning.

    Looking back, trying to use the registers in the cycles before the LDDW write occurred made the code faster but MUCH harder to design. I needed to expand the code and free a register pair specifically for the loads but it works now. The performance is now about 40 cycles per bit from 32. The old decoder was 55 so it is still an improvement.

    This was a great learning experience. 

    Thanks

    Chris

  • c0l0jar0 said:

    The performance is now about 40 cycles per bit from 32. The old decoder was 55 so it is still an improvement.

    It's mistake to look at improvement coefficient alone. You should try to estimate theoretical limit and ask yourself why are you far from it. Of course it might turn out that theoretical estimate didn't take something into account and was therefore wrong, but it doesn't change the point. That improvement coefficient relative to some other code, purchased or not, doesn't necessarily tell that there is no room for improvement.

  • Chris,

    Good job finding the issue and fixing it. Assembly work is very tedious and it is easy to have subtle problems like this.

    Three suggestions, if you ever need to improve the performance from 40 cycles:

    1. Use SPLOOP, as Andy mentioned. It creates an interruptible solution that takes care of that issue within the loop. This is not always possible, for example when there are function calls or when the loop kernel length is too long. The compiler will make use of SPLOOP when it can do so, and that would be a good example to start from.

    2. Depending on the available execution units, you could disable and re-enable interrupts within a hand-pipelined loop. You would want to save the original value of the CSR.GIE (and maybe PGIE) for the restoration part in case interrupts were already disabled when the function was called.

    3. Like the compiler often does, you can move the old register value to another register that will not be dependent on the 4-cycle delay. This extends the lifetime of that value in the second register since it will not depend on the completion of the LDDW.

    Regards,
    RandyP

  • Just for completeness sake. There is one more case when interrupts are handled in special manner, it is post-branch delay slots,where they are implicitly disabled. So that "lower" edge of conventional loop and return from subroutine are also places where one can rely on counter-intuitive order of instructions. One can note that if one can't count on interrupts being enabled upon entry to subroutine [in which case you can't use DINT/RINT, but have to examine GIE as Randy suggests and act accordingly] one can protect short special code sequences with

    B label

    delay-slots long special code

    label:

  • Hi,

    Your experience is really exciting. But I do not understand what you said "the interrupt intolerance" and rearrange register pair. Could you explain it in more detail?

     

     

    Thanks,

  • Robert,

    You were addressing Chris, I believe. But I will comment on one item that I discussed with him above.

    The C6000 architecture implements an unprotected instruction pipeline. This is done to allow the most possible instructions to execute in parallel and the most possible instructions to execute per clock cycle. The tradeoff is increased assembly instruction complexity to get increased performance.

    One functional limitation of this is that you can write instruction sequences that cannot tolerate being interrupted. For these sequences, you must disable interrupts or rewrite the assembly code to be interrupt tolerant. To understand more about the pipeline, please refer to the CPU & Instruction Set Reference Guide for the DSP core that you are using.

    I strongly recommend avoiding writing C6000 assembly code. Instead, use the C Compiler and its optimization capabilities to get high-performing, readable, debuggable, and supportable code.

    You may have misunderstood the reference about "rearrange register pair", or else I misunderstand what you are asking. Chris mentioned needing a register pair for the LDDW instruction, so he had to rearrange his register assignments to make a pair available.

    Regards,
    RandyP

  • Thank you RandyP. I have no ambition to write an assembly project now. I have some hardware (FPGA) experience. I know the compiler efficiency is very amazing. To write DSP C code, I cannot compete with many engineers. I want to learn a little low level coding technique in case of needed. I really find that there is some similarities between FPGA and DSP asm. Both of them extensively uses pipeline and parallelism to improve throughput. For a career point of view, some companies have open for such engineer. A nearby consumer product company looks for DSP asm programmer for more than one year. They requires high throughput with a little memory footprint.

    Chris' achievement is excellent. I have known that the interrupt occurs when the 4 cycle LDDW executes, and this results the decoder errors.

    I still do not understand why Chris has the interrupt related error even he used DINT/RINT in his first post. Later what instruction he uses to disable interrup correct that error?

     

     

    Thanks

  • I don't know exactly how DINT/RINT did not fix the original problem. I did see on another thread that someone experienced the same problem and they claimed that something in BIOS would still cause interrupts.

    The original problem was that I was loading into a register and continuing to use it for a few more cycles assuming the load had not yet written to the register. When interrupted this could be a incorrect assumption. That is the conclusion that I came to with the help of Andy and Randy.

    What I did to loosely prove this theory was to free a register to allow for alternating the loading into a second register pair that was not used as much immediately after the load instruction. This reduced the errors by about 90%. I then stretched out the pipelined/unrolled inner loop so that there would be zero cases where this invalid assumption was used and the errors disappeared.

    Looking back on it, if I had saved some of the save-on-use registers to the stack I could have not lost any performance at all. At this point I have a lot of other work to do and I don't know if I will get back to this. It is good enough for now.

    Since everyone keeps mentioning that they recommend relying on the C/C++ compiler I wanted to make a comment. This Viterbi decoder uses ~80% of the cycles when  the DSP is at its worst case loading. We already had a Viterbi decoder written in assembly and with it we nearly had zero cycles to spare. For this reason there would be no point in writing a C decoder. I happen to enjoy the theory of Viterbi decoding and I wanted to leverage my knowledge of it to achieve performance that is much greater than the compiler can (better by a factor of 16 - trust me I tried it first in C++). A Viterbi decoder, I think, is an excellent candidate for hand-coded assembly because it is so parallelizable.

    Thanks again Randy and Andy for the feedback. Maybe someone can explain why RINT/DINT didn't work?

  • "The performance is now about 40 cycles per bit from 32. "

    A new question is about the above statement. If you find the new interrupt disable command, can you get the 32 cycle per bit performance again without error? How many cycles are needed for the new disable interrupt overhead? If it is less than 8 cycles, it is still better than 40.

     

    "if I had saved some of the save-on-use registers to the stack I could have not lost any performance at all."

    Could you explain the above a little more to me?

     

    What kind of encoder is used in your project? (e.g. 171 133 convolutional code? which type?) I really want to know this to get connection to your cycle count?

     

     

    Thanks for your great work.

     

  • Hi, Chris:

    "When running the code in DSS (CPU accurate simulator) through Matlab"

     

    Could you give me more information (web link, tutorial or something else) on how to use DSS through Matlab?

     

     

    Thanks,

  • Hi,

    Chris may feel my precious question is appropriate to answer. Sorry about that. I want to know the hand code ability in C64 fixed point. I cannot find it for a long time. Can I ask how many trellis butterflies (2 in X 2 out) in your decoding?

    What new instructions you use in  C6740 comparing C64?

    Your fist post said that your code was 25% faster than the original. Later, you quoted the original code cycle was 55. I cannot get your code cycle either 32 or 40. I ask you here because I often cannot get the cycle number consistent in similar article/post. I want to get a cycle calculation  example. Thanks,

     

  • I'm sorry I have not been able to get back to you. I went on vacation and I have been busy since.

    1) I didn't get DINT RINT to work, I never found the root cause but I thought it might relate to another thread that talked about the OS still using interrupts even after DINT. Not sure but either way I don't want to disable them for the final implementation.

    2) I had to "spread" the instructions thinner because of register use. If I had instead saved some of those "save-on-use" registers to the stack I would not have had to change the instruction order, only the registers used. Save on use is like A10-15 and B10-15 I think. I wasn't using those and if I had I would have kept the same performance.

    3)  The code is 133 171.

    4) For DSS in Matlab I use CCS v5.3 and Matlab R2012B 32-bit. This is very important. I put the following at the top of scripts to only include the paths once, prevents warnings.

    ds_dir = 'C:\ti\ccsv5\ccs_base\DebugServer\';

    % Do not add the path if it is already present
    java_path = javaclasspath;
    path_already_added = ~isempty(regexp(java_path,'dss.jar', 'once'));
    if(~path_already_added)
    javaaddpath([ ds_dir 'packages\ti\dss\java\dss.jar' ]);
    end

    % This goes at the top of the script 
    import com.ti.debug.engine.scripting.*;
    import com.ti.ccstudio.scripting.environment.*;

    Let me know if you need more help, I haven't gotten the actual GUI to open and attach to the DSS instance yet.

    5) The assembly:

    ; --- Loop Unrolling ---
    ; The inner loop has been completely unrolled.
    ; Side A and B each calculate FOUR branches at a time for a total of EIGHT
    ; There are 64 possible start and end states for each outer iteration so that means
    ; EIGHT sets of EIGHT have been unrolled and pipelined/overlapped
    ; Outer loop iterations also overlap for efficiency

    6) C67 has addsub2 and I load 4 path metrics at a time (LDDW) on both A and B side.

    7) Old 3rd party code(C64x): 55cycles/bit, my first implementation: 32, my recent implementation: 40 (but didn't need to be, I don't have time to fix it, see second answer).

    If you are using C64x I would think you should hope for 55cycles/bit.

    Good luck

    Chris

  • Thank you for your reply. I remember that someone did not suggest me using circular addressing in FIR implementation in the past on TI's forum for BIOS (and DSP status register) reason. Here for the metric table, who stores the accumulated metrics, has two addresses of 64. One 64 table is for previous while the other 64 is for updating. The reference design in TI DSP uses circular buffer. Do you use circular or not? There is no problem with the BIOS (OS)?

    Another small question is about the soft decision value. You use 4 bits? These soft decision data are packed to words, or not?

    What is the decoded bit stored? They are packed to 32 bit word?

    I had written C code for Viterbi. I am interested in writing it in asm as a private project for quite some time.

     

     

    Thanks,

     

  • Please ignore my input data format question. It has said clearly. Sorry about that.

     

    "C67 has addsub2 and I load 4 path metrics at a time (LDDW) on both A and B side."

  • Robert W said:

    I remember that someone did not suggest me using circular addressing in FIR implementation in the past on TI's forum for BIOS (and DSP status register) reason.

    A bit off topic. The attitude toward assembly programming is generally "hostile" [i.e. as maintained by participating TI employees], not to mention usage of esoteric capabilities such as circular addressing. I'd argue that it's rather rude to discourage people from using it, but it's rude not toward discouraged programmers, but toward TI hardware designers. Indeed, compare the effort devoted into providing really cool capabilities, and effort to advise not to use them.

    As for specifically circular addressing. As already implied earlier [in that other thread], the question whether or not to use it is effectively equivalent to whether or not operating environment supports it and if not, whether or not you are in position to do something about it. Or in even more practical terms. If you have source code for all interrupt service routines, then you can add pair of lines to each one of them to allow AMR usage [in interruptible code]. Ideally it should be arranged by compiler, i.e. handling of AMR (as well as some other status registers) in subroutines declared as interrupt handlers. But this would require an attitude change from TI...

  • Please see section 5.1.3 of the Compiler User Guide:

    Also, please look for the intrinsic version of any special assembly instruction you may want to use. The compiler team tries to have an intrinsic available for any instruction that would not normally be used by the compiler, such as GMPY or LMBD. These are listed in the Compiler User Guide.

    Regards,
    RandyP 

  • Hi,

    Thank you, Chris and the other two experts.

    When I try to write asm code, I find a problem related to memory reading. You said:

    " I load 4 path metrics at a time (LDDW) on both A and B side."

     

    I get the (133, 171) code nextStates, outputs as below:

    [aa.nextStates aa.outputs]

     

         0    32     0     3           

         32     0     0     3

          1    33     1     2

         33     1     1     2

    Those 8 branches is for register A and register B to complete. Is it right? Anyway, state metric 0 and 32 are far away, how can you using LDDW on both A and B side?

    Maybe I am wrong on your description. Please correct me.

     

     

     

     

  • Do you use two LDDW readings?

    one: LDDW branch 0,1,2,3

    two: LDDW branch 32,33,34, 35?

     

    These two LDDW cannot run at the same time for memory bank conflict. right?

     

    state branch is 16 bits. Is it right? 

    Please confirm or correct me. Thanks again.

  • Robert,

    I do believe I have already given you some very valuable information. I think you can create your own design given those hints. 

    I would recommend doing an initial design in a spread sheet though. Each column is one iteration of the unrolled outer loop, you can then see how they overlap.

    Good luck

  • 7) Old 3rd party code(C64x): 55cycles/bit, my first implementation: 32, my recent implementation: 40 (but didn't need to be, I don't have time to fix it, see second answer). If you are using C64x I would think you should hope for 55cycles/bit.

     

    I would like to know whether 40 cycles includes both ACS and traceback, or only ACS of the metric calculation? Thanks,