How is T3 getting trashed during CALL ?

Brian Willoughby

I have a situation where my firmware is not working, and when I step through the code I see that T3 will sometimes be trashed (usually loaded with 0, but sometimes with a large number like 0x7FFE). At first, I dug deep into the called subroutine, but after turning up nothing I noticed that T3 might be trashed by a different subroutine each time.

The big picture: This is a C5506 on a custom board that runs other C55 test firmware perfectly. I am not using DSP/BIOS. I am only using CSL. The code is a mixture of C and assembly, including some code from dsplib 2.40. My interrupts are all very short, are written entirely in C, do not call assembly, and do not write to arrays. Basically, the interrupts are pure C code that updates global volatile variables, some 32-bit, some 16-bit. All data is in DARAM, except for one read-only 8K DMA buffer in SARAM. All code is in SARAM. I am using the large model and function-level optimization. The linker command file skips the memory-mapped registers, so T3 is probably not being clobbered in memory. One of the first things that I do in my C startup code is turn off C54CM via a custom assembly routine that I wrote. All code compiles without warnings because I have dealt with each of the warnings like CPU_116 until the compile/assemble process is clean.

The problem is that one of my critical non-interrupt functions has a for loop from 0 to 7, and the C compiler uses T3 as a counter to determine how many times the loop is repeated. When T3 gets trashed with a value like 0x7FFE, that function takes several thousands of cycles longer to execute, and possibly trashes memory. This critical function is called approximately twice per USB frame, so it's being called at 2 kHz.

I use a number of Texas Instruments assembly subroutines in this non-interrupt function, such as rfft(), and a few assembly subroutines that I created myself, such as a square and sum vector function and a 32-bit sqrtv() function. But each of these assembly subroutines always pushes T2 and T3 on the stack if it alters those registers.

This tells me that either I am trashing my stack, or the interrupts are trashing the stack, or the interrupts are trashing T3 directly.

When stepping through my critical function, I usually step over the fft or sqrt functions, and I can see in the debugger that T3 is changed while stepping over a single assembly CALL instruction. This really makes no sense to me.

One thing I tried was to move my stack allocation to the opposite end of DARAM memory. The idea here was that if any of my regular C arrays were overflowing (out-of-bounds writes), then originally the next object in memory was the stack. So, I moved the stack very high in memory such that I should notice other data being trashed before the stack, but this made no changes. That seems to leave interrupts as the culprit.

Although I have seen T3 change when stepping over a CALL, I also have stepped into the assembly functions, and one time I caught the following instruction alter T3: OR #0120h, mmap(ST1_55), which should be equivalent to BSET SXMD, BSET SATD. Reading about some of the status register mode bits, it seems that T3 may be affected by the C54CM mode bit, but I am unclear whether this could affect my program. As I mentioned, I clear C54CM right away in my program, so C54xx mode should never be enabled, and yet I still have lingering suspicions that T3 might somehow be affected by C54CM or some other mode.

Although I have never caught my code jumping into invalid memory (does the C55xx actually use the stack for return addresses?), it does seem to be the case that if I let my code run long enough then halting it with the debugger will show the Program Counter is in the middle of nowhere, but this could be a false reading if the debugger has somehow become confused by register trashing.

Note: If I comment out the single call to rfft() within my for (c = 0; c < 8; c++) loop, then everything runs fine (except that I'm processing time-domain samples instead of frequency domain). This makes it seem like cfft() or cbrev() or unpack() are trashing T3 directly, but they all properly save T3 on the stack. Surprisingly, I have also caught sqrt_16() killing T3, and the same story there is that the assembly seems to be properly saving T3 on the stack.

Anyone have any clues? Is there anything special about T3 with regard to C54xx Compatibility Mode, or any other mode for that matter?

Can I rely on the compiler to save all allocated registers that are used in a C interrupt function, provided that I use both the #pragma INTERRUPT and the C compiler interrupt type for the function? The assembly output of the C compiler sure seems to be producing valid interrupt routines that save context on the stack, so I don't really see how the interrupts could be trashing registers.

One possibility is that I have an out-of-bounds array write which is trashing the stack, but I have reviewed nearly all of my code and have not found anything obvious.

I guess what I am looking for here is to hear whether anyone is aware of something that I might be overlooking. I have studied reams of TI DSP documentation, and I am quite familiar with the rules for mixing assembly and C, particularly that T0 and T1 are not preserved across a subroutine CALL, but T2 and T3 are supposed to be preserved. As I said, I have witnessed T3 get trashed when stepping over a number of different CALL instructions in this loop, so what do I suspect next? The debugger? The compiler? Interrupt context? ?

over 15 years ago

0 Steve Tsang over 15 years ago

TI__Mastermind 23785 points

Brian,

This is a tough one. One of my colleague will take a crack at it.

Regards.

0 David Rick over 15 years ago

Intellectual 390 points

Brian,

I chased a problem like this about five years ago. It was very difficult to find, but it turned out that an assembly language routine in DSPlib was not properly preserving the T3 register. T3 is (or used to be) a "save on entry" register. In C, it isn't normally stacked unless the called function actually uses it. Then it's the called function that must push and pop it, not the caller. If you are calling assembly functions, they must follow the same rule. The problem was that DSPlib's maxval and minval routines weren't doing this. TI had documented a patch; when I did it my problem went away. It may be that it was fixed in a later version of DSPlib, but I don't know. Are you calling DSPlib functions?

David L. Rick

Hach Company

0 Brian Willoughby over 15 years ago in reply to David Rick

Genius 4630 points

Yes, I am calling dsplib functions. I have also been reviewing them specifically for any problems with not saving T3. So far, I have not found an obvious culprit.

I even had a moment where I thought T3 was not being saved by unpack.asm (part of the rfft() macro), until I looked closer and realized that MOV pair(T2), dbl(*SP(#00h)) is actually saving T3 on the stack, too.

I have fixed many bugs in dsplib, and I really wonder why there isn't a more up-to-date release with bug fixes from all around. I should keep looking, because perhaps I missed reviewing one function that I am using.

What's really curious is that if I remove rfft() from my loop, the problem goes away. Yet I have carefully reviewed cfft_scale, cbrev, and unpack and I cannot find a problem with T3 in those routines. We may be on the right track...

P.S. Thanks! Hopefully TI can help get to the bottom of this.

0 Peter Chung over 15 years ago in reply to Brian Willoughby

TI__Expert 8065 points

Brian,

I also looked at the three assemby functions and have not yet found a culprit. The cbrev.asm does not save T3 in stack but T3 is not being used at all.

One thing came to my mind is that it may help if you disable the optimization option. At least you can try to see if it makes any differenes. Sorry, not much help I can provide.

Regards,

Peter Chung

0 Brian Willoughby over 15 years ago in reply to Peter Chung

Genius 4630 points

Thanks, Peter.

I am now using the three subroutines of rfft() directly, so that I can isolate the effects of each.

unpack() seems to be trashing T3, which to my best interpretation means the stack is being trashed some time between when T3 is saved and restored. I am continuing to look into this. As I mentioned, I have moved the stack around in memory. I'll have to check my interrupt code, too.

cfft_scale() causes some serious DMA sync drop problems, but it does not seem to trash T3.

cbrev() seems to be the safest of the pieces of rfft(). I now use it off-place for performance, which isn't possible within the rfft() macro.

I will try disabling optimization, but with my firmware, lack of optimization means the code does not fit into the 1 ms time slot needed for USB servicing, so I may be forced to hand-optimize various C code as assembly before I can successfully run without any optimization. Thanks for the suggestion.

0 Brian Willoughby over 15 years ago in reply to Brian Willoughby

Genius 4630 points

I did try turning off all optimization, and the T3 trashing seems to cease. But the problem is that I miss isochronous transactions on many USB frames because the routine now takes longer than 1 ms to complete. I have the added complication that my DMA sync drop interrupt is being called thousands of times per second, so I can't be 100% certain that all is working.

I suspect that the C interrupt routines might somehow be responsible for trashing T3, but I don't think I have reviewed the assembly output for all of my C interrupts yet...

0 Brian Willoughby over 15 years ago in reply to Brian Willoughby

Genius 4630 points

Is there any difference between the following two code sequences?

PSH T2

PSH T3

...

POP T3

POP T2

; versus

AADD #-2, SP

MOV pair(T2), dbl(*SP(#00h))

...

MOV dbl(*SP(#00h)), pair(T2)

AADD #2, SP

The reason I ask is that unpack() seems to be the only routine giving me trouble by trashing T3, and it's also the only one using the MOV pair/dbl instructions. It makes me wonder if somehow the stack utilization is different, and that's why the value is bad when T3 is restored from the stack.

P.S. The documentation is severely lacking or inaccurate. unpack() is not documented at all in SPRU422J, although it is briefly mentioned under rfft(). At least cbrev() has its own entry with a benchmark, so it is surprising the unpack() lacks one since it takes even more cycles.

In addition, the header for unpack.asm claims that it unpacks the output of a radix-2 DIF complex FFT, and yet the rfft() macro uses cfft(), which is defined in its headers as a DIT complex FFT. I suppose it's not too bad, since I have found references online that confuse DIF with DIT. But it certainly isn't reassuring that dsplib is not even consistent.

0 Peter Chung over 15 years ago in reply to Brian Willoughby

TI__Expert 8065 points

Brian,

I don't see a problem with the code. One question for you. Did you step into the assemby code and checked T3 value right after "MOV dbl(*SP(#00h)), pair(T2)"?

You can include the source file into your project and step into each line of the unpack.asm.

Regards,

Peter Chung

0 Brian Willoughby over 15 years ago in reply to Peter Chung

Genius 4630 points

Yes, I did step into the assembly code with the register window in CCS3 open. The debugger showed T3 in red as it changed to an out-of-range number (out of range for the loop). I was able to reproduce this several times: on entry, T3 is between 0 and 8, but at that instruction T3 changes to something that was not saved on the stack.

The only thing I haven't done is open a memory window on the stack to witness the original value of T3 be placed there. I'm hoping I might be able to catch the corruption while stepping through the code.

P.S. I do not recall whether or not dsplib is available as a linkable library. My projects all include the source and build them every time, although I do use the common dsplib.h

0 Brian Willoughby over 15 years ago in reply to Brian Willoughby

Genius 4630 points

Thanks, Peter. I just found the problem. This is a serious bug in unpack.asm, rendering the routine useless in any environment with interrupts.

Basically, the code should be: MOV pair(T2), dbl(*SP(#1))

With the Stack Pointer offset of #0, as in the original code, T2 is written within the safe area of the stack frame, but T3 is written outside the stack frame. The first interrupt that comes through will trash the original value of T3, so it's no wonder it is trashed when "restored" from the stack later.

To clarify with an example, I will walk through the code as I step in the debugger: Initially, my XSP = 0x0036E3. After AADD #-2, SP the value is XSP = 0x0036E1. (at this point, one would expect T2 and T3 to be written to the stack at addresses 0x0036E1 and 0x0036E2, but watch what happens). After MOV pair(T2), dbl(*SP(#0)) the value of T2 is written to 0x0036E1 while the value of T3 is written to 0x0036E0. Note that 0x0036E0 is actually outside the stack frame! The stack frame element at address 0x0036E2 is never even used. As soon as an interrupt occurs, 0x0036E0 is modify by the interrupt event handling, and from then on the original value of T3 is lost. I discovered this by opening a memory window on the Stack and stepping through individual assembly opcodes.

WARNING: It would seem that this error is in several source files that are part of dsplib, namely: abias, araw, aubias, cbias, craw, cubias, fltoq15, power, q15tofl, unpack, unpack32, unpacki, and unpacki32.asm - thus, it is not safe to use any of these dsplib routines in an environment where interrupts might occur.

Aside: While the MOV pair(TAx), dbl(Lmem) instruction is really "cool" because it moves 32 bits in a single cycle, there really seems to be no advantage to using it here. PSH T2, PSH T3 takes only 4/5 as much code space, runs in the same number of cycles, and is not prone to programmer offset mistakes. In constract, saving ACx to the stack requires the dbl(Lmem) address mode, and thus care must be taken to use an odd offset.

Question: Is there some processor quirk that changes the address of Long Word accesses? SPRU371F has some confusing information on page 3-7 where there are two Long Words shown, each with a different endian scheme, but no explanation of what the deciding factor is for the odd/even addressing. Could this be related? If so, shouldn't unpack.asm assume that it is being called from C and adjust the stack to be long-word-aligned? Most of dsplib comes with the warning that the code is inefficient for assembly usage because its routines are designed to be called from C.

0 Peter Chung over 15 years ago in reply to Brian Willoughby

TI__Expert 8065 points

Brian,

Thank you for the update.

Actually, I was going to write to you about the odd and even address aligment with 32-bit data. I was checking if I can push T2 and T3 one at a time.

Yes, it should be the culprit. I don't know why "MOV pair(T2), dbl(*SP(#0))" was used. If the stack pointer address is odd, T3 will be saved to outside of the stack.

I think pushing T2 and T3 one at a time to stack is safer as below.

PSH mmap(T2)
PSH mmap(T3)

Please do not use ""MOV pair(T2), dbl(*SP(#1))" either if you are sure about the stack pointer address when the unpack() is called.

Since you are include each source file of dsplib, you can modify the code and see if it fixes the problem.

Regards,

Peter Chung

0 Maarten Venter over 15 years ago in reply to Peter Chung

Intellectual 395 points

I am using the acorr() function from DSPLIB and the same thing happens to me. As soon as I use acorr() function in loop where I reconfigure DMA structures and interrupts are working, my code boms out. When I remove the acorr() function everything works fine. I am also using other DSPLIB functions like power(), fir(), atan2_16(), sqrt_16(), dlms() without problems in the same section. But when I use acorr() something goes wrong. I do not have optimization switched on.

What can I do to fix this, do I also have to rebuild the DSPLIB source code by adding the asm files to my project? I am not sure what to fix in the acorr.asm file.

Please help

Regards

0 Brian Willoughby over 15 years ago in reply to Peter Chung

Genius 4630 points

Thanks again, Peter.

My immediate fix for unpack.asm was to use PSH and POP, and it wasn't until later that I realized dbl(*SP(#1)) could suffer the same problems.

On a related note, I have written a 32-bit version of sqrt_16(), and I use MOV AC0, dbl(*SP(#0)) and MOV dbl(*SP(#0)), AC0 for a temporary variable. My question is: How do you save an Accumulator register in the stack frame as a 32-bit value without suffering from this alignment issue?

0 Brian Willoughby over 15 years ago in reply to Maarten Venter

Genius 4630 points

Maarten,

Yes, in your case you would remove 55xdsp.lib or 55xdspx.lib from Incl.Libraries in the Linker Build Options and then add the individual source files to your project for the routines that you use. I would suggest that you make a copy of the, e.g., C:\dsplib_2.40.00\55x_src\abias.asm (or araw.asm, or aubias.asm, as appropriate) into your source control management system, and place the copies with your other source code so that you can make modifications and later potentially merge outside bug fixes.

For the specific case of acorr(), the dbl(*sp(#0)) address mode is only used to save XAR5, XAR6, and XAR7 into pre-allocated stack frame memory. Most importantly, the saved values are never accessed within a loop, but are only read back as the last step before exiting the routine, so this fact makes it possible to replace MOV with PSHBOTH and POPBOTH. If you are using the large memory model, then you need to save more than 16 bits, so you can't simply use PSH and POP. PSHBOTH and POPBOTH are handy opcodes that use both stacks to store a pair of 16-bit values separately, so there is no 32-bit alignment issue. There are comments in the code indicating that the MOV operations were once PSHM, which might have been a similar opcode or alias. If you make this change, then you can reduce the size of REG_SAVE_SZ from 6 to 0, although that is not necessary.

P.S. In my sqrt_32() routine, I cannot replace the MOV dbl(*sp(#0)), AC0 with POPBOTH because the operation occurs within a loop, and it's obviously not possible to pop a value off the stack more than once. I'm still looking for a solution for that one.

0 Maarten Venter over 15 years ago in reply to Maarten Venter

Intellectual 395 points

When I use acorr_c_bias(), the c-callable function everthing works. Must be something wrong with the .asm code.

Regards

0 Maarten Venter over 15 years ago in reply to Brian Willoughby

Intellectual 395 points

Hi Brian

Thanks for helping. I tried to replace the MOV with PSHBOTH and POPBOTH, but still get the same results.

Regards

0 David Rick over 15 years ago in reply to Maarten Venter

Intellectual 390 points

I once had a bug in some assembly code I had written in which I had pushed and popped ST1_55 without the mmap qualifier. This worked in small model, but broke (mysteriously) in large model because sometimes I wasn't really accessing ST1_55.

Now I notice that Peter's push and pop suggestion for T2&T3 also use the mmap qualifier. Is this required for pushes and pops of all CPU registers, or only for some? I guess I'm unclear about which push and pop instructions are doing memory-mapped access and which aren't. If this is necessary for all, then it would seem that the code below is destined to break eventually.

  ; preserve save-on-entry registers
  pshboth(XAR1)
  pshboth(XAR2)
  pshboth(XAR3)
  pshboth(XAR4)
  pshboth(XAR5)
  push(AR6,AR7)
  pshboth(XCDP)
  push(T2,T3)

David L. Rick

0 Peter Chung over 15 years ago in reply to Brian Willoughby

TI__Expert 8065 points

Brian,

Use "PSH dbl(AC0)".

This instruction decrements SP by 2, then moves the content of the accumulator high part ACx(31–16) to the 16-bit data memory location pointed
by SP and moves the content of the accumulator low part ACx(15–0) to the 16-bit data memory location pointed by SP + 1 regardless of odd or even address.

I have tested the instruction and it works.

Regards,

Peter Chung

0 Brian Willoughby over 15 years ago in reply to Peter Chung

Genius 4630 points

Peter,

Thanks for the suggestion, but it isn't appropriate for my situation and does not seem appropriate for any normal situation. At the risk of straying from the subject of this thread, I'll discuss my interpretation a little.

In the general case, PSH and POP are only useful when you need to save a register at the boundary of a block of code, usually a subroutine. But AC0 is not one of the registers that needs to be saved. Thus, I do not see any much use for the PSH dbl(AC0) and POP dbl(AC0) instructions in normal mixed C and assembly, although perhaps there are a few esoteric situations where it might be useful, such as a significant block of code within a subroutine.

In my specific case, I need to allocate some space on the stack frame for a variable that is write-once, read-many. In other words, I need to calculate a specific value in AC0, save it to a local variable by storing it in the stack frame, and then read that value in every pass through my RPTB loop. You cannot POP the same value from the stack more than once. I did find that PSH/POP dbl(AC0) instruction addressing mode in the documentation, but realized that it would not work for me.

P.S. I noticed that there is a PSH T2, T3 opcode and a corresponding POP T2, T3 opcode. These instructions are twice as efficient as the original unpack.asm code.

David,

You raise a good question. In my experience, leaving off the mmap() modifier for any STn_55 status register results in a very clearly-worded warning/error from the assembler. The same message does not appear when T2&T3 are used without mmap(). I am left with the same question as you: When is mmap() necessary? I guess I could compile my entire firmware into one .asm file, including the C code using the preprocessor assembly output option, and then scan for code which sets the MMR data page register. Maybe a simpler approach would be to set a breakpoint in the debugger and examine the MMR data page register for a non-zero value at run time. Either way, if the MMR is non-zero, it would seem that mmap() is absolutely necessary.

But your question still remains, because it would seem onerous to wrap every register access with mmap(). I think that it depends upon the instruction, particularly the addressing mode. There is a difference between src/dst and Smem(or Lmem) addressing modes. If an instruction description mentions src or dst, then mmap() is not necessary because memory is not being accessed. However, some register accesses are technically a MOV to/from memory, where you see Smem in the opcode description, in which case the memory access must be qualified with mmap() when using the large or huge memory model. I am still investigating this question.

Brian Willoughby

Sound Consulting

Processors

Processors forum

How is T3 getting trashed during CALL ?