This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Storing C64+ stack pointer in a C-function

Does anyone have a good idea how to store stack pointer (and maybe some other registers) on entering a

C-function? The best I have come up with is:

uint32 global_regs[4] = {0, 0, 0, 0};

and in the beginning of the C-function a call to save_regs: 

#define save_regs() \

  asm(" NOP 4  ; wait"); \
  asm(" STW     A4, *B15--[1] ; save a4 on stack"); \
  asm(" MVKL    _global_regs,A4"); \
  asm(" MVKH    _global_regs,A4"); \
  asm(" STW     B3, *A4++[0]    ; save return address"); \
  asm(" STW     B15, *A4++[0]   ; save stack pointer"); \
  asm(" STW     A15, *A4++[0]   ; save frame pointer"); \
  asm(" STW     B14, *A4[0]   ; save data page pointer"); \
  asm(" LDW     *++B15[1],A4  ; restore A4"); \
  asm(" NOP 4  ; wait");

It seems like getting the correct SP value (instead of extra 32-bit word - contents of A4 - in stack) stored takes quite a lot of cycles considering the simplicity of the task.

Also other good ways to do that would be appreciated.

The interrupts are running.

 

I'd also like to find some texts about the assembly targeted to a programmer. From the instruction set reference it's hard to find suitable

instructions with needed addressing mode options. Also all kinds of clever tricks and "phrases" are welcome.

 

BTW, does the

asm(" STW     A4, *B15--[1] ; save a4 on stack"); \

overwrite the top of stack?

It looks like the instruction should be either "STW A4, *--B15[1]" or "STW A4, *B15--[0]" so that the write goes to the

first unused memory word. I guess TI C-compiler uses post-decrementing stack push?

 

  • Why do you want to do this?

    The cleanest way would be to call an assembly function that is written to abide by the rules set forth in the C Compiler User's Guide for interfacing C with assembly functions.

    You will need to study that document even for your asm() macro, to make sure you do the stack emulation correctly and to make sure you abide by the other rules, such as alignment.

  • I see LOTS of issues with what you are doing, i.e. lots of violations of C calling conventions, compiler conventions, stack conventions.  Whatever it is you're trying to do, this is not a good way of doing it.  Please give more details about what you're trying to accomplish and perhaps someone can suggest a better route.

  • What C-calling conventions are violated? I Return address is in B3, SP in B15, FP in A15, DPP in B14.

    Stack is post-decrementing and points to the 1st free location. The idea is to store the regs without affecting the rest of the code.

    This is supposed to be in the beginning of a function to find out where it's called from in case an error is detected later on.

    This is because there are tens of processes each of which can call the function from tens of different places.

     

    The processor is C6482, and the compiler is TMS320C6x C/C++ Compiler v6.0.18

     

    The code is supposed to be used like:

     

    int send_a_message(const union SIGNAL **msg, PROCESS  pid)

    {

      OSBOOLEAN is_valid;

      save_regs();

      sysinfo_signal_validate(*msg, &is_valid);

      if (!is_valid)

      {

        dbg_msg("bad message at %p, (%p, &p, %p)\n", global_regs[0], global_regs[1], global_regs[2], global_regs[3]);

       return -1;

      }

      send(msg, pid);

      return 0;

    }

     

  • And how about this?

    uint32 regs[4];

    ...

      asm(" NOP 4  ; wait"); \
      asm(" STW     B15, *--B15[1] ; save SP on stack"); \
      asm(" STW     A5, *--B15[1] ; save a5 on stack"); \
      asm(" LDW     *B15[2],A5    ; SP to A5"); \
      asm(" STW     A4, *--B15[1] ; save a4 on stack"); \
      asm(" MVKL    _regs,A4"); \
      asm(" MVKH    _regs,A4"); \
      asm(" STW     B3, *A4++[0]    ; save return address"); \
      asm(" STW     A5, *A4++[0]   ; save stack pointer"); \
      asm(" STW     A15, *A4++[0]   ; save frame pointer"); \
      asm(" STW     B14, *A4[0]   ; save data page pointer"); \
      asm(" LDW     *B15++[1],A4  ; restore A4"); \
      asm(" LDW     *B15++[1],A5  ; restore A5"); \
      asm(" LDW     *B15++[1],B15  ; restore SP"); \
      asm(" NOP 4  ; wait");

  • turboscrew said:
    What C-calling conventions are violated?

    The basic concept of a high-level language removes you from the knowledge of or access of individual registers, including the stack pointer. So, from within a C program, you are not "supposed" to be interested in where the SP is or how it is used.

    This gets turned around when you start writing assembly routines, since you have to know how a C function implements things like SP, FP, DP, and return address.

    It is understandable that you want to get this information. There are a lot of times that this could be helpful in debug when odd problems start coming in. It does add a lot of overhead to your function, obviously.

    There is a lot of value in making this a call to an inlined assembly function instead of an inline macro. You have several registers that are free for your use, by the compiler's conventions. With inlining a function, you will still have the return address in B3, assuming it is retained there by the compiler anyway - that is not a given when there are other functions to be called later in this function.

    Without coding it up and trying what you have listed for your macro, I prefer the first one with a couple of changes. The save and restore of A4 to the stack need to use an offset value of [2] to maintain 8-byte alignment of the stack. The reason for the post-decrement is that the SP points to the first available location, but this location is not allocated. If you use it without moving the SP and an interrupt occurs, it will get overwritten. So, you have allocate space and then write to that available location. The post-decrement does this for you all in one instruction, technically in the other order, but all in one instruction. Using the pre-decrement in your second macro example would allocate the space and write the data at that vulnerable spot at the top of the stack, which is not actually allocated within the stack.

    The use of the asm() function will disturb the optimizer since it treats this code as a black box of instructions that it cannot manipulate. If you are using reasonably good optimization levels, -o2 or -o3, everything will get moved around so it is possible to get unexpected placement of your code relative to overhead things like moving B3 to B9 or pre-loading variables. I do not know if the compiler guarantees that the save_regs() macro will be placed in the right order relative to the other function calls, but a call to an assembly function will be.

  • I kinda thought that post decrement uses the value before decrementing in the instruction, like:

    STW     A5, *--B15[1]  means that  first B15 is decremented by sizeof(word) then the decremented value is used with offset of +1 words as the address to store A5 ( addr -1 word +1word = addr), and  STW     A5, *B15--[1]means that first the contents of B15 is used as address (with the offset +1 words) as the address to store the contents of A5, then B15 is incremented by sizeof(word).

    I guess that in pre-decrement case the space is "allocated" before writing. Can interrupt occur in the middle of instruction? Between the store and auto increment/decrement?

    I think ASM-block cannot be moved by compiler. If it's the first executable piece of code in function, i think it stays that way except the function prologue.

    So if the stack should be incremented in dwords (2x32-bit words), is there any autoincrements to do that? How does compiler achieve that there is no interrupts between word pushes that could leave the stack not properly aligned, or does the stack alignment bu 8 bytes only apply to function calls?

    This is for debugging of realtime embedded thing and there is no way of using, say, debugger if problem takes place. If the system finds the error, it reports the kind of error, but no hint about last system call maker. The system call also does not return if error is encountered.

    Anyway, is there a better/faster/more efficient way to store at least PC and SP? Maybe a way to push two registers at once?

     

     

     

     

  • Ok, thanks for explaining what you're up to.  It makes a little more sense now what you're attempting to do.  I still am very much against trying to accomplish this through asm() statements in your C code.   In general it's far too easy to end up with issues as a result.  If you're dead set on keeping to the assembly path then the "right way" to do it would be to write a wrapper function in assembly that abides by all the C calling conventions.  Your assembly wrapper function would store whatever variables/registers you want and then it would make a function call to the "real" function.  You will need separate names like you are already doing, e.g. send_a_message and send_message, etc.

    You may also want to consider trying to implement this functionality purely in C code (without asm statements).  Generally when doing instrumentation every function is given its own ID and the ID gets pushed to the trace buffer upon entry (and sometimes exit).  You could implement something like this, preferably with #ifdef statements around it so that it can be compiled out of the code entirely if desired.

    If you're not looking for system wide trace capability, i.e. if it's only this "special" function, then perhaps you might just modify the function definition such that the function calling it also passes in its own process/thread ID as part of the call.  That way you would have that information readily available from within the C environment.

  • First, let me be rude.

    turboscrew said:
    I kinda thought that post decrement ...

    Our documentation is not the perfectly indexed and organized book that you and I would both like, but the actual answers are in there, most of the time. Your guess that *--B15[1] does the -- before the [1] is one of many intelligent guesses you might make. But our Instruction Set Architecture team wanted more flexibility than just 1 word greater range. If you lookup the STW instruction description in the CPU & Instruction Set Reference Guide, the Description section describes exactly how the offsets are applied and when.

    You seem to be confusing post-decrement *B15[1]-- and pre-decrement *--B15[1]. With B15 (SP) you never want to use an offset that will leave you mis-aligned and you never want to use pre-decrement because your data will be left in an unallocated spot. Try the simulator to see how and when the registers get updated and where the data goes. That is a lot better than guessing.

    Now I will go back to being polite.

    turboscrew said:
    I guess that in pre-decrement case the space is "allocated" before writing. Can interrupt occur in the middle of instruction? Between the store and auto increment/decrement?

    No, an interrupt will not occur in the middle of an instruction.

    I agree that it is hard to get a clear understanding of the assembly operation. And it is easy to see why you interpret it the way you do. But the way the stack is implemented, "pre-decrement" will move the target address (allocating space) and then write the data at that newly calculated address (just past the end of the new top-of-stack). As I said in my previous post, you want to use post-decrement because it will write the data to an available spot (old empty top-of-stack) and then move the SP to allocate more space which will include where the data was written. Since an interrupt cannot occur in the middle, you will not lose any data.

    By trying to convince you to stay away from assembly altogether, Brad is being more practical than I am. If you stay with assembly, you will have to try the simulator to see how it works. That is how many of us go about figuring out the intricacies of this architecture.

    turboscrew said:
    I think ASM-block cannot be moved by compiler. If it's the first executable piece of code in function, i think it stays that way except the function prologue.

    Since you and I did not write the compiler, our opinions will not change how it actually works. Try it and see, and then understand that with different code it may behave differently, function calls, local variables, far accesses for code or data, etc. I would say that 75% of the time you would be right. But part of the prologue could be saving B3 and starting to use it to initialize local variables, just as an example. Maybe it would be good to look through the available intrinsics in the C Compiler User's Guide to see if there is one that will save some of these registers for you - I do not think there is one, but intrinsics are much easier for the compiler to work around when optimizing.

    turboscrew said:
    So if the stack should be incremented in dwords (2x32-bit words), is there any autoincrements to do that?

    RandyP said:
    The save and restore of A4 to the stack need to use an offset value of [2] to maintain 8-byte alignment of the stack.

        STW A4, *B15[2]--

    This puts A4 at the empty top-of-stack location and then decrements B15 by 2 words' space (8 bytes) to maintain alignment.

    turboscrew said:
    How does compiler achieve that there is no interrupts between word pushes that could leave the stack not properly aligned, or does the stack alignment bu 8 bytes only apply to function calls?

    This is an excellent question to ask, and one that everyone who decides to write assembly code should ask. How does the compiler do it?

    Turn on the -k compiler switch to keep the generated assembly output file. Then look at how SP/B15 is used. Try this for a function that has local variables and for one that does not.

    Also, look at the push and pop.asm functions in the rtssrc.zip file in the C6000\lib folder. Write a function with the "interrupt" keyword and save the assembly output so you can see how interrupts handle the stack and saving/restoring registers.

  • If you have not already looked at the compiler documentation, this is a "must read" for what you're doing:

    http://www.ti.com/lit/spru187

    Chapter 7.3 - 7.5 are very important for you.  In particular see 7.4.2 "How a called function responds".  You'll see that the compiler avoids the scenario where SP is not double-word aligned by simply decrementing SP at the start to allocate a "frame" on the stack.  It then uses pre-increments to "push" data into those spots.  In this way SP is never in a situation where it's not double-word aligned.

    These are the fine details I'd prefer to avoid!

  • Well, according to spru732h, *B15[1]-- doesn't exist. Instead *B15--[1] does. (I'm also been reading spru187N.)

    It says for STW that:

    The memory address is formed from a base address register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit unsigned constant (ucst5).

    offsetR/ucst5 is scaled by a left-shift of 2 bits. After scaling, offsetR/ucst5 is added to or subtracted from baseR. For the preincrement, predecrement, positive offset, and negative offset address generator options, the result of the calculation is the address to be accessed in memory. For postincrement or postdecrement addressing, the value of baseR before the addition or subtraction is sent to memory.

    Does the offset also affect the amount of increment/decrement?

    Is the offset in decrements considered negative, so Rxx--[4] is not Rxx + 4, Rxx = Rxx - sizeof(...), but Rxx - 4, Rxx = Rxx - 5*sizeof(...) or...?

    I understand that pre-increment means here, like in connection with most other processors, that the incrementation is done before use.

    That's why it's hard to figure out why POST-decrement is safer to use in context of stacks. And if interrupt cannot happen in the middle of an instruction, I don't see what's the safety difference between pre- and post-decrement here.

    Also, little more than 50 000 lines of code is a bit too much for temporarily adding instrumentation code everywhere. I'd rather add the piece of assembly in the beginning of about 10 system call wrapper functions. There is also a (stupid) reason not to use actual assembly, and I can do nothing about it.

    (Sorry about constant editing, but the Quote seems to cause posting to "automagically" move parts of the text randomly around.) [RandyP - html edit to attempt repair]

  • Also I didn't quite understand this (STW with B14/B15 with more than 5 bit offset):

    STW (.unit) src, *+B14/B15(60) represents an offset of 12 bytes

     

  • Brad Griffis said:

    If you have not already looked at the compiler documentation, this is a "must read" for what you're doing:

    http://www.ti.com/lit/spru187

    Chapter 7.3 - 7.5 are very important for you.  In particular see 7.4.2 "How a called function responds".  You'll see that the compiler avoids the scenario where SP is not double-word aligned by simply decrementing SP at the start to allocate a "frame" on the stack.  It then uses pre-increments to "push" data into those spots.  In this way SP is never in a situation where it's not double-word aligned.

    These are the fine details I'd prefer to avoid!

     

    So what you are saying: I should do something like:

    SUBAW B15, 4, B15   ; Add four 32-bit slots to stack

    STW A4, *B15[4]

    STW A5, *B15[3]

    ...

    ADDAW B15, 4, B15   ; Restore stack

     

  • turboscrew said:

    So what you are saying: I should do something like:

    SUBAW B15, 4, B15   ; Add four 32-bit slots to stack

    STW A4, *B15[4]

    STW A5, *B15[3]

    ...

    ADDAW B15, 4, B15   ; Restore stack

    You've got the idea but your notation is not quite right.  You should be doing something like

    STW A4, *+B15[4]

    Here's a snippet from the CPU Guide:

     

     

     

  • turboscrew said:
    That's why it's hard to figure out why POST-decrement is safer to use in context of stacks. And if interrupt cannot happen in the middle of an instruction, I don't see what's the safety difference between pre- and post-decrement here.

    These questions are determined by the convention used for the stack.  The direction of the stack determines whether to use an increment or decrement.  If the stack is defined to always point to the next FREE location you generally use a POST-inc/decrement.  If the stack is defined to the last OCCUPIED location then you use a PRE-inc/decrement.  Further on this device is the consideration that you must always be double word aligned.

    turboscrew said:
    There is also a (stupid) reason not to use actual assembly, and I can do nothing about it.

    When you use an asm() statement the compiler blindly chucks it into the generated assembly.  It does not comprehend what you've done.  In other words, it will NOT know that you have just changed the position of the stack pointer!  If the compiler generates any code that accesses something on the stack (e.g. using a constant offset like in your example in the previous post) then you are going to end up with broken code.  This is the sort of nasty behavior I'm trying to avoid.  Personally I do not think what you're doing is safe.  If you cannot write an assembly wrapper then I think you need to do something else.

  • Brad Griffis said:

    You've got the idea but your notation is not quite right.  You should be doing something like

    STW A4, *+B15[4]

      

     

    Oops, yes.

    Then there is the problem of  "getting a snapshot of the SP".

     

  • Brad Griffis said:

    That's why it's hard to figure out why POST-decrement is safer to use in context of stacks. And if interrupt cannot happen in the middle of an instruction, I don't see what's the safety difference between pre- and post-decrement here.

    These questions are determined by the convention used for the stack.  The direction of the stack determines whether to use an increment or decrement.  If the stack is defined to always point to the next FREE location you generally use a POST-inc/decrement.  If the stack is defined to the last OCCUPIED location then you use a PRE-inc/decrement.  Further on this device is the consideration that you must always be double word aligned.

    turboscrew said:
    There is also a (stupid) reason not to use actual assembly, and I can do nothing about it.

    When you use an asm() statement the compiler blindly chucks it into the generated assembly.  It does not comprehend what you've done.  In other words, it will NOT know that you have just changed the position of the stack pointer!  If the compiler generates any code that accesses something on the stack (e.g. using a constant offset like in your example in the previous post) then you are going to end up with broken code.  This is the sort of nasty behavior I'm trying to avoid.  Personally I do not think what you're doing is safe.  If you cannot write an assembly wrapper then I think you need to do something else.

    [/quote]

    Well, the compiler manual says:

    "(that is, the stack pointer points to the next free location...)"
    and
    "Since the stack grows toward smaller addresses..."

    So maybe STDW B14:B15 and STDW  A4:A5

    But how to handle the second operans? It lookx like it has no autoincrement/decrement.

    Maybe:

    SUBAW B15, 4, B15   ; Add four 32-bit slots to stack

    STW  B15  *+B15[3] 

    STDW A4:A5  *+B15[1]

    The there is one unused 32-bit slot in the stack, but the stack is 64-bit aligned.

    Also, the idea is that C-code should not be aware that the ASM-block exists. That's why the stack needs to be restored before exiting the ASM-block and entering the C-code.

     

  • turboscrew said:

    But how to handle the second operans? It lookx like it has no autoincrement/decrement.

    Maybe:

    SUBAW B15, 4, B15   ; Add four 32-bit slots to stack

    STW  B15  *+B15[3] 

    STDW A4:A5  *+B15[1]

    The there is one unused 32-bit slot in the stack, but the stack is 64-bit aligned.

    I think that should do it.

    turboscrew said:
    Also, the idea is that C-code should not be aware that the ASM-block exists. That's why the stack needs to be restored before exiting the ASM-block and entering the C-code.

    Forgot about that -- ok, maybe this thing has a prayer of working!  :)

  • This has gone way way off-track.

    We want to discourage you from using asm() statements because of optimizer effects. An assembly wrapper would be the cleanest way to accomplish your requirement for information in the global variables, but you have a corporate policy that prohibits asm files but allows complex asm() instructions.

    You will not learn the C6000 assembly details by having us explain one instruction at a time. The way to learn it is to look at the compiler-generated assembly code, write your own test code, and run the simulator with single-steps while watching memory and registers to see when they change and how they change.

    From re-reading all of this thread, I do not see why you are considering the SUBAW B15,4,B15 instruction other than it keeps the stack aligned to 8 bytes. That is good, but you do not need to be putting the SP on the stack just to get a value to save, right?

    Please go back to your original posting and change the two [1]'s to [2]. This will keep the stack aligned, you will be safe for any interrupts that happen between your asm instructions, and your globals will be filled. When you look at the SP value, just remember that it is off by -8 from what the SP is in the real function.

    If you do not understand why the alignment works and the data goes to the right place, the best way to understand it will be to fire up the simulator and watch.

  • RandyP said:

    This has gone way way off-track.

    We want to discourage you from using asm() statements because of optimizer effects. An assembly wrapper would be the cleanest way to accomplish your requirement for information in the global variables, but you have a corporate policy that prohibits asm files but allows complex asm() instructions.

    The world is not perfect, not even fair. I'm just trying to do my job nonetheless. By adding an .asm-file, I would be the one to be executed :-).

    RandyP said:

    You will not learn the C6000 assembly details by having us explain one instruction at a time. The way to learn it is to look at the compiler-generated assembly code, write your own test code, and run the simulator with single-steps while watching memory and registers to see when they change and how they change.

    I'm not trying to. I was just wondering if the offset affect the amount of incrementation/decrementation. The manual doesn't say so. It brings to mind that the offset is just offset and totally independent of the incrementation/decrementation, like in most other processors. At least it looks to me as if you suggested that using STW *--R15[2] causes decrementation by 2 words instead of one, and the manual says that the target address is the incremented R15 + 2 word sizes.

    RandyP said:

    From re-reading all of this thread, I do not see why you are considering the SUBAW B15,4,B15 instruction other than it keeps the stack aligned to 8 bytes. That is good, but you do not need to be putting the SP on the stack just to get a value to save, right?

    No, but I 'd like to get (at least) the SP and return address stored somewhere as fast as possible. We don't have too many loose MIPS, and storing/restoring registers are memory accesses and tend to take time.

    RandyP said:

    Please go back to your original posting and change the two [1]'s to [2]. This will keep the stack aligned, you will be safe for any interrupts that happen between your asm instructions, and your globals will be filled. When you look at the SP value, just remember that it is off by -8 from what the SP is in the real function.

    If you do not understand why the alignment works and the data goes to the right place, the best way to understand it will be to fire up the simulator and watch.

    I just become worried if there is an interrupt (and maybe even context switch) between the first and second STW.

    [edit]

    Yep, Yep. Context switches switch the stacks too - not to worry about them.

    [/edit]

     

     

  • turboscrew said:
    I just become worried if there is an interrupt (and maybe even context switch) between the first and second STW.

    As long as the stack is 8-byte aligned there's no problem introduced by an interrupt occurring between instructions.

    At this point I think you have enough info to try this out.  Please give it a whirl and see how it works.  You can post your latest code for review too.

  • Thanks Brad Griffis and RandyP, for all the help.

    BTW, the code is needed to figure out about a customer problem, so there is not too much time for trying out different things.

    Anyway, I'm going for the 4 word "stack pre-allocation" version and storing SP via register instead of stack.

    It sounds the safest way.

     

  • How about this:

      asm(" NOP     4  ; wait"); \
      asm(" SUBAW   B15, 2, B15  ; Add two 32-bit slots to stack"); \
      asm(" STDW    A4:A5, *+B15[2] ; Store A4 and A5 into the stack slot"); \
      asm(" ADDAW   B15, 2, A5  ; Original SP to A5"); \
      asm(" MVKL    _global_regs,A4; Storage address to A4"); \
      asm(" MVKH    _globall_regs,A4"); \
      asm(" STW     B3, *A4++[0]    ; save return address"); \
      asm(" STW     A5, *A4++[0]   ; save stack pointer"); \
      asm(" STW     A15, *A4++[0]   ; save frame pointer"); \
      asm(" STW     B14, *A4[0]   ; save data page pointer"); \
      asm(" LDDW    *+B15[2], A4:A5 ; Restore A4 and A5 from the stack slot"); \
      asm(" ADDAW   B15, 2, B15  ; Remove the two 32-bit slots to stack"); \
      asm(" NOP 4  ; wait");

     

    Do you think that would work?

    How about condition flags? Where can I find the explanations about the effects of instructions on flags?

     

     

  • No, the code above will not work. It overwrites the original stack contents and ends up saving only DP/B14 into the global_regs array.

    You can find out about condition flags (very insightful question) in the C64x+ CPU & Instruction Set Reference Guide. Each instruction's description page itemizes the condition flags that will be affected.

    Since you mentioned that MIPS are a valuable resource, here are some optimization hints.

    I assume that the NOP 4 at the top is a safety mechanism. Since stack modifications happen immediately, B15 does not need this protection, and early function setup should not be doing memory reads, just writes to the stack and setting up registers with constants. Not a guarantee, but high enough probability that it would make sense to look at the function code above your asm insertion to see. 

    Same assumption for the NOP 4 at the bottom. It is needed to protect any possible use of the destination reg(s). But the beauty of the C6000 unprotected pipeline is that you can move code around to get rid of lots of these NOPs

    The following is the best I could come up with in terms of cycle count. I switched the address register to B4 to avoid cross-path conflicts with an eye on optimization. A free byproduct of moving the LDW earlier is that you get the "unmodified" SP placed into the global_regs array without having to do math separately.

      asm("   STW     B4, *B15--[2]    ; Store B4 onto new stack space"); \
      asm("|| MVKL    _global_regs,B4"); \
      asm("   MVKH    _global_regs,B4"); \
      asm("|| LDW     *++B15[2],B4     ; restore B4"); \
      asm("   STW     B3,  *B4++[1]    ; save return address"); \
      asm("   STW     B15, *B4++[1]    ; save modified stack pointer"); \
      asm("   STW     A15, *B4++[1]    ; save frame pointer"); \
      asm("   STW     B14, *+B4[0]     ; save data page pointer");

    This takes 6 cycles (disregarding cache and memory effects) instead of 19 for the SUBAW example above, or 10 cycles if you find a need to put the top NOP 4 back in.

    If you want to learn how the pipeline and the instructions work, use the Cycle Accurate simulator and the C64x+ CPU & Instruction Set Reference Guide.

    This will be good code to have laying around. I will remember it for some tough debug applications. Thanks for sharing your questions on the E2E forum.

  • RandyP said:

    No, the code above will not work. It overwrites the original stack contents and ends up saving only DP/B14 into the global_regs array.

    You can find out about condition flags (very insightful question) in the C64x+ CPU & Instruction Set Reference Guide. Each instruction's description page itemizes the condition flags that will be affected.

    Since you mentioned that MIPS are a valuable resource, here are some optimization hints.

    I assume that the NOP 4 at the top is a safety mechanism. Since stack modifications happen immediately, B15 does not need this protection, and early function setup should not be doing memory reads, just writes to the stack and setting up registers with constants. Not a guarantee, but high enough probability that it would make sense to look at the function code above your asm insertion to see. 

    Same assumption for the NOP 4 at the bottom. It is needed to protect any possible use of the destination reg(s). But the beauty of the C6000 unprotected pipeline is that you can move code around to get rid of lots of these NOPs

    The following is the best I could come up with in terms of cycle count. I switched the address register to B4 to avoid cross-path conflicts with an eye on optimization. A free byproduct of moving the LDW earlier is that you get the "unmodified" SP placed into the global_regs array without having to do math separately.

      asm("   STW     B4, *B15--[2]    ; Store B4 onto new stack space"); \
      asm("|| MVKL    _global_regs,B4"); \
      asm("   MVKH    _global_regs,B4"); \
      asm("|| LDW     *++B15[2],B4     ; restore B4"); \
      asm("   STW     B3,  *B4++[1]    ; save return address"); \
      asm("   STW     B15, *B4++[1]    ; save modified stack pointer"); \
      asm("   STW     A15, *B4++[1]    ; save frame pointer"); \
      asm("   STW     B14, *+B4[0]     ; save data page pointer");

    This takes 6 cycles (disregarding cache and memory effects) instead of 19 for the SUBAW example above, or 10 cycles if you find a need to put the top NOP 4 back in.

    If you want to learn how the pipeline and the instructions work, use the Cycle Accurate simulator and the C64x+ CPU & Instruction Set Reference Guide.

    This will be good code to have laying around. I will remember it for some tough debug applications. Thanks for sharing your questions on the E2E forum.

    Halleluyah, truckload of thanks.

    First, it's really good code for the task, secondly, I think figuring this out in proper is a deeper learning experience in many regards.

     

  • turboscrew said:

    No, the code above will not work. It overwrites the original stack contents and ends up saving only DP/B14 into the global_regs array.

    You can find out about condition flags (very insightful question) in the C64x+ CPU & Instruction Set Reference Guide. Each instruction's description page itemizes the condition flags that will be affected.

    Since you mentioned that MIPS are a valuable resource, here are some optimization hints.

    I assume that the NOP 4 at the top is a safety mechanism. Since stack modifications happen immediately, B15 does not need this protection, and early function setup should not be doing memory reads, just writes to the stack and setting up registers with constants. Not a guarantee, but high enough probability that it would make sense to look at the function code above your asm insertion to see. 

    Same assumption for the NOP 4 at the bottom. It is needed to protect any possible use of the destination reg(s). But the beauty of the C6000 unprotected pipeline is that you can move code around to get rid of lots of these NOPs

    The following is the best I could come up with in terms of cycle count. I switched the address register to B4 to avoid cross-path conflicts with an eye on optimization. A free byproduct of moving the LDW earlier is that you get the "unmodified" SP placed into the global_regs array without having to do math separately.

      asm("   STW     B4, *B15--[2]    ; Store B4 onto new stack space"); \
      asm("|| MVKL    _global_regs,B4"); \
      asm("   MVKH    _global_regs,B4"); \
      asm("|| LDW     *++B15[2],B4     ; restore B4"); \
      asm("   STW     B3,  *B4++[1]    ; save return address"); \
      asm("   STW     B15, *B4++[1]    ; save modified stack pointer"); \
      asm("   STW     A15, *B4++[1]    ; save frame pointer"); \
      asm("   STW     B14, *+B4[0]     ; save data page pointer");

    This takes 6 cycles (disregarding cache and memory effects) instead of 19 for the SUBAW example above, or 10 cycles if you find a need to put the top NOP 4 back in.

    If you want to learn how the pipeline and the instructions work, use the Cycle Accurate simulator and the C64x+ CPU & Instruction Set Reference Guide.

    This will be good code to have laying around. I will remember it for some tough debug applications. Thanks for sharing your questions on the E2E forum.

     

    Halleluyah, truckload of thanks.

    First, it's really good code for the task, secondly, I think figuring this out in proper is a deeper learning experience in many regards.

     

    [/quote]

     

    One thing still, If I understood it right, it should be interrupt-safe due tu the 5-slot LDW. Did I understand it right?

     

  • turboscrew said:

    One thing still, If I understood it right, it should be interrupt-safe due tu the 5-slot LDW. Did I understand it right?

    The code does nothing to disable interrupts.  Actually, now that I look at it again I don't think it's interrupt safe.  Specifically I'm concerned about an interrupt occuring after this instruction:

     

      asm("|| LDW     *++B15[2],B4     ; restore B4"); \

    If an interrupt occurs after that instruction then the load will complete (i.e. B4 will be restored) and when the code returns from the interrupt you will be storing data to the wrong address.  I think it needs to be tweaked.

      asm("   STW     B4, *B15--[2]    ; Store B4 onto new stack space"); \
      asm("|| MVKL    _global_regs,B4"); \
      asm("   MVKH    _global_regs,B4"); \
      asm("   STW     B3,  *B4++[1]    ; save return address"); \
      asm("   STW     B15, *B4++[1]    ; save modified stack pointer"); \
      asm("   STW     A15, *B4++[1]    ; save frame pointer"); \
      asm("   STW     B14, *+B4[0]     ; save data page pointer");
      asm("|| LDW     *++B15[2],B4     ; restore B4"); \
      asm("   NOP 5");

     


  • turboscrew said:

    One thing still, If I understood it right, it should be interrupt-safe due tu the 5-slot LDW. Did I understand it right?

    Good catch, turboscrew. I should have thought about that, because it is a risk anytime you try to do tricky things that count on the pipeline delays.

    Brad's fix will work, but it gives up the unmodified stack pointer being saved. Brad may need to explain the NOP 5 instead of NOP 4.

    Another solution is to use a B branch instruction to protect the code from any interrupts. When a B is executed, the 5 delay slots after it will also be executed without being interrupted. The only downside is that interrupts will be put off for 5 cycles. But by executing the B in parallel with the first execution packet, it will not add any cycle count to the total execution time.

    Unfortunately, the B needs a discrete target address. We cannot use the PC-relative $+n symbol because of the mixture of 16- and 32-bit instructions. And if you want to use this save_regs functionality more than once in a file, we have to come up with a duplicated symbol that will not interfere with the multiple calls to save_regs. The assembly macro is the only way I can think of to do this. It allows you to define a label followed by "?", and the ? will be replaced by an incrementing number with a $ before and after it.

    #define save_regs() \
      asm("save_regs_m    .macro    TargetArray"); \
      asm("   STW     B4, *B15--[2]    ; Store B4 onto new stack space"); \
      asm("|| MVKL    TargetArray,B4") ; \
      asm("|| B       EOM?             ; use Branch pipeline to protect LDW from interrupts"); \
      asm("   MVKH    TargetArray,B4") ; \
      asm("|| LDW     *++B15[2],B4     ; restore B4"); \
      asm("   STW     B3,  *B4++[1]    ; save return address"); \
      asm("   STW     B15, *B4++[1]    ; save modified stack pointer"); \
      asm("   STW     A15, *B4++[1]    ; save frame pointer"); \
      asm("   STW     B14, *+B4[0]     ; save data page pointer"); \
      asm("EOM?:                       ; target branch address at end of macro"); \
      asm("   .endm     ; end of the macro definition"); \
      asm("   save_regs_m _global_regs ; the macro call");

    This macro format also adds the flexibility of saving the registers in different memory arrays, if you wanted to do that.

    Unfortunately, it will also generate a compiler warning since it embeds a .macro / .endm construct inside a function.

    There could be a risk that the code immediately after the macro insertion could violate Branch-target rules. The assembler or linker would report this as an error, and I think there is a very good chance that the assembler or linker will take care of it automatically, but it could disturb some of the optimizations that the compiler tries to do.

  • RandyP said:

    Another solution is to use a B branch instruction to protect the code from any interrupts. When a B is executed, the 5 delay slots after it will also be executed without being interrupted. The only downside is that interrupts will be put off for 5 cycles. But by executing the B in parallel with the first execution packet, it will not add any cycle count to the total execution time.

    Unfortunately, the B needs a discrete target address. We cannot use the PC-relative n symbol because of the mixture of 16- and 32-bit instructions. And if you want to use this save_regs functionality more than once in a file, we have to come up with a duplicated symbol that will not interfere with the multiple calls to save_regs. The assembly macro is the only way I can think of to do this. It allows you to define a label followed by "?", and the ? will be replaced by an incrementing number with a $ before and after it.

    #define save_regs() \
      asm("save_regs_m    .macro    TargetArray"); \
      asm("   STW     B4, *B15--[2]    ; Store B4 onto new stack space"); \
      asm("|| MVKL    TargetArray,B4") ; \
      asm("|| B       EOM?             ; use Branch pipeline to protect LDW from interrupts"); \
      asm("   MVKH    TargetArray,B4") ; \
      asm("|| LDW     *++B15[2],B4     ; restore B4"); \
      asm("   STW     B3,  *B4++[1]    ; save return address"); \
      asm("   STW     B15, *B4++[1]    ; save modified stack pointer"); \
      asm("   STW     A15, *B4++[1]    ; save frame pointer"); \
      asm("   STW     B14, *+B4[0]     ; save data page pointer"); \
      asm("EOM?:                       ; target branch address at end of macro"); \
      asm("   .endm     ; end of the macro definition"); \
      asm("   save_regs_m _global_regs ; the macro call");

    This macro format also adds the flexibility of saving the registers in different memory arrays, if you wanted to do that.

    Unfortunately, it will also generate a compiler warning since it embeds a .macro / .endm construct inside a function.

    There could be a risk that the code immediately after the macro insertion could violate Branch-target rules. The assembler or linker would report this as an error, and I think there is a very good chance that the assembler or linker will take care of it automatically, but it could disturb some of the optimizations that the compiler tries to do.

     

     

    This seems to crash the stack.

     

     

    Are you sure about this

      STW     B4, *B15--[2]

    The manual does not actually say how this works. It talks about offset and post-decrementing, but it doesn't say how the offset is actually used. In the examples it looks like Rn--[2] autodecrements by two words after storing with (negative?) offset of 2 words.

    At first it looks like:

      B4 is stored to address (B15 + 2*sizeof(word))

      then the B15 is decremented by sizeof(word)

    Even if the offset is taken to go to the same direction as the incrementation/decrementation

    it looks like offset was 2 words but incrementation/decrementation only 1 word.

      STW  B4, *--B15[2]

    seems to first decrement B15 by sizeof(word) then store B4 to decremented B15 + 2*sizeof(word) => B15 + sizeof(word).

     

    Oh yes - the processor is used in big endian mode only.

     

    I tend to understand it like this:

    X,Y = used, 0= not used

    Originally


    (hi mem)
    X
    X
    X
    X
    0
    0
    0
    0  <= SP

    Pushing 0xYYYY = write 0xYYYY at SP, then decrement SP (STW ?, *B15--)

    (hi mem)
    X
    X
    X
    X
    Y
    Y
    Y
    Y
    0
    0
    0
    0  <= SP

  • turboscrew said:

    This seems to crash the stack.

    I took a closer look at Randy's code but don't see anything wrong with it.  Can you please copy/paste the disassembly of your function so we can see what it looks like after it has gone through the compiler, assembler, and linker?  Please include the stuff "around" the asm statements as we would want to see everything starting with your C function up through a little bit after the macro.

    Does it crash every time or only if an interrupt occurs?  If every time please step through the code to determine where things go wrong.

    turboscrew said:

    Are you sure about this

      STW     B4, *B15--[2]

    The manual does not actually say how this works. It talks about offset and post-decrementing, but it doesn't say how the offset is actually used. In the examples it looks like Rn--[2] autodecrements by two words after storing with (negative?) offset of 2 words.

    We use the standard meaning of "post decrement".  In the case above the contents of B4 would be written to the address pointed to by B15.  AFTER the write the address of B15 would be decremented by 2*sizeof(word) = 8 bytes.

    turboscrew said:

    I tend to understand it like this:

    X,Y = used, 0= not used

    Originally


    (hi mem)
    X
    X
    X
    X
    0
    0
    0
    0  <= SP

    Pushing 0xYYYY = write 0xYYYY at SP, then decrement SP (STW ?, *B15--)

    (hi mem)
    X
    X
    X
    X
    Y
    Y
    Y
    Y
    0
    0
    0
    0  <= SP

    What you show is what you would observe with STW ?, *B15--.  However, this is not allowed as the stack will no longer be aligned on a 2-word boundary.  That's why Randy's code is correct.

  • How does the offset value affect the state B15 is left?

    I understand that  STW B4, *B15-- and STW B4, *B15--[2] are equal except that the contents of

    B4 get stored in 8 bytes higher address in the latter. B15 is left in a same state (the old value -4).

     

  • turboscrew said:

    I understand that  STW B4, *B15-- and STW B4, *B15--[2] are equal except that the contents of

    B4 get stored in 8 bytes higher address in the latter. B15 is left in a same state (the old value -4).

    No, it's the other way around.  Both of your examples would ultimately end up with B4 stored in the same spot in memory (i.e. the original address of B15).  The value of B15 AFTER the store will be different (first example post-decrements by 1 word, second example post-decrements by 2 words).

  • Turboscrew,

    In your reply on Tue, Feb 23 2010 3:51 AM (Central Std Time), 11th post in this thread, you quoted the CPU & Inst Set Ref Guide. It is duplicated here for your convenience.

    SPRU187 said:

    The memory address is formed from a base address register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit unsigned constant (ucst5).

    offsetR/ucst5 is scaled by a left-shift of 2 bits. After scaling, offsetR/ucst5 is added to or subtracted from baseR. For the preincrement, predecrement, positive offset, and negative offset address generator options, the result of the calculation is the address to be accessed in memory. For postincrement or postdecrement addressing, the value of baseR before the addition or subtraction is sent to memory.

    This tells you how the target memory address is formed and it tells you how baseR is modified.

    Both of the example snippets that I offered do work. The second is protected from interrupts by using the Branch instruction and the macro methodology allows the same code to be used in multiple places in your program.

    The stack is not corrupted.

    Please load it and run it in the simulator if you want to see it working.

  • RandyP said:

    Turboscrew,

    In your reply on Tue, Feb 23 2010 3:51 AM (Central Std Time), 11th post in this thread, you quoted the CPU & Inst Set Ref Guide. It is duplicated here for your convenience.

    The memory address is formed from a base address register (baseR) and an optional offset that is either a register (offsetR) or a 5-bit unsigned constant (ucst5).

    offsetR/ucst5 is scaled by a left-shift of 2 bits. After scaling, offsetR/ucst5 is added to or subtracted from baseR. For the preincrement, predecrement, positive offset, and negative offset address generator options, the result of the calculation is the address to be accessed in memory. For postincrement or postdecrement addressing, the value of baseR before the addition or subtraction is sent to memory.

    This tells you how the target memory address is formed and it tells you how baseR is modified.

    Both of the example snippets that I offered do work. The second is protected from interrupts by using the Branch instruction and the macro methodology allows the same code to be used in multiple places in your program.

    The stack is not corrupted.

    Please load it and run it in the simulator if you want to see it working.

    [/quote]

    "the result of the calculation is the address to be accessed in memory" so this is the storage address = baseR +/- offset (pre-increment/-decrement,...)

    This calcukated address is then not stored to baseR.

    I didn't see anything about offsets in post-increment/-decrement case. It would just be odd if it wasn't symmetrical with pre-increment/-decrement:

    Offset doesn't affect the baseR value - just the address to be accessed.

    I try to get that piece of assembly with its "environment" here.

     

     

     

     

    .

    .

     

     

  • turboscrew said:

    "the result of the calculation is the address to be accessed in memory" so this is the storage address = baseR +/- offset (pre-increment/-decrement,...)

    This calculated address is then not stored to baseR?

     

    I didn't see anything about offsets in post-increment/-decrement case. It would just be odd if it wasn't symmetrical with pre-increment/-decrement:

    Offset doesn't affect the baseR value - just the address to be accessed.

     

    I try to get that piece of assembly with its "environment" here.

    Aha, getting the asm-listing doesn't work in CCS3.3. It shows the mixed C/ASM, but doesn't allow selection and copying.

     

    Now I guess I understand something (I watched the asm-code run with EVM-card and BlackHawk JTAG = real HW).

    The code seems to work fine (at least when single-stepping). I guess that the problem is in the  "EOM?".

    Too bad we're required to get rid of compiler warnings.

     

    In (at least) post-decrement the "offset" is not OFFSET at all, but amount of incrementation/decrementation, and it doesn't seem to affect the STORE at all.

    only baseR-change. So STW B4, *B15--[2] does:

      first contents of B4 is stored in address stored in B15 (not B15 + offset).

      then B15 is decremented by 2 << sizeof(word).

    So the autoincrement and decrement instructions (pre-/post ) are not semantically symmetrical?

    Negative and positive offsets are different from post-increment/-decrement, but similat to pre-increment/-decrement?

     

    The pipeline stalling only takes place if register is "marked as being written to" and the next instruction wants to use the value?

     

    [edit}

    I guess my way to understanding the processor is not too good - I should start learning the "microcode architecture".

     

    By the way, what then does STW B4, *+B15[2]?

    And is STW B4, *B15 == STW B4, *+B15[0]  = *B15++[0] ?

    [/edit]

    [edit2]

    Darn! Firefox seems to do weird things to the text.

    [/edit2]

    [edit3rp] I don't know if it is FF or just the forum, but there was a lot of html/xml clutter in there trying to support the clipboard paste. [/edit3rp]

  • Turboscrew,

    When you run the code on your hardware does it work properly or not?  I can't understand if there's an actual issue you're debugging or if you just can't understand the assembly code.

    Everything with the addressing modes is described in these 2 tables:

    The first table shows you very clearly which addressing modes actually result in a modified base register and which do not.  The second table tells you the name of the address mode and the corresponding syntax.  Please study these tables.

     

     

  • I just learned something by accident that is very awesome and solves that hated compiler warning. I also demand no warnings in my code, so I was embarrassed to offer that to you before. The new discovery is that the asm("..."); instruction can be located outside the executable scope of functions, in other words in the global space at the top of the file. This means you can use the following to replace your previous #define:

    Optimized save_regs: said:

      asm("save_regs_m    .macro    TargetArray"); \
      asm("   STW     B4, *B15--[2]    ; Store B4 onto new stack space"); \
      asm("|| MVKL    TargetArray,B4") ; \
      asm("|| B       EOM?             ; use Branch pipeline to protect LDW from interrupts"); \
      asm("   MVKH    TargetArray,B4") ; \
      asm("|| LDW     *++B15[2],B4     ; restore B4"); \
      asm("   STW     B3,  *B4++[1]    ; save return address"); \
      asm("   STW     B15, *B4++[1]    ; save modified stack pointer"); \
      asm("   STW     A15, *B4++[1]    ; save frame pointer"); \
      asm("   STW     B14, *+B4[0]     ; save data page pointer"); \
      asm("EOM?:                       ; target branch address at end of macro"); \
      asm("   .endm     ; end of the macro definition");
    #define save_regs() \
      asm("   save_regs_m _global_regs ; the macro call");
    #define save_regs_generic(x) /*x must be a quoted string of the assembly label for the target storage array*/\
      asm("   save_regs_m "x" ; the macro call");

    Usage:
      save_regs();
    or 
      save_regs_generic( "_global_regs" );

    For most of the rest of your questions, even if you are not trying to learn the C6000 assembly details, you will need to look at more of the CPU & Inst Set Ref Guide, spru732h, as you have already started to do. I cannot explain the details of the instructions one-by-one any better than the Reference Guide does.

    Please look at Sections 3.8 "Addressing Modes", 3.8.1.1 "LD and ST Instructions", and 3.8.3 "Syntax for Load/Store Address Generation". The Tables that Brad inlined for you are in there. And for the level of understanding that you require of yourself, I would even recommend that you read in detail all of Sections 1, 2, 3, and 4. Please read the material with the attitude of learning how it does work rather than expecting it to prove why it does not work the way you expect it to based on your extensive previous experiences.

    Your questions are insightful and thoughtful and demonstrate a great deal of experience with other processor families. I may be wrong in how I interpret your questions, but when you try to apply a pre-requisite of symmetry, for example, you create a roadblock to understanding the actual requirement for performance with a limited instruction set.

  • After staring at the first table, I finally understood - sorry for my being half blind :-).

    I was too keen looking into the cells of the table, but I should have payed more attention to the row/column headers.

     

    My problem is that with the code I showed, the single-stepping on EVM is fine, but on a target board (specific)

    and full speed, it crashed the stack. Maybe it's some interrupt issue...

     

    I think I'll try out the new version. Thanks.

     

    About my suppositions in general - it just affects my interpretation of things left ambiguous to me by first reading.

    I guess it's natural human behavior and "learning away" is much harder than learning.

    Anyway now that I understood how the indirect addressing works on this processor, I must say, it's better than

    what I have been used to.  Auto *crements have quite little use except for stack and table walk throughs, but then,

    most programming languages use local variables located in stack,  so the stack operations are heavily used.

     

    It's also nice that you can (i assume, haven't checked it yet) build a stack frame and push the old SP in the stack

    with a single instruction (and thus get rid of interrupt related problems), and then remove the stack frame and

    restore the SP with another single instruction.

     

     

  • I found this to be a very informative thread... thanks.

    I wonder if you could use the setjmp() call as a "safe and standard" way to save the regs

    ...just because you call setjmp(), doesn't mean you ever have to call longjmp()

     

    #include <stdio.h>

    #include <setjmp.h>

    jmp_buf Gregs;

    int a(int a1)

    {

    int i, j, k; // local vars

    i = j = k = 0xdead;

    setjmp(Gregs);

    printf("%x %x %x\n\n", i,j,k);

    for (i=0;i<13;i++) // see setjmp.h for reg size

         printf("%d %08x\n", i, ((int *)Gregs)[i])); // now you just have to decode what register is stored at what offset, slightly less portable

    return 0;

    }

     

    int main(int argc, char **argv)

    {

    return a(1);

    }

     

  • Very interesting!  I have never looked into this function so now this thread has become educational for me too!  :)

    This certainly looks very close to what turboscrew was asking for.  The only issue I see, at least due to our specific implementation, is that setjmp is implemented as a function.  That means there will actually be a call and return.  In other words, when the stack pointer gets saved it will be a little bit different because the function calling setjmp will push some variables onto the stack before branching to the _setjmp symbol.  Did that make sense?  So if the stack pointer was "x" when we entered some function, it would actually get saved as "x+8" (or something similar) by setjmp since variables could be pushed to the stack prior to calling setjmp.

    This is still really good information.  There's actually another thread on the forum where I was trying to figure out a way to have an ISR check to make sure the stack pointer is double-word aligned.  I think setjmp is perfect for that!  Although the C code would modify the stack, it would always do so in double-word increments, so it would preserve whether or not the stack was aligned upon entering the function.

  • Yes, I understand the additional stack frame, setjmp() accounts for that.

    In a previous life (well, job) I had to implement setjmp(), and the (wonderful?) gcc compiler had globally optimized away the frame pointer. It made my task difficult, to say the least.

  • dkerns said:
    Yes, I understand the additional stack frame, setjmp() accounts for that.

    Perhaps yours did, but TI's does not.  I looked at the assembly (fyi, it is bundled in rtssrc.zip along in the lib directory of the compiler), and it simply stores 13 registers to the provided address and then returns.

  • hmm ok. thanks. it should return 0, so the caller can distinguish between the original call and a "return" from longjmp()

    but even so, should be "re-usable" beyond it's original intent.

  • dkerns said:
    hmm ok. thanks. it should return 0, so the caller can distinguish between the original call and a "return" from longjmp()

    It does -- sorry left that part out!  I didn't understand its significance until you mentioned it just now.  Thanks for teaching me about this obscure C function!  :)