This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Minimal impact profiling method for reading TSC register



Hi,

We are trying to determine minimal method to profile using TSC register.  Using _itoll() is affecting the code we're trying to profile -- it substantially increases cycle count.  So now we're using inline asm code shown below:

__asm(" STDW B1:B0, *--SP(128)");  /* save B1:B0 on stack */
__asm(" STW A0, *--SP(16)");    /* save A0 on stack */
__asm("  BNOP TSC_Read_Done, 1");
__asm("  MVC TSCL, B0");
__asm("  MVC TSCH, B1");
__asm("  MVKL conv_fx_profile_begin, A0"); 
__asm("  MVKH conv_fx_profile_begin, A0");
__asm("TSC_Read_Done:");
__asm("  STDW B1:B0, *A0");
__asm(" NOP 5");
__asm(" LDW *SP++(16), A0");   /* restore A0 from stack */
__asm(" LDDW *SP++(128), B1:B0");  /* restore B1:B0 from stack */
  :
  :  code to profile
  :
__asm(" STDW B1:B0, *--SP(128)");  /* save B1:B0 on stack */
__asm(" STW A0, *--SP(16)");    /* save A0 on stack */
__asm("  BNOP TSC_Read_Done2, 1");
__asm("  MVC TSCL, B0");
__asm("  MVC TSCH, B1");
__asm("  MVKL conv_fx_profile, A0"); 
__asm("  MVKH conv_fx_profile, A0");
__asm("TSC_Read_Done2:");
__asm("  STDW B1:B0, *A0");
__asm(" NOP 5");
__asm(" LDW *SP++(16), A0");   /* restore A0 from stack */
__asm(" LDDW *SP++(128), B1:B0");  /* restore B1:B0 from stack */
conv_fx_profile = conv_fx_profile - conv_fx_profile_begin; 
conv_fx_total_cycles+=conv_fx_profile;  

This mostly works well, but in some functions I have a crash.  What is the correct way to save A0, B0, and B1 on the stack ?

Alternatively, is there another method ?  The objective is the least amount of impact on the code being measured, for example if the function is inlined then I don't want another function call.

-- 
Thanks!

Regards,
Sarvani Chadalapaka
HPC Systems Engineer
Signalogic Inc.
  • I find it very difficult to believe that _itoll is the source of the slowdown.  It is specifically intended to allow the compiler to convert the values in zero cycles.  I can't imagine pushing registers to the stack is faster.

    First, make sure you are writing the C read of TSCL first, then TSCH:

    unsigned lo = TSCL;
    unsigned hi = TSCH;
    unsigned long long t = _itoll(hi, lo);

    You can't write it like this or you might get the wrong result:

    unsigned long long t = _itoll(TSCH, TSCHL);

    See https://e2e.ti.com/support/dsp/tms320c6000_high_performance_dsps/f/112/t/330784

    You should not need to place these reads in the delay slot of a branch. Reading TSCL is supposed to lock the value in TSCH until you read from it.

    Second, using inline assembly code is an at-your-own risk venture.  It's not guaranteed that the assembly code will appear exactly where you expect with respect to the generated assembly code for the C statements.

    Finally, for a C6000 push, you should be using post-decrement mode, not pre-decrement.  For the pop, you should be using pre-increment, not post-increment.    If you don't follow this convention, if an interrupt occurs in the middle of your assembly statements, the top of the stack (your pushed register) will get clobbered.

  • Archaeologist,

    Thank you for your response. Your method of reading TSC registers helped. 

    -- 
    Thanks!
    
    Regards,
    Sarvani Chadalapaka
    HPC Systems Engineer
    Signalogic Inc.
  • Archaeologist said:

    You should not need to place these reads in the delay slot of a branch. Reading TSCL is supposed to lock the value in TSCH until you read from it.

    While the lock of high part is correct, you still want to ensure that no interrupts occur between low and high reads. Placing reads in branch delay slot is simply a way to ensure that. Because interrupts are implicitly disabled there. Simplest and most reliable way is to implement this as separate function in assembly, which has predicable overhead (of 5 cycles):

            .global _favorite_name
    _favorite_name:
            .asmfunc
            B       RA
            MVC     TSCL,B0
            MVC     TSCH,B1
      [!B0] MVC     B0,TSCL         ; start TSC
            MV      B0,A4
            MV      B1,A5
            .endasmfunc