This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Issue with _itoll(TSCH,TSCL)

Hi

I am using  _itoll(TSCH,TSCL) method to find out the DSP cycles consumed my encode process call.

I am different values for the cycles consumed, even though Ideveloping a custom H264 encoder on DM648 platform. On several occasions, I have obtained  haven't modified my source code. Some times it increases to the order of 25,000,000 cycles and this causes frustration while optimizing. After probably resetting the custom board few times , I am able to get back the original cycles consumed. I am confused at this behaviour.

Is TSCH,TSCL counters are reliable? Or do I have to suspect my hardware?

Pls reply.

Best Regards

JK

 

  • JK,

    The TSC TimeStamp Counter in the C64x+ (and later cores) is very reliable and is the best choice for counting cycles.

    If you do not do a hard reset or power cycle, the TSC will continue counting from the last value it had. Always take the difference in values read instead of looking at the actual values. The TSCL=0; instruction at the beginning of the process does not reset the TSC to 0; it only starts the TSC if it was not already counting.

    Regards,
    RandyP

  • Reading the TSCL register causes the high 32 bits to be latched into TSCH.  The C language does not guarantee that this will happen when you write _itoll(TSCH, TSCL); if the compiler emits code that reads TSCH first, then you will see a stale value for TSCH.  Does it work as you expect if you write something like this?

    unsigned lo = TSCL;

    unsigned hi = TSCH;

    unsigned long long tsc = _itoll(hi, lo);

  • Thanks RandyP and Michael for the answers.

    Michael, I will  try  your suggestion and let you know the result.

    Best Regards

    JK

  • Archaeologist and I had a lengthy discussion on the best way to implement this 64-bit read+concatenation. You can see that thread here. His suggested method was exactly as Michael suggested above (great minds think alike?).

    My experience was that the TI C6000 compiler did what I wanted, but for customer recommendations, the "right way" to do the operation is more important than just what works. Thanks for the recommendation, Michael.

    JK,

    How are you calculating your benchmark time? Are you taking differences or looking at the full 64-bit number expecting it to be the time from the beginning of the benchmark's execution?

    Regards,
    RandyP

  • Hi RandyP

    Yes, I am taking the differences.

    I did  a quick checking. Michael's suggestion seemed to work in the same way. The cycles count still  increases suddenly by some fixed amount for the same code. Still  digging.

    Thanks

    JK

     

  • Jayakrishnan,

    How are you collecting and presenting your results? If you have printf's in your code, these will take a lot of CPU cycles and can easily overlap (through interrupts, etc.) with your benchmarking code.

    Are you using the emulator at all during the benchmarking? If it updates the screen, that will cause reads from any location involved in an active display window like the Memory Browser and Variables/Expressions. Those emulator reads will cost CPU cycles.

    If we can help, please post more information or questions.

    Regards,
    RandyP

  • RandyP,

    Sorry for the delay. I was investigating more on this issue.

    My code outline is as follows.

    start_time = _itoll(TSCH,TSCL);

    encode(buffer);

    end_time = _itoll(TSCH,TSCL);

    printf("Cycles consumed=%lld\n",end_time - start_time);

    I have ensured that there are no printfs.

    I am using XDS510USB emulator + CCS5.5  in Windows 8.1 environment.

    I have also tried benchmarking a simple file read using the same emulator. The cycles consumed by this operation is also not consistent. This is expected, since the data is residing in the host's hard disk, and also speed of the emulator data transfer may be varying. But what is intriguing is the inconsistency in the cycles consumed by encode() call.

    I have another question. Is there a simple way to check the CPU load of the DSP?

    Regards

    JK

     

     

  • JK,

    Certainly, something is taking extra time during some of your executions of the test. This is a valid result of your testing, and this is something you may want to search for to understand or to eliminate.

    If you can disable interrupts during the encode() execution, do so. This may be difficult if some are being used. If needed, mask out the ones that are not required for the encode() execution. If one or more are required for execution, they should only be occurring as part of the encode() function execution; if they are sometimes inside the benchmark region and sometimes outside, this would explain big disparities in the measurements.

    Benchmark sections of the encode() execution. You may be able to narrow down which portion has the variation and then track down what is causing it.

    Regards,
    RandyP

  • Dear RandyP,

    No interrupts are used in the code. Anyways, I disabled them and checked. No change.

    It is very difficult to benchmark each section, since the code base so huge. But currently , the encode() process cycles are consistent  for the past 2 weeks. The only change I made  is putting two big arrays in to L2 SRAM.(previously in ext mem)

    The code has some floating point functions. Perhaps this could also be the reason for the inconsistency in the cycles consumed.

    Best Regards

    JK

  • JK,

    Since this is not a Compiler question at this point (thanks, Compiler Forum team!), I will ask a Moderator to move this to the DSP forum for more coverage of device issues.

    If there are no interrupts occuring, the next thing to make sure of is that the debugger is not doing anything during the execution of the testing. If any of your CCS screens are being updated or any scripts are reading values during the execution, that can cause a lot of delays in the code. This usually does not happen in simple debugging scenarios, but it is something you must be aware of - do not click on any windows until the execution has completed.

    With the debugger removed from any impact, and with interrupts globally disabled (GIE=0), the code should execute exactly the same every time it runs with the same data. If the data is changing, then you could have some functions that are non-deterministic based on data. Division routines can be non-deterministic but should average out the same over many data samples.

    You will need to narrow down the region of the problem by taking TSC measurements at mutliple points in the code sequence and find which large sections are seeing the most variation, or if the variation occurs everywhere in the algorithms.

    Interrupts and the debugger are the only things I can think of that will cause variation like this, so please check and double-check both of those factors.

    Regards,
    RandyP