This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Profiling

Other Parts Discussed in Thread: DM3730

I have a simple algorithm running on DSP (in DM3730) that I want to profile. I started profiling by using Server_getCpuLoad. The pseudo code on host microcontroller looks like

process (...);

Server_getCpuLoad(..);

I consistently get a value bet 6 to 8 %. I then decided to corroborate this value using timer, so in my algorithm I have

startTime = TSCL

do the processing

endTime = TSCL

deltaTime = endTime - startTime;

I then compute

load = (deltaTime/DSP_FREQUENCY).

This number is consistently around 1%.  How is it possible that the two estimates of load vary so much?

regards

Sachin

  • Hi folks,

      The above issue is fairly urgent of us (TI Germantown) and I would greatly appreciate any clues/feedback as to where I can look to resolve this discrepancy.

    regards

    Sachin

  • Server_getCpuLoad() returns the approximate CPU load of the Server over the last second.  It involves sending a message to the DSP, gathering the load number and sending a reply to the ARM.  That messaging means it's not a completely trivial service, although if you call it once a second (the only way it really makes sense to do), the overhead is in the noise.  Typically users spin up a separate background thread that wakes up every second with the purpose of only calling Server_getCpuLoad().

    Are you calling Server_getCpuLoad() after _every_ process() call?  Which might be something like 30-60 times a second?  That might explain the increased loading.

    Chris

  • Thanks a ton Chris. I am calling Server_getCpuload after every process, and that is exactly 100 times a second. I used other ways of profiling my code http://processors.wiki.ti.com/index.php/Codec_Engine_Profiling and those methods consistently return 1% (which agrees with what I got using TSCL/TSCH). I will now try to call Server_getCpuLoad once a second and see what I get.

    regards

    Sachin

  • Hi Chris,

    Now I changed my code to call only once and I still get 8%. So the code effectively now is

    do {

      process (...);

    } while (5 sec)

    Server_getCpuLoad();

    regards

    Sachin

  • Do you have SW other than the codec running?  Maybe drivers, statically created tasks, etc...

    Chris

  • Yes there is driver running (EDMA3/McBSP/AIC) and also idle task, but can that amount to so much. Further I run the system at different sampling rates. (8, 16, 44.1 and 48 kHz), the following table show what I get.

    Load %

    Cycles

    Process+ ISR

    5

    5000000 + 49000 < 1%

    7

    5000000 + 80000

    16

    5000000 + 235000

    16

    5000000 + 250000 > 1%

    The load % is obtained using Server_getCpuLoad and the cycles obtained using TSCL/TSCH. I further used the two methods shown in http://processors.wiki.ti.com/index.php/Codec_Engine_Profiling, and I get the results that are consistent with cycles count shown in the last column of the table above. It is just that I am unable to account for the number shown by Server_getCpuLoad.

  • Hi folks,

      I still haven't been able to resolve my profiling issue satisfactorily. Here are the approaches I used

    1) Server_getCpuLoad (): this gives me around 7% load

    2) I used TSCL to measure actual cycles and get a figure of around 4700000, which translates to 1% CPU load

    3) Used GT_trace: I used two methods for this I saw @ http://processors.wiki.ti.com/index.php/Codec_Engine_Profiling

      a)  Execute "CE_DEBUG=1 CE_DSP0TRACE="CV=5;GT_time=2" ./devnode.." I get following o/p

          [DSP] @+001,475us: [+5 T:0x8d9f0bec] CV - VISA_enter(visa=0x8d9f0440): algHandle = 0x8d9f0478
          [DSP] @+018,745us: [+5 T:0x8d9f0bec] CV - VISA_exit(visa=0x8d9f0440): algHandle = 0x8d9f0478

         Based on this o/p => 18745 * 256 = 4798720 cylces, which matches what I get in point 2

      b)Execute "CE_DEBUG=1 CE_DSP0TRACE="CV=5;ti.mas.devnode.DEVNODE=5;GT_
    time=2" ./devnode...". I then get following o/p


    [DSP] @+000,001us: [+5 T:0x8d9f09a4] ti.mas.devnode.DEVNODE - BENCHMARK> process() call : delta=4630387

    As can be seen the delta still agrees with what I observe in item 2.

    Bottom line is that of the four different methods I use to profile my code, three of them seem to converg on same value (approx 4700000). However the Server_getCpuLoad seems to diverge a lot. How can I debug this issue?

    regards

    Sachin

  • Correct me if I'm wrong, but of the 4 methods, only Server_getCpuLoad() accounts for the entire load of the processor (including, for example, your driver code).  The other methods only benchmark the codec.

    Could you add instrumentation to benchmark the driver(s) and any other non-codec activity?

    Another item might be overhead for cache management before/after your process call.  If you use CE_DSP0TRACE to enable all benchmarking masks (level 5), not just "CV", you should see the cache overhead calls, too.  That is, probably switch from CE_DSP0TRACE="CV=5;GT_time=2" to CE_DSP0TRACE="*=5;GT_time=2" you should see a few more trace statements.

    Chris

  • You are correct in your observation. Actually my  "codec" is the driver, remember the famous DEVNODE, I am trying to bench mark that. I do not have any other codec running. I also got the complete CE_DSP0TRACE, and I noticed

    DSP] @+000,020us: [+5 T:0x8d9f0b54] CV - VISA_enter(visa=0x8d9f03a8): algHandle = 0x8d9f03e0
    [DSP] @+017,553us: [+5 T:0x8d9f0b54] CV - VISA_exit(visa=0x8d9f03a8): algHandle = 0x8d9f03e0
    [DSP] @+000,037us: [+5 T:0x8d9f0b54] CN - NODE> returned from call(algHandle=0x8d9f03a8, msg=0x8ff04880); messageId=0x0002cf76
    [DSP] @+002,358us: [+5 T:0x8d9f0b54] CN - NODE> 0x8d9f0310(AFE#0) call(algHandle=0x8d9f03a8, msg=0x8ff04880); messageId=0x0002cf77

    Does this give any clues?

    Sachin

    FWIW, I found this link http://processors.wiki.ti.com/index.php/Codec_Engine_Application_Developers_Guide in the wiki. I am trying to implement the Load_init , but can't locate Load.h??? Is this step required?

  • Those are the important 4 DSP-side trace points, but the trace you pasted has half of 2 different calls.  The order for a full DSP-side sequence is:

    1. CN - NODE> ... call ...
    2. CV - VISA_enter ...
    3. CV - VISA_exit ...
    4. CN - NODE> returned ...

    • #1 is printed just after the DSP has received the message from the ARM.
    • #2 is printed just before calling the codec's activate/process methods.  The delta between this and #1 is basically pre-process() call cache invalidate.
    • #3 is printed after returning from the codec's process() method.  The delta between this and #2 is the algActivate() (if present) and process() call.
    • #4 is printed just before sending the reply back to the ARM.  The delta between this and #3 is algDeactivate() (if present) and cache writebacks.
    Although we don't have a single full frame above, the times show _very_ small overhead in #2 and #4, which (assuming the trace statements above are for a process() call and not a control() call!) means the cache overhead is negligible.
    If I'm remembering the DEVNODE architecture, there's a driver independently running, capturing data and filling up buffers.  If that's right, what is the overhead of all that data capture work?
    Chris