I have a simple algorithm running on DSP (in DM3730) that I want to profile. I started profiling by using Server_getCpuLoad. The pseudo code on host microcontroller looks like
process (...);
Server_getCpuLoad(..);
I consistently get a value bet 6 to 8 %. I then decided to corroborate this value using timer, so in my algorithm I have
startTime = TSCL
do the processing
endTime = TSCL
deltaTime = endTime - startTime;
I then compute
load = (deltaTime/DSP_FREQUENCY).
This number is consistently around 1%. How is it possible that the two estimates of load vary so much?
regards
Sachin
Hi folks,
The above issue is fairly urgent of us (TI Germantown) and I would greatly appreciate any clues/feedback as to where I can look to resolve this discrepancy.
Server_getCpuLoad() returns the approximate CPU load of the Server over the last second. It involves sending a message to the DSP, gathering the load number and sending a reply to the ARM. That messaging means it's not a completely trivial service, although if you call it once a second (the only way it really makes sense to do), the overhead is in the noise. Typically users spin up a separate background thread that wakes up every second with the purpose of only calling Server_getCpuLoad().
Are you calling Server_getCpuLoad() after _every_ process() call? Which might be something like 30-60 times a second? That might explain the increased loading.
Chris
Thanks a ton Chris. I am calling Server_getCpuload after every process, and that is exactly 100 times a second. I used other ways of profiling my code http://processors.wiki.ti.com/index.php/Codec_Engine_Profiling and those methods consistently return 1% (which agrees with what I got using TSCL/TSCH). I will now try to call Server_getCpuLoad once a second and see what I get.
Hi Chris,
Now I changed my code to call only once and I still get 8%. So the code effectively now is
do {
} while (5 sec)
Server_getCpuLoad();
Do you have SW other than the codec running? Maybe drivers, statically created tasks, etc...
Yes there is driver running (EDMA3/McBSP/AIC) and also idle task, but can that amount to so much. Further I run the system at different sampling rates. (8, 16, 44.1 and 48 kHz), the following table show what I get.
Load %
Cycles
Process+ ISR
5
5000000 + 49000 < 1%
7
5000000 + 80000
16
5000000 + 235000
5000000 + 250000 > 1%
The load % is obtained using Server_getCpuLoad and the cycles obtained using TSCL/TSCH. I further used the two methods shown in http://processors.wiki.ti.com/index.php/Codec_Engine_Profiling, and I get the results that are consistent with cycles count shown in the last column of the table above. It is just that I am unable to account for the number shown by Server_getCpuLoad.
I still haven't been able to resolve my profiling issue satisfactorily. Here are the approaches I used
1) Server_getCpuLoad (): this gives me around 7% load
2) I used TSCL to measure actual cycles and get a figure of around 4700000, which translates to 1% CPU load
3) Used GT_trace: I used two methods for this I saw @ http://processors.wiki.ti.com/index.php/Codec_Engine_Profiling
a) Execute "CE_DEBUG=1 CE_DSP0TRACE="CV=5;GT_time=2" ./devnode.." I get following o/p
[DSP] @+001,475us: [+5 T:0x8d9f0bec] CV - VISA_enter(visa=0x8d9f0440): algHandle = 0x8d9f0478 [DSP] @+018,745us: [+5 T:0x8d9f0bec] CV - VISA_exit(visa=0x8d9f0440): algHandle = 0x8d9f0478
Based on this o/p => 18745 * 256 = 4798720 cylces, which matches what I get in point 2
b)Execute "CE_DEBUG=1 CE_DSP0TRACE="CV=5;ti.mas.devnode.DEVNODE=5;GT_time=2" ./devnode...". I then get following o/p
[DSP] @+000,001us: [+5 T:0x8d9f09a4] ti.mas.devnode.DEVNODE - BENCHMARK> process() call : delta=4630387
As can be seen the delta still agrees with what I observe in item 2.
Bottom line is that of the four different methods I use to profile my code, three of them seem to converg on same value (approx 4700000). However the Server_getCpuLoad seems to diverge a lot. How can I debug this issue?
Correct me if I'm wrong, but of the 4 methods, only Server_getCpuLoad() accounts for the entire load of the processor (including, for example, your driver code). The other methods only benchmark the codec.
Could you add instrumentation to benchmark the driver(s) and any other non-codec activity?
Another item might be overhead for cache management before/after your process call. If you use CE_DSP0TRACE to enable all benchmarking masks (level 5), not just "CV", you should see the cache overhead calls, too. That is, probably switch from CE_DSP0TRACE="CV=5;GT_time=2" to CE_DSP0TRACE="*=5;GT_time=2" you should see a few more trace statements.
You are correct in your observation. Actually my "codec" is the driver, remember the famous DEVNODE, I am trying to bench mark that. I do not have any other codec running. I also got the complete CE_DSP0TRACE, and I noticed
DSP] @+000,020us: [+5 T:0x8d9f0b54] CV - VISA_enter(visa=0x8d9f03a8): algHandle = 0x8d9f03e0[DSP] @+017,553us: [+5 T:0x8d9f0b54] CV - VISA_exit(visa=0x8d9f03a8): algHandle = 0x8d9f03e0[DSP] @+000,037us: [+5 T:0x8d9f0b54] CN - NODE> returned from call(algHandle=0x8d9f03a8, msg=0x8ff04880); messageId=0x0002cf76[DSP] @+002,358us: [+5 T:0x8d9f0b54] CN - NODE> 0x8d9f0310(AFE#0) call(algHandle=0x8d9f03a8, msg=0x8ff04880); messageId=0x0002cf77Does this give any clues?
FWIW, I found this link http://processors.wiki.ti.com/index.php/Codec_Engine_Application_Developers_Guide in the wiki. I am trying to implement the Load_init , but can't locate Load.h??? Is this step required?
Those are the important 4 DSP-side trace points, but the trace you pasted has half of 2 different calls. The order for a full DSP-side sequence is: