This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Help with Debug: algorithm takes 3X runtime on DSP as ARM.

Other Parts Discussed in Thread: DM3730

I'm hoping to get some hints, maybe the right document that I haven't found yet to help with my problem.  I have a huffman encoder that I've ported from the ARM to the DSP on a DM3730 SoC.  The functionality is there, but it takes a long time to run.  The implementation is a straight port and the ARM runs at 600MHz.  This should mean that the DSP runs at 520MHz.


Optimization is set to -O2, but I'm not sure how to find out other things such as:

am I wasting cycles in areas other than my algorithm on tracing?  I define GT_TRACE 0 right before including gt.h, which I understand removes all of my GT_XTrace calls in my module.  But is there tracing in other components of codec engine that are being called in the process of running this algorithm remotely?  If so, how do I know and how do I turn it off?  My gut tells me there is some overhead still in the system that I just need to find and root out.

Another thing I've noticed is that when I do have CE_DEBUG enabled, the timestamps aren't correct.  My algorithm as measured round-trip from the ARM takes about 50ms, but the reporting I get from the CE_DEBUG messages is that it takes 100ms.  Clearly the timestamper is assuming the wrong clock rate or tick rate somewhere.  I tried modifying this in the server.cfg of my codec server, and set the clock frequency to 520 (MHz).  Is this the right thing to do and the right place?

Any tips or tricks are appreciated.  I've looked in a few places so far, but haven't found any definitive reasons why this would be significantly slower. 

I've looked in the GPP to DSP porting guide [1] and the codec engine FAQ among a few others..

[1]

  • I found a partial answer to my question with this resource [1]: Basically I needed to discover the magic formula of multiplying the elapsed time * 256 and dividing by the clock rate. I also failed to mention in my original post that I'm using codec engine and DVSDK 4.03.00.06.

    What I've learned is that my MODULE_process call is truly running longer than the equivalent code on the ARM. What I need to find out is:

    1) In the package.mak (and included makefiles) that gets built in my codec package (fluke.codecs.huffenc), where is the configuration being pulled from? I am talking about optimization flags, etc. I have searched and cannot figure it out. The comment says it's generated from package.bld, but that file does not have the compiler settings, either. How can I change compiler settings if there are so many levels of indirection here?

    2) Where are the cache settings for the DSP? I have a codec server with a server.cfg and server.tcf file. Within the server.tcf, there are device_regs settings like this:

    var device_regs = {
    l1PMode: "16k",
    l1DMode: "16k",
    l2Mode: "64k",
    l1DHeapSize: 0
    };

    that I presume are used. Is it safe to assume I have cache enabled?

    [1] processors.wiki.ti.com/.../Codec_Engine_Profiling
  • Moving this post to DM37x forum.