This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RPE in DVRRDK runs much slower than RPE in EZSDK???



Hi,

   I am now developing the DSP algorithm for EVM8168.  Packaging my own algorithm into an xDM-compliant form, I have successfully gotten my algorithm validated in the EZSDK5.05 with the RPE framework.  Fortunately, the performance of my algorithm meets our demand:  It takes about 50ms for the "Rpe_process()" to process a single frame by invoking my xDM algorithm.

   However, when such an identical xDM algorithm was integrated, similarly using the RPE framework, into the DVRRDK4.0 demo, we need 500ms for a single frame processing!

   These results made me confused.  I supposed that such issue might be about DSP caching.  Meanwhile, the differences in configuration of Memory Map between EZSDK and DVRRDK is another case which may be responsible for the different performance.

   I am not sure about the real cause of this issue.  Could you give me a hand?  Actually, I always regard the DVRRDK as a well-organized and high performance software kit.

Naroah

Nov/21/2013

  • This should be issue with caching. Ensure your algorithm buffers and the input / output buffers are allocated from cached memory. In DVR RDK SharedRegion 0 is non-cached whereas in ezsdk I think it is cached. Refer DVR RDK RPE example and use SharedRegion 1 for both RPE algorithm input/output buffers and ensure the XDM algorithms internal buffers are also allocated from cached memory.

  • Hi Badri,

         Thank you for your reply.

         Actually, the memory allocation function we employed is the same with that in DVRRDK AAC codec demo, say, "Void *Audio_allocateSharedRegionBuf (Int32 bufSize)". And the heap name is "SR_FRAME_BUFFERS_ID", which I considered as ShareRegion1, as it said that "#define SR_FRAME_BUFFERS_ID     1" in audio_utils.c. 

          However, I am not sure whether SR1 is DSP cache enabled: when rebuilding the dvr_rdk, gmake would like to cast some MemMap information according to config_1G_256MLinux.bld.  This info, for instance, "(SR1 bitstream buffer Cached on A8.  Cached on M3, although access by DMAs)", never mentions which chunk of memory is cached by DSP.

         How could I know the Cache status of my DSP and the memory?  Is it feasible for me to modifiy some configure files like config_1G_256MLinux.bld and thus to my algorithm work efficiently in DSP?  Since I don't familiar with such an configuration in DVRRDK, I wish your help.

          One more thing, I have allocated a 1920*1080Byte (= 2025KB) for inBuf and a 129600Byte (= 126.6KB) for outBuf.

          I am sincerely looking forward to your reply.

    Naroah

    Nov/22/2013

  • Hi Badri,
       According to what you mentioned, I have my DSP configure file rechecked.  I confirm that the Cache for SR1, which locates @ 0x9000_0000 is enabled by referencing to "FC_RMAN_IRES_c6xdsp.cfg":


    131 /* Disable caching for HWspinlock addresses */
    132 Cache.MAR0_31    = 0x00000000;
    133 Cache.MAR32_63   = 0x00000000;
    134 /* Config/EDMA registers cache disabled */
    135 Cache.MAR64_95   = 0x00000000;
    136 Cache.MAR96_127  = 0x00000000;

    ========== These lines below indicates SR1 is cached correctly ==========

    137 /* CPU access code and data  - 0x80000000 cache enable */
    138 Cache.MAR128_159 = 0xFFFFFFFF;
    139 /* TILER memory cache disabled  - 0xA0000000*/
    140 Cache.MAR160_191 = 0xFFFFFFFF;
    141 /* memory cache disabled  - 0xC0000000*/
    142 Cache.MAR192_223 = 0xFFFFFFFF;
    143 /* memory cache disabled  - 0xE0000000*/
    144 Cache.MAR224_255 = 0xFFFFFFFF;


       Furthermore, by modifying the MAR128_159 from 0xFFFF_FFFF into 0x0000_0000, I managed to get the SR1 cache DISABLED manually and explicitly.  Rebuilding the dvr_rdk, replacing the firmware binary (*.xe674)  and running my test application again, I sorrowly find that it takes OVER 2000ms for my algorithm to process a single frame!!!  It indicates that the 500ms-per-frame-result is definately a Cache-Enabled one, while a non-cache result would take the 2000ms-result instead.

        But, how come?  I mean: the cache system in DSP seems to be innocent.  But who would stand accused of robbery of my DSP cycles?

        Could you give me a hand to capture the outlaw?  Only justice will be done.

    Your sincerely,

    Naroah

    Nov/22/2013

  • Hi Badri,

         Although I didn't employ the OSD and SCD module in my test application, I wonder whether they would occupy the DSP cycles and thus deterioriate my algorithm performance.  Is that a possible case?  How can I confirm whether the OSD and SCD modules are disabled in the DVRRDK?

    Naroah

    Nov/23/2013

  • The test you have done with disabling mar bits for 128-159 is wrong and doesnt provide anyinfo on whether your i/o buffers and algorithm internal buffers are cached or not.You just disabled caching for everything including dsp code and cpu heaps and it expected to get very poor performance.As i mentioned printing the buffer addresses for every buffer used by the algorithmis the only way to confirm if caching is the problem.

  • Hi Badri,

        Thank you for your reply!

        You are right.  What I did by disabling the global cache enable bits is wrong.  I followed your advices and try to print the inPut and outPut pointers into the terminal, and these are the result I get:

        I wish the buffer should be located at the SR1(0x9000_0000~0xA000_0000).  Yet, the result runs definately out of my expectation.  I tried to change my buffer allocation code like this below:

    ... Before Memory Allocation

    heap = System_ipcGetSRHeap(1);
    inBufDesc->numBufs = 1;
    inBufDesc->descs[0].buf = (uint8_t *) Memory_alloc (heap, (WIDTH * HEIGHT), 128, NULL);
    inBufDesc->descs[0].bufSize = (WIDTH * HEIGHT);
    outBufDesc->numBufs = 1;
    outBufDesc->descs[0].buf = (uint8_t *) Memory_alloc (heap, (WIDTH * HEIGHT/16 * 5), 128, NULL);
    outBufDesc->descs[0].bufSize = (WIDTH * HEIGHT/16 * 5);

    // Print the Pointers

    printf("inBuf  pointer Address @ 0x%08x size = %d\n", (uint32_t) inBufDesc->descs[0].buf, inBufDesc->descs[0].bufSize);
    printf("outBuf pointer Address @ 0x%08x size = %d\n", (uint32_t)outBufDesc->descs[0].buf, outBufDesc->descs[0].bufSize);

    ... Invoke the RPE_process()

        I don't know what's wrong with my code.  Could you help me and tell me how can I allocate the memory from the SR1 which locates at 0x9000_0000?

        I am sincerely looking forward to your reply.

    Naroah

    Nov/24/2013

  • Hi Badri,

        I find I have made a mistake, the output memory addresses in my previous post are virtual address.  This time, I checked their actual physical addresses by using the Vsys_alloc() and printing them out:

        ...

        Vsys_allocBuf(1, WIDTH * HEIGHT, 128, &pInBuf);
        Vsys_allocBuf(1, WIDTH * HEIGHT / 16 * 5, 128, &pOutBuf);

    ...

        inBufDesc->numBufs = 1;
        inBufDesc->descs[0].buf = pInBuf.virtAddr;

        inBufDesc->descs[0].bufSize = (WIDTH * HEIGHT);
        outBufDesc->numBufs = 1;
        outBufDesc->descs[0].buf = pOutBuf.virtAddr;
        outBufDesc->descs[0].bufSize = (WIDTH * HEIGHT / 16 * 5);
    ...

         However, the performance is as poor as the result in my first test.

         This time, I think these buffer pointers are set correctly, as their physical address are indeed located at 0x90000000~0xA0000000.  And also, the correlated MAR bit are setting to 0xFFFFFFFF.  It that enough to say my cache are are set correctly?  Or, is there any options I should configure to ensure my cache enabled?

         Looking forward to your reply, sincerely I am.

    Naroah

    Nov/24/2013

        

  • CHeck the address of the algorthm internal buffers (memTab/ires) buffers are also cached.Also if you algortihm uses L2 SRAM confirm that they are actually allocated from SRAM.

  • Hi Badri,

        Thank you for your great advices.  I manage to check the memory and buffer allocation on the DSP side.  Sorry for my ignorance -- Actually, I don't know how to print the debug info into the serial terminal by C647x when this C674x DSP works as a slave core.  I tried "printf" but nothing happened.  Maybe "Syslink_prinf()" is an alternative, and yet I failed to output some strings into the serial terminal by empolying such a fxn, either.

        Is an emulator necessary in this debugging case?  I've got an XDS100v2, by the way, which lied on the bottom of my EVM8168 devkit box for about 2 years...

        Waiting for your reply, also sincerely.

    Naroah

    Nov/25/2013

  • You can use Vps_printf to print on c674 and it will be printed on the console if you run the remote_debug_client.out (it is run by default if you execute init.sh)

  •     Bloody great...  In order to focus on the RPE issue, I have all the unnecessary parts removed from my test application by reorganizing the whole make project.  Unfortunately, ti_vsys module is not available now.  Whatever, I would like to rewind the system immediately and to inform you as soon as I get the additional debug information of my issue.

        Thank you all the same.

    Naroah

    Nov/25/2013

  • Well you can also connect JTAG to c674 target and then printfs will be seen on CCS console.

  • Hi Badri,

        Excuse me, but I feel really confused at this moment. 

        How can I confirm if I have employed the L2 SRAM in my algorithm?  Some static arrays [Look-up Table] are declared in my program.  I have never dynamically allocate any memory in my DSP C code.  

        Moreover, I didn't take any care of my algorithm if the L2 SRAM is used when I packed it as an xDM-compliant one.  Simply integrated in the RPE in EZSDK, it runs perfectly. 

        Could some crucial differences, which are between the DSP cfg files in EZSDK and the one in DVRRDK, are responsible for this issue

    Naroah

    Nov/25/2013

  • Hi Badri,

        This time, I test another two xDM-algorithms built by myself -- one for G722 encoding and another for G722 decoding.  Surprisingly, their performances in RPE@DVRRDK approximate those in RPE@DVRRDK. 

        However, the 10x-slower issue still exists when I test the algorithm I talked about in my previous posts.

        Do you have any idea about it?

    Naroah

    Nov/25/2013

  • As you are the author of the algorithm you would know what resource are requested by the algorithm via memTab interface and what resources are requested by the algorithm via IRES interface. If you require some algoritm internal buffer to be in internal memoy the memTab attribute for that buffer will have corresponding type. You can print all the memTab allocated buffers and confirm if they are external memory buffersm they are cached and if they are internal memory buffers they are allocated from L2 SRAM

  • Hi Badri,

        Thank you for your reply.

        Taking advantage of the xDM-GenAlg in CCSv5.  I didn't take any care of the memTab or iRES but simply inserted my own fxns into the XXX_XX_process().  I would like to peruse the xDAIS documents which concerned about memTab and iRES issues.  I wish I could handle this xDAIS stuff.

    Naroah

    Nov/26/2013

  • Hi Badri,

        The only part that my algorithm employed memTab is that

    Int XXX_XX_alloc(const IALG_Params *algParams,
        IALG_Fxns **pf, IALG_MemRec memTab[])
    {
        memTab[0].size = sizeof(XXX_XX_Obj);
        memTab[0].alignment = 0;
        memTab[0].space = IALG_EXTERNAL;
        memTab[0].attrs = IALG_PERSIST;

        ...

    }

        And I didn't use the IRES interface when I tried to generate this algorithm by GenAlg Wizard.

        I didn't check the "Add IRES Interface" box.

        Moreover, my algorithm was an In-place one.  At least, I have never try to allocate a large chunk of memory in my algorithm. Only some temporary variables and small arrays (<64B) are used.

        And I have read A Technical Overview of eXpressDSP-Compliant Algorithms for DSP Software Producers. In chapter 5.3.4, it said that:

        Would it mean that it is useless for me to take care of the cache issue in my algorithm?  By the way, I don't know how to enable/disable cache for my xDM-algorithm, except disabling the cache by changing MAR bit in "FC_IRES_c6xdsp.cfg".

    Naroah

    Nov/27/2013

  • Ok then print the address of your memTab[0] buffer and confirm it is in cached memory region.

  • Hi Badri,

        By using the XDS100v2 via CCSv5.50, linking the C6xdsp, I step into the fxn: "XXX_XX_alloc()" and go through the fxn, getting some information from the watch window as below:

     And these below are for memTab[0]:

        According to these results, it seems that memTab locates @ 0x9000_0000~0xA000_0000, say, the SR1, which is cache enabled by MAR128_159 = 0xFFFFFFFF.

    Naroah

    Nov/27/2013

  • You should see value of base which has address 0xCA3E7308.When was the screen shot taken ? Is it after allocation of memTab memory ?

  • This screen shot has been taken before the program readys to run "return(1)" in the "XXX_XX_alloc()".

    What the value of base for?

  • Hi Badri,

        Ok, I know what you mean:

        When the alloc() fxn returns, xdm_server.c would set the alg handle by refering to memTab[0].base.  So this time, I get the C6x core running though these lines and get another snap:

        Now, I have got the information of memTab[0].base, which locates at 0x9000_0090 with the value 0x9000_0100.

    Nov/27/2013

    Naroah

  • Hi Badri,

        Thank you for your generous help!  And I have this problem solved!

        Yes, you are right.  It is definately a cache problem although the bug is "cached"/hidden in the dark part of my cupboard.

        Although the inBufDesc->desc[0].buf and someelse like memTabs etc. are correctly cached in my algorithm, an argument Tab, which declared in my InArgs structure, is frequently accessed by the kernel part my algorithm.  By using the emulator to check these paramters, I find the InArgs is located @ BFXX_XXXX, say, SR0, non cached by DSP.

        Rearranging my code and reallocating memory for these args in SR1, I got my algorithm running as fast as it did in EZSDK!

        Thank you for your support again!

    Naroah

    Nov/27/2013