This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

DVRRDK c6xdsp debug/release mode

Other Parts Discussed in Thread: SYSBIOS

I am working on DVRRDK 3.0.

Why is the DSP compiled in debug mode, not release mode? with M3 compiling in release mode

I want the change some compilation options in rules_c674.mk,  suppress symbolic debug, etc., such that DSP can optimize its performance

Any suggestion? Thanks

  • DSP is compiled with all optimization flags enabled. You can remove -g flag but we have seen it does not give noticeable performance improvement because we have already used the optimize_with_debug flag. Profile is just called debug .

  • yes, I have tried replacing --symdebug:dwarf with --symdebug:none, but no noticeable imporvement. Can performance be improved if debug related feature is not enabled at all?

  • There is no debug related feature . Which flag are you referring to ? The compiler settings are already optimal for performance. If you think it is because of compiler flags enable the compiler consultant flags which will generate a report with recommendations but I don't think you will get any significant boost from additional compiler settings. What is the performance issue you are seeing ?

  • I am performing some testing on the speed. I allocate a buffer from DSP_HEAPINT_MEM, then use DMA to copy 16 rows of data to the buffer. And do some assignment statement. and DMA it to the original frame.

    #define DMA_BUF_WIDTH 960
    #define DMA_BUF_HEIGHT 16
    unsigned int n_buf_size = DMA_BUF_WIDTH*DMA_BUF_HEIGHT*2;

    extern const ti_sysbios_heaps_HeapMem_Handle DSP_HEAPINT_MEM;
    unsigned char* n_video_buf;

    Error_Block eb;

    Error_init(&eb);
    n_video_buf = (unsigned char*) Memory_alloc((xdc_runtime_IHeap_Handle)DSP_HEAPINT_MEM, n_buf_size, 8, &eb);

    dma_cnt = pSwOsdObj->videoWindowPrm.height/2 / DMA_BUF_HEIGHT;
    for (i=0; i<dma_cnt; i++)
    {
    // copy lower half of frame to buffer
    dmaPrm1.width = pSwOsdObj->videoWindowPrm.width;
    dmaPrm1.height = DMA_BUF_HEIGHT;
    dmaPrm1.srcStartX = 0;
    dmaPrm1.srcStartY = i*DMA_BUF_HEIGHT + pSwOsdObj->videoWindowPrm.height/2;
    dmaPrm1.destStartX = 0;
    dmaPrm1.destStartY = 0;

    status = Utils_dmaCopy2D(&pObj->dmaCh, &dmaPrm1, 1);
    UTILS_assert(status==FVID2_SOK);

    pt = n_video_buf;
    for (k=0; k<DMA_BUF_HEIGHT/2; k++)
    {
      for (j=0; j<72; j++)
      {
        pt[k*dmaPrm1.destPitch[0]+j] = pt[(k+DMA_BUF_HEIGHT/2)*dmaPrm1.destPitch[0]+j];
      }
    }

    pt = n_video_buf + (DMA_BUF_WIDTH*DMA_BUF_HEIGHT);
    for (k=0; k<DMA_BUF_HEIGHT/2/2; k++)
    {
      for (j=0; j<72; j++)
      {
        pt[k*dmaPrm1.destPitch[1]+j] = pt[(k+DMA_BUF_HEIGHT/2/2)*dmaPrm1.destPitch[1]+j];
      }
    }

    // copy buffer to upper half of frame
    dmaPrm2.width = pSwOsdObj->videoWindowPrm.width;
    dmaPrm2.height = DMA_BUF_HEIGHT;
    dmaPrm2.srcStartX = 0;
    dmaPrm2.srcStartY = 0;
    dmaPrm2.destStartX = 0;
    dmaPrm2.destStartY = i*DMA_BUF_HEIGHT;

    status = Utils_dmaCopy2D(&pObj->dmaCh, &dmaPrm2, 1);
    UTILS_assert(status==FVID2_SOK);
    }

    With this, I am just processing 72 x 288 of a frame, but already I get 40% DSP loading. When without the two assignnent statements pt[i] = pt[j] in the for loop, the loading was only 15%. Why is it that slow? If I change it to memcpy, the loading is about 19%.

  • what am I doing wrongly that causing the high loading for just a little processing?

    and I forget to mention in above post, the input to DSP is 16ch D1 + 16ch CIF

  • Please give me some idea what I am doing wrongly, thanks.

  • Pls profile the for loop where you are copying the pixels separately and the DMA function separately. You can use Timestamp_get32() API to get start and end time. You should take care of timer wraparound as Timestamp_get32 will wraparound once every 4 secs or so.

    You are copying huge amount of data by doing this processing for 32 ch x 30 fps x 72 lines so loading is expected. You can use mem_stats utility to measure the increase in DDR transaction with and without your loop.

  • the current for loop copy just 72 bytes of a line byte by byte gives an increase in 25% loading, while using memcpy line by line for fthe whole row of 720 just gives about 4% loading. is it proper?

    which Utils_mem function API are you referring to for checking increase in DDR transaction?

    Thanks

  • This is not expected. There is definitely something wrong with your code  which is preventing the compiler from software pipelining the loop. Did you do memcpy from same src to dst ? As I mentioned enable compiler consultant and check if loop is being disqualified for software pipelining.

    Add below compiler flags to get verbose diagnostics:

    --gen_opt_info=2,--consultant,--verbose_diagnostics

    Also make sure you are not accessing any global variable in the loop .Also try adding restrict qualifier to src and dest.

  • another simple question, but not related to this topic

    if I create an local char array in a function, or create it in AlgLink_OsdObj, can I use it in DMA in dsp? or do the address need to map to a physical addess like in the L2 RAM case?

  • Local array will be on stack of the task which is placed in DDR by default.

    Creating array in AlgLink_OsdObj will place the array is .far section which is also placed in DDR by default.

    DDR is mapped 1: 1 (phy = virt) so no address translation is required