DVRRDK c6xdsp debug/release mode

Thomas Lo

Other Parts Discussed in Thread: SYSBIOS

I am working on DVRRDK 3.0.

Why is the DSP compiled in debug mode, not release mode? with M3 compiling in release mode

I want the change some compilation options in rules_c674.mk, suppress symbolic debug, etc., such that DSP can optimize its performance

Any suggestion? Thanks

over 12 years ago

0 Badri Narayanan over 12 years ago

TI__Guru 59700 points

DSP is compiled with all optimization flags enabled. You can remove -g flag but we have seen it does not give noticeable performance improvement because we have already used the optimize_with_debug flag. Profile is just called debug .

0 Thomas Lo over 12 years ago in reply to Badri Narayanan

Expert 1560 points

yes, I have tried replacing --symdebug:dwarf with --symdebug:none, but no noticeable imporvement. Can performance be improved if debug related feature is not enabled at all?

0 Badri Narayanan over 12 years ago in reply to Thomas Lo

TI__Guru 59700 points

There is no debug related feature . Which flag are you referring to ? The compiler settings are already optimal for performance. If you think it is because of compiler flags enable the compiler consultant flags which will generate a report with recommendations but I don't think you will get any significant boost from additional compiler settings. What is the performance issue you are seeing ?

0 Thomas Lo over 12 years ago in reply to Badri Narayanan

Expert 1560 points

I am performing some testing on the speed. I allocate a buffer from DSP_HEAPINT_MEM, then use DMA to copy 16 rows of data to the buffer. And do some assignment statement. and DMA it to the original frame.

#define DMA_BUF_WIDTH 960
#define DMA_BUF_HEIGHT 16
unsigned int n_buf_size = DMA_BUF_WIDTH*DMA_BUF_HEIGHT*2;

extern const ti_sysbios_heaps_HeapMem_Handle DSP_HEAPINT_MEM;
unsigned char* n_video_buf;

Error_Block eb;

Error_init(&eb);
n_video_buf = (unsigned char*) Memory_alloc((xdc_runtime_IHeap_Handle)DSP_HEAPINT_MEM, n_buf_size, 8, &eb);

dma_cnt = pSwOsdObj->videoWindowPrm.height/2 / DMA_BUF_HEIGHT;
for (i=0; i<dma_cnt; i++)
{
// copy lower half of frame to buffer
dmaPrm1.width = pSwOsdObj->videoWindowPrm.width;
dmaPrm1.height = DMA_BUF_HEIGHT;
dmaPrm1.srcStartX = 0;
dmaPrm1.srcStartY = i*DMA_BUF_HEIGHT + pSwOsdObj->videoWindowPrm.height/2;
dmaPrm1.destStartX = 0;
dmaPrm1.destStartY = 0;

status = Utils_dmaCopy2D(&pObj->dmaCh, &dmaPrm1, 1);
UTILS_assert(status==FVID2_SOK);

pt = n_video_buf;
for (k=0; k<DMA_BUF_HEIGHT/2; k++)
{
for (j=0; j<72; j++)
{
pt[k*dmaPrm1.destPitch[0]+j] = pt[(k+DMA_BUF_HEIGHT/2)*dmaPrm1.destPitch[0]+j];
}
}

pt = n_video_buf + (DMA_BUF_WIDTH*DMA_BUF_HEIGHT);
for (k=0; k<DMA_BUF_HEIGHT/2/2; k++)
{
for (j=0; j<72; j++)
{
pt[k*dmaPrm1.destPitch[1]+j] = pt[(k+DMA_BUF_HEIGHT/2/2)*dmaPrm1.destPitch[1]+j];
}
}

// copy buffer to upper half of frame
dmaPrm2.width = pSwOsdObj->videoWindowPrm.width;
dmaPrm2.height = DMA_BUF_HEIGHT;
dmaPrm2.srcStartX = 0;
dmaPrm2.srcStartY = 0;
dmaPrm2.destStartX = 0;
dmaPrm2.destStartY = i*DMA_BUF_HEIGHT;

status = Utils_dmaCopy2D(&pObj->dmaCh, &dmaPrm2, 1);
UTILS_assert(status==FVID2_SOK);
}

With this, I am just processing 72 x 288 of a frame, but already I get 40% DSP loading. When without the two assignnent statements pt[i] = pt[j] in the for loop, the loading was only 15%. Why is it that slow? If I change it to memcpy, the loading is about 19%.

0 Thomas Lo over 12 years ago in reply to Thomas Lo

Expert 1560 points

what am I doing wrongly that causing the high loading for just a little processing?

and I forget to mention in above post, the input to DSP is 16ch D1 + 16ch CIF

0 Thomas Lo over 12 years ago in reply to Badri Narayanan

Expert 1560 points

Please give me some idea what I am doing wrongly, thanks.

0 Badri Narayanan over 12 years ago in reply to Thomas Lo

TI__Guru 59700 points

Pls profile the for loop where you are copying the pixels separately and the DMA function separately. You can use Timestamp_get32() API to get start and end time. You should take care of timer wraparound as Timestamp_get32 will wraparound once every 4 secs or so.

You are copying huge amount of data by doing this processing for 32 ch x 30 fps x 72 lines so loading is expected. You can use mem_stats utility to measure the increase in DDR transaction with and without your loop.

0 Thomas Lo over 12 years ago in reply to Badri Narayanan

Expert 1560 points

the current for loop copy just 72 bytes of a line byte by byte gives an increase in 25% loading, while using memcpy line by line for fthe whole row of 720 just gives about 4% loading. is it proper?

which Utils_mem function API are you referring to for checking increase in DDR transaction?

Thanks

0 Badri Narayanan over 12 years ago in reply to Thomas Lo

TI__Guru 59700 points

This is not expected. There is definitely something wrong with your code which is preventing the compiler from software pipelining the loop. Did you do memcpy from same src to dst ? As I mentioned enable compiler consultant and check if loop is being disqualified for software pipelining.

Add below compiler flags to get verbose diagnostics:

--gen_opt_info=2,--consultant,--verbose_diagnostics

Also make sure you are not accessing any global variable in the loop .Also try adding restrict qualifier to src and dest.

0 Thomas Lo over 12 years ago in reply to Badri Narayanan

Expert 1560 points

another simple question, but not related to this topic

if I create an local char array in a function, or create it in AlgLink_OsdObj, can I use it in DMA in dsp? or do the address need to map to a physical addess like in the L2 RAM case?

0 Badri Narayanan over 12 years ago in reply to Thomas Lo

TI__Guru 59700 points

Local array will be on stack of the task which is placed in DDR by default.

Creating array in AlgLink_OsdObj will place the array is .far section which is also placed in DDR by default.

DDR is mapped 1: 1 (phy = virt) so no address translation is required

Processors

Processors forum

DVRRDK c6xdsp debug/release mode