DRA7xx : DSPLIB, MATHLIB calls provide degraded throughput !

Naveen Shetti

Intellectual 690 points

Other Parts Discussed in Thread: MATHLIB, SYSBIOS

Hello,

We are working on dra7xx-evm(OMAP5777) board with following setup :

1. ipc_3_23_00_01

2. bios_6_37_03_30

3. xdctools_3_25_06_96

4. CCS5.5

5. dsplib_c66x_3_4_0_0

6. mathlib_c66x_3_1_0_0

We are running linux on ARM core and sysbios is running on DSP core.

We tried to profile simple vector addition call ( DSPF_sp_vecadd ) on the DRA7xx DSP1 core running at 600MHz clock with SYSBIOSv6.37.

The length of the vector is 40000 float values.

As per the TI provided benchmark for this call it is ( 3 /4 * N + 24) cycles which amounts to ( 3/4 * 40000 + 24) cycles. This comes to around 30024 cycles.

Since our DSP core1 is running at 600MHz, the corresponding benchmark figure for the same translates to 30024/0.600 = 50040ns = 50.040 micro sec

But when we profile the same call in our code it is providing 541.3 micro sec !!! That's more than 10 times the mentioned benchmark figure for the same. We have used -O3 optimization flag to compile our code. The mathlib and dsplib libraries are also compiled with -O3 flags.

Please Note : We have used clock and timestamp calls to profile the code. We have also ensured beforehand, the accuracy of the timestamp calls by individually profiling them against a Task_sleep of 1 sec. Hence there is no ambiguity in the profiled figures which we have got.

Please let us know how can we improve on the DSPLIB throughput. The above mentioned addition call is a snippet of the algorithm which our application uses. This algorithm has multiple addition, multiplication, fft, sqrt and various other vector operations. All these calls show a degraded performance, which is no where near to the benchmark figure.

Please shed some light on the same.

Thanks,

Naveen Shetti

over 10 years ago

0 tscheck over 10 years ago

TI__Mastermind 23525 points

Unfortunately, the answer to your question is the same or mostly the same to the answer here:

e2e.ti.com/.../387007

essentially, the difference is the flat memory model of the simulator which was used for the benchmark values, versus the silicon based measurements that incorporate cache. For larger arrays the performance will degrade more so and the equation would not apply. For future reference, we are looking at adding silicon measured results to the benchmark data.

Regards,
Travis

0 Naveen Shetti over 10 years ago in reply to tscheck

Intellectual 690 points

Hello Travis,

Below mentioned are few observations for the tests conducted on the DSPLIB Multiply Call. We have also enabled the optimization flags as mentioned in the in the DSPLIB/ MATHLIB manuals : -O3 -ms0 --symdebug:none --optimize_with_debug=on --opt_for_speed=5. The cache memory is also enabled for the buffers created in the DDR.

We have profiled this DSPLIB Multiply call against simple C "for- loop" multiply for 40k float values.

If the Buffers are created in DDR (cache enabled):
-------------------------------------------------------------------------
Vector Multiply (in nano sec) : 538031
Loop Multiply (in nano sec) : 495208

If the Buffers are created in DDR (cache disabled):
-------------------------------------------------------------------------
Vector Multiply (in nano sec) : 8638455
Loop Multiply (in nano sec) : 5510838

If the Buffers are created in L2SRAM :
--------------------------------------------------------
Vector Multiply (in nano sec) : 108815
Loop Multiply (in nano sec) : 108817

Observations:
---------------------
From the above profile figures we can conclude that :

1. The operations performed in L2SRAM are much faster compared to DDR memory.

2. The Vector operation doesn't provide any significant improvement in performance in all the scenarios above. In most cases (when the vector size is significantly higher) we have observed that a simple "for-loop" multiply call gives better performance than the DSPLIB multiply call.

3. The simulator benchmark figure provided by TI for the same DSPLIB vector call is ( 3 /4 * N + 24) cycles which amounts to ( 3/4 * 40000 + 24) cycles. This comes to around 30024 cycles. Since our DSP core1 is running at 600MHz, the corresponding benchmark figure for the same translates to 30024/0.600 = 50040ns = 50.040 micro sec. Our profile figures are no where near this.

4. We have limitation on L2SRAM memory size (max 4MB as per the Corepac c66x datasheet). Our application vector memory requirement for various operations is much higher than this. Hence we cannot perform all our operations in the L2SRAM.

Please let us know how to improvise on these DSPLIB/MATHLIB wrt performance . Is there anything that we have missed out?

Also we are facing another issue :
1. When we create the buffers in L2SRAM, we could load the test application through CCS- JTAG and execute the same. The above mentioned L2SRAM profile figure is the output of the testcode using this procedure of direct image load through CCS.

2. However if we try to load the same application from the SD-Card, then the corresponding DSP core doesn't boot at all !! This happens only if the buffers are in L2SRAM and we try to boot from SD-Card. If the buffers are in any other section like HEAP then there is no problem wrt booting. Please help !

Thanks,
Naveen Shetti

0 Asheesh Bhardwaj over 10 years ago in reply to Naveen Shetti

TI__Expert 4680 points

Naveen,

The DSP library has a floating point vector multiply function DSPF_vec_mul. You can use this with the data in L2SRAM for achieving the best performance. Since you have large vector multiply requirement it is advised that you perform EDMA operation for bringing in and sending data out between DDR and L2SRAM. You can create ping pong buffers for both input and output so that while CPU is working on ping buffer the EDMA engine can transfer the data between memories through pong buffer. That way you can hide the data transfer latencies and also achieve CPU best performance.

Regards,

Asheesh

0 Naveen Shetti over 10 years ago in reply to Asheesh Bhardwaj

Intellectual 690 points

Hello Asheesh,

We are used the same floating point vector multiply function for our profile figures provided earlier.

Our application requires multiple buffers of different sizes to be allocated in L2SRAM. Different vector operations act on these buffers. Also the total size of these buffers exceed 4MB, hence we cannot have all these buffers at the same time in L2SRAM.

We would definitely consider EDMA option if necessary. However we are currently facing problem when we allocate any buffer in L2SRAM and try to boot DSP image from the SDCard. Please help us to resolve the same

We are using "remoteproc" driver to boot these DSP images. These DSP images boot fine if no L2SRAM buffers are allocated in the test code.

Path : /home/mistral/ti-glsdk_dra7xx-evm_6_10_00_02/board-support/linux/arch/arm/mach-omap2/remoteproc.c

static struct omap_rproc_pdata dra7_rproc_data[] = {
    {
        .name        = "dsp1",
        .firmware    = "dra7-dsp1-fw.xe66",
        .mbox_name    = "mbox-dsp1",
        .oh_name    = "dsp1",
        .timers        = dra7_dsp1_timers,
        .timers_cnt    = ARRAY_SIZE(dra7_dsp1_timers),
        .set_bootaddr    = dra7_ctrl_write_dsp1_boot_addr,
    },

    {
        .name        = "dsp2",
        .firmware    = "dra7-dsp2-fw.xe66",
        .mbox_name    = "mbox-dsp2",
        .oh_name    = "dsp2",
        .timers        = dra7_dsp2_timers,
        .timers_cnt    = ARRAY_SIZE(dra7_dsp2_timers),
        .set_bootaddr    = dra7_ctrl_write_dsp2_boot_addr,
    },

   ....
   ....
}

Also please find the below mentioned snippet code from our app.cfg, linker.cmd and test application.

/* app.cfg Snippet */

Program.sectMap[".matrix"] = "L2SRAM";

/* linker.cmd Snippet */

MEMORY
{
    EXT_CODE (RWX) : org = 0x95000000, len = 0x200000
    EXT_DATA (RW) : org = 0x95200000, len = 0x800000
    EXT_HEAP (RW) : org = 0x95a00000, len = 0x3200000
    TRACE_BUF (RW) : org = 0x9f000000, len = 0x60000
    EXC_DATA (RW) : org = 0x9f060000, len = 0x10000
    PM_DATA (RWX) : org = 0x9f070000, len = 0x20000
    SR_0 (RWX) : org = 0xbfc00000, len = 0x6400000
    L2SRAM (RWX) : org = 0x800000, len = 0x400000
}

SECTIONS
{
    ...
    ...
    .matrix: load > L2SRAM
    ...
}

/* Test Code Snippet */

#define SP_DW_ALIGN 4
#define SP_IQ_MAX_COUNT (40000)

#pragma DATA_SECTION(gpf32LocalBuffI,".matrix")
#pragma DATA_ALIGN(gpf32LocalBuffI, SP_DW_ALIGN);
FLOAT gpf32LocalBuffI[SP_IQ_MAX_COUNT];

#pragma DATA_SECTION(gpf32LocalBuffQ,".matrix")
#pragma DATA_ALIGN(gpf32LocalBuffQ, SP_DW_ALIGN);
FLOAT gpf32LocalBuffQ[SP_IQ_MAX_COUNT];

/* DSPLIB call in main() */

DSPF_sp_vecmul (gpf32LocalBuffI, gpf32LocalBuffI, gpf32LocalBuffQ, 40000);

Regards

Naveen Shetti

0 Asheesh Bhardwaj over 10 years ago in reply to Naveen Shetti

TI__Expert 4680 points

Naveen,

In order to get the performance for the algorithm partition your buffers so that you can do data transfer using DMA. The size of the buffers will depend on your algorithm and available L2SRAM.
The SD card boot is a separate issue then performance of the DSP lib algorithms.

I suggest you to start a separate thread for SD card boot issue.

Regards

Asheesh

0 Mostafa El-Hashash over 10 years ago in reply to Asheesh Bhardwaj

Intellectual 395 points

Hello,
How can i add these optimization flags -O3 -ms0 --symdebug:none --optimize_with_debug=on --opt_for_speed=5.
but i am not using Code composer studio, i am working on Linux
Regards

0 Moses Isang over 10 years ago in reply to Mostafa El-Hashash

TI__Expert 8175 points

Hi Mostafa,
Is this related to the current post? You can create a fresh post for this question if it isn't related and you'll get it answered.
Thanks,
Moses

Processors

Processors forum

DRA7xx : DSPLIB, MATHLIB calls provide degraded throughput !