This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6Run Benchmark DSP much Slower than ARM even for different memory maps

Other Parts Discussed in Thread: DM3730

Hello

New to the TI/DSP stuff, I have been trying to figure out why I can't get DSP code to run faster. I have read through the forums as much as I could but have not figured out what my issue is. I have tried adjusting the DSP mapping but that does not effect anything at all I think I am missing a key point.

I have attached a text file that contains two different memory map runs with the benchmark runs

Here are two different memory maps I did for C6RUN:

First I had DSP_REGION_CMEM_SIZE at 16 MB and
DSP_REGION_CODE_SIZE  at 13 MB

then I tried DSP_REGION_CMEM_SIZE at 28 MB and
DSP_REGION_CODE_SIZE  at 28 MB

Beside that only not making any difference. When I checked lsmod for both there was no difference even though I suspect there should be...

I do all the steps that the readme for c6run says to do in setting up platform etc..and I do not run into any run time errors

I read that I am NOT suppose to adjust DSPLINK memory map as the platform config does that for me...

How do I get the DSP to run faster? What I'm I missing?

Any recommendations would be great.

  • Performance depends on:

    1. What is the code you are trying to run? Is the code suitable for DSP? Generally DSP is good at running code that does same stuff over and over again. So code that has loops will do better on DSP
    2. What are the options you are using to compile the code? http://processors.wiki.ti.com/index.php/C6RunLib_Documentation#Common_Command-line_Options Make sure you are using options that help improve the performance
    3. How much work are you asking DSP to do? Note, every time ARM offloads processing to DSP, there is some overhead involved. So make sure when you are asking DSP to do certain processing, you are giving enough work to DSP.. else, just your calling overheads will dominate. I think the ballpark for overhead is ~200usecs
    4. Are you helping DSP codegen tools to get best performance by providing them the relevant information? See the document here: http://www.ti.com/lit/pdf/sprabf2 by helping the DSP codegen to not make worst case assumptions about your code, you can gain significant performance

     

  • Thank you for your response Gagan. I was running all the included C6Run Example code ( bench_dsp and cfft_dsp ) and comparing them with ( bench_arm and cfft_arm)

    Both Arm version's run much faster. I used the standard memory map and proper u-boot environment variable setup. I search the forum and found someone else:

    http://e2e.ti.com/support/dsp/omap_applications_processors/f/447/p/70317/255208.aspx#255208

    who wrote some Matrix calculation sample and ran that too. The DSP version was running close to the same speed as the ARM.

    I also ran the C6Accel sample code(c6accel_app)

    that goes through each function and logs time it takes to go through it. Comparing mine to the pdf TI provided mine is slower as well. 

    Running on DM3730. 

    I will look more into what you wrote (2,3 and 4) and will report back if anything helps.

    Thanks

    Steve

  • Steve, one other thing that I didn't mention is the impact of running floating point code. Note for DM3730, DSP is fixed point whereas the A8 supports floating point. So if the benchmark you are running is natively floating point, the performance of the DSP will not be great. There are fixed point version of FFTs provided in the DSPLIB.

    The other thing to note is the CPU freq for the two cores. You should account for that when comparing performance

    Cheers,
    Gagan