This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Linux/AM5728: AM5728

Part Number: AM5728
Other Parts Discussed in Thread: SYSBIOS

Tool/software: Linux

Hi 

We are working on a custom board around AM5728 (based on processor sdk 05.02). As our ultimate goal, we want to use OpenCL or IPC based processing delegation to the DSPs.

First we tested the simplest approach without Linux on ARM. Running program on DSP and measuring the calculation time: 18.5ms by use JTAG ( without Linux running on the ARM).

but when I use linux on arm a15 (processor sdk 5.02) and run program on DSP1  by ipc  it takes 32 ms. I see this problem when was using opencl for run code on DSP.()

Would you please shed some light on why we have this poor performance in our OpenCL or IPC based applications?

Any help would greatly be appreciated.

Kind regards,

Esmaill

  • Hi, Esmaill,

    There was a similar discussion on this. Please take a look at the reply from ran35366 in the thread, e2e.ti.com/.../560396

    If it answers your questions, please click "Resolved". Thanks!

    Rex

  • I compare performance of running the code on the DSP when linux is running on the ARM A15 with running the code on the DSP by JTAG when linux is not running on the ARM A15.
    I sends the code on the DSP by use IPC or Opencl when linux is running on the ARM.
    I sends the code on the DSP by use JTAG when linux is not running on the ARM.
    But in the first case, the measured time is 32ms and in the second case 18.5ms.
    Why DSP performance is different in both cases?
    Would you please shed some light on why we have this poor performance in our OpenCL or IPC based applications?

    Any help would greatly be appreciated.

    Kind regards,

    Esmaill
  • Hi, Esmaill,

    I am not sure how you coded your DSP application. Where are the data and code space located? If they are in DDR3, then the 2 cases are not apple to apple comparison. With Linux running, the DSP may compete with DDR3 bandwidth with Linux. Try to move the DSP code to local L2 memory if it is not currently located and see if it makes difference.

    Rex
  • ‌‌Both case are in the same condition. We have located the code and the data on the DDR. Our code is big to locate on the L2 so we used 256 KB of L2 for cache and  for improve performance we locate sections of code (.stack: load > L2SRAM) in 32 KB remaining from L2 after cache.
    We check EMIF bandwidth by below command :
    omapconf trace perf -t 30 -d 3 -s 0.1 -c trace_perf_config.dat 
    the result was 16%.

    So it does not seem to be the problem of bandwidth.

    Kind regards,

    Esmaill

  • Hi, Esmaill,

    Linux is running off DDR3. When Kernel context is running, the DSP execution will be affected. Try some small benchmark code which can fix in the L2 memory and does the comparison. I will think the comparison should be the performance between ARM Linux only vs ARM Linux and DSP RTOS to see the benefit of DSP.

    Rex
  • Rex, 

    thank you for the effort. We will implement your proposed benchmark. 

    We have used the example structure of the big Data for the execution of the code on DSP using IPC. We have added a few lines to enable the cache in DSP to the DSP.CFG file:

    var Cache = xdc.useModule('ti.sysbios.family.c66.Cache');
    Cache.setMarMeta(EXT_DATA.base, EXT_DATA.len, Cache.PC|Cache.PFX);
    Cache.setMarMeta(EXT_CODE.base, EXT_CODE.len, Cache.PC|Cache.PFX);

    In this code, virtual addresses are used. Is it possible that the MMU unit will reduce performance?

    Our code is larger than in L2 and the data used is about 1 MB and we use IPC to implement the code on DSP (the same structure, big Data example). What is your solutions to increase the performance in this case?

    Best regards,

    Esmaill

     

  • Hi,

    We have created a benchmark code that lies in the L2 cache.
    We measured the number of execution cycles in the case that Linux is not running on the ARM.
    The number of cycles in this case was 10,000,000 times.

    Then, we implemented the code in Linux using IPC on the DSP. We first put the code in the task.
    In this case, the number of cycles was 20,000,000. But when we put the code before bios.start() (in the main), the number of cycles was 8,500,000 .
    This result is very strange that it takes longer to execute the code after bios.start().

    Would you please shed some light on why we have this poor performance in this case(after bios.start())?

    Any help would greatly be appreciated.

    Kind regards,

    Esmaill
  • Hi, Esmaill,

    First, I was hoping to have a benchmark running on Linux/ARM compared to that offloaded to DSP using IPC. Even doing this way, it depends on the algorithm which may still not showing big advantage of DSP offload due to ARM architecture which handles a lot of complicated computing. Any way, my point is that the benefit of using IPC is to offload complicated computing to DSP and release ARM from locking up so other processes can run at the same time.

    I just discuss your question of different behaviors when running before bios_start() and in the task. I'll have him work with you.

    Rex
  • Hi,

    We know that by offloading complex algorithms on DSP, the execution time is reduced compared to ARM / Linux. But we want to use maximum DSP computations  when Linux is running on the ARM. However, by executing the code on DSP using IPC (in Linux)the execution time doubles. We expect to use the maximum computational  power of the DSP in this case. As the benchmark shows, the doubling time is not due to DDR bandwidth.

    We also wrote a program in which bios runs on the ARM,  and by using IPC we offloading our algorithm code on the DSP. The execution time only increased 600 us compared to the direct implementation on DSP. We used big Data example structure to write the code.

    Best regards,

    Esmaill

  • Hi,

    Q1. In Linux overloading to DSP case, what is the DSP core speed? In DSP JTAG standalone case without Linux/ARM, what is the DSP core speed? Are they the same?

    Q2. In your benchmarking, is it a CPU+memory benchmarking? That is no peripheral involved?

    Q3. Then, we implemented the code in Linux using IPC on the DSP. We first put the code in the task.
    In this case, the number of cycles was 20,000,000. But when we put the code before bios.start() (in the main), the number of cycles was 8,500,000 .
    This result is very strange that it takes longer to execute the code after bios.start().
    =========> Without Linux, just benchmarking code in DSP with JTAG: 1) put it before bios_start(). 2) create as a task. Are they the same performance?

    Regards, Eric
  • Hi,

    Yes, They are same. It is 750 Mhz.
    Yes, it is CPU+memory benchmarking.
    Yes, That is no peripheral involved.

    we implemented the benchmarking code in DSP with JTAG: We first put the code in the task.
    In this case, the number of cycles was 11,000,000. But when we put the code before bios.start() (in the main), the number of cycles was 8,200,000 .
    In this benchmark, we observe by changing the number of times of the run of code  (by insert code in the for loop) that there is a fixed overhead 3 ms.

    We also wrote a program in which bios runs on the ARM, and by using IPC we offloading our algorithm code on the DSP. The execution time only increased 600 us compared to the direct implementation on DSP. We used big Data example structure to write the code.

    Regards, Esmaill
  • Hi,

    In big data example, cmem memory was not cached on the DSP. The times drew closer to each other by activating cache for cmem.

    Regards, Esmaill