Linux/AM5728: AM5728

esmaill kazemian

Part Number: AM5728
Other Parts Discussed in Thread: SYSBIOS

Tool/software: Linux

We are working on a custom board around AM5728 (based on processor sdk 05.02). As our ultimate goal, we want to use OpenCL or IPC based processing delegation to the DSPs.

First we tested the simplest approach without Linux on ARM. Running program on DSP and measuring the calculation time: 18.5ms by use JTAG ( without Linux running on the ARM).

but when I use linux on arm a15 (processor sdk 5.02) and run program on DSP1 by ipc it takes 32 ms. I see this problem when was using opencl for run code on DSP.()

Would you please shed some light on why we have this poor performance in our OpenCL or IPC based applications?

Any help would greatly be appreciated.

Kind regards,

Esmaill

over 7 years ago

0 Rex Chang over 7 years ago

TI__Guru 50170 points

Hi, Esmaill,

There was a similar discussion on this. Please take a look at the reply from ran35366 in the thread, e2e.ti.com/.../560396

If it answers your questions, please click "Resolved". Thanks!

Rex

0 esmaill kazemian over 7 years ago in reply to Rex Chang

Intellectual 280 points

I compare performance of running the code on the DSP when linux is running on the ARM A15 with running the code on the DSP by JTAG when linux is not running on the ARM A15.
I sends the code on the DSP by use IPC or Opencl when linux is running on the ARM.
I sends the code on the DSP by use JTAG when linux is not running on the ARM.
But in the first case, the measured time is 32ms and in the second case 18.5ms.
Why DSP performance is different in both cases?
Would you please shed some light on why we have this poor performance in our OpenCL or IPC based applications?

Any help would greatly be appreciated.

Kind regards,

Esmaill

0 Rex Chang over 7 years ago in reply to esmaill kazemian

TI__Guru 50170 points

Hi, Esmaill,

I am not sure how you coded your DSP application. Where are the data and code space located? If they are in DDR3, then the 2 cases are not apple to apple comparison. With Linux running, the DSP may compete with DDR3 bandwidth with Linux. Try to move the DSP code to local L2 memory if it is not currently located and see if it makes difference.

Rex

0 esmaill kazemian over 7 years ago in reply to Rex Chang

Intellectual 280 points

‌‌Both case are in the same condition. We have located the code and the data on the DDR. Our code is big to locate on the L2 so we used 256 KB of L2 for cache and for improve performance we locate sections of code (.stack: load > L2SRAM) in 32 KB remaining from L2 after cache.
We check EMIF bandwidth by below command :
omapconf trace perf -t 30 -d 3 -s 0.1 -c trace_perf_config.dat
the result was 16%.

So it does not seem to be the problem of bandwidth.

Kind regards,

Esmaill

0 Rex Chang over 7 years ago in reply to esmaill kazemian

TI__Guru 50170 points

Hi, Esmaill,

Linux is running off DDR3. When Kernel context is running, the DSP execution will be affected. Try some small benchmark code which can fix in the L2 memory and does the comparison. I will think the comparison should be the performance between ARM Linux only vs ARM Linux and DSP RTOS to see the benefit of DSP.

Rex

0 esmaill kazemian over 7 years ago in reply to Rex Chang

Intellectual 280 points

Rex,

thank you for the effort. We will implement your proposed benchmark.

We have used the example structure of the big Data for the execution of the code on DSP using IPC. We have added a few lines to enable the cache in DSP to the DSP.CFG file:

var Cache = xdc.useModule('ti.sysbios.family.c66.Cache');
Cache.setMarMeta(EXT_DATA.base, EXT_DATA.len, Cache.PC|Cache.PFX);
Cache.setMarMeta(EXT_CODE.base, EXT_CODE.len, Cache.PC|Cache.PFX);

In this code, virtual addresses are used. Is it possible that the MMU unit will reduce performance?

Our code is larger than in L2 and the data used is about 1 MB and we use IPC to implement the code on DSP (the same structure, big Data example). What is your solutions to increase the performance in this case?

Best regards,

Esmaill

0 esmaill kazemian over 7 years ago in reply to Rex Chang

Intellectual 280 points

Hi,

We have created a benchmark code that lies in the L2 cache.
We measured the number of execution cycles in the case that Linux is not running on the ARM.
The number of cycles in this case was 10,000,000 times.

Then, we implemented the code in Linux using IPC on the DSP. We first put the code in the task.
In this case, the number of cycles was 20,000,000. But when we put the code before bios.start() (in the main), the number of cycles was 8,500,000 .
This result is very strange that it takes longer to execute the code after bios.start().

Would you please shed some light on why we have this poor performance in this case(after bios.start())?

Any help would greatly be appreciated.

Kind regards,

Esmaill

0 Rex Chang over 7 years ago in reply to esmaill kazemian

TI__Guru 50170 points

Hi, Esmaill,

First, I was hoping to have a benchmark running on Linux/ARM compared to that offloaded to DSP using IPC. Even doing this way, it depends on the algorithm which may still not showing big advantage of DSP offload due to ARM architecture which handles a lot of complicated computing. Any way, my point is that the benefit of using IPC is to offload complicated computing to DSP and release ARM from locking up so other processes can run at the same time.

I just discuss your question of different behaviors when running before bios_start() and in the task. I'll have him work with you.

Rex

0 esmaill kazemian over 7 years ago in reply to Rex Chang

Intellectual 280 points

Hi,

We know that by offloading complex algorithms on DSP, the execution time is reduced compared to ARM / Linux. But we want to use maximum DSP computations when Linux is running on the ARM. However, by executing the code on DSP using IPC (in Linux)the execution time doubles. We expect to use the maximum computational power of the DSP in this case. As the benchmark shows, the doubling time is not due to DDR bandwidth.

We also wrote a program in which bios runs on the ARM, and by using IPC we offloading our algorithm code on the DSP. The execution time only increased 600 us compared to the direct implementation on DSP. We used big Data example structure to write the code.

Best regards,

Esmaill

0 lding over 7 years ago in reply to esmaill kazemian

TI__Guru* 95265 points

Hi,

Q1. In Linux overloading to DSP case, what is the DSP core speed? In DSP JTAG standalone case without Linux/ARM, what is the DSP core speed? Are they the same?

Q2. In your benchmarking, is it a CPU+memory benchmarking? That is no peripheral involved?

Q3. Then, we implemented the code in Linux using IPC on the DSP. We first put the code in the task.
In this case, the number of cycles was 20,000,000. But when we put the code before bios.start() (in the main), the number of cycles was 8,500,000 .
This result is very strange that it takes longer to execute the code after bios.start().
=========> Without Linux, just benchmarking code in DSP with JTAG: 1) put it before bios_start(). 2) create as a task. Are they the same performance?

Regards, Eric

0 esmaill kazemian over 7 years ago in reply to lding

Intellectual 280 points

Hi,

Yes, They are same. It is 750 Mhz.
Yes, it is CPU+memory benchmarking.
Yes, That is no peripheral involved.

we implemented the benchmarking code in DSP with JTAG: We first put the code in the task.
In this case, the number of cycles was 11,000,000. But when we put the code before bios.start() (in the main), the number of cycles was 8,200,000 .
In this benchmark, we observe by changing the number of times of the run of code (by insert code in the for loop) that there is a fixed overhead 3 ms.

We also wrote a program in which bios runs on the ARM, and by using IPC we offloading our algorithm code on the DSP. The execution time only increased 600 us compared to the direct implementation on DSP. We used big Data example structure to write the code.

Regards, Esmaill

0 esmaill kazemian over 7 years ago in reply to esmaill kazemian

Intellectual 280 points

Hi,

In big data example, cmem memory was not cached on the DSP. The times drew closer to each other by activating cache for cmem.

Regards, Esmaill

Processors

Processors forum

Linux/AM5728: AM5728