AM5729: Can custom DSP algorithm run with TIDL?

Sung-IL

Mastermind 28755 points

Part Number: AM5729

Dear Champs,

Is it possible to add custom DSP algorithm on DSP while running TIDL?

Should customer separate DSP - e.g. customer algorithm running on DSP1 and TIDL on DSP2?

Please let me know how customer can add their DSP algorithm with TIDL and provide a guide for this.

Thanks and Best Regards,

SI.

over 3 years ago

0 Paula Carrillo over 3 years ago

TI__Mastermind 32700 points

Hi SI, it is possible to use only one DSP for TIDL. Which leaves free the other DSP. There are some examples in our User Guide. Link and classification example below

cd /usr/share/ti/tidl/examples/classification
./tidl_classification -g 2 -d 1 -e 2 -l ./imagenet.txt -s ./classlist.txt -i 1 -c ./stream_config_j11_v2.txt

cd /usr/share/ti/tidl/examples/classification
./tidl_classification -g 1 -d 2 -e 2 -l ./imagenet.txt -s ./classlist.txt -i ./clips/test10.mp4 -c ./stream_config_j11_v2.txt

thank you,

Paula

0 Sung-IL over 3 years ago in reply to Paula Carrillo

TI__Mastermind 28755 points

Hi Paula,

Thanks for this information.

Actually my customer already tried to run their custom algorithm on DSP1 while running TIDL on DSP2 with EVEs, but could not run TIDL at this time.

They are guessing there is a conflict in the IPC and they want to know how they can use IPC on their customer algorithm while running TIDL on other DSP.

Could you please provide guide on this?

They are suspecting the IPC buffer was overlapped and thus TIDL was not worked with their custom algorithm.

Thanks and Best Regards,

SI.

0 Yuan Zhao over 3 years ago in reply to Sung-IL

TI__Expert 3705 points

Please see if you can use the environment variable "TI_OCL_COMPUTE_UNIT_LIST" to partition the DSP cores, e.g.

TI_OCL_COMPUTE_UNIT_LIST="0" ./your_ocl_alg_application

TI_OCL_COMPUTE_UNIT_LIST="1" ./your_tidl_application

-Yuan

0 Sung-IL over 3 years ago in reply to Yuan Zhao

TI__Mastermind 28755 points

Hi Yuan,

They already use the environment variable "TI_OCL_COMPUT_UNIT_LIST" to partition the DSP cores as below.

~~~~~~~~~

root@am57xx-evm:~# cat /etc/ti-mctd/ti_mctd_config.json

{

"cmem-block-offchip" : "0",

"cmem-block-onchip" : "1",

"compute-unit-list" : "1",

"eve-devices-disable" : "0",

"linux-shmem-size-KB" : "256",

}

~~~~~~~~~~~~~

But, it still does not work.

Do you think this configuration is right?

They are still suspecting there was an issue in the IPC and they resolved an issue and can run their DSP application with TIDL, but there are 2 issues in this case.

Please let me know your opinion on below and guide us.

while running TIDL on DSP2 and IPU1, they moved the CMA of DSP1 to 0x9000 0000 ~ 0x9d00 0000 region as below - e.g. they moved PHYS_MEM_IPC_VRING in rsc_table_vayu_dsp.c to 0x9000 0000, and they confirmed TIDL worked well on DSP2 with their DSP application on DSP1.

~~~~~~~~~

[ 0.000000] Reserved memory: created CMA memory pool at 0x0000000090000000, size 208 MiB

[ 0.000000] OF: reserved mem: initialized node dsp1-memory@90000000, compatible id shared-dma-pool

[ 0.000000] Reserved memory: created CMA memory pool at 0x000000009d000000, size 32 MiB

[ 0.000000] OF: reserved mem: initialized node ipu1-memory@9d000000, compatible id shared-dma-pool

[ 0.000000] Reserved memory: created CMA memory pool at 0x000000009f000000, size 8 MiB

[ 0.000000] OF: reserved mem: initialized node dsp2-memory@9f000000, compatible id shared-dma-pool

[ 0.000000] Reserved memory: created CMA memory pool at 0x000000009f800000, size 120 MiB

[ 0.000000] OF: reserved mem: initialized node ipu2-memory@9f800000, compatible id shared-dma-pool

[ 0.000000] cma: Reserved 24 MiB at 0x00000000be400000

~~~~~~~~~

Do you think this approach is right?

In this case, there are 2 issues.

1. there is only 204MB remained for their DSP application SW(0x9000 0000 ~ 0x9d00 0000)

2. They have tested IPC communication between Cortex-A15 ARM core and DSP1 while running TIDL on DSP2 and custom DSP application on DSP1, but they found some issues in the IPC communication after 1000 tests.

The details for IPC communication test are....

* They are sending IPC Qmsg(A15->DSP1) and OpenCL application repeated with 200ms delay.

* There is no issue in IPC communication when only OpenCL userspace application working without DSP1 communication.

* But, When OpenCL userspace application working with DSP1 communication repeatedly with 200ms delay, they found an issue in IPU1 IPC after 1000 trials.

Do you think their approach is right? or is there other way to run their DSP application with TIDL?

Thanks and Best Regards,

SI.

0 Yuan Zhao over 3 years ago in reply to Sung-IL

TI__Expert 3705 points

Hi SI,

I am not sure if the IPC issues are related to the movement of DSP/IPU CMA regions.

1) One thing customer could try is to not make any modifications, put stock PSDK filesystem on the EVM sdcard. Then use the envrionment varaible I mentioned in previous post, run OpenCL application on Arm using DSP1 only, run TIDL application on Arm using DSP2 only, see if the aforementioned IPC problems manifest.

2) Can you clarify what images are running on each core? DSP1, DSP2, IPU1 are running OpenCL firmware? What is running in IPU2?

3) The CMA for IPU2 could be problematic:

[ 0.000000] Reserved memory: created CMA memory pool at 0x000000009f800000, size 120 MiB

[ 0.000000] OF: reserved mem: initialized node ipu2-memory@9f800000, compatible id shared-dma-pool

This will step into 0xA000_0000, which is the DDR CMEM region. For OpenCL/TIDL to run properly on AM57x, we need the first CMEM block to start from 0xA000_0000.

4) TIDL-API underneath the hood is using OpenCL. The "/etc/ti-mctd/ti_mctd_config.json" is a global setting for OpenCL. When the customer set "compute-unit-list" to "1", all OpenCL and TIDL application will be talking to DSP2 only. Maybe customer is not running OpenCL firmware on DSP1? Are they running their own firmware on DSP1?

I don't have a booted AM57 EVM available at the moment. I'll see if I can find somebody in office to help press the power button on my EVM.

-Yuan

0 Sung-IL over 3 years ago in reply to Yuan Zhao

TI__Mastermind 28755 points

Hi Yuan,

Thanks for your response. I added my response in below.

Do you mean customer should go back to original and try it again only after modifying the environment variable?

2) Can you clarify what images are running on each core? DSP1, DSP2, IPU1 are running OpenCL firmware? What is running in IPU2?

TIDL is running on DSP2 and IPU1. There is customer's DSP algorithm SW on DSP1. There is no custom SW in IPU2.

Customer's DSP SW running on DSP1 is just simple program communicating with A15 just for test now.

3) The CMA for IPU2 could be problematic:

[ 0.000000] Reserved memory: created CMA memory pool at 0x000000009f800000, size 120 MiB

[ 0.000000] OF: reserved mem: initialized node ipu2-memory@9f800000, compatible id shared-dma-pool

This will step into 0xA000_0000, which is the DDR CMEM region. For OpenCL/TIDL to run properly on AM57x, we need the first CMEM block to start from 0xA000_0000.

Do you think this IPU2 memory corruption into 0xA000 0000 region may cause IPC instability? I'll request customer to have IPC test again after securing '0xA000 0000' region

Yes. Their own firmware running without OpenCL FW on DSP1. They don't use OpenCL and just use IPC with their own application on DSP1 to communicate with A15.

Thanks and Best Regards,

SI.

0 Yuan Zhao over 3 years ago in reply to Sung-IL

TI__Expert 3705 points

Do you mean customer should go back to original and try it again only after modifying the environment variable?

[YZ] Yes, put a different SDCard in EVM, running stock PSDK filesystem unmodified. No custom firmware. Modify an OpenCL application (e.g. vecadd) to make it run in an infinite loop and run it only using DSP1. Run the TIDL application using IPU1(EVEs) and DSP2 only.

2) Can you clarify what images are running on each core? DSP1, DSP2, IPU1 are running OpenCL firmware? What is running in IPU2?

TIDL is running on DSP2 and IPU1. There is customer's DSP algorithm SW on DSP1. There is no custom SW in IPU2.

Customer's DSP SW running on DSP1 is just simple program communicating with A15 just for test now.

3) The CMA for IPU2 could be problematic:

[ 0.000000] Reserved memory: created CMA memory pool at 0x000000009f800000, size 120 MiB

[ 0.000000] OF: reserved mem: initialized node ipu2-memory@9f800000, compatible id shared-dma-pool

This will step into 0xA000_0000, which is the DDR CMEM region. For OpenCL/TIDL to run properly on AM57x, we need the first CMEM block to start from 0xA000_0000.

Do you think this IPU2 memory corruption into 0xA000 0000 region may cause IPC instability? I'll request customer to have IPC test again after securing '0xA000 0000' region

[YZ] I am not sure. But the reserved memory regions should not overlap.

Yes. Their own firmware running without OpenCL FW on DSP1. They don't use OpenCL and just use IPC with their own application on DSP1 to communicate with A15.

[YZ] Customer might also look into all things IPC related and why IPC traffic on DSP1 interfered with IPC traffic on IPU1. How many buffers they defined for their MessageQ? Is the DSP1 firwmare responding to each MessageQ? Any flow control in place? Where does the IPC daemon log go? Have that exhausted the "/tmp" filesystem after 1000 iterations? This is why I suggested experiment in 1) for testing stock IPC implementation in OpenCL.

0 Yuan Zhao over 3 years ago in reply to Sung-IL

TI__Expert 3705 points

Hi SI,

I have conducted the experiment 1) I mentioned in above post. A15 can have independent IPC communications with DSP1 and DSP2/IPU1. Enclosed are the details.

-Yuan

Window 1 running OpenCL "conv1d" example, using DSP1 only
- source patched for running many iterations and program progress print
Window 2 running TIDL-API "mcbench" test on jacinto11_v2 network, using 4 EVEs (IPU1) and DSP2 only
- source patched for progress print
Start these two programs in two windows roughly the same time
- conv1d examples for 10 million iterations
- mcbench jacinto11_v2 network on 50 million frames
- start time: Sat Sep 12 03:09:36 UTC 2020
Status at two days later:
- Mon Sep 14 13:28:39 UTC 2020
- Both applications are still running/progressing
- Windows 1, "conv1d" is at iteration 278270
- Windows 2, "mcbench" on jacinto11_v2 network is at frame 14000200
Conclusion:
- Two applications can have independent IPC communications to DSP1 and
DSP2/IPU1, and do not interfere with each other.

Recommendation to customer: make small changes at a time and debug at each step
- Step 1: set up this experiment to validate independent IPC communications among A15 and other cores
- Setp 2: replace "Window 2" application with customer's TIDL application, validate IPC communications
- Step 3: replace "Window 1" with cutomer's IPC test (replace DSP1 firmware, but do not move DSP1 CMA memory yet), validate
- Step 4: replace "Window 1" with customer's IPC test (move DSP1 CMA memory), validate

Window 1:

root@am57xx-evm:/usr/share/ti/examples/opencl/conv1d# diff -u main.cpp.orig main.cpp 
--- main.cpp.orig
+++ main.cpp
@@ -79,6 +79,9 @@
   int input_numcompunits = 0;
   if (argc > 1)  input_numcompunits = atoi(argv[1]);  // valid: 1, 2, 4, 8
 
+ for (int iter = 0; iter < input_numcompunits; iter++)
+ {
+  if (iter % 10 == 0)  printf("\n\n\nITER %d:\n\n", iter);
   try
   {
     Context             context (CL_DEVICE_TYPE_ACCELERATOR);
@@ -275,6 +278,7 @@
     cerr << "ERROR: " << err.what() << "(" << err.err() << ", "
          << ocl_decode_error(err.err()) << ")" << endl;
   }
+ }
 
   if (num_errors != 0)
   {
root@am57xx-evm:/usr/share/ti/examples/opencl/conv1d# make clean; make
Compiling k_extc.c
cl6x -mv6600 --abi=eabi -I/usr/share/ti/opencl -I/usr/share/ti/opencl -I/usr/share/ti/cgt-c6x/include -c -o3 -mw --symdebug:none k_extc.c
Compiling ti_kernels.cl
/usr/bin/clocl -t  ti_kernels.cl k_extc.obj
Compiling main.cpp
g++ -c -O3 -I/usr/include -Wall main.cpp
root@am57xx-evm:/usr/share/ti/examples/opencl/conv1d# TI_OCL_COMPUTE_UNIT_LIST="0" ./conv1d 10000000; date

Window 2:

root@am57xx-evm:/usr/share/ti/tidl/examples/mcbench# diff -u main.cpp.orig main.cpp
--- main.cpp.orig
+++ main.cpp
@@ -174,6 +174,7 @@
         for (uint32_t frame_idx = 0;
              frame_idx < opts.num_frames + num_eops; frame_idx++)
         {
+            if (frame_idx % 100 == 0) printf("\n\n\nFrame: %d\n\n", frame_idx);
             ExecutionObjectPipeline* eop = eops[frame_idx % num_eops];
 
             // Wait for previous frame on the same eop to finish processing
root@am57xx-evm:/usr/share/ti/tidl/examples/mcbench# make clean; make
rm -f mcbench stats_tool_out.* *.out
rm -f frame_*.png multibox_*.png
g++ -I/usr/include -O3 -Wall -Werror -Wno-error=ignored-attributes -I. -I/home/root/tidl/tidl_api/inc -std=c++11 -I/usr/share/ti/opencl -I/usr/share/ti/opencl main.cpp ../common/utils.cpp ../common/video_utils.cpp /home/root/tidl/tidl_api/tidl_api.a -L/usr/lib -lOpenCL -locl_util -lpthread -lopencv_highgui -lopencv_imgcodecs -lopencv_videoio -lopencv_imgproc -lopencv_core -o mcbench
root@am57xx-evm:/usr/share/ti/tidl/examples/mcbench# TI_OCL_COMPUTE_UNIT_LIST="1" TIDL_PARAM_HEAP_SIZE_EVE=3000000 TIDL_PARAM_HEAP_SIZE_DSP=3000000 TIDL_NETWORK_HEAP_SIZE_EVE=20000000 TIDL_NETWORK_HEAP_SIZE_DSP=2000000 ./mcbench -g 2 -d 1 -e 4 -c ../test/testvecs/config/infer/tidl_config_j11_v2_lg2.txt -f 50000000 -i ../test/testvecs/input/preproc_0_224x224_multi.y -v; date

0 Sung-IL over 3 years ago in reply to Yuan Zhao

TI__Mastermind 28755 points

Hi Yuan,

Thanks for this information.

My customer is trying to run based on your scenarios now, but they are worrying it is hard for them to increase memory size for their custom DSP application.

Could you please let me know the memory map information of TIDL and OpenCL?

Is it only limitation for TIDL/OpenCL what you said that ‘For OpenCL/TIDL to run properly on AM57x, we need the first CMEM block to start from 0xA000_0000.’?

Could you please help on this and which cmem should be reserved?

They want to know which region can be used for their custom DSP application.

Thanks and Best Regards,

SI.

Processors

Processors forum

AM5729: Can custom DSP algorithm run with TIDL?