This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM572x OpenCL example code fails

Other Parts Discussed in Thread: AM5728, FFTLIB

I'm hoping to find some pointers to debugging an issue I'm having running OpenCL code on an AM5728.

The gist of the problem is that the example code is compiling and running fine up until the point of validating the data resulting from the DSP kernels. Each of the examples I've tried seem to have the same problem. Work is being done on the DSP, so the firmware is loading correctly and LAD is queueing up messages just fine, but the large chunks of data all are corrupt. I would guess that there's something wrong with the cmem configuration, but everything looks to be configured correctly. 

I'm running kernel 4.1.13 on a Compulab CL-SOM-AM57x with AM5728 running SDK version 2.00.01.07 - Unfortunately that's the most recent SDK they have "officially" running - even though I'm not 100% sure their documentation is complete - I've had to fill in some holes here and there.

I noticed that for some other folks having issues with earlier versions of the SDK, the recommendation is to update the SDK. I'd be happy to update to a newer version of the SDK if anyone can comment on which versions will work with a 4.1 kernel. I can manage recompiling everything in the SDK, but it will be more difficult to upgrade the kernel to 4.4 at this point since I don't have all of Compulab's configurations for that kernel. 

To illustrate some of the issues, here are the output of a few different examples...

The one with the most useful output seems to be the vecadd_md. Here's what it looks like:

$ ./vecadd_md
DEVICE 0: TI Multicore C66 DSP

=== Method 1: Using ReadBuffer/WriteBuffer APIs ===
Failed at Element 1: 8 != 4
Failed at Element 2: 16 != 8
Failed at Element 3: 24 != 12
Failed at Element 4: 32 != 16
Failed at Element 5: 40 != 20
Failed at Element 6: 48 != 24
Failed at Element 7: 56 != 28
Failed at Element 8: 64 != 32
Failed at Element 9: 72 != 36
Method 1:  1130 micro seconds
DEVICE 0:
Write BufA  : Queue  to Submit: 9 us
Write BufA  : Submit to Start : 26 us
Write BufA  : Start  to End   : 42 us

Write BufB  : Queue  to Submit: 58 us
Write BufB  : Submit to Start : 34 us
Write BufB  : Start  to End   : 25 us

Kernel Exec : Queue  to Submit: 1 us
Kernel Exec : Submit to Start : 18 us
Kernel Exec : Start  to End   : 279 us

Read BufDst : Queue  to Submit: 276 us
Read BufDst : Submit to Start : 125 us
Read BufDst : Start  to End   : 23 us

Fail with 8191 errors!


=== Method 2: Using MapBuffer/UnmapBuffer APIs ===
Failed at Element 1: 8 != 4
Failed at Element 2: 16 != 8
Failed at Element 3: 24 != 12
Failed at Element 4: 32 != 16
Failed at Element 5: 40 != 20
Failed at Element 6: 48 != 24
Failed at Element 7: 56 != 28
Failed at Element 8: 64 != 32
Failed at Element 9: 72 != 36
Method 2:  859 micro seconds
DEVICE 0:
Map   BufA  : Queue  to Submit: 1 us
Map   BufA  : Submit to Start : 16 us
Map   BufA  : Start  to End   : 2 us

Map   BufB  : Queue  to Submit: 115 us
Map   BufB  : Submit to Start : 5 us
Map   BufB  : Start  to End   : 1 us

Unmap BufA : Queue  to Submit: 1 us
Unmap BufA : Submit to Start : 18 us
Unmap BufA : Start  to End   : 10 us

Unmap BufB : Queue  to Submit: 1 us
Unmap BufB : Submit to Start : 14 us
Unmap BufB : Start  to End   : 6 us

Kernel Exec : Queue  to Submit: 1 us
Kernel Exec : Submit to Start : 12 us
Kernel Exec : Start  to End   : 234 us

Map   BufDst : Queue  to Submit: 240 us
Map   BufDst : Submit to Start : 49 us
Map   BufDst : Start  to End   : 4 us

Unmap BufDst : Queue  to Submit: 2 us
Unmap BufDst : Submit to Start : 15 us
Unmap BufDst : Start  to End   : 2 us

Fail with 8191 errors!

Something like fftlib doesn't report any errors, but I don't think the code is doing any validation of the results. However it shows that at least some work is being done by the DSPs, so that much is working.

$ ./dsplib_fft 
Offloading FFT (SP,Complex) of 64K elements...

Write X : Queue  to Submit: 8 us
Write X : Submit to Start : 26 us
Write X : Start  to End   : 935 us

Twiddle : Queue  to Submit: 953 us
Twiddle : Submit to Start : 42 us
Twiddle : Start  to End   : 823 us

FFT : Queue  to Submit: 1271 us
FFT : Submit to Start : 5 us
FFT : Start  to End   : 13169 us

Read Y : Queue  to Submit: 14430 us
Read Y : Submit to Start : 134 us
Read Y : Start  to End   : 858 us

Done!

Okay, so some of the examples report this type of error:

sudo ./dspheap 
[host  ] DDR  heap size 16384k
recvfrom failed: Link has been severed (67)
rpmsgThreadFxn: transportGet failed on fd 12, returned -20
TIOCL FATAL: Communication to a DSP has been lost (likely due to an MMU fault). Please wait while the DSPs are reset and the runtime attempts to terminate. A reboot may be required before running another OpenCL application if this fails. See the kernel log for fault information.

Looking in the kernel log:

[   95.542808] omap-iommu 40d01000.mmu: iommu fault: da 0xc0011540 flags 0x0
[   95.549636]  remoteproc2: crash detected in 40800000.dsp: type mmufault
[   95.556289] omap-iommu 40d01000.mmu: 40d01000.mmu: errs:0x00000002 da:0xc0011540 pgd:0xec09b000 *pgd:px00000000
[   95.566472]  remoteproc2: handling crash #1 in 40800000.dsp
[   95.572107]  remoteproc2: recovering 40800000.dsp

I'm not sure if that's directly related to the other issue, but it would indicate that something is wrong. 

Thanks for any pointers.

Scott

  • Here's the relevant portion of the dts file:

    +/ {
    +	model = "CompuLab CL-SOM-AM57x";
    +	compatible = "compulab,cl-som-am57x", "ti,am5728", "ti,dra742", "ti,dra74", "ti,dra7";
    +
    +	memory {
    +		device_type = "memory";
    +		reg = <0x0 0x80000000 0x0 0x20000000>; /* 512 MB - minimal configuration */
    +	};
    +
    +	reserved-memory {
    +		#address-cells = <2>;
    +		#size-cells = <2>;
    +		ranges;
    +
    +		ipu1_cma_pool: ipu1_cma@9d000000 {
    +			compatible = "shared-dma-pool";
    +			reg = <0x0 0x9d000000 0x0 0x2000000>;
    +			reusable;
    +			status = "okay";
    +		};
    +
    +		ipu2_cma_pool: ipu2_cma@95800000 {
    +			compatible = "shared-dma-pool";
    +			reg = <0x0 0x95800000 0x0 0x3800000>;
    +			reusable;
    +			status = "okay";
    +		};
    +
    +		dsp1_cma_pool: dsp1_cma@99000000 {
    +			compatible = "shared-dma-pool";
    +			reg = <0x0 0x99000000 0x0 0x4000000>;
    +			reusable;
    +			status = "okay";
    +		};
    +
    +		dsp2_cma_pool: dsp2_cma@9f000000 {
    +			compatible = "shared-dma-pool";
    +			reg = <0x0 0x9f000000 0x0 0x800000>;
    +			reusable;
    +			status = "okay";
    +		};
    +
    +                cmem_block_mem_0: cmem_block_mem@a0000000 {
    +                        reg = <0x0 0xa0000000 0x0 0x0c000000>;
    +                        no-map;
    +                        status = "okay";
    +                };
    +
    +		cmem_block_mem_1_ocmc3: cmem_block_mem@40500000 {
    +			reg = <0x0 0x40500000 0x0 0x100000>;
    +			no-map;
    +			status = "okay";
    +		};
    +	};
    +
    +        cmem {
    +                compatible = "ti,cmem";
    +                #address-cells = <1>;
    +                #size-cells = <0>;
    +
    +		#pool-size-cells = <2>;
    +
    +                status = "okay";
    +
    +                cmem_block_0: cmem_block@0 {
    +                        reg = <0>;
    +                        memory-region = <&cmem_block_mem_0>;
    +                        cmem-buf-pools = <1 0x0 0x0c000000>;
    +                };
    +
    +		cmem_block_1: cmem_block@1 {
    +			reg = <1>;
    +			memory-region = <&cmem_block_mem_1_ocmc3>;
    +		};
    +        };
    +
    

    And the output of cat /proc/cmem:

    cat /proc/cmem
    
    Block 0: Pool 0: 1 bufs size 0xc000000 (0xc000000 requested)
    
    Pool 0 busy bufs:
    
    Pool 0 free bufs:
    id 0: phys addr 0xa0000000
    

  • Hi,

    The OpenCL experts have been notified. They will respond here.
  • Thanks. 

    As a follow up, I am currently rebuilding the system from scratch and revising my documentation. The last build had a mix of pre-built 2.00.01 binaries and files build from ti git source (whatever the active master branch had), which may be complicating matters. 

    So the new approach is to use the 4.1 TI kernel with kernel mods from Compulab, the current git source for cmemk from the ludev repo, and all 3.01 prebuilt binaries from the SDK.

    Here's the build doc so far:

    nw2sdevices.atlassian.net/.../SBC-AM57x Development Platform

    One thing I noticed, however - related to the possible mmu crashes reported above, I have the following messages in the kernel logs related to booting the DSPs:

    [    3.689020] omap-rproc 40800000.dsp: assigned reserved memory node dsp1_cma@99000000
    [    3.689055]  remoteproc2: 40800000.dsp is available
    [    3.689057]  remoteproc2: Note: remoteproc is still under development and considered experimental.
    [    3.689059]  remoteproc2: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.
    [    3.689362] omap-rproc 41000000.dsp: assigned reserved memory node dsp2_cma@9f000000
    [    3.689409]  remoteproc3: 41000000.dsp is available
    [    3.689411]  remoteproc3: Note: remoteproc is still under development and considered experimental.
    [    3.689413]  remoteproc3: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.
    [    4.026909]  remoteproc2: registered virtio0 (type 7)
    [    4.033544]  remoteproc3: registered virtio1 (type 7)
    [    4.279553]  remoteproc2: powering up 40800000.dsp
    [    4.364175]  remoteproc2: Booting fw image dra7-dsp1-fw.xe66, size 22037092
    [    4.370815] omap_hwmod: mmu0_dsp1: _wait_target_disable failed
    [    4.370882] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0
    [    4.370913] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0
    [    4.438186]  remoteproc2: bad phdr da 0x800000 mem 0x5cd1
    [    4.438190]  remoteproc2: Failed to load program segments: -22
    [    4.470565] omap_hwmod: mmu1_dsp1: _wait_target_disable failed
    [    4.477140] omap_hwmod: mmu0_dsp1: _wait_target_disable failed
    [    4.522396]  remoteproc2: rproc_boot() failed -22
    [    4.525316]  remoteproc3: powering up 41000000.dsp
    [    4.546894]  remoteproc3: Booting fw image dra7-dsp2-fw.xe66, size 22037092
    [    4.553622] omap-iommu 41501000.mmu: 41501000.mmu: version 3.0
    [    4.553656] omap-iommu 41502000.mmu: 41502000.mmu: version 3.0
    [    4.565109]  remoteproc3: bad phdr da 0x800000 mem 0x5cd1
    [    4.565111]  remoteproc3: Failed to load program segments: -22
    [    4.586529] omap_hwmod: mmu1_dsp2: _wait_target_disable failed
    [    4.593142] omap_hwmod: mmu0_dsp2: _wait_target_disable failed
    [    4.604951]  remoteproc3: rproc_boot() failed -22
    [    4.604983] virtio_rpmsg_bus: probe of virtio1 failed with error -22
    

  • At this point, I believe the issue is some sort of conflict with the 4.1 kernel. I've managed to get a basic 4.4 kernel running without the bad phr errors. It will be a couple of days before I can see if that translates into working example code as I'm traveling and carrying my dev board with me. But here's the relevant logs:

    [    0.000000] Reserved memory: initialized node dsp1_cma@99000000, compatible id shared-dma-pool
    [    0.000000] Reserved memory: initialized node dsp2_cma@9f000000, compatible id shared-dma-pool
    [    0.297551] omap-iommu 40d01000.mmu: 40d01000.mmu registered
    [    0.297732] omap-iommu 40d02000.mmu: 40d02000.mmu registered
    [    0.297906] omap-iommu 58882000.mmu: 58882000.mmu registered
    [    0.298074] omap-iommu 55082000.mmu: 55082000.mmu registered
    [    0.298355] omap-iommu 41501000.mmu: 41501000.mmu registered
    [    0.298536] omap-iommu 41502000.mmu: 41502000.mmu registered
    [    3.980930] omap-rproc 40800000.dsp: assigned reserved memory node dsp1_cma@99000000
    [    3.980987]  remoteproc2: 40800000.dsp is available
    [    3.980930] omap-rproc 40800000.dsp: assigned reserved memory node dsp1_cma@99000000
    [    3.980987]  remoteproc2: 40800000.dsp is available
    [    3.980991]  remoteproc2: Note: remoteproc is still under development and considered experimental.
    [    3.980994]  remoteproc2: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.
    [    3.988508] omap-rproc 41000000.dsp: assigned reserved memory node dsp2_cma@9f000000
    [    3.988561]  remoteproc3: 41000000.dsp is available
    [    3.988565]  remoteproc3: Note: remoteproc is still under development and considered experimental.
    [    3.988569]  remoteproc3: THE BINARY FORMAT IS NOT YET FINALIZED, and backward compatibility isn't yet guaranteed.
    [    4.422013]  remoteproc3: registered virtio0 (type 7)
    [    4.428938]  remoteproc2: registered virtio1 (type 7)
    [    4.834751]  remoteproc3: powering up 41000000.dsp
    [    4.974527]  remoteproc3: Booting fw image dra7-dsp2-fw.xe66, size 22037092
    [    5.013287] omap_hwmod: mmu0_dsp2: _wait_target_disable failed
    [    5.019197] omap-iommu 41501000.mmu: 41501000.mmu: version 3.0
    [    5.025206] omap-iommu 41502000.mmu: 41502000.mmu: version 3.0
    [    5.107312]  remoteproc3: remote processor 41000000.dsp is now up
    [    5.209873]  remoteproc2: powering up 40800000.dsp
    [    5.269372]  remoteproc2: Booting fw image dra7-dsp1-fw.xe66, size 22037092
    [    5.328078] omap_hwmod: mmu0_dsp1: _wait_target_disable failed
    [    5.333986] omap-iommu 40d01000.mmu: 40d01000.mmu: version 3.0
    [    5.339930] omap-iommu 40d02000.mmu: 40d02000.mmu: version 3.0
    [    5.366986]  remoteproc2: remote processor 40800000.dsp is now up
    [   15.722083] omap_hwmod: mmu1_dsp2: _wait_target_disable failed
    [   15.735042] omap_hwmod: mmu1_dsp1: _wait_target_disable failed
    [   15.748133] omap_hwmod: mmu0_dsp2: _wait_target_disable failed
    [   15.761200] omap_hwmod: mmu0_dsp1: _wait_target_disable failed
    

    Fingers crossed. 

    s

  • Okay, success. With the 4.4 kernel and the 3.01.00 SDK. I just couldn't get things running with 4.1 and 2.00.01 and it's probably not worth any more effort. I'll just have to switch kernels until Compulab releases their 4.4 official kernel since I lose a few things building my own kernel. If anyone's interested, (and happen to have the same board), the kernel that I have working which is the 4.1 TI kernel patched with Compulab's overlay and then merged into the TI 4.4 branch is here:

    https://github.com/nw2s/b2-dsp-linux

    
    ./vecadd_md 
    
    DEVICE 0: TI Multicore C66 DSP
    
    === Method 1: Using ReadBuffer/WriteBuffer APIs ===
    Method 1:  1504 micro seconds
    DEVICE 0:
    Write BufA  : Queue  to Submit: 12 us
    Write BufA  : Submit to Start : 193 us
    Write BufA  : Start  to End   : 57 us
    
    Write BufB  : Queue  to Submit: 234 us
    Write BufB  : Submit to Start : 97 us
    Write BufB  : Start  to End   : 55 us
    
    Kernel Exec : Queue  to Submit: 2 us
    Kernel Exec : Submit to Start : 49 us
    Kernel Exec : Start  to End   : 293 us
    
    Read BufDst : Queue  to Submit: 321 us
    Read BufDst : Submit to Start : 127 us
    Read BufDst : Start  to End   : 25 us
    
    Success!
    
    
    === Method 2: Using MapBuffer/UnmapBuffer APIs ===
    Method 2:  714 micro seconds
    DEVICE 0:
    Map   BufA  : Queue  to Submit: 2 us
    Map   BufA  : Submit to Start : 24 us
    Map   BufA  : Start  to End   : 3 us
    
    Map   BufB  : Queue  to Submit: 24 us
    Map   BufB  : Submit to Start : 5 us
    Map   BufB  : Start  to End   : 1 us
    
    Unmap BufA : Queue  to Submit: 2 us
    Unmap BufA : Submit to Start : 22 us
    Unmap BufA : Start  to End   : 12 us
    
    Unmap BufB : Queue  to Submit: 2 us
    Unmap BufB : Submit to Start : 18 us
    Unmap BufB : Start  to End   : 10 us
    
    Kernel Exec : Queue  to Submit: 2 us
    Kernel Exec : Submit to Start : 18 us
    Kernel Exec : Start  to End   : 306 us
    
    Map   BufDst : Queue  to Submit: 315 us
    Map   BufDst : Submit to Start : 28 us
    Map   BufDst : Start  to End   : 7 us
    
    Unmap BufDst : Queue  to Submit: 3 us
    Unmap BufDst : Submit to Start : 20 us
    Unmap BufDst : Start  to End   : 2 us
    
    Success!
    

  • Great work Thomas.

    Have you noticed that Compulb released a Yocto distribution with the Linux kernel 4.1.13-cl-som-am57x-ti-3.2 based on TI processor SDK version 2.0.1

    I am new into the platform too (long experience with DSPs as stand alone), but I don't know how to setup the platform to create DSPs applications. COuld you give me a hint with this?

    Regards
  • I am not familiar with yocto, sorry. I've bugged compulab about 4.4 and sdk 3... If the yocto distribution is for 4.1 and sdk 2.0, then it may not be worth the effort as I couldn't get OpenCL working on those versions.

    If you want to see how I got 4.4/3.0 combo working from a mix of TI and compulab code, see nw2sdevices.atlassian.net/.../SBC-AM57x Development Platform - you might be able to use that as a starting point for doing the same with the newer kernel and SDK.

    Scott
  • Thomas, thanks for helping the community by sharing that page and the comments.
  • Thomas Wilson said:
    I am not familiar with yocto, sorry. I've bugged compulab about 4.4 and sdk 3... If the yocto distribution is for 4.1 and sdk 2.0, then it may not be worth the effort as I couldn't get OpenCL working on those versions.

    If you want to see how I got 4.4/3.0 combo working from a mix of TI and compulab code, see nw2sdevices.atlassian.net/.../SBC-AM57x Development Platform - you might be able to use that as a starting point for doing the same with the newer kernel and SDK.

    Scott

    Thank you Thomas for writting that detailed process.

    One question after reading it all, did you use Compulab's Mainline based Linux 4.4 distribution?

    25-Oct-2016, CL-SOM-AM57x Linux release

    Linux kernel 4.4.21-cl-som-am57x-3.2 and v4.1.13-cl-som-am57x-ti-3.2 for CL-SOM-AM57x updates

    It is supposed to come with 4.4.21 kernel and patch 3.2, but no support for DSP or Graphics acceleration. 

    Do you think it would be possible to start from it?

    On the other hand, we will use OpenCv and DSPs applications (both using OpenCL and Code Composer Studio own developed programs). Have you done anything similar. If so, do you have any example on How to code on this DSP?

    thank you again

    PS: Of course, we will share any advance we get on this board. We find the documentation quite reduced for newbies.