This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM62A7-Q1: accessing dmabuf takes too long

Part Number: AM62A7-Q1
Other Parts Discussed in Thread: AM62A7

1.dmabuf到内存,用户态大概在36~40ms之间;
[1970-01-01 00:01:54 505.320][v4l2halapi_cap][callback_process][189]:/dev/video1 copy size 3932160 from dmabuf to buf cost: 37 ms 732 us.
[1970-01-01 00:01:54 699.738][v4l2halapi_cap][callback_process][189]:/dev/video0 copy size 3932160 from dmabuf to buf cost: 37 ms 847 us.
[1970-01-01 00:01:54 740.942][v4l2halapi_cap][callback_process][189]:/dev/video1 copy size 3932160 from dmabuf to buf cost: 37 ms 954 us.
[1970-01-01 00:01:54 932.309][v4l2halapi_cap][callback_process][189]:/dev/video0 copy size 3932160 from dmabuf to buf cost: 36 ms 713 us.
2.dmabuf到内存,内核态36~40ms之间;
[ 108.097452] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0x11e01000 cost time 39699 us, sequence: 12, idx: 1.
[ 108.147086] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0xb601000 cost time 37459 us, sequence: 12, idx: 0.
[ 108.333503] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0x121c2000 cost time 38833 us, sequence: 13, idx: 1.
[ 108.382931] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0xb9c2000 cost time 37214 us, sequence: 13, idx: 0.
[ 108.570038] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0x12583000 cost time 38376 us, sequence: 14, idx: 1.
[ 108.619260] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0xef01000 cost time 37029 us, sequence: 14, idx: 0.
[ 108.808304] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0x12944000 cost time 38521 us, sequence: 15, idx: 1.
[ 108.857273] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0xf2c2000 cost time 36848 us, sequence: 15, idx: 0.
3.内存到内存,内核态大概在13~15ms;
[ 103.366459] <DRV_ADS6311> [1970-01-01 00:01:50 000.705.930] <I> 4152 copy addr1 00000000bab3f124 to addr2 00000000ae7456e2 cost time 5669 us, len: 3932160.
[ 103.402573] <DRV_ADS6311> [1970-01-01 00:01:50 036.823.105] <I> 4152 copy addr1 00000000ae7456e2 to addr2 00000000bab3f124 cost time 4392 us, len: 3932160.
[ 103.434550] <DRV_ADS6311> [1970-01-01 00:01:50 068.801.460] <I> 4152 copy addr1 00000000bab3f124 to addr2 00000000ae7456e2 cost time 3050 us, len: 3932160.
[ 103.465226] <DRV_ADS6311> [1970-01-01 00:01:50 099.477.670] <I> 4152 copy addr1 00000000ae7456e2 to addr2 00000000bab3f124 cost time 2699 us, len: 3932160.
4.内存到内存,用户态大概在2ms出头:
[1970-01-01 00:01:55 393.750][v4l2halapi_cap][callback_process][200]:/dev/video0 copy size 3932160 from buf to buf cost: 2 ms 301 us.
[1970-01-01 00:01:55 434.785][v4l2halapi_cap][callback_process][200]:/dev/video1 copy size 3932160 from buf to buf cost: 2 ms 467 us.
[1970-01-01 00:01:55 622.394][v4l2halapi_cap][callback_process][200]:/dev/video0 copy size 3932160 from buf to buf cost: 2 ms 495 us.
[1970-01-01 00:01:55 663.103][v4l2halapi_cap][callback_process][200]:/dev/video1 copy size 3932160 from buf to buf cost: 2 ms 134 us.

  • Hi CE:

        I completed the time-consuming test of memcpy operation under various circumstances. The following are my test results.

        1. dmabuf to memory, user space takes about 36~40ms;
    [1970-01-01 00:01:54 505.320][v4l2halapi_cap][callback_process][189]:/dev/video1 copy size 3932160 from dmabuf to buf cost: 37 ms 732 us.
    [1970-01-01 00:01:54 699.738][v4l2halapi_cap][callback_process][189]:/dev/video0 copy size 3932160 from dmabuf to buf cost: 37 ms 847 us.
    [1970-01-01 00:01:54 740.942][v4l2halapi_cap][callback_process][189]:/dev/video1 copy size 3932160 from dmabuf to buf cost: 37 ms 954 us.
    [1970-01-01 00:01:54 932.309][v4l2halapi_cap][callback_process][189]:/dev/video0 copy size 3932160 from dmabuf to buf cost: 36 ms 713 us.
        2. dmabuf to memory, kernel space is between 36~40ms;
    [ 108.097452] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0x11e01000 cost time 39699 us, sequence: 12, idx: 1.
    [ 108.147086] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0xb601000 cost time 37459 us, sequence: 12, idx: 0.
    [ 108.333503] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0x121c2000 cost time 38833 us, sequence: 13, idx: 1.
    [ 108.382931] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0xb9c2000 cost time 37214 us, sequence: 13, idx: 0.
    [ 108.570038] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0x12583000 cost time 38376 us, sequence: 14, idx: 1.
    [ 108.619260] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0xef01000 cost time 37029 us, sequence: 14, idx: 0.
    [ 108.808304] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0x12944000 cost time 38521 us, sequence: 15, idx: 1.
    [ 108.857273] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0xf2c2000 cost time 36848 us, sequence: 15, idx: 0.
        3. From memory to memory, the kernel space takes about 2~5ms;
    [ 103.366459] <DRV_ADS6311> [1970-01-01 00:01:50 000.705.930] <I> 4152 copy addr1 00000000bab3f124 to addr2 00000000ae7456e2 cost time 5669 us, len: 393216 0.
    [ 103.402573] <DRV_ADS6311> [1970-01-01 00:01:50 036.823.105] <I> 4152 copy addr1 00000000ae7456e2 to addr2 00000000bab3f124 cost time 4392 us, len: 393216 0.
    [ 103.434550] <DRV_ADS6311> [1970-01-01 00:01:50 068.801.460] <I> 4152 copy addr1 00000000bab3f124 to addr2 00000000ae7456e2 cost time 3050 us, len: 393216 0.
    [ 103.465226] <DRV_ADS6311> [1970-01-01 00:01:50 099.477.670] <I> 4152 copy addr1 00000000ae7456e2 to addr2 00000000bab3f124 cost time 2699 us, len: 393216 0.
        4. Memory to memory, user space takes about 2ms:
    [1970-01-01 00:01:55 393.750][v4l2halapi_cap][callback_process][200]:/dev/video0 copy size 3932160 from buf to buf cost: 2 ms 301 us.
    [1970-01-01 00:01:55 434.785][v4l2halapi_cap][callback_process][200]:/dev/video1 copy size 3932160 from buf to buf cost: 2 ms 467 us.
    [1970-01-01 00:01:55 622.394][v4l2halapi_cap][callback_process][200]:/dev/video0 copy size 3932160 from buf to buf cost: 2 ms 495 us.
    [1970-01-01 00:01:55 663.103][v4l2halapi_cap][callback_process][200]:/dev/video1 copy size 3932160 from buf to buf cost: 2 ms 134 us.

        The following is the corresponding log file:

    ti_am62a7_serial_ti_am62ax_serial_2024-03-26_11_44_17.log

  • Hi CE:

        The code for the new time-consuming test of the kernel driver is as follows:

        1.[ 108.097452] j721e-csi2rx 30102000.ticsi2rx: memcpy len 3932160 from addr 0x11e01000 cost time 39699 us, sequence: 12, idx: 1.

        2.[ 103.366459] <DRV_ADS6311> [1970-01-01 00:01:50 000.705.930] <I> 4152 copy addr1 00000000bab3f124 to addr2 00000000ae7456e2 cost time 5669 us, len: 393216 0.

  • Hi Qingfeng,

    Thanks for sharing these measurements. 

    4. Memory to memory, user space takes about 2ms:
    [1970-01-01 00:01:55 393.750][v4l2halapi_cap][callback_process][200]:/dev/video0 copy size 3932160 from buf to buf cost: 2 ms 301 us.

    The user space memcpy performance is roughly as expected. Please refer to SitaraTmAM62A Benchmarks, Table 3-1, LMBench Results.

    3.932160MB/0.002301sec = 1709MB/s, which is close to 2,058 MB/s provided in the above app note. 

    Do you know the clock frequency of the A53 core when you ran your measurements?

    I'll have to check internally about the first 3 cases of your test.

    Regards,

    Jianzhong

  • Hi Jianzhong:
        During testing, the core frequency was 1.25G:

    root@am62axx-evm:~# cat /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq
    1250000
    root@am62axx-evm:~# cat /sys/devices/system/cpu/cpu1/cpufreq/cpuinfo_max_freq
    1250000
    root@am62axx-evm:~# cat /sys/devices/system/cpu/cpu2/cpufreq/cpuinfo_max_freq
    1250000
    root@am62axx-evm:~# cat /sys/devices/system/cpu/cpu3/cpufreq/cpuinfo_max_freq
    1250000

  • Hi Qingfeng,

    For memcpy from dmabuf, the slow speed is due to cache inconsistency of dmabuf. When DMA transfers CSI data to dmabuf, the content of dmabuf in cache will be invalid. Then when you do a memcpy from dmabuf, Linux will first perform cache maintenance by bringing the data from DDR to cache, which takes extra time.

    If you have to use memcpy for transferring data from dmabuf, you can change the ASEL value in device tree to let the hardware take care of cache coherency. Please try to change the 3 lines at https://git.ti.com/cgit/ti-linux-kernel/ti-linux-kernel/tree/arch/arm64/boot/dts/ti/k3-am62a-main.dtsi?h=ti-linux-6.1.y#n974:

    		dmas = <&main_bcdma_csi 0 0x5000 0>, <&main_bcdma_csi 0 0x5001 0>,
    			<&main_bcdma_csi 0 0x5002 0>, <&main_bcdma_csi 0 0x5003 0>,
    			<&main_bcdma_csi 0 0x5004 0>, <&main_bcdma_csi 0 0x5005 0>;
    

    to

    		dmas = <&main_bcdma_csi 0 0x5000 15>, <&main_bcdma_csi 0 0x5001 15>,
    			<&main_bcdma_csi 0 0x5002 15>, <&main_bcdma_csi 0 0x5003 15>,
    			<&main_bcdma_csi 0 0x5004 15>, <&main_bcdma_csi 0 0x5005 15>;
    

    This will make the hardware maintain cache coherency between DMA buffers and cache. Therefore, when you do memcpy from dmabuf, the content in cache will always be valid.

    Regards,

    Jianzhong

  • Hi jianzhong:

        The modification has an effect and the memcpy time is greatly shortened:

        [1970-01-01 00:01:23 779.989][v4l2halapi_cap][callback_process][210]:/dev/video0 copy size 3932160 from dmabuf to buf cost: 4 ms 474 us.

        [1970-01-01 00:01:24 101.696][v4l2halapi_cap][callback_process][210]:/dev/video1 copy size 3932160 from dmabuf to buf cost: 3 ms 491 us.

  • Hi jianzhong:

        Does am62a7 support neon?

  • Hi Qingfeng,

    Thanks for the testing and confirmation.

     Does am62a7 support neon?

    Yes, Neon is supported. Please refer to the AM62A TRM, section 1.4.1 Arm Cortex-A53 Subsystem (A53SS), and you'll find the following:

    Regards,

    Jianzhong