This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

[FAQ] Why is memcpy so slow on AM6x when copying data from a V4L2 buffer?

Part Number: AM62A7
Other Parts Discussed in Thread: AM68A, AM69A, AM62A3, , AM62A3-Q1

Tool/software:

I want to copy the data in the V4L2 buffer filled by the CSI RX driver on AM6x. However, the copy is very slow. A 4MB copy takes about 38 msec. Why is that and how can I improve the memcpy performance?

Other parts: AM62A3, AM62A7-Q1, AM62A3-Q1, AM68A, AM69A

  • Reason for slow memcpy:

    On AM6x (AM62, AM62A, AM68A, etc) SoC, the Linux CSI RX driver fills the V4L2 buffer in DDR through DMA. The DMA buffers for CSI RX driver are set to cache non-coherent by default.

    When the CSI RX driver fills the provided V4L2 buffer via DMA, the cache content of that buffer becomes invalid and the DMA driver marks the cache lines for V4L2 buffer invalid. When the V4L2 buffer is read by memcpy(), Linux will perform cache maintenance operations by bringing in the data from DDR to cache. This is the reason why memcpy() takes long time.

    Solution:

    To avoid long memcpy time, it is recommended not to do memcpy but to use DMA or pass the buffer pointer instead. If memcpy is unavoidable, one can change the CSI RX DMA buffers to cache coherent. That can be done in the device tree as below (using k3-am62a-main.dtsi for AM62A in SDK 9.2 as an example).

    diff --git a/arch/arm64/boot/dts/ti/k3-am62a-main.dtsi b/arch/arm64/boot/dts/ti/k3-am62a-main.dtsi
    index 1b3b492c7..0d35a6225 100644
    --- a/arch/arm64/boot/dts/ti/k3-am62a-main.dtsi
    +++ b/arch/arm64/boot/dts/ti/k3-am62a-main.dtsi
    @@ -971,9 +971,9 @@
      
            ti_csi2rx0: ticsi2rx@30102000 {
                    compatible = "ti,j721e-csi2rx";
    -               dmas = <&main_bcdma_csi 0 0x5000 0>, <&main_bcdma_csi 0 0x5001 0>,
    -                       <&main_bcdma_csi 0 0x5002 0>, <&main_bcdma_csi 0 0x5003 0>,
    -                       <&main_bcdma_csi 0 0x5004 0>, <&main_bcdma_csi 0 0x5005 0>;
    +               dmas = <&main_bcdma_csi 0 0x5000 15>, <&main_bcdma_csi 0 0x5001 15>,
    +                       <&main_bcdma_csi 0 0x5002 15>, <&main_bcdma_csi 0 0x5003 15>,
    +                       <&main_bcdma_csi 0 0x5004 15>, <&main_bcdma_csi 0 0x5005 15>;
                    dma-names = "rx0", "rx1", "rx2", "rx3", "rx4", "rx5";
                    reg = <0x00 0x30102000 0x00 0x1000>;
                    power-domains = <&k3_pds 182 TI_SCI_PD_EXCLUSIVE>;

    Refer to the device tree binding for DMA, Documentation/devicetree/bindings/dma/ti/k3-bcdma.yaml, for more information about DMA properties.

    Additional Information:

    The device tree modification above changes the ASEL value from 0 to 15. The meanings of the ASEL values are:

    1. ASEL=0. HW doesn't maintain data consistency across cache and DDR. Linux has to perform cache maintenance operations (such as flushing and invalidating) for data consistency between cache and DDR. When CSI data is written to DDR by DMA, the DMA driver marks the cache lines for v4l2 buffer invalid. 
    2. ASEL=14 or 15. HW maintains data consistency across cache and DDR, and Linux doesn't need to run cache operations. Therefore, memcpy is faster. The side effect of using 14 or 15 for ASEL is that every bit of data that moves through CBASS will cause snoop onto A53 caches to get stale lines updated / invalidated. This contends with L2 resources that program running on A53 core may also be using and thus can cause A53 performance degradation if traffic is heavy in some cases.
    3. The difference between 14 and 15 for ASEL is L2 cache allocation for cache warming feature as described in the TRM (for example, AM62A TRM, section 3.4.1 IO Coherency Support). When ASEL is 14, the A53 will allocate v4l2 buffer data in the cache (warming the cache).

    References:

    ARM Architecture - Accelerator Coherency Port

    For other common problems related to CSI camera, please refer to this FAQ: