This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

SK-AM69: tiovxisp + v4l2h264enc same pipeline performance bottleneck (each element performs as expected independently)

Part Number: SK-AM69

I have a gstreamer appsrc which feeds into the following end-to-end pipeline:

appsrc name=appsrc \
    ! tiovxisp dcc-isp-file=$DCC_PATH/dcc_viss.bin sink_0::dcc-2a-file=$DCC_PATH/dcc_2a.bin \
    ! v4l2h264enc !  fpsdisplaysink video-sink=fakesink signal-fps-measurements=true sync=false

Unfortunately, when I feed frames into the frontend of this pipeline at 60 fps (format=bggr, framerate=60/1, width=2056, height=2464), I am only receiving 49 FPS at the output. I instrumented with queues (queue leaky=2) and `tiperfoverlay` and it appears that `tiovxisp` slows down to ≤50 FPS when used in conjunction with `v4l2h264enc`.

I've tried all manner of `output-io-mode` and `capture-io-mode`s on v4l2h264enc as well as `pool-size` on `tiovxisp`. 

However, when I break up the pipeline, I am able to hit 60 fps or greater (tiovxisp can hit 100 fps):

# 100 FPS, tiovxisp only, WORKS at full FPS even when other pipeline is running
appsrc name=appsrc ! video/x-bayer,format=bggr,framerate=100/1,height=2056,width=2464 \
    ! tiovxisp dcc-isp-file=$DCC_PATH/dcc_viss.bin sink_0::dcc-2a-file=$DCC_PATH/dcc_2a.bin \
    !  fpsdisplaysink video-sink=fakesink signal-fps-measurements=true sync=false

# 60 FPS v4l2h264enc only, WORKS at full FPS even when other pipeline is running
# needs videorate since AM69 can't generate 100 fps in videotestsrc
videotestsrc ! video/x-raw,framerate=20/1 ! videorate \
    ! video/x-raw,framerate=60/1,height=2056,width=2464,format=NV12 ! v4l2h264enc \
    ! fpsdisplaysink video-sink=fakesink signal-fps-measurements=true sync=false

# Also 60 FPS from custom appsrc, WORKS at full FPS
appsrc name=appsrc ! video/x-raw,framerate=60/1,height=2056,width=2464,format=NV12 \
    ! v4l2h264enc ! fpsdisplaysink video-sink=fakesink signal-fps-measurements=true sync=false

My CPU utilization is very low and my memory bandwidth isn't unreasonably high. Most confusingly, I can run two of the above pipelines at the same time in separate processes without slowing down (e.g. `tiovxisp`-only pipeline at 100 fps + `v4l2h264enc` pipeline at 60 FPS fps)! So I don't believe it's cache misses or DMA bandwidth limitations.

`tiperfoverlay` from failing 50 fps pipeline (end to end)

CPU: mpu: TOTAL LOAD = 9.21
CPU:  c7x_1: TOTAL LOAD = 0.00
CPU:  c7x_2: TOTAL LOAD = 0.00
CPU:  c7x_3: TOTAL LOAD = 0.00
CPU:  c7x_4: TOTAL LOAD = 0.00
HWA:   VISS: LOAD = 37.16 % ( 250 MP/s )
DDR: READ  BW: AVG =   2445 MB/s, PEAK =   2445 MB/s
DDR: WRITE BW: AVG =   2120 MB/s, PEAK =   2120 MB/s
DDR: TOTAL BW: AVG =   4565 MB/s, PEAK =   4565 MB/s
TEMP: thermal_zone0(MCU_R5F) = 55.00 C
TEMP: thermal_zone1(MCU) = 55.43 C
TEMP: thermal_zone2(GPU) = 51.93 C
TEMP: thermal_zone3(C7x) = 55.00 C
TEMP: thermal_zone4(CPU) = 53.25 C
TEMP: thermal_zone5(C7x) = 55.00 C
TEMP: thermal_zone6(DDR) = 56.52 C
FPS: 48

Working partial pipeline (v4l2h264enc only @ 60 fps):

CPU: mpu: TOTAL LOAD = 13.10
CPU:  c7x_1: TOTAL LOAD = 0.00
CPU:  c7x_2: TOTAL LOAD = 0.00
CPU:  c7x_3: TOTAL LOAD = 0.00
CPU:  c7x_4: TOTAL LOAD = 0.00
DDR: READ  BW: AVG =   2577 MB/s, PEAK =   2577 MB/s
DDR: WRITE BW: AVG =   1699 MB/s, PEAK =   1699 MB/s
DDR: TOTAL BW: AVG =   4276 MB/s, PEAK =   4276 MB/s
TEMP: thermal_zone0(MCU_R5F) = 55.00 C
TEMP: thermal_zone1(MCU) = 55.00 C
TEMP: thermal_zone2(GPU) = 52.15 C
TEMP: thermal_zone3(C7x) = 55.22 C
TEMP: thermal_zone4(CPU) = 54.13 C
TEMP: thermal_zone5(C7x) = 54.78 C
TEMP: thermal_zone6(DDR) = 57.38 C

Partial (tiovxisp only @ 100 fps):

CPU: mpu: TOTAL LOAD = 7.50
CPU:  c7x_1: TOTAL LOAD = 0.00
CPU:  c7x_2: TOTAL LOAD = 0.00
CPU:  c7x_3: TOTAL LOAD = 0.00
CPU:  c7x_4: TOTAL LOAD = 0.00
HWA:   VISS: LOAD = 72.58 % ( 488 MP/s )
DDR: READ  BW: AVG =   2301 MB/s, PEAK =   2301 MB/s
DDR: WRITE BW: AVG =   2617 MB/s, PEAK =   2617 MB/s
DDR: TOTAL BW: AVG =   4918 MB/s, PEAK =   4918 MB/s
TEMP: thermal_zone0(MCU_R5F) = 54.78 C
TEMP: thermal_zone1(MCU) = 55.43 C
TEMP: thermal_zone2(GPU) = 51.71 C
TEMP: thermal_zone3(C7x) = 54.78 C
TEMP: thermal_zone4(CPU) = 52.81 C
TEMP: thermal_zone5(C7x) = 55.00 C
TEMP: thermal_zone6(DDR) = 56.52 C
FPS: 95

Two independent pipelines (`tiovxisp` @ 100 fps in one + `v4l2h264enc` @ 60 fps in another) running simultaneously:

CPU: mpu: TOTAL LOAD = 20.69
CPU:  c7x_1: TOTAL LOAD = 0.00
CPU:  c7x_2: TOTAL LOAD = 0.00
CPU:  c7x_3: TOTAL LOAD = 0.00
CPU:  c7x_4: TOTAL LOAD = 0.00
HWA:   VISS: LOAD = 70.35 % ( 473 MP/s )
DDR: READ  BW: AVG =   3992 MB/s, PEAK =   3992 MB/s
DDR: WRITE BW: AVG =   3783 MB/s, PEAK =   3783 MB/s
DDR: TOTAL BW: AVG =   7775 MB/s, PEAK =   7775 MB/s
TEMP: thermal_zone0(MCU_R5F) = 55.43 C
TEMP: thermal_zone1(MCU) = 56.95 C
TEMP: thermal_zone2(GPU) = 52.15 C
TEMP: thermal_zone3(C7x) = 56.52 C
TEMP: thermal_zone4(CPU) = 54.13 C
TEMP: thermal_zone5(C7x) = 55.22 C
TEMP: thermal_zone6(DDR) = 57.81 C

Is there any reason tiovxisp would slow down when attached to a v4l2h264enc element in the same pipeline?

This is running on processor-sdk-linux-am69a (09_01_00)

EDIT: In the hopes of replicating without my `appsrc` and hardware, here's some pipelines that should hopefully show the issue with just the Processor SDK:

# 1) end-to-end, <50 FPS performance
gst-launch-1.0 videotestsrc ! video/x-bayer,framerate=15/1 ! videorate \
    ! video/x-bayer,framerate=60/1,height=2056,width=2464,format=bggr \
    ! tiovxisp dcc-isp-file=/opt/imaging/imx219/linear/dcc_viss.bin \
        sink_0::dcc-2a-file=/opt/imaging/imx219/linear/dcc_2a.bin \
        sink_0::pool-size=16 src::pool-size=16 \
    ! v4l2h264enc capture-io-mode=dmabuf ! tiperfoverlay dump=true overlay=false \
    ! fpsdisplaysink video-sink=fakesink signal-fps-measurements=true -ve

# 2) only tiovxisp, ~60 fps
gst-launch-1.0 videotestsrc ! video/x-bayer,framerate=15/1 ! videorate \
    ! video/x-bayer,framerate=60/1,height=2056,width=2464,format=bggr \
    ! tiovxisp dcc-isp-file=/opt/imaging/imx219/linear/dcc_viss.bin \
        sink_0::dcc-2a-file=/opt/imaging/imx219/linear/dcc_2a.bin \
        sink_0::pool-size=16 src::pool-size=16 \
    ! tiperfoverlay dump=true overlay=false \
    ! fpsdisplaysink video-sink=fakesink signal-fps-measurements=true -ve
    
# 3) only v4l2h264enc, ~60 fps
gst-launch-1.0 videotestsrc ! video/x-raw,framerate=15/1 ! videorate \
    ! video/x-raw,framerate=60/1,height=2056,width=2464,format=NV12 \
    ! v4l2h264enc capture-io-mode=dmabuf ! tiperfoverlay dump=true overlay=false \
    ! fpsdisplaysink video-sink=fakesink signal-fps-measurements=true -ve

#########
# Note that pipelines 2 & 3 can run simultaneously on the AM69 without any issue 
# while pipeline 1 cannot acheive the expected rate


  • One additional piece of information that I came up with during testing tonight:

    If I split the end to end pipeline across two different processes, I can achieve almost my goal fps using `shmsrc` and `shmsink`.

    ## The following achieves ~54-58 FPS ##
    # In one shell
    appsrc name=appsrc \
        ! tiovxisp dcc-isp-file=$DCC_PATH/dcc_viss.bin sink_0::dcc-2a-file=$DCC_PATH/dcc_2a.bin \
            sink_0::pool-size=16 src::pool-size=16 \
        ! queue leaky=2 max-size-bytes=501319680 ! tiperfoverlay dump=true overlay=false ! gdppay \
        ! shmsink socket-path=/tmp/ipc wait-for-connection=1 sync=true
    
    # In another
    shmsrc socket-path=/tmp/ipc ! gdpdepay ! v4l2h264enc capture-io-mode=dmabuf \
        ! fpsdisplaysink video-sink=fakesink signal-fps-measurements=true
    
    ## Output of `tiperfoverlay` when *both* pipelines are running
    CPU: mpu: TOTAL LOAD = 22.20
    CPU:  c7x_1: TOTAL LOAD = 0.00
    CPU:  c7x_2: TOTAL LOAD = 0.00
    CPU:  c7x_3: TOTAL LOAD = 0.00
    CPU:  c7x_4: TOTAL LOAD = 0.00
    HWA:   VISS: LOAD = 43.83 % ( 294 MP/s )
    DDR: READ  BW: AVG =   3936 MB/s, PEAK =   3936 MB/s
    DDR: WRITE BW: AVG =   3662 MB/s, PEAK =   3662 MB/s
    DDR: TOTAL BW: AVG =   7598 MB/s, PEAK =   7598 MB/s
    TEMP: thermal_zone0(MCU_R5F) = 55.87 C
    TEMP: thermal_zone1(MCU) = 57.38 C
    TEMP: thermal_zone2(GPU) = 52.37 C
    TEMP: thermal_zone3(C7x) = 56.30 C
    TEMP: thermal_zone4(CPU) = 54.13 C
    TEMP: thermal_zone5(C7x) = 55.43 C
    TEMP: thermal_zone6(DDR) = 58.02 C
    FPS: 58

    However, when I move both `tiovxisp` and `v4l2h264enc` back into the same pipeline, things slow down again...

    ## The following achieves only 47-49 fps ##
    # In one shell
    appsrc name=appsrc ! tiperfoverlay dump=true overlay=false ! gdppay \
        ! shmsink socket-path=/tmp/ipc wait-for-connection=1
    
    # In another
    shmsrc socket-path=/tmp/ipc ! gdpdepay \
        ! tiovxisp dcc-isp-file=$DCC_PATH/dcc_viss.bin sink_0::dcc-2a-file=$DCC_PATH/dcc_2a.bin \
            sink_0::pool-size=16 src::pool-size=16 \
        ! tiperfoverlay dump=true overlay=false ! queue ! v4l2h264enc \
        ! fpsdisplaysink video-sink=fakesink signal-fps-measurements=true
        
    # `tiperfoverlay` output when both pipelines are running
    CPU: mpu: TOTAL LOAD = 12.31
    CPU:  c7x_1: TOTAL LOAD = 0.00
    CPU:  c7x_2: TOTAL LOAD = 0.00
    CPU:  c7x_3: TOTAL LOAD = 0.00
    CPU:  c7x_4: TOTAL LOAD = 0.00
    HWA:   VISS: LOAD = 37.62 % ( 253 MP/s )
    DDR: READ  BW: AVG =   2954 MB/s, PEAK =   2954 MB/s
    DDR: WRITE BW: AVG =   2525 MB/s, PEAK =   2525 MB/s
    DDR: TOTAL BW: AVG =   5479 MB/s, PEAK =   5479 MB/s
    TEMP: thermal_zone0(MCU_R5F) = 54.78 C
    TEMP: thermal_zone1(MCU) = 55.22 C
    TEMP: thermal_zone2(GPU) = 51.93 C
    TEMP: thermal_zone3(C7x) = 54.78 C
    TEMP: thermal_zone4(CPU) = 53.47 C
    TEMP: thermal_zone5(C7x) = 54.56 C
    TEMP: thermal_zone6(DDR) = 56.95 C
    FPS: 48

    As mentioned previously, I've tried adding `queue` elements just about everywhere to try to force GStreamer to split tasks across multiple threads, but in both the IPC case and the single pipeline case, adding queues between each element has had no impact. This seems to be somehow related to how `tiovxisp` or v4l2h264enc` handling threading/memory management, but I'm at a loss for where to look next.

  • Hi Logan,

    Please allow me some time to take a look at this and test on my end. Please expect a response by the end of the week.

    Thank you for your patience.

    - Fabiana

  • Hi Logan,

    I recommend running the gst tracers script from our Edge AI SDK for SK-AM69 for additional information such as latency: 7. Measuring performance — Processor SDK Linux for AM69A Documentation I was able to run the pipelines you shared and observed the same results. As you've suspected, v4l2h264enc and tiovxisp elements being in the same pipeline do seem to be the reason for the latency.  

    I will discuss with my team to see if there is anything that can be done to optimize these pipelines and will get back to you after this holiday break, first week of January.

    Thank you,

    Fabiana