Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4
I’m seeing an intermittent kernel failure during video decode on the DM365. There is more than one type of failure, but at least 90% of them look like this:
Normal 0 false false false EN-US X-NONE X-NONE MicrosoftInternetExplorer4
Unable to handle kernel paging request at virtual address afd24814
pgd = c2370000
[afd24814] *pgd=00000000
Internal error: Oops: 5 [#1]
Modules linked in: dm365mmap edmak irqk cmemk linx_eth_cm linx regrw
CPU: 0
PC is at lnhcb_deliver+0xc7c/0xd80 [linx]
LR is at all_conns_connected+0x80/0x88 [linx]
pc : [<bf0250c4>] lr : [<bf021798>] Not tainted
sp : c2285e38 ip : c2285e20 fp : c2285e94
r10: 00000001 r9 : c389b2c0 r8 : 00010005
r7 : ef06c814 r6 : c0ba4560 r5 : 3bc1b205 r4 : 00000001
r3 : c0cb8000 r2 : 3bc1b205 r1 : 00000001 r0 : 00000001
Flags: nzCv IRQs on FIQs on Mode SVC_32 Segment user
Control: 5317F
Table: 82370000 DAC: 00000015
Process tsApp (pid: 786, stack limit = 0xc2284258)
Stack: (0xc2285e38 to 0xc2286000)
5e20: c0ada800 c0ada800
5e40: c2285e6c c2285e50 bf04721c 3bc1b205 00000002 c389b2c0 00000000 00000001
5e60: c2285e94 c2285e70 c01b3bc8 0000000c 3bc1b205 c389b2c0 00000001 c0ada800
5e80: 00000000 00000000 c2285edc c2285e98 bf046e1c bf024458 0000000c c389b2c0
5ea0: 00000000 00000002 00000000 f0007fff 00000001 c0ada800 c389b2c0 0000097b
5ec0: 00000000 c0ada8c4 00000001 00000000 c2285f04 c2285ee0 bf04750c bf046a6c
5ee0: c0ada8d0 00000000 00000004 00000000 00000009 c02d51c0 c2285f24 c2285f08
5f00: c0052d3c bf047324 00000001 c02d5210 00000102 c2284000 c2285f34 c2285f28
5f20: c0052dd4 c0052cd4 c2285f64 c2285f38 c00530d4 c0052dac c2285f74 00400140
5f40: c2284000 c2285fb0 00000001 00000000 c2284000 001cfa04 c2285f7c c2285f68
5f60: c00531a8 c0053088 00000035 c026ad40 c2285f8c c2285f80 c005352c c0053184
5f80: c2285fac c2285f90 c0038bc4 c00534f4 00000001 ffffffff fbc48000 001cebac
5fa0: 00000000 c2285fb0 c0037c2c c0038b90 42deb000 436fe3c8 00000168 00000001
5fc0: 436feb68 001d08f4 001cebac 001dc738 42de5000 00000000 001cfa04 436fe0e4
5fe0: 00000018 436fe0c0 0006e1bc 00087388 20000010 ffffffff 00000000 00000000
Backtrace:
[<bf024448>] (lnhcb_deliver+0x0/0xd80 [linx]) from [<bf046e1c>] (rx_tasklet_recv+0x3c0/0x3d8 [linx_eth_cm])
[<bf046a5c>] (rx_tasklet_recv+0x0/0x3d8 [linx_eth_cm]) from [<bf04750c>] (rx_tasklet+0x1f8/0x244 [linx_eth_cm])
[<bf047314>] (rx_tasklet+0x0/0x244 [linx_eth_cm]) from [<c0052d3c>] (__tasklet_action+0x78/0x94)
[<c0052cc4>] (__tasklet_action+0x0/0x94) from [<c0052dd4>] (tasklet_action+0x38/0x40)
r7 = C2284000 r6 = 00000102 r5 = C02D5210 r4 = 00000001
[<c0052d9c>] (tasklet_action+0x0/0x40) from [<c00530d4>] (___do_softirq+0x5c/0xfc)
[<c0053078>] (___do_softirq+0x0/0xfc) from [<c00531a8>] (__do_softirq+0x34/0x50)
[<c0053174>] (__do_softirq+0x0/0x50) from [<c005352c>] (irq_exit+0x48/0x64)
r5 = C026AD40 r4 = 00000035
[<c00534e4>] (irq_exit+0x0/0x64) from [<c0038bc4>] (asm_do_IRQ+0x44/0x50)
[<c0038b80>] (asm_do_IRQ+0x0/0x50) from [<c0037c2c>] (__irq_usr+0x4c/0xa0)
r6 = 001CEBAC r5 = FBC48000 r4 = FFFFFFFF
Code: e5893000 e51b2048 e5963010 e1a07102 (e7935007)
<1>Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = c2370000
[00000000] *pgd=82286031, *pte=00000000, *ppte=00000000
I’m running a test that decodes the same 5 segments of video repeatedly. The 5 videos are encoded with the h.264 encoder at a resolution of 720 x 480, muxed with audio and subtitle data into an MPEG-2 transport stream, and range in length from 20 seconds to 90 seconds. The 5 videos often decode successfully as many as 10 times or more before the failure occurs while decoding one of the videos, but sometimes the failure occurs the first time that one of the videos is decoded. However, any one of the 5 videos may be the one that fails, and the frame number varies as well. I have also re-recorded the 5 videos and thus have completely new video data, but the failure still occurs.
You can see from the oops console output that the failure appears to be in ‘linx’, which is 3rd party IPC software used in our system. However, the failure occurs only when we are actually decoding video via the VIDDEC2_process() call. If I perform all processing of the video up to the VIDDEC2_process() call and stop at that point, the failure doesn’t occur even with overnight testing (note that in this test case we are still decoding audio and subtitle and handling much linx message traffic). If I perform the VIDDEC2_process() call but then bypass all remaining processing to DMA the decoded data to the display buffers and display the video, the problem still occurs. The problem has never been observed while encoding, only decoding. Thus it seems to be isolated to the VIDDEC2_process() call.
It is important to note that our application is configured to perform encode or decode, but never both at the same time. Therefore after each decode, our application software is killed and restarted with a configuration as an encoder. When the subsequent video decode is started approximately 10 seconds later, the application is again killed and restarted, but it is restarted with a configuration as a decoder. So all dvsdk resources are being cleaned up and re-allocated with each individual decode session. Also, after each decode session the cmemk, irqk, edmak, and dm365mmap kernel modules are removed and re-installed. This occurs after the application is killed as a decoder and before it is restarted as an encoder.
The video data is encoded using the same processor and dvsdk that is attempting to decode it. Also, we calculate a CRC on each video frame and add it to the transport stream header for that frame, and then check the CRC just before making the VIDDEC2_process() call. The CRC always checks out OK, so I think we can rule out corruption of the video data between encoding and decoding.
The dvsdk I’m using is ‘udworks-v2.1-02_10_01_18’ and contains the following component versions:
codec_engine_2_24
dmai_1_21_00_10
dvtb_4_10_03
xdais_6_24
cg_xml_2_12_00
dm365_2_10_01_18_release_notes.html
edma3_lld_1_06_00_01
xdctools_3_15_01_59
dm365_codecs_01_00_06
linuxutils_2_24_03
dvsdk_demos_2_10_00_17
framework_components_2_25_00_04
…with the platinum codecs installed:
H.264 High Profile DM365 Encoder 02.00.00.08
H.264 High Profile DM365 Decoder 02.00.00.05
However, the problem was also observed with our software build that has these codec versions:
H.264 High Profile DM365 Encoder 01.20.00.05
H.264 High Profile DM365 Decoder 01.10.00.04
Our decoder configuration is:
IH264VDEC_Params tParams;
IH264VDEC_DynamicParams tDynamicParams;
tParams.viddecParams.maxHeight = 720;
tParams.viddecParams.maxWidth = 1280;
tParams.viddecParams.size = sizeof (IH264VDEC_Params);
tParams.viddecParams.maxFrameRate = 30000;
tParams.viddecParams.maxBitRate = 0;
tParams.viddecParams.dataEndianness = XDM_BYTE;
tParams.viddecParams.forceChromaFormat = XDM_YUV_420SP;
tParams.hdvicpHandle = NULL;
tParams.displayDelay = 16;
tParams.levelLimit = 0;
tParams.disableHDVICPeveryFrame = 0;
tParams.inputDataMode = 1;
tParams.sliceFormat = 1;
tParams.frame_closedloop_flag = 0;
// Set video decoder dynamic params
tDynamicParams.viddecDynamicParams.size = sizeof (IH264VDEC_DynamicParams);
tDynamicParams.viddecDynamicParams.decodeHeader = XDM_DECODE_AU;
tDynamicParams.viddecDynamicParams.displayWidth = 0;
tDynamicParams.viddecDynamicParams.frameSkipMode = IVIDEO_NO_SKIP;
tDynamicParams.viddecDynamicParams.frameOrder = IVIDDEC2_DISPLAY_ORDER;
tDynamicParams.viddecDynamicParams.newFrameFlag = XDAS_FALSE;
tDynamicParams.viddecDynamicParams.mbDataFlag = XDAS_FALSE ;
tDynamicParams.getDataFxn = NULL;
tDynamicParams.dataSyncHandle = NULL;
tDynamicParams.resetHDVICPeveryFrame = 1;
And a typical set of decoder input arguments is:
VIDDEC2_InArgs tInArgs;
XDM1_BufDesc tInBufDesc;
tInBufDesc.descs[0].bufSize = 6417;
tInBufDesc.descs[0].buf = 0x47b10000;
tInBufDesc.descs[0].accessMask = 0;
tInBufDesc.numBufs = 1;
tInArgs.numBytes = 6417;
tInArgs.inputID = 3;
tInArgs.size = sizeof (VIDDEC2_InArgs);
My questions are:
1) Is there a known problem with corruption of the kernel while decoding video on the DM365?
2) Are there any suggested methods of further isolating this problem to find the root cause?