I am encountering a scheduling issue with software based on the DM6467 h.264 encode demo. A call to Venc1_process() in a higher priority thread (thread 1) waits for a memory transfer loop in a lower priority thread (thread 2).
In my application, thread 2 can take as long as it wants, with absolutely no interest in speed. Thread 1 is crucial. My only wild guess is that the compiler is doing something clever with my for loop, preventing the higher priority thread from prempting it. Any ideas? Thanks.
****************** Excerpt from thread 1 (SCHED_FIFO Priority = MAX - 1 ***********************************
if (Venc1_process(hVe1, hCcvOutBuf, hDstBuf) < 0) { ERR("Failed to encode video buffer\n"); cleanup(THREAD_FAILURE); }
**************************************************************************************************************************
********************Within the Venc1_process() call, defined in Videnc1.c , it stops at this line ***************
/* Encode video buffer */ status = VIDENC1_process(hVe->hEncode, &inBufDesc, &outBufDesc, &inArgs, &outArgs);
******************** Excerpt from thread 2 (SCHED_FIFO Priority = MAX - 5 **********************************
printf("Begin Semi-planar to Planar format conversion\n"); //Reformat data for(k = 0; k < imgBufSize/4; k ++) { //Copy Cb data imgIn422PBufP[imgBufCPOff + k] = imgInBufP[imgBufCPOff + (k*2)]; //Copy Cr data imgIn422PBufP[imgBufCPOff + imgBufCROff + k] = imgInBufP[imgBufCPOff + (k*2) + 1]; } printf("Completed Semi-planar to Planar format conversion\n");
***********************************************************************************************************
Andrew MuehlfeldA call to Venc1_process() in a higher priority thread (thread 1) waits for a memory transfer loop in a lower priority thread (thread 2).
Is this a remote codec? If so, do you know that the VIDENC1_process() call isn't just stuck waiting for a return from the remote core?
If you remove the low-pri thread loop, does the VIDENC1_process() call return normally?
Does the VIDENC1_process() call return after the loop is done?
Sorry for the basic questions, just need to set the table correctly in order to get to the bottom of the issue.
Regards,
- Rob
Q: Is this a remote codec? If so, do you know that the VIDENC1_process() call isn't just stuck waiting for a return from the remote core?A: Yes, this is a remote codec. VIDENC1_process() is called from the ARM on a DM6467T. It is calling TI's h264enc v01.20.02, using the DMAI interface. The h264enc codec runs on the DSP core. It's certainly possible that VIDENC1_process() call is waiting for a return from the remote core. At first glance, that seems pretty likely. If that's the case, the question becomes: why doesn't the remote core return immediately, as it does without the loop? The VIDENC1_process() call is part of a 30 frames per second video encoding application. It stalls for about one second during the memory transfer loop, noted both by watching the video, and from debug printf() statements. Are there some resources VIDENC1_process() needs that may be used by the loop? Are there certain conditions under which a remote call cannot be made?Q: If you remove the low-pri thread loop, does the VIDENC1_process() call return normally?A: Yes. If I remove the loop, the VIDENC1_process() call returns normally.Q: Does the VIDENC1_process() call return after the loop is done?A: Yes, the VIDENC1_process() call returns after the loop is done.
Two things that come to my mind
1) The compiler producing a tight loop code for second thread and hence not allowing interrupts. To confirm this can you try to compile the second thread with debug and without optimization flags.
2) The second is keeping DDR lot more busy and hence delaying the encode which also needs the same DDR bandwidth. The buffers that second thread is operating on - whats their size and by any chance are these non-cached buffers?
Thanks,
Satish
Thanks for your thoughts Satish. I think I have a lot left to learn here...1) I compiled with debugging and no optimization flags, with no change in results.2) Yes, the buffers in the second thread are non-cached. Both buffers are DMAI buffers created with Buffer_create(), using the default values for memParams(type = Memory_CONTIGPOOL, flags = Memory_NONCACHED, align = Memory_DEFAULTALIGNMENT, seg = 0). Both buffers are 21073920 bytes (~20MB).Here is my understanding of cached vs. non-cached buffers. Please correct me.Using non-cached buffers in this case increases time for the loop to complete, since each byte is read and written individually to DDR2, rather than reading and writing whole cache lines at a time. The reason DMAI buffers deafult to non-cached has something to do with the ability to pass buffers between the ARM core and the DSP core. I'm confused, however. I thought Codec Engine handled all cache management requirements, for the very purpose of allowing the ARM and DSP cores to share cached buffers. I even encountered a problem where one of my non-cached buffers became corrupted because I wasn't calling XDM_SETACCESSMODE_WRITE(outBufs->descs[0].accessMask). What am I missing?I changed the two buffers in the second thread to cached by setting gfxAttrs.bAttrs.memParams.flags = 0; prior to calling Buffer_create(). This increased the speed of the loop's completion, but the 1st thread still stalls while the 2nd thread's loop completes.3) I don't know exactly what the compiler does when you turn on debugging. I tried forcing the loop to give up the CPU by adding usleep(1) as the last statement in the loop. This allowed the 1st thread to run without interruption, which is my ultimate goal, but then it takes the loop several minutes to complete. When I originally said thread 2 had no speed requirements, I meant that it could take a few seconds, not a few minutes.4) What else can I try to determine whether the loop is blocking interrupts, or the two threads are fighting for memory bandwidth?5) Is there any way to monitor DDR2 bandwidth?
6) Should I be using VDCE for this memory operation, instead of c code on the ARM?
Andrew MuehlfeldI'm confused, however. I thought Codec Engine handled all cache management requirements, for the very purpose of allowing the ARM and DSP cores to share cached buffers. I even encountered a problem where one of my non-cached buffers became corrupted because I wasn't calling XDM_SETACCESSMODE_WRITE(outBufs->descs[0].accessMask). What am I missing?
This article may help - http://processors.wiki.ti.com/index.php/Cache_Management
In particular, the Codec Engine section toward the bottom describes what [little] cache management CE does/doesn't do.
Chris
20 MB buffer!! - Curious about whats it containing? Is it going through the DMAI to a usual video decoder/encoder?
Well its cacheability on ARM side depends on several factors like is ARM/DSP modifying the data in this buffer by CPU touch or only DMAs. Can't comment much without knowing what all is happening with this buffer.
Not a solution but can you experiment with SCHED_RR instead of SCHED_FIFO - this should allow your first thread to get a chance...
Another wild guess - In case, first thread has any dependency on any kernel thread, the second thread being scheduled as SCHED_FIFO will have higher priority and will preempt kernel thread...
The 20MB buffer contains YCbCr data for a 10 megapixel image (each pixel gets 8 bits Y and 8 bits C). I included a larger excerpt of the code below.From the video thread (previously referred to as thread 1), a YUV422 Semi-Planar image is memcp'd to a buffer that gets passed to the still thread (previously referred to as thread 2). It is memcp'd, rather than passed directly, because the original buffer continues to be used in the video thread. The still thread (thread 2) copies the data from the newly populated buffer, to another 20MB buffer, reformatting it to YUV422 Planar (separating interleaved Cb and Cr bytes into two separate blocks). The resulting buffer is passed to TI's jpeg encoder, running on the DSP, using DMAI calls.With default non-cached buffers, the de-interleaving takes 1.7 seconds. With cached buffers, the de-interleaving takes 700ms. I read up on cache coherency, as recommended by Chris. The first buffer is never touched by the DSP, so I believe it can be cached with no cache management required. The second buffer gets sent to the DSP, after being written by the ARM CPU, so I need to call Memory_cacheWbInv() prior to process(). Is that correct? Using cached buffers is an improvement, but not a solution.I experimented with SCHED_RR, both with the original priorities, and elevating thread 2's priority to the same as thread 1. The delay to thread 1 was unchanged.Can you elaborate on the kernel thread idea? Both threads call DMAI functions which I believe use the dsplink.ko kernel module. Could that be related?**********************************************************************// Video Thread (thread 1) - Fill and pass bufferint takeStill(VideoEnv *stillEnvp, Buffer_Handle *pPreProcessedBuf, Buffer_Handle *pImgInBuf, int imgBufSize){ Int stillFifoRet; char *imgInBufP; char *diOutBufP; printf("Beginning takeStill\n"); if(video_still_count > 0) { printf("Calling Fifo_get on hStillOutFifo, will print return after\n"); stillFifoRet = Fifo_get(stillEnvp->hStillOutFifo, &(*pImgInBuf)); printf("Returned from Fifo_get on hStillOutFifo\n"); if (stillFifoRet < 0) { ERR("Failed to get buffer from still thread\n"); } } imgInBufP = Buffer_getUserPtr(*pImgInBuf); diOutBufP = Buffer_getUserPtr(*pPreProcessedBuf); printf("takeStill: imgBufSize: %d\n", imgBufSize); memcpy(imgInBufP, diOutBufP, imgBufSize); Buffer_setNumBytesUsed(*pImgInBuf, imgBufSize); printf("Calling Fifo_put on hStillInFifo, will print return after\n"); Fifo_put(stillEnvp->hStillInFifo, *pImgInBuf); printf("Returned from Fifo_put on hStillInFifo\n"); //Clear flag takePushPin = 0; video_still_count ++; return 0;}**********************************************************************// Still thread (thread 2) - while (!gblGetQuit()) { /* Pause processing? */ Pause_test(envp->hPauseProcess); /* Get a buffer to encode from the capture thread */ fifoRet = Fifo_get(envp->hInFifo, &hImgInBuf); if (fifoRet < 0) { ERR("Failed to get buffer from video thread\n"); cleanup(THREAD_FAILURE); } printf("completed Fifo_get()\n"); /* Did the capture thread flush the fifo? */ if (fifoRet == Dmai_EFLUSH) { cleanup(THREAD_SUCCESS); } int imgBufSize; int imgBufCPOff; int imgBufCROff; Int8 *imgInBufP; Int8 *imgIn422PBufP; imgIn422PBufP = Buffer_getUserPtr(hImgIn422PBuf); imgInBufP = Buffer_getUserPtr(hImgInBuf); imgBufSize = Buffer_getNumBytesUsed(hImgInBuf); imgBufCPOff = imgBufSize/2; //Color Plane Offset imgBufCROff = imgBufSize/4; //CR Offset (referenced from CP Offset) Buffer_setNumBytesUsed(hImgIn422PBuf, imgBufSize); printf("still.c: Doing memcpy\n"); memcpy(imgIn422PBufP, imgInBufP, imgBufSize/2); printf("Beginning Semi-planar to Planar format conversion\n"); //Reformat data for(k = 0; k < imgBufSize/4; k ++) { //Copy Cb data imgIn422PBufP[imgBufCPOff + k] = imgInBufP[imgBufCPOff + (k*2)]; //Copy Cr data imgIn422PBufP[imgBufCPOff + imgBufCROff + k] = imgInBufP[imgBufCPOff + (k*2) + 1]; } printf("Completed Semi-planar to Planar format conversion\n"); //Return buffer to video thread for next still Fifo_put(envp->hOutFifo, hImgInBuf); if(Buffer_getNumBytesUsed(hImgIn422PBuf) == 691200) { //printf("Little jpeg\n"); iEncDynamicParams->inputHeight = 480; iEncDynamicParams->inputWidth = 720; iEncDynamicParams->captureWidth = 720; } else { //printf("Big jpeg\n"); iEncDynamicParams->inputHeight = TEN_MP_HEIGHT; iEncDynamicParams->inputWidth = TEN_MP_WIDTH; iEncDynamicParams->captureWidth = TEN_MP_WIDTH; } BufferGfx_getDimensions(hImgIn422PBuf, &tmpDimensions); //printf("Original hImgInBuf Dimensions --- Width: %d, Height: %d, Line Length: %d, X: %d, Y: %d\n", tmpDimensions.width, tmpDimensions.height, tmpDimensions.lineLength, tmpDimensions.x, tmpDimensions.y); tmpDimensions.width = iEncDynamicParams->inputWidth; tmpDimensions.height = iEncDynamicParams->inputHeight; tmpDimensions.lineLength = iEncDynamicParams->inputWidth; BufferGfx_setDimensions(hImgIn422PBuf, &tmpDimensions); BufferGfx_getDimensions(hImgIn422PBuf, &tmpDimensions); //printf("New hImgInBuf Dimensions --- Width: %d, Height: %d, Line Length: %d, X: %d, Y: %d\n", tmpDimensions.width, tmpDimensions.height, tmpDimensions.lineLength, tmpDimensions.x, tmpDimensions.y); iEncStatus.size = sizeof(IMGENC1_Status); iEncStatus.data.buf = NULL; //It is required to call IMGENC1_control() after each Ienc1_process() call to // reinitialize the jpeg encoder if(IMGENC1_control(hIMGENC1, XDM_SETPARAMS, iEncDynamicParams, &iEncStatus)) { printf("Called VIDENC1_control, failed\n"); } if(IMGENC1_control(hIMGENC1, XDM_GETBUFINFO, iEncDynamicParams, &iEncStatus)) { printf("Called VIDENC1_control, failed\n"); } printf("Calling Ienc1_process on image # %d\n", still_count); if(Ienc1_process(hIe1, hImgIn422PBuf, hImgOutBuf) != 0) { printf("Ienc1 Failed\n"); } else { printf("Ienc1 Succeeded\n"); } //printf("JPEG encoder used: %d bytes, updated\n", Buffer_getNumBytesUsed(hImgOutBuf)); printf("Returned from Ienc1_process on image # %d\n", still_count); printf("Writing jpeg image to file\n"); sprintf(dynamic_file_names, "/opt/dvsdk/dm6467/test_%d.jpg", still_count); outFile = fopen(dynamic_file_names, "w"); fwrite(Buffer_getUserPtr(hImgOutBuf), 1, Buffer_getNumBytesUsed(hImgOutBuf), outFile); fclose(outFile); still_count ++; printf("Completed writing jpeg image to file\n"); }**********************************************************************
Thanks for the explanation. Here is what I understood from your explanation above
Couple of points
Thanks
Satish Arora Thanks for the explanation. Here is what I understood from your explanation above There is a video thread(thread 1) which receives a captured frame in a buffer around 20 MB; Video thread copies it into another buffer and passes in to thread 2 which is still thread and continues to use the original buffer for Video encoding. Still thread receives the YUV422 semi-planar buffer and colorconverts to YUV 422 planar. returns the buffer back to video thread. Still thread then goes on encoding the 422 planar buffer using JPEG encoder.
All correct.
Satish Arora Is the rate for video encoding different than still encoding? I mean out of all that get processed by video encoder, only few frames passed to still thread?
Yes, the rate of still encoding is much lower than video encoding. The user can request a still frame encoding, which calls takeStill(), and is anticipated once every few minutes, and at a maximum burst rate of a a few consecutive still images at 5 second intervals.
Satish Arora You did a memcpy in video thread, so that you can pass on the copied buffer to still thread. Since you were making a copy, you might as well convert this to planar here itself; this should save your unnecessary conversion/copy in the thread 2.
That's what I did originally, but the conversion to planar took too long. The memcpy() is quick. The motivation for creating a separate thread in the first place was to allow the conversion to run as a low priority thread, without blocking the video thread.
Satish Arora I might be wrong here but I see that thread 1 has a dependency on thread 2 i.e. it needs to get the buffer (required for copy) back from thread 2 before it can progress further. Just want to check if you have made sure that the thread 1 is actually stalled here or in the video encode call...
Thread 1 would wait on thread 2 if the user requested a second still image before the first completed. In my test case, I only request one, and have verified that takeStill() is only called once.
Satish Arora I am also wondering if it is the copy in the second thread that holds your first thread or the JPEG encoding. I get this doubt because JPEG encoder also runs on DSP and hence can create some degradation for video encoder which also would require DSP support for encoding. You can try commenting off the copy part and still have the JEPG encoder part as is to confirm this.
This comment led me down a good path. I realized that the JPEG codec and H264 codec were in the same group on the remote server, with the same priority. I moved the JPEG codec to a new group, with its own scratch memory, and a lower priority. The JPEG encoder did introduce a delay that varied from 100ms to 300ms. That has been eliminated. I tried your suggestion of commenting out the transfer loop and leaving the JPEG encoder loop intact. Once I fixed the priorities in codec.cfg and server.cfg, there was no delay with the transfer loop commented out. With the transfer loop re-introduced, the delay came back, even with the fixed server priorities.
Satish Arora Also if the stall is in video encoder, rough idea how much is the additional stall that you see because of presence of thread 2.
memcpy (y data): 60-70ms deinterleave (cb and cr data): 700-1100ms
Andrew Muehlfeld This comment led me down a good path. I realized that the JPEG codec and H264 codec were in the same group on the remote server, with the same priority. I moved the JPEG codec to a new group, with its own scratch memory, and a lower priority. The JPEG encoder did introduce a delay that varied from 100ms to 300ms. That has been eliminated. I tried your suggestion of commenting out the transfer loop and leaving the JPEG encoder loop intact. Once I fixed the priorities in codec.cfg and server.cfg, there was no delay with the transfer loop commented out. With the transfer loop re-introduced, the delay came back, even with the fixed server priorities.
Good to see it helped somewhere.
Andrew Muehlfeld That's what I did originally, but the conversion to planar took too long. The memcpy() is quick. The motivation for creating a separate thread in the first place was to allow the conversion to run as a low priority thread, without blocking the video thread.
Understood. In general copying big video buffers with CPU is not a good idea. You would want to use DMAs (EDMA in DM6467) for such copy. Not just for simple copy, even for YUV422SP to YUV422P conversion, you should be able to use EDMAs. You can look at VDCE driver, it uses EDMA to copy the Luma buffer. Using EDMAs you should also be able to do SP to P conversion for chroma. Look at few example of using DMAs at http://www.ti.com/lit/ug/sprueq5b/sprueq5b.pdf.
If you are able to do this copy/conversion fast enough using EMDAs then you might be able to avoid two copies, as you can then do this in thread 1 only.
Satish AroraUnderstood. In general copying big video buffers with CPU is not a good idea. You would want to use DMAs (EDMA in DM6467) for such copy. Not just for simple copy, even for YUV422SP to YUV422P conversion, you should be able to use EDMAs. You can look at VDCE driver, it uses EDMA to copy the Luma buffer. Using EDMAs you should also be able to do SP to P conversion for chroma. Look at few example of using DMAs at http://www.ti.com/lit/ug/sprueq5b/sprueq5b.pdf. If you are able to do this copy/conversion fast enough using EMDAs then you might be able to avoid two copies, as you can then do this in thread 1 only.
I found another thread on e2e where somebody used EDMAs to convert from YUV422 SP to YUV422 Planar. - http://e2e.ti.com/support/embedded/multimedia_software_codecs/f/356/t/56639.aspx
Thanks for the ideas. I can see that this operation would be better done with DMA. There seem to be many ways to perform DMA on the DM6467T, but very little documentation on any of them.
One method is to use the ACPY3 API. Another method is to use the VDCE driver. Another is to use EDMA driver directly. Which method is most appropriate in this case? Is there any documentation available?
Did you solve the 422SP to 422P conversion with EDMA3 ? I need to do the same thing...
No, I did not modify my 422SP to 422P conversion to use EDMA3. I sped up the conversion on the ARM by using cached buffers, and optimized some other application specific things surrounding the conversion, but the video stream still pauses. I'm still curious why the linux thread priorities aren't functioning as expected, and I might still try EDMA3 sometime, but the hiccup has been reduced to a tolerable length, at least for the short term, and eliminating it completely isn't a top priority. If you do figure out how to do it, please share.
It's not a priority for now, but i think i will need to do it into the 3 next month. I will let you inform ;) !
Mika