This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

QDMA Conflict between H.264 encoder and custom IVIDANALYTICS algorithm using ACPY3.



Hello,

I have begun seeing an issue when doing D1 H.264 (H264ENC 2.01.013) encoding with Codec Engine (2.25.02.11) simultaneously with a custom vidanalytics algorithm I wrote. I do not see the problem when I disable DMA in my algorithm, but the performance is miserable (it gets less than half the required framerate).  The problem manifests as a hang in the first H.264 process call.  When I use CCS to halt it while hung, it seems to be stuck in H264VENC_TI_QDMA_wait().

Both algorithms are in different scratch groups.  I am using 1D1D ACPY3 DMA transfers in my algorithm as a replacement for memcpy.  I am using only 1 DMA channel.

Does anyone have any ideas on how I can debug & fix this problem?

I have attached a verbose log file with FC tracing enabled.

6888.both.log

I have also attached the codec.cfg and server.cfg file I'm using for my codec server.

0804.cfg.zip

Thank you,

Dennis Estenson

  • Hi Dennis,

    I went through your log and the config files in detail, and in terms of resource assignment, I don't really see any issues. Both the codecs are in separate scratch groups and have been assigned different resources. This means, they should be able to work independently and together...

    In your experiments, you mention that you could manage to run both algorithms simultaneously when VMD was NOT doing DMA. When it was doing DMA, then H264 ENC gets stuck in first process call.

    Is the H264VENC_TI_QDMA_wait calling ACPY3_wait ? Is there a chance you can send me snapshot of EDMA3 registers at the time of the hang ? I want to see what in the register space might be causing the halt.

    I can't tell from the addresses but can you confirm (from the log) that all the requests for internal memory by the H264 codec are satisfied properly by internal memory buffers ? If not, ACPY3 can get stuck in the wait call, because the transfer never happens. (The log only shows up to the process call, it doesn't show the ACPY3 APIs being called in the case when H264 is stuck). The fact that H264 works when VMD is not doing DMA, kind of implies that the usage of ACPY3 APIs in H264 is probably not an issue. 

    One other thing to try, is to not use PaRAM#0 as your NULL PaRAM. Since all QDMA channels come up initially pointing to it, sometimes there can be interference...

    Change the following in your .cfg file:-

    DMAN3.nullPaRamIndex = 65; //Make sure this is a PaRAM not being used by your system, defaults to "0"

    Gunjan.

  • Hi Gunjan,

    Thank you for taking a look.  The H.264 encoder is the version provided by TI as a binary only in the DVSDK.  We do not have the source code, so I don't know what H264VENC_TI_QDMA_wait() is doing.  I tried changing the nullPaRamIndex variable as you suggested, but it had no effect with regards to this hang.

    Do you have a .gel file or something I could use to get the register dump you requested?

    Thanks,

    Dennis

  • Okay, we may need to get help from the codec team as well.

    Does the call stack show any anything from ACPY3 ? If you trace the working scenario, that does shows successful calls to ACPY3_start and ACPY3_wait right ? I'm trying to look for proof that H264VENC_TI_QDMA does indeed use ACPY3 for doing it's transfers. 

    Let me know what your observations are with the changed Null PARAM.

  • No sorry, I don't have a gel file. The simplest way might be to open up addresses 0x1c0 0000 till  0x1c04000 in CCS when the DSP is stuck. This is the entire EDMA3 register space. 

    Just a snapshot of this would be fine. Is this possible ?

  • I have attached the (tediously copy/pasted) log of the registers in that range.

    Thanks,

    Dennis

    8015.edma-regs.txt

  • It looks like the H264 encoder does not use ACPY3.  There is no mention of it in the logs of a successful run.

    It also looks like both VMD & H264 use QDMA channel 0.  I think that's a logical channel, but is it possible it's using the same physical channel?

    Thanks,

    Dennis

  • Thanks Dennis,  I'll take a look at them. I'm sorry to copy past them, I was hoping you could just do a Print Screen or something like that.

  • The VMD log does show ACPY3 activate/configure/start/wait/deactivate statements, and I would have expected to see them in the log with H264 + no-DMA VMD. 

    If two algorithms (in different scratch groups) using ACPY3 both use QDMA channel #0, that is perfectly fine, since the channel simply queues up all the transfers and performs them one by one. As long as the PaRAM and TCCs used by the 2 different scratch groups, are discrete, things should work fine.

    However, since the H264 encoder uses some other QDMA library, I have no idea if it is well-behaved w.r.t to ACPY3. ACPY3 performs all it's global register initialization in the ACPY3_init  call and expects them to persist throughout the length of the program. With some another QDMA library, that assumption may/may not be valid, and may be causing the global initialization that ACPY3 expects to get overwritten.

    Do you have a TI FAE/contact ? I'd like to work with them to find out more about the codec and the dma library being used by it. 

  • We've been working with Tiemen Spits out of the San Diego office just down the road from us.  The H.264 encoder is using DMAN3. Should I be using it directly instead of using the ACPY3 library, in order that the algorithms play more nicely together?  Time is running very short for this project (it's being shipped in 7 days) and I'll have to work every day including the weekend until this problem is solved.  Can you suggest another workaround?

    Thanks,

    Dennis

  • I'll sync up with Tiemen. Without finding out more about the DMA lib being used by the codec, I can't really suggest a workaround. 

  • It looks like Tiemen is out until next week. The H.264 encoder we're using is the OMAP35x-optimized version found at http://software-dl.ti.com/dsps/dsps_public_sw/codecs/OMAP35xx/index_FDS.html.  I hope this helps.

    Dennis

  • Thanks, I'll take a look at the release notes and user guide etc. In your VIDANALYTICS codec or the rest of your app, are you doing ANY DMA-specific register programming or configuration ?! Or are ACPY3, DMAN3 doing everything you need ?

    Also did you get a chance to try changing the NULL PaRAM from 0 to some other ununsed PaRAM ?

  • I am only using DMAN3 and ACPY3. I'm not touching the registers at all.

    I did try to change the null param and it did not have any effect.

  • I put the 2 algorithms in the same scratch group & now it seems to work.  I don't know if that's a permanent solution though.

  • Putting algorithms in the same scratch group means allowing it to use the same set of resources (memory as well as DMA resources).

    The side effect of putting both in the same scratch group is that the framework doesn't allow both of them to run simultaneously. Which means that once a particular algorithms is activated and in its own process call, a second algorithm will be not be able to proceed to its own activation. Since they are potentially using the same set of resources, allowing the 2nd algorithm thru would trash the state for the 1st algorithm.

    While this is a fair thing to do, it may not be good for you in terms of performance.

    Nevertheless, this is an interesting observation, could you share the log of this successful run and I can compare the two logs and see if I can figure out what might be causing the issue in the first case.

     


  • We had been previously using the same scratch group for these algorithms, but had severe enough performance issues that we had to scale the image to 1/5 the size before doing VMD in order to maintain the full framerate while recording.  This changed once we came upon and tried to work around a bug in the codec engine and it was suggested that having the two algorithms share the same scratch group was a "bad thing".  Once we fixed the bug in the codec engine, this problem arose, apparently now, because we had changed the scratch groups.  It would be great if we could keep them in different scratch groups if that will improve the performance.  I will get you a new log.

    Dennis

  • I have attached the log you requested.

    It appears to assign QDMA channel 0 to VMD and channels 1-6 to H.264, where before with different scratch groups, it assigned channel 0 to VMD and channels 0-5 to H.264. To me, that seems like a bug in DMAN3 or something.

    2642.both.log

    Thanks for your help.

    Dennis

  • Are there any known issues with using DMAN3 in one algorithm in one scratch group & using ACPY3 in another algorithm in another scratch group?  That seems to be the criteria necessary to cause this issue.

  • DMAN3 uses as many QDMA channels as have been assigned to it. You could even assign 1 QDMA channel, and both the algorithms (irrespective of their scratch groups) would share it. 

    QDMA Channels are used in a round  robin fashion, so the fact that the they are being shared in your failing case, is NOT a bug in DMAN3/ACPY3. And to answer you question, no we haven't had any issues when using multiple scratch groups in DMAN3 and ACPY3.

    In comparing the two logs, I do see that you have used Null PARAM 65 for the passing case. But you mention that changing null PARAM to 65 in your failing case (different scratch groups), didn't change behavior, is that right ?

    The other thing that changes between the 2 scenarios is that in the passing scenario, the VMD algorithm is deactivated before the H264 algorithm is activated (since they are all in the same scratch group, this is  necessary).

    In the failing case, since they are in different scratch groups, the VMD algorithm is still active (not deactivated), when we activate the H264 algorithm. I'm still waiting to hear from the codec authors of H264. But for VMD, can you give me an idea of what goes on in the activation/deactivation calls to VMD ? Do you do something specific to DMA ?

  • Hi Gunjan,

    The reason the NULL PARAM is 65 for the passing case is that I had not changed it back to zero when I tested it for the failing case.  It was definitely not that change that made it work.

    As for the activate/deactivate functions in VMD, they do nothing except print out the fact that they were called, and also for testing purposes, it does a memory dump of internal algorithm data. Should more be done in these functions? Should I do something specific to DMA?

    Thanks,

    Dennis

  • Dennis,

    You don't necessarily have to do anything in the activate, deactivate functions. Those are the points in the application lifecycle when a codec has exclusive access to the scratch resources it has requested (memory and dma resources). Before that (for instance in the IALG_Fxns::algInit call), the codec has access to (and should be touching) only the persistent resources that it has requested for. Might be a good idea to make sure you aren't touching any scratch resources in your algInit call.

    The reason I asked you what VMD alg. does in the activate, deactivate functions, is only because that was a point in difference between your passing and failing scenarios. That is, depending on which scratch group the codecs are in, these functions may or may not be called.  

    We have managed to locate/loop in some codec experts, so maybe we will learn something from them.

    Thanks,

    Gunjan

  • I have again verified that there are no scratch resources being used in the algInit call, but the base addresses from memTab[i].base are stored in a member variable of our persistent object.

    How do constant static global variables fit into the mix? Are they considered persistent for the purposes of this discussion? Is it safe to access them in algInit? I am accessing some default params and dynamic params structures that are defined as static const structs.

    Thanks,

    Dennis

  • It should be okay to store the addresses of these buffers in the algorithm object's instance (persistent memory). The right time to access these buffers depends on how they were requested.

    The notes here about 'scratch memory' might be useful:-

    http://processors.wiki.ti.com/index.php/Framework_Components_FAQ#What_exactly_is_.22scratch_memory.22_and_when_can_my_algorithm_use_it_.3F

    If the process call of the algorithm is the first time you are using it's scratch memory, then you should be fine.

  • Yes, the only time the scratch memory is actually used is in the process call.  It's used to store luminance data from the input frame and also contains an intermediate buffer for storing processed data whose lifetime is limited to the process function.

    Dennis

  • For H264 encoder, memTab[1] is a request for Internal scratch memory:-

    [DSP] @26,331,904tk: [+4 T:0x87ec0cb4 S:0x87ec88dc] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Requested memTab[1]: size=0xff00, align=0x8, space=IALG_DARAM0, attrs=IALG_SCRATCH

    And I see from the log that it actually gets allocated EXTERNAL memory for the failing case:-
    [DSP] @26,361,412tk: [+4 T:0x87ec0cb4 S:0x87ec88dc] ti.sdo.fc.dskt2 - DSKT2_createAlg3> Allocated memTab[1]: base=0x8774d900, size=0xff00, align=0x8, space=IALG_ESDATA, attrs=IALG_PERSIST
    This might be an issue depending on how strict the requirement for internal memory is (it is very strict for ACPY3 based algorithms such as the VMD alg you wrote). In the configuration file, I see DMAN3's USE_EXTERNAL_SCRATCH configuration is set to TRUE. Could you try setting that to false and adjusting your heap sizes etc so that both algorithms get created such that they get the memory they request ?
    Thanks,
    Gunjan

  • Dennis,

    Did you get a chance to try out the configuration change ?

    Thanks,

    Gunjan.

  • Gunjan said:
    This might be an issue depending on how strict the requirement for internal memory is (it is very strict for ACPY3 based algorithms such as the VMD alg you wrote).

    I have seen this stated before, that ACPY3 requires internal memory. I am using it to copy large (up to 345kb) blocks of memory from DDR to DDR.  How is this working if the internal requirement is so strict?

    Thanks,

    Dennis

  • VMD algorithm is getting all the internal memory it needs. And ACPY3 would certainly not  work without it.

    It is the H264Enc that is getting external memory instead of internal memory. The log statement I shared yesterday, was from the H264enc creation trace, and I don't know if the requirement for internal memory there is strict or not. I have asked the codec authors, but havent received a response yet. 

    For now, if you could configure it so that both the algorithms get the internal memory they need, that would either confirm/rule out the internal memory requirement as an issue.

  • After I made the change you suggested, to disable external scratch memory, the H.264 encoder fails to initialize.

    [DSP] @1,546,287tk: [+7 T:0x87ec0c74 S:0x87ec8934] ti.sdo.ce.alg.Algorithm - Algorithm_create> Algorithm creation FAILED; make sure that 1) alg params are correct/appropriate, 2) there is enough internal and external algorithm memory available -- check DSKT2 settings for heap assignments and scratch allocation

    The codec claims to require (in the H264ENC.xs file) 65280 (0xFF00) bytes of DARAM scratch. I am providing 0x10000 bytes in the codec.cfg file.

  • Failure to create the codec might have been the reason this parameter was set to TRUE even though the comments mention that this switch should be on (See huge paragraph, just above this setting in server.cfg).

    The total amount of internal memory used by your application is the sum of internal memory requirements of both the codecs (since they are in different scratch groups). Is there enough memory available in L1DHEAP

    If not, is there other internal memory (L2RAM) that can also be used to satisfy some of these requests ?

    You could set the SARAM_SCRATCH_SIZES and DARAM_SCRATCH_SIZES settings for both the groups to 0, and let the algorithm's requests themselves dictate how much memory should be set aside for each scratch group. (See the same comment in the .cfg file for explanation). Also the trace log will indicate where the allocation (by the DSKT2 module) actually fails. 

     

  • We confirmed from H264 Enc authors that their requirement for internal memory is strict. It seems like when both the codecs are created in different scratch groups, since they cannot share memory, the Enc doesn't get all the internal memory it needs. It is assigned external memory instead of internal, which causes it's process call to get stuck.  In my mind, you have the following alternatives:-

    - Ensure total internal memory is sufficient for internal memory requirements of VIDANAL + H264Enc codecs so that they can remain in different scratch groups

    - Keep them in the same scratch group so they can share this internal memory

    -( Don't know if this is possible but).. reduce internal memory requirments of the VIDNAL codec to the bare minimum and see if total internal memory requirement is sufficient to create both.

     

    Thanks,

    Gunjan

    Please mark this thread as Verified, if I have answered your questions.

  • For the time being, we'll be keeping both algorithms in the same scratch group.  However, next week, I'll probably be working on getting them into their own scratch groups again.  The VMD algorithm we have does not request any internal scratch memory except for the requirement that ACPY3 has.  Now that I think of it, it is using VLIB, and I'm not sure what requirements it has with regards to this (probably none, since it's not an XDAIS/DMAI algorithm).

    Something else has priority for me today, but I will try some things based on your suggestions to see if we can get this problem resolved completely.

    Thank you for your help,

    Dennis