This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

EDMA3 6678 Transfers failing for "larger" A and B counts

I'm going a matrix application, so part of it involves moving submatrices around in memory.  I was initially excited about the EDMA 'tranpose'/'data sorting' technique, but when I included it, the transpose operation started returning incorrect results for larger than trivial-sized matrices.  I reasoned out where I needed to write-back and invalidate the caches, and in debugging, I added more (safely), but that never fixed the problem (in fact, one time I think it delayed the problem until an even larger size, then it started not returning correct results again)

The sizes of transfers I'm talking about vary, but they are all less than an Acount of 4*256 and a Bcount of around 186 is when it starts to fail (the transfers of A=142*4, B=184 to be precise).

My guess was that I was overloading the event queue (although I wasn't explicitly putting anything in there myself), but I could not figure out how to remedy the problem.  And that might not even be the case because it works fine for repeated, relatively small transfer in an inside loop, but it fails on some of the slightly larger chunks of data in the outside loop.  The fact that this transfer is to/from MSMC and DDR may be something, but I believe I tried moving the particular matrix to L2 instead, but with no luck.

My method of transferring data is to use the CSL 'channel set' function, spin on QUERY_INTRPEND, and then clear the IPR bit.  I think this is the "Shadow Region", but that terminology thoroughly confuses me.  This works for the other transfers and when I am doing small sizes.  I have seen some other methods to do DMA transfers (like maybe QDMA), so if I need to do one of those, I could try rewriting my code for that.  Originally, I was using 1 channel for all of my transfers to/from DDR, but allocating 2 extra channels solely for the "failing" transfers doesn't fix the problem (though they were from the same Event Queue as far as I could tell).

It would be very easy for me to believe that the hardware just "gave up" trying to transfer the data, set an error bit somewhere, and then set the IPR bit -- but I am not sure how to check that hypothetical error bit  or even how to relaunch the transfer once an error is detected in software.

I also saw some things about an error/interrupt handler, and that may be a step towards the solution, but I could not find any examples dealing with a handler.

Any help, hints, or alternatives would be much appreciated

  • Here are some ideas to debug this.

    Since you are looking for the "error bit", the userguide (http://www.ti.com/lit/sprugs5) contains the register map for each EDMA (there are 3 EDMA-3 units on the 6678), while the device data sheet contains the base addresses for each peripheral (http://www.ti.com/lit/gpn/tms320c6678).  These documents should help find the registers to inspect after it fails.

    There are error reporting registers defined in section 4.2.2 of the user guide (sprugs5).   Since you suspect that a transfer gave up, check the EMR/EMRH register.  Since you suspect that a queue overflowed the CCERR register may help.  Based on what is in these register would dictate the next steps.

    There is an example on how to set up the interrupts in the EDMA3 LLD which can be found in the MCSDK which is downloadable from http://e2e.ti.com/support/embedded/b/announcements/archive/2011/07/05/production-release-of-mcsdk-2-0-for-c66x-devices.aspx

    After downloading the MCSDK, the code which sets up the interrupts is in "C:\Program Files\Texas Instruments\edma3_lld_02_11_01_02\packages\ti\sdo\edma3\drv\sample\src\platforms\sample_c6678_int_reg.c".

  • Thanks for the help, I went an inserted some of the CSL error checking functions -- and I was initially incorrectly not passing an array to the EventsMissed function... but after a long debugging period, I realized that mistake.

    And I found that indeed when I do larger transfers, I begin to get Events Missed. I understand conceptually the mechanism of the interrupts -- but the LLD code is extremely obfuscated and doesn't appear any more functional than CSL's CPINTC.  And even if I did get the interrupts configured correctly, it would end up going back to software and reissuing the transfer, so right now, I'm just checking the EventsMissed register and looping back to do the transfer again if there was an event missed.  However, just sticking a loop around the exact same transfer doesn't seem to fix the problem -- meaning I'll have to program in some sort of backoff routine rather than issue the same-sized transfer that failed in the first place.  Hopefully all the checks, spinning, and partial reissuing won't make the transfers slower overall than doing normal CPU loads...

  • Tim,

    Based on your observations, I have a few questions.

    1) Are you chaining/linking EDMA channels together? If so, which channels and in what order?

    2) Are you enabling any intermediate completion events in the EDMA paramsets? If so, for which ones? These two questions will give us a clear idea of the sequence of transfers leading up to the missed event.

    3) Does your last transfer link to a NULL paramset? If it does, please ensure that  you do not set any intermediate completion events on this last transfer.

  • I am not doing anything more than issuing a single transfer (as part of a loop in the application, but it is a single DMA transfer each time) with Acounts of less than 1024 (bytes) and Bcounts of around 200 (# of arrays).  I figured "large" number of arrays was causing a failure since the transfer might be automatically decomposed into smaller events and then that number of events was overflowing the Queue.  As far as I can tell, the only restriction on transfer sizes comes from the number of bits of representation.

     

    Aren't chaining and linking entirely different mechanisms? From what I gathered from the manual: linking is reloading a new PaRAM set that was specified in the OPTs when the transfer completes.  Chaining.. I'm not as sure about. The manual mentioned that the transpose example "chained to itself", but it wasn't clear how that was configured.  One can see that the Intermediate Transfer Chaining interrupt bit is set, but the "receiver" of the Intermediate event is not apparent.

    My application would have benefited from using a transpose/"data sorting", but mimicking the example in the EDMA Manual didn't yield much success (whatever the reason).  But the transpose operation isn't a bottleneck, so that's much less of a priority than getting the "simple" transfers working reliably.

    So without any chaining/linking, my single transfer is generating events missed. And I'm also encountering difficulties recovering from those missed events. I'm currently in the process of more extensive debugging. I can now clear the missed events, and I've gotten the transfer to (eventually) complete sometimes.  In one case, a transfer of 512x142 (Acount x Bcount) finally completed when it was backed off to 512 x 2 (all other divisions by 2 up to 2 failed). On the other hand, a different transfer of 416x142 never completed (I backed off the Bcount down to 1, but the transfer never not generated an event missed (being cleared each time)). Hopefully backing off the Acounts too will get these transfers to eventually succeed, but I find it extremely discomforting that the transfers fail in the first place and that after failing, it gets harder for any transfers to succeed.  It's possible I'm doing something wrong, but I don't believe I've gone out of bounds with my addresses -- and the previous, smaller transfers were succeeding.

    That brings up an important question: can the controllers be used in parallel?  If the same controller can't be used by multiple cores, then that's a serious limitation when there are more cores than controllers..

  • Tim,

    Let us leave out the self-chaining part for now. It is a very effective way of performing a transpose that uses just one ABSYNC transfer... maybe we can bring it in later. Also a multicore discussion can be a separate discussion unless your failing setup uses multiple cores. That brings a whole new dimension to the discussion.

    It is not clear whether your paramset contents are the same for each iteration of the loop or different. Is your objective a 2-D transfer or 3-D transfer? Since you are trying to transpose a matrix I assume it is 3-D. But I did not see any mention of a CCNT. To debug the issue, it will be helpful if you post the paramset contents here.

  • Thanks for your help so far. I am grateful. I seem to have found the solution to the problem I was having.

    Due to the wonders of human error, it turns out I was doing 2 incredibly dangerous things in combination: I had the STATIC option set to disabled -- and (at first) I was leaving the LINK PaRAM field uninitialized.  I had since started initializing the LINK param to 0xFFFF and later enabled STATIC, but only after power cycling my board did I stop getting the errors when I was getting them before. I have tested up to the full size of the transfer I desired, so that means everything should work out.

     

    As mentioned before, my application does involve a transpose that I was having problems with performing using DMA.  The aforementioned problem may have been the cause, but I'll need to do a deeper investigation (I removed the DMA transpose routine very early on in development).  I also had a 3 submatrix transfer that I was having problem with when I was using DMA (i.e. I could do the transfer of 2 submatrices fine, but doing the 3rd caused problems) -- however, _that_ problem is more than likely the exact same as the STATIC/LINK error I had.

     

    Thanks for your patience! (and suggestion for scrutinizing the param set -- I had just been sticking with something that was producing correct results before)

     

    Finally, yes, this is a multi-core application, but I have the parallel issues worked out (on Debug, anyway) and I can execute initialization routines on both cores (I'm only testing on 2 at the moment) with separate CSL EDMA modules, but using the same EDMA instance (which would be required since 3 instances would need to be shared among 8 cores).  I could go into more detail, but I'm at a point where I have 2 cores being able to use EDMA in parallel and return correct results for the test cases so far.

  • Tim, good to know that you successfully traced the issue.

    My 2 cents if you want to share the same EDMA instance among more two or more cores...it is a good practice to allocate independent resources for each core using shadow regions. This avoids EDMA resource conflicts. There are 8 shadow regions on an EDMA instance, so you can assign one per core.

    Please click the  Verify Answer  button below if your question is answered. Thanks.

  • So I incorporated the change, and everything worked swimmingly for 1-core 2D transfers -- but I tried to work with the Transpose again, and it still didn't work, but it isn't a vital part of the application, so I decided to just do without it again.

    But now I'm encountering a similar problem (incorrect results) for a similar problem (many large transfers).  However, the kicker is now that I have my application multicore, and that seems to be a contributing cause.

     

    Here's my situation:

    Everything is correct and fine when I run on 1-core with full optimization or N-cores with Debug.  When I do N-cores with some "slowing" function like a printf() on core0 between some of the transfers, the results appear to be correct, but depending on how "slow" the extra function is, the incorrect results may start happening again when I perform even more transfers.

    I first noticed a problem when I had 4 cores doing these transfers on CC0 with channels 0-3, using 2 event queues and 1 shadow region for each core.

    Then I moved some of the cores to work through CC1, and that started producing correct results again.

    Adding the 5th core didn't have a problem (the 5 cores spread across CC0 and CC1 using 5 shadow regions and channel 0-4 on whichever single CC the core was using)

    But when I added core6 and again started having problems.  Even moving the cores evenly across CC0-2 didn't solve the problem.  I then found that if I changed the channel from Chanel 5 to Channel 6, I didn't have the problem anymore.  That opened up the whole concern about if the channels I'm using are interfering with the other events that apparently use the same event mechanism (even though I'm not using anything besides the CPU, IPC/Semaphores, and DMA).

     

    With that workaround, I got 6 cores running, but I am stuck at getting 7 to work.  Adding the 7th core with its own Shadow region seems to always return incorrect values when I issue more transfers ("always" meaning that there's at least one test size I'm using for which it fails).  I am doing every error check I can find in the CSL API, but nothing is being detected as an error or missed event.

     

    In all cases "incorrect results" is the result of an accuracy checking function exceeding the threshold (about 100x "correct" error values).  These incorrect results occurred at various places, very frequently at places that aren't even touched by the "newly added" core. And sometimes it was only a couple arrays of the entire frame. And the indexes of the incorrect results varies every time I run, but they are normally aligned to the beginning of a frame, though the frame may not be at the beginning of the data.

    Here's an example:

    For transfers in a loop of first a transfer of at most Acnt=4*192, Bcnt=128 into MSMCSRAM followed by repeated transfers of at most Acnt=4*128,Bcnt=24 into L1DSRAM

    (Super Matrix dimensions, ... Maximum Error)

    [C66xx_0] 500_502_500_500: ...3.433e-05

    [C66xx_0] 720_722_720_720: ...4.578e-05

    [C66xx_0] 940_942_940_940: ...6.866e-05

    [C66xx_0] Error at (i= 192,j=   2): -15.7921 instead of -13.7985 

    [C66xx_0] Error at (i= 193,j=   2): -5.4402 instead of -7.1265 

    [C66xx_0] Error at (i= 194,j=   2): -9.4237 instead of -3.6400 

    [C66xx_0] Error at (i= 195,j=   2): -0.2810 instead of -2.4690 

    ... (more errors)

    [C66xx_0] Error at (i= 193,j=   3):  4.9041 instead of  1.9240 

    [C66xx_0] Error at (i= 194,j=   3): -1.4603 instead of  0.7088 

    [C66xx_0] Error at (i= 195,j=   3): 12.7566 instead of  1.5279 

    ... (more errors)

    [C66xx_0] 1160_1162_1160_1160: ..1.123e+01 

    [C66xx_0] 1380_1382_1380_1380:... 1.068e-04 

    [C66xx_0] 1600_1602_1600_1600: ... 1.183e-04 

    [C66xx_0] 1820_1822_1820_1820: ... 1.526e-04

     

    A typical Param set is:

    [C66xx_6] option     : 0x10600c () // TCINT enabled, TCC = 6, Static enabled, AB sync

    [C66xx_6] SrcAddr    : 0x80611d80 

    [C66xx_6] BcntAcnt   : 0x800010 (128, 16)

    [C66xx_6] DstAddr    : 0xc090c00 

    [C66xx_6] srcDstBidx : 0x101590 (5520, 16)

    [C66xx_6] linkBcntrld: 0xffff (65535, 0)

    [C66xx_6] srcDstCidx : 0x0 (0, 0)

    [C66xx_6] Ccnt       : 0x1 (1 )

     

    Should I try using more channels? And wouldn't that cause problems if choosing a particular channel number appeared to affect my situation earlier? I feel like the shear number of transfers I'm doing is the problem, but sometimes the transfers are correct for later, *even larger* values (though this is not true most of the time).  Is it just not possible to issue some shapes of transfers repeatedly? (e.g. repeatedly doing large Bcounts but small Acounts).

    Is there a known limitation on the CCs/TCs? Are my transfers too big and too frequent for manually-triggered transfers?

     

    I have a good feeling it's not related to this issue but rather some compiler bug for high optimization, but I sometimes (not always) get this error near when I initialize EDMA across all cores:

    xdc.runtime.Core: line 86: assertion failure: A_initializedParams: uninitialized Params struct

    Again, I don't think that's the cause of the problem (especially since recompiling normally fixes the problem), but it's the only explicit error I get from any source.

  • Tim,

    Are you checking for transfer completion in the IPR register? If your transfer uses TCC=6 and TCINTEN=1, IPR should be polled for IRP[6]=1. Perhaps you are already doing this but just want to double-check since you did not explicitly state it.

    I doubt the failure has anything to do with using Ch6 v.s Ch7 since they are logical elements in the EDMA and do not represent hardware.

    #1Can you paste how you are allocating resources across shadow regions? Also, FYI, for future reference, there is no advantage in setting STATIC=1 unless you are doing some debug on an isolated transfer that does not link/chain another.

    Your paramset looks correct. You transfer size also is well within bounds. I don't believe you are hitting any hardware limitation on the EDMA CC/TC. You would have hit an error status if you were. If you are checking for IPR completion, it is really strange that you see incorrect transfers and no errors to go along with it. What

    #2 If you turn optimization OFF and the errors go away, then we have hit the issue right there as you reported: "uninitialized Params struct". When you say recompiling fixes the problem, does a re-compile of the same source with the same optimization level make the "uninitialized Params struct" error go away and consequently, your incorrect transfers as well?

  • *epic facepalm*

    After writing a novel-length post about my setup, I got to the part where I started pasting every DMA related segment... and then finally I got to the Polling function, which Immediately through me off:

     

    It used to be:

     

    CSL_Edma3CmdIntr regionIntr;

        regionIntr.region = regionNum;

        regionIntr.intr = regionIntr.intr & ((0x1)<<intrNum);

        regionIntr.intrh  = 0;

        CSL_edma3HwControl(hModule,CSL_EDMA3_CMD_INTRPEND_CLEAR,&regionIntr);

     

     

    Uninitialized variables off the stack, anyone? So it looks like I wasn't necessarily clearing the IPR bit each time since if the uninitialized variable _happened_ to have a 0 in that bit, then ANDing it would have no effect and the following CSL command wouldn't clear the bit I wanted it to clear (or any other bits for that matter).

     

    I have now changed it to

    regionIntr.intr = ((0x1)<<intrNum);

     

    And I've been able to run far past the tests that were failing earlier.  I'll try going up to 8 cores, and I'll even try adding back in the transpose again! Thanks for the help and making me look at pieces of my code that had long since dropped off my radar.  +10 awesome points for you.

     

    P.S. Just did the transpose. It worked for smaller sizes (which it wasn't doing before), but when I had a large test, it returned incorrect results. Something could possibly be afoul though.

    P.P.S. The "uninitialized Params struct" error still occurs. Although it's talking about XDC params, not DMA Params.

  • No problem, let me know if you are able to run on all 8 cores.

  • At least for the one set of tests I performed, everything worked fine on 8 cores. I was using all 3 CC's even though I might have been able to get away with fewer.

    As I mentioned, the transpose didn't appear to work when I had larger matrices, but for the smaller matrices it worked -- although it slowed down my application. It appeared as if the transpose was taking longer than DMAing in and then doing a [optimized] CPU transpose and also doing the transpose in DMA meant I needed to invalidate it in the cache as opposed to it being mildly warm in the cache from the CPU transpose.

     

    Anyway, I achieved my goal: 8 cores with DMA.

     

    Thanks a bunch!