OMAP L138 EDMA3 and SPI0 issue

Pirow Engelbrecht

Other Parts Discussed in Thread: OMAP-L138

Hello,

I'm currently using the EDMA3 controller on the OMAP L138 (running at 456MHz) for 16-bit transfers to and from the SPI0 peripheral. The SPI0 peripheral is set up as a slave on the L138 and is being driving at almost 25MHz SPI clock rate (so about 750 ns per 16-bit transfer). The DMA transfers in and out are double buffered using EDMA3 Paramsets. Memory used for the transfers is L2 (IRAM).

To minimize issues with SPI transfers, I've dedicated transfer controller 0 of channel controller 0 to the SPI0 peripheral by setting the SPI events to go to queue zero. All other peripherals using DMA uses TC1 (via queue 1).

I've also changed the master priority registers (in the SYSCFG module) to let EDMA3_0_TC0 run at priority 0 (default). EDMA3_0_TC1 runs at priority 1 (changed from default) and EDMA3_1_TC0 runs at priority 4 (default).

Unfortunately, I'm sometimes still seeing a dropped 16-bit word. Can anyone advise what other optimizations or changes are needed to run the SPI peripheral at the datasheet speed of 25MHz?

Thanks

over 7 years ago

0 Mukul Bhatnagar over 7 years ago

TI__Guru* 81005 points

Hi
Questions
1) Is it an occassional dropped word or always a specific word at a specific time instance? Trying to figure out if it is an initialization/synchronization type issue or truly a priority/contention type issue
2) What other traffic do you have in the system and TC1?
3) Does the "dropped word" issue go away if do not have other traffic on TC1 or reduce SPI clock frequency>
4) What is the source/destination for other traffice - is this also going in/out of L2? If so can you try to put this on SHRAM or DDR?

See the SCR section in the following wiki
processors.wiki.ti.com/.../AM1x_SoC_Architectural_Overview
to see if you have any common source/destination bridges that may be causing a choke point
Hope this helps

0 Pirow Engelbrecht over 7 years ago in reply to Mukul Bhatnagar

Intellectual 530 points

Hi Mukul,

1) It is only occasionally and is random.

2) We have McBSP0 and McBSP1 running at 50Mbit/s on TC1. CC1 is doing QDMA transfers between L2 and mDDR and mDDR and L2

3) SPI frequency reduction is not an option (it is a function of the SPI master). I can try and take out the TC1 traffic, but it is difficult as we have a dual OMAP design with designated tasks to each core. I can possibly reduce some of the traffic to see if it makes a difference.

4) The EDMA McBSP traffic is using un-cached mDDR, but our QDMA transfers are transferring from/to L2.

0 Norman Wong over 7 years ago in reply to Pirow Engelbrecht

Guru 26430 points

My experience is the SPI as master with EDMA rather than your case of McBSP as slave. Not quite sure if you are using L2 or DDR to/from the McBSP. I found that using DDR resulted in pauses in a continous transfer of words over SPI. I am guessing the DDR is busy doing a refresh and pauses the DMA. Using L2 removed the pause. In my case, a irregular SPI master clock is annoying but functional. In your case, the pause might cause an receiver overflow.

0 Mukul Bhatnagar over 7 years ago in reply to Pirow Engelbrecht

TI__Guru* 81005 points

Thanks.
It appears that you may have a contention issue

Few suggestions to try out
1) Try to move your SPI traffice to SHRAM instead of L2 ( I think L2 is getting heavier traffic from QDMAs) - or if you need SPI in L2, try to move your QDMA traffic to SHRAM instead (instead of L2)
2) Try to throttle your QDMA traffic - is the requirement on this link time critical? If not , make sure you are breaking the QDMA transfers to smaller data chunks per transfer request - if you have a big transfer size (ACNT*BCNT) going out per event - it maybe back to back transfers to L2 that may be starving your SPI /EDMA traffic due to head of line blocking somewhere (even though you have the TC0 at higher priority).
3) If CC1-TC0 is only going to be used only for memory to memory transfer (and no other DMA etc) *and* your system is ok with lower throughput, you can also play with the TC.RDRATE register - this register setting puts dead cycles between every read issued by TC - it can be detrimental for overall TC performance but perhaps acceptable to your system.
4) It is important to understand/isolate the offending / contention traffic - is the McBSP transfers or the QDMA transfers or both? If it is the QDMA transfers - it is likely L2 that is your common choke point, if it is the McBSP transfer it is likely the BR3 and BR4 in the system interconnect, where you could have potential head of line blocking in the bridges.

0 ran35366 over 7 years ago in reply to Pirow Engelbrecht

TI__Genius 12805 points

Pirow, here is what I suggest:

First Look at www.ti.com/.../spruh77c.pdf at table and figure 4-1 page 102 and do some theoretical calculation what is the total bandwidth that the system use via any of the bridges. Basically make a list of all the data moves that the system uses, remember that the EDMA reads data into its internal buffers from the source and then write data toward the destination. Make sure that all bridges and connections are not more than, say, 50% of capacity.
Next make sure that the choice of EDMA is the best with regards to data transfer. No EDMACC is over used.

If the system required bandwidth is well below the device capacity I would suspect a delay issue. We saw it in a different device (with a different customer) where a ling transfer with low priority blocked higher priority short transfer that caused it to lose a value. I suggest that you make sure that all other EDMA transfers are as short as possible and then see if the problem goes away. Indeed shorter transfers are less efficient, so if the delay is your issue, you want to tune the transfer seize to teh sweet point where the system is efficient and no lose values

Does it make sense? Mukul, what do you think?

Ran

0 ran35366 over 7 years ago in reply to ran35366

TI__Genius 12805 points

of course I meant long and not ling in the following line: where a ling transfer with low priority blocked higher priority short transfer

0 ran35366 over 7 years ago in reply to ran35366

TI__Genius 12805 points

OOPS

I did not mention that the clock of SCR is the same as DSP/ARM clock. 375-476MHz clock (see the data sheet www.ti.com/.../omap-l138.pdf )

Ran

0 ran35366 over 7 years ago in reply to ran35366

TI__Genius 12805 points

OOPS number 2

I did not mention that the clock of SCR is HALF of DSP/ARM clock. 375-476MHz divide by 2 clock (see the data sheet www.ti.com/.../omap-l138.pdf )

Ran

0 Pirow Engelbrecht over 7 years ago in reply to Norman Wong

Intellectual 530 points

Hi Norman,

Thanks, we are already using L2. mDDR has indeed shown to have problems for SPI transfers.

0 Pirow Engelbrecht over 7 years ago in reply to Mukul Bhatnagar

Intellectual 530 points

Hi Mukul,

Some of these suggestions will take some time:
1) Does the EDMA3 cache snooping work for L3 SHRAM? If not, moving to SHRAM, I would have to think about cache instructions to be added in. SHRAM in our system is cached
2) The QDMA traffic is not THAT time critical, although still part of a real time system. Transfers are set up as AB transfers as bytes, which is probably fairly inefficient. I'll switch to word transfers. How can I throttle QDMA
3) I've tried TC.RDRATE - it does not seem to make a difference
4) OK, is there any priorities that can be assigned to BR3 and BR4?

0 Pirow Engelbrecht over 7 years ago in reply to Pirow Engelbrecht

Intellectual 530 points

Sorry, 2) should read:
The QDMA traffic is not THAT time critical, although still part of the real time system. Transfers are set up with ACNT equal to size of the transfers and the BCNT as 1 (transfers are less than 64k. How should I set up the QDMA to throttle the transfers?

0 Mukul Bhatnagar over 7 years ago in reply to Pirow Engelbrecht

TI__Guru* 81005 points

Hi Pirow

On 1, no DSP caches will not stay coherent if another master touches Shared RAM/DDR (manual cache operations will be needed)

On 2, It appears like for QDMA event you are transferring a big ACNT (BCNT=1)? If so, you are likely using the TC very "efficiently" and you may need to insert some dead cycles between transfers , by either manipulating RDRATE for this TC ( looks like this is not helping) - or further break down your transfers to smaller chunks - easiest way to do this would be to use a DMA channel (instead of QDMA) and have chaining enabled so that with a single ESR write and intermediate chaining you break your ACNT x BCNT transfers into smaller transfer chunks using ACNT = (smaller than current size) , BCNT = total transfer size (this will also be the number of TRPs that go out to TC instead of one big TRP per chained event).

On 3, see above

On 4, the priorities that you assign to the various masters - in your case the TC priorities via the MSTPRI registers is what is "globally" seen and respected on all interconnect components (Bridges and SCRs) "if" there is an opportunity to arbitrate. But remember, if the higher priority request did not show up at the time your lower priority QDMA request showed up you could have several lower priority access chunks slip in the bridge fifo - where it is just first in first out. The default arbitration size is 16 byte chunks for EDMA TCs ( i.e. the tc internally breaks the request to default burst size/DBS chunks)

You may want to see if playing with the QDMA memory x'fers TC default burst size (via CFGCHIP0 in SYSCFG) makes any differences. It could work in favor or against - depending on the nature of your SPI traffic etc.

Memory re-assignment and breaking your memory to memory transfers into smaller chunks by changing the EDMA programming are likely going to have the most impact to improve system traffic and reduce chances of head of line blocking etc.

0 Norman Wong over 7 years ago in reply to Pirow Engelbrecht

Guru 26430 points

Some confusion on my part. Are the SPI-->L2 transactions related to the L2-->mDDR transactions? Does the same packet travel through one then the other? If so, is the dropped word detected in L2 or mDDR? I've been assuming master to slave only. Is there slave to master as well?

0 Pirow Engelbrecht over 7 years ago in reply to Norman Wong

Intellectual 530 points

Hi Norman,

Yes. Basically there is a fast ping-pong buffer in L2 to receive incoming SPI words (each buffer is about 600us deep). Once the DMA engine has finished one buffer it triggers an interrupt that triggers a QDMA transfer to transfer the just buffered block out to mDDR (a deep receive buffer). Another semaphore driven process then brings in the the next block for processing from mDDR to L2 (using QDMA).

0 Mukul Bhatnagar over 7 years ago in reply to Norman Wong

TI__Guru* 81005 points

Hi Norman

If you are confused on the path of contention or where the "common" end point is - i have tried to quickly draw out the SPI0 and QDMA traffic paths ( I didn't draw the McBSP path and did not pay much emphasis on the direction of the arrows src/dst etc).

Hope this helps. spi_edma.pdf

0 Pirow Engelbrecht over 7 years ago in reply to Mukul Bhatnagar

Intellectual 530 points

Hi Mukul,

Yes, except there is one more contention path, as the DMA controller is also doing SPI TX from L2 (double buffered).

0 Norman Wong over 7 years ago in reply to Mukul Bhatnagar

Guru 26430 points

Mukul, thanks for the diagram. DMA has never been my strong point.

0 Norman Wong over 7 years ago in reply to Pirow Engelbrecht

Guru 26430 points

Pirow, Is it possible that the ping pong buffer are getting overwritten before being transferred out? Perhaps your dropped word is actually a word from the next packet. Admitedly, is that happened, you would likely have more than one dropped word.

0 Pirow Engelbrecht over 7 years ago in reply to Norman Wong

Intellectual 530 points

Hi Norman, the SPI words received are actually synchronised pairs and the buffers are of even length. We are losing exactly one SPI word, i.e. one half of a synchronised pair. I'll add in some explicit checks to see that the buffer that the DMA engine is writing to is not the same as the buffer that is being copied out to mDDR.

0 Pirow Engelbrecht over 7 years ago in reply to Pirow Engelbrecht

Intellectual 530 points

Hi Norman,

I've confirmed that it is NOT the ping-pong buffer mechanism becoming un-synchronized. I've added a assert check to check the current destination address of the DMA engine for SPI0 RX and the assert did not fire when the dropped word occurred.

0 Pirow Engelbrecht over 7 years ago in reply to Mukul Bhatnagar

Intellectual 530 points

Hi Mukul,

I've shifted the ping-pong buffers to L3 SHRAM, but this does not fix the issue. On this issue, is the L3 SHRAM actually lower latency than L2 for the DMA engine to write to? It seems that for L2 there is the internal C6000 busses to traverse and also L2 cache coherency?

I'll try the QDMA breakup and the transfer controller default burst size.

0 Mukul Bhatnagar over 7 years ago in reply to Pirow Engelbrecht

TI__Guru* 81005 points

Hi Pirow
Hmm. I find this a little strange that moving to Shared RAM did not change your failure/failure rate at all.
It would almost seem like the QDMAs from L2 to DDR are not causing the issue , because by moving the SPI rx/tx to SHRAM you should now have completely independent paths

Perhaps it is the McBSP traffic and how it interleaves in BR3/4 that is causing issues with SPI traffic , if so , before trying the QDMA break up (which is likely a good thing in the longer term anyways) - can you confirm that if you were to disable your QDMA or McBSP traffic your SPI will run fine and w/o errors? Only a single data corruption, even though random - could be some sort of software/synchronization issue too?

from a topology perspective the latency to the boundary of L2 and SHRAM is roughly the same for the TC/EDMA, the only advantage is likely when DSP has to access L2 memory it is "closer" to DSP compared to DSP having to go over the interconnect to fetch data from SHRAM etc.

0 Pirow Engelbrecht over 7 years ago in reply to Pirow Engelbrecht

Intellectual 530 points

Hi Mukul,

OK, finally found why there is a missing word - the SPI peripheral silently discards a SPI word if there is a bit length error. Apparently the SPI master (OMAP is the slave) is exceeding the OMAP SPI's timing requirements and the OMAP is missing a clock edge. The SPI peripheral then discards the SPI word. So not a DMA problem after all. I'll open a separate ticket about the OMAP's inability to run continuous SPI at 22MHz+.

Processors

Processors forum

OMAP L138 EDMA3 and SPI0 issue