This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

L138 EDMA3 Linked transfers PaRAM change latency

Hello,

I'm currently using the EDMA3 controller on the OMAP L138 for 16-bit transfers to and from the SPI0 peripheral. The SPI0 peripheral is set up as a slave and is being driving at almost 25MHz (so about 750 ns per 16-bit transfer). The DMA transfer in is set up as a ping-pong buffer (using PaRAM set 14 and set 64 and 65 for reload for ping and pong).

Initiate streaming to memory works fine (mDDR in this case), but the DMA engine seems to lose a sample when updating the PaRAM 14 set with either set 64 or 65. 

I'll up the clock speed (I'm not currently running full speed) and switch to ping-pong memory in L2 to see if things improve, but it would be useful to understand what the latency is for the reloading of the PaRAM set in terms of clock cycles.

Thanks

  • Pirow

    In general even though SPI clock in slave mode supports upto 25 MHz, we have never seen customers able to run a constant stream SPI using EDMA , at 25 MHz.  

    The best we have seen is stand alone 20 MHz and for loaded systems somewhere around 10-12.5 MHz. 

    http://e2e.ti.com/support/embedded/tirtos/f/355/t/93428.aspx

    While increasing CPU speed (SPI module speed) and moving buffers to internal memory might be helpful, but I do not think you will be able to run the system at 25 MHz SPI clock in slave mode. 

    I don't have exact latency data on PARAM re-load, I do not think this is more then 5-10 CPU cycles

    Just to give you a coarse/approx numbers on EDMA latency 

    1. SPI Event to CC Event Latch
    2. CC Event Latch to CC submitting a transfer request packet
    3. Transfer request packet submit from CC to TC
    4. TC issue the first read command
    5. Read Response latency = Dependent on the source latency and competing accesses/traffic
    6. TC issues the first write request after the read data is returned
    7. TC write request to storing data at dest = Dependent on the destination latency and competing accesses/traffic

    For 1)  ~ 20-30 CPU cycles

    For 2+3+4) ~ 50 CPU cycles 

    For 5, 7 ~ would be dependent on source/dest : do not have numbers for SPI-DDR etc. 

    For 6) ~ 18 CPU cycles 

    So you are looking at 140-150 CPU Cycles + 5)+ 7)

    I do not think you are running into PARAM reload latency as the bottle neck, I think this is purely SPI source/dest fetch latency. 

    Regards

    Mukul 

  • Mukul,

    Thank you for the clarification. We've selected the OMAP L138 based on the fact that the datasheet electrical and software specifications deemed 25 MHz slave SPI possible. 

    Can you obtain the numbers for the SPI read response latency and write latency to L2 please? This would clarify whether it is impossible to obtain the actual throughput advertised?

    If we dedicate a PRU core to manage the SPI peripheral, do you think that this would help us achieve 25MHz TX and RX on SPI?

    Thanks

  • Hi Mukul,

    One additional detail: we're running 16-bit SPI words. Does your pessimistic view of the SPI throughput assume 8-bit words?

    Thanks

  • I might've messed up a bit on the break down , but think of a total round trip latency for an event on CC boundary to data landing to L2 (ACNT =1/2/4 will not make a difference) is about ~ 110 cycles to L2 and probably in the range of ~180-200 cycles to Shared RAM and DDR boundary (no refresh, bank conflicts) --> For a stand alone completely unloaded system with no access conflicts from other sources on the memories used for SPI pay load. 

    In a loaded CC/TC or memory system, these numbers start becoming bit less meaningful. 

    Pirow Engelbrecht said:
    We've selected the OMAP L138 based on the fact that the datasheet electrical and software specifications deemed 25 MHz slave SPI possible. 

    Datasheet is projecting SPI clock rate timings etc, unfortunately this is not always a good way to "proxy" for SPI throughput etc, which is system and chip latency dependent. 

    Additionally overheads might be there in the system based on OS and other aspects. 

    Pirow Engelbrecht said:
    Can you obtain the numbers for the SPI read response latency and write latency to L2 please? This would clarify whether it is impossible to obtain the actual throughput advertised?

    Unfortunately the above data is all I have from a chip internal latency perspective. The devices is about ~ 7 yrs old. 

    Pirow Engelbrecht said:
    If we dedicate a PRU core to manage the SPI peripheral, do you think that this would help us achieve 25MHz TX and RX on SPI?

    I don't have any previous data points on this, I think the chip topology for traversing through is roughly  the same, but perhaps it might give you some more lift. Additionally , if in your fully loaded system EDMA is being used to service other peripherals or memory to memory transfers, putting SPI on PRU might help isolate from CC/TC sharing etc. PRU example base is limited though, so might require more work on your side to build a driver. 

    Curious to see what clock rates are you able to sustain reliably with both DDR2 and L2 as buffer locations? You mentioned that clock is not sped up yet, which clocks? You should try to run the CPU / DDR2 etc at max entitlement on the devices you have. However like I mentioned i have only heard of 12-20 MHz max SPI slave DMA speed rates. Nothing close to 25 MHz. 

    Regards

    Mukul 

  • So just to clarify, we are running at an actual SPI slave 16-bit word rate of 1.34 MHz (a paltry 2.68 Mbyte/s)  and the EDMA3 engine running at 228 MHz is unable to cope with this data rate?

  • Hmm, so this is different then what you previously communicated on the SPI clock speeds? Perhaps there is a math here that I am not catching? Irrespective if you are running at 1.34 MHz there should not be any issues, so perhaps we need to understand your driver implementation further and why you think it is in the area of ping pong param reload you think the EDMA is messing up?

    PaRAM reload, should really be happening in the background and will be done before your next SPI event comes in.

    Can you make sure that you are not seeing any error bits set in the EDMA (EMR etc). 

    If CPU is running at 456 (DMA @ 228) , I think you are fine from an internal clock speed perspective. How fast were you running the mDDR?

    Do you have additional traffic, transfers on the Queue-TC servicing the SPI transfers ?

    Do you have additional traffic on mDDR when you did these experiments or is this a stand alone test?

    Regards

    Mukul 

  • No, my statements have been consistent. The SPI slave clock is still ~25MHz (bit clock, externally driven) - see my very first post. 

    I'm seeing a missed event bit set for that channel.

    I have an additional memory-to-memory transfer running simultaneously (mDDR to mDDR)

    I originally clocked at 300MHz, with the mDDR running at 150MHz. I'll up this to 456MHz.

    The DSP is executing from mDDR which is the only other source of traffic.

    Thanks

  • Are you using EDMA for memory to memory transfers? If you are, can you make sure they are not on the same Queue-TC as the SPI slave transfers?

    If you are seeing an EMR bit set (and the frequency of this being set is lower with buffers in L2 etc), and it is not because of the RX event hitting a null entry, it is likely that you are getting another SPI RXEVT before previous one got dequeued. 

    That could happen if you have got a big memory to memory transfer sitting ahead of it in the Queue-TC servicing the SPI traffic.

    Regards

    Mukul 

  • Hi Mukul,

    I'll switch the queue for the QDMA transfers to Q1.

    Can you clarify what you meant by 25 MHz SPI in your first reply? I think there is a mismatch what I mean by 25MHz SPI (bit rate clock) and you mean by 25 MHz SPI.

    Thanks