This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Problem of EMAC TXUNDERRUN With Ethernet speed 100MBit/s on TMS570LS3137

Other Parts Discussed in Thread: TMS570LS3137, DP83848YB, HALCOGEN

Dear all

I summarize the following:

 

Scenario:

1) 1 TMS570LS3137 on custom board connected to a PC (shell client + Wireshark) via an ethernet hub

2) 1 PHY DP83848YB connected via RMII with MDIO (1MHz)

3) FreeRTOS 8.2.1

4) LWIP 1.4.1 (all the pbuf buffers are in internal SRAM 256K)

5) Only 1 TCP/IP active connection (socket based) that implements a shell command server 

6.1) EMAC Flow Control disabled

6.2) Only 1 EMAC active channel (ch0).

6.3) Whole CPPI RAM 8K dedicated to channel (ch0) for rx/tx (4K/4K).

 

Transmission Problem:

1)      With Ethernet speed 100MBit/s, "EMAC TXUNDERRUN" occurred always for bigger size packets (800 byte <= size <=1460 ) while smaller size packets are successfully sent.

  1. Then this bigger packets were never transmitted outside the PHY when CPU “GCLK”= 180MHz and EMAC “VCLK3” = 45MHz
  2. Besides, whit debugger, we noticed that also the bigger packets could be successfully sent, only if we break (via break-point) the CPU immediately after the update of header descriptor “EMAC_TXHDP”. That means that the EMAC - DMA was still working while the CPU was in pause. Then we suppose that "EMAC-DMA" bigger packets transfer from internal "SRAM" to "EMAC FIFO" is disturbed/interrupted  by CPU (removing break-point), and the EMAC acts a timeout, that increase the TXUNDERRUN statistic.
  3. Then, we had workaround the problem increasing the "VCLK3" from 45MHz to 90MHz in order to speed-up the "EMAC-DMA" transfer from "SRAM" to "EMAC FIFO". After this change, it become possible that some big packets transmissions from "LWIP" to "EMAC" succeeded, but not always (see Problem 2).
  4. Note that to increase the “VCLK3” was the only possibility for us, because the CPU was already running at the maximum speed (“GCLK”= 180MHz)

 

2)      With Ethernet speed 100MBit/s, “EMAC TXUNDERRUN” occurred sometimes (and, then, one or more TCP retransmission are performed by LWIP) for bigger size packets (800 byte <= size <=1460 ) while smaller size packets are successfully sent.

  1. Note that we have set the “FIFOCONTROL” register to “3” in order to maximize the FIFO cell threshold "3 x 64byte" = "192 bytes" (it seems too small !!) and to increase the delay before EMAC starts transmission (we hope DMA could complete the initiated transfer!! :-( ).
  2. Note that with Ethernet at 10Mbit/s we don’t have any problem (all is working).
  3. The problem is workaround when we reduce the maximum TCP packet size, via MSS parameter of LWIP (i.e. from 1460 to 512)
  4. Note that this is not a satisfying solution !!

 

Questions:

1) How to set “EMAC-DMA” vs “CPU” priority to access internal RAM?

1.1) Where is the "master priority register" referred at RM-29.2.14?

         29.2.14 Transfer Node Priority

         The device contains a chip-level master priority register that is used to set the priority of the transfer node
         used in issuing memory transfer requests to system memory.

2) Is it proven the behavior of tms570ls3137 on Ethenet 100Mbit/s?

3) Is there an example of TMS570LS3137 working with 100Mbit/s?

 

  Can anyone help me?

  • Paolo, we are looking at your problem and will be back to you shortly.
  • Hi, there.

    According to TRM,  each 64-byte memory read/write request from the EMAC must be serviced in no more than 5.12 μs for 100 Mbps operation. It should be easily met even with 45 MHz. In validation, we have loopback test directly sending and receiving large packet (without any Ethernet stack). We did not see any speed issue as long as the descriptors are in the CPPI RAM and data buffers are located in the on-chip RAM. I am wondering if there is something else causing the problem.

    What do you mean by "Whole CPPI RAM 8K dedicated to channel (ch0) for rx/tx (4K/4K)."?  CPPI RAM are for the descriptors. To narrow down the debugging, can you directly send data without a stack using a method similar to the following code snip?

     // Setup descriptors
        pDesc = EMAC_DescBuff;
        pDescTx[0] = pDesc;


    // Setup descriptor for Tx Buffer
        pDesc->pNext     = pDesc + 1;
        pDesc->pBuffer   = (unsigned char *)EMAC_TxBuff;
        pDesc->BufOffLen = sizeof( EMAC_TxBuff );
        pDesc->PktFlgLen = EMAC_DSC_FLAG_OWNER |
                             EMAC_DSC_FLAG_EOP |
                             EMAC_DSC_FLAG_SOP |
                             EMAC_PacketSize;
                            
        pDesc++;
        (pDesc-1)->pNext = 0;

        // Setup descriptor for Rx Buffers
        pDescRx[0] = pDesc;

        pDesc->pNext     = pDesc + 1;
        pDesc->pBuffer   = (unsigned char *)EMAC_RxBuff;
        pDesc->BufOffLen = sizeof(EMAC_RxBuff );
        pDesc->PktFlgLen = EMAC_DSC_FLAG_OWNER;
        pDesc++;
        (pDesc-1)->pNext = 0;


        // Initialize head pointers for Tx/Rx to start transfer

        EMAC_REGS->RX0HDP = (Uint32) pDescRx[0];
        EMAC_REGS->TX0HDP = (Uint32) pDescTx[0];

    You can also toggle a GIO pin to check what the underrun occurs.

    Thanks and regards,

    Zhaohong

  • Dear Zhaohong

    Thank you for your quickly reply.
    Can you please answer my questions 1 and 1.1 in the first post?
    In parallel, i will try your test.
    Best regards
  • Hi, there.

    On TMS570LS3137, all access to the internal RAM (regardless of which master) goes through the B0TCM and B1TCM. CPU always has the highest priority. However, the system will grant the pending request from other masters (DMA, EMAc, etc) one complete access every 16 CPU clock cycles. the internal RAM always runs at 0 wait state. In your case, the RAM access speed is 180 MHz.  I do not think that access to RAM is the issue here since you said the EMAC is the only other master.  I believe that something else is wrong.

    You may need to check the basics first. Can enable the ECLK to check the system clock on a scope? Can you check the memory map that the EMAC descriptors are in the CPPI RAM and TX and RX data buffers are in the internal RAM? Can you toggle an I/O pin in the txunderrun ISR to see if the error happens in the beginning of middle of a massage? Can you also save all the RMAC registers in the ISR so that we can have a more complete picture of the failure? Can you use a scope to check Ethernet PHY clock?

    Thanks and regards,

    Zhaohong

  • Dear Zhaohong

    thank you for your reply.

    I'm working on the matter and i will back asap.

    What you mean when you assert "one complete access every 16 CPU clock cycles".

    What is a complete EMAC-DMA access? How many bytes?

    Regards

    Paolo

  • Paolo,

    One complete access is a complete read or write to the RAM. The max access size to the internal RAM (B0TCM/B1TCM) is 64 bit. The read/write from EMAC is 32 bit, but the bridge  merges the burst of 32 bit requests to 64 bit for internal RAM access.

    Thanks and regards,

    Zhaohong

  • Dear Zhaohong

    i'm back again.

    New question for you:

    q1) can one or more "interrupt requests" break/delay the dma transfer?

    q2) what happens to the emac-dma transfer while the ARM is in a privileged mode?

     

    Scenario

    We have only IRQ interrupts (no FIQ).

    We have the following VIM channels:

    ch2) vPortPreemptiveTick: the real time clock for FreeRTOS

    ch9) gioHighLevelInterrupt: interrupt by an external device

    ch21) vPortYeildWithinAPI: the software interrupt used from both task or ISR  for task context switch

    ch23) gioLowLevelInterrupt: same context of channel 9

    ch27) linLowLevelInterrupt: characters from sci/lin (keyboard)

    ch33) dmaFTCAInterrupt: a dma software request for mibspi3 communication (we also tried to remove it, without changes for emac-dma tx)

    ch37) mibspi3HighInterruptLevel: same context of channel 33

    ch76) EMACCore0MiscIsr

    ch77) EMACCore0TxIsr

    ch78) EMACCore0ThreshIsr

    ch79) EMACCore0RxIsr

    4 Gio Pins driven by the software

    - gio1) 1 gio pin in the transmit routine: high at start routine, low at the end routine

    - gio2) 1 gio pin in the ISR TX (C0_TX_PULSE): high at start ISR, low at the end ISR

    - gio3) 1 gio pin in the ISR TX (C0_TX_PULSE): toggle at start ISR when it can see a change in the statistic EMAC_TXUNDERRUN register.

    - gio4) 1 gio pin in all the other ISRs to debug all the IRQs occurrences:  high at start ISR, low at the end ISR

    Note: we don't allow ISR nesting (during ISR, the I flag in CPSR remains 1)

     

    Trigger Condition

    By monitoring gio pins in an oscilloscope, we have triggered the occurrence of "emac tx underrun"  EMAC_TXUNDERRUN in the following way:

    Every time that a C0_TX_PULSE isr is over, we check the increment of "underrun statistic register" (EMAC_TXUNDERRUN) and in that a case, we toggle a pin (gio3) (i don't know if there is an ISR way to signal the underrun by HW).

    Scope

    Before the trigger condition, we can see the pulse (gio1) of transmission by application, that updates the tx descriptor to start a new emac-dma transfer in a critical region (interrupt disabled).

    Note: We have put a semaphore to avoid "multiple packet tx" in order to have for every transmission (to emac) one following TX interrupt for Ack (1-1).

    With the trigger condition, we can also see the pulse (gio2) of the trasmission ISR "C0_TX_PULSE".

    Between the application tx and the following completion tx ISR (when the trigger condition occurred) we can only see some interrupts occurrences (gio4). They also appear in other GOOD cases (without TX UNDERRUN).

     

    Furthermore

    It seems that reducing the number of interrupt occurrences, the TXUNDERRUN decreases. In fact, it is less probable, but it happens.

    Can you please answer q1 and q2.

    Have you any idea/suggestion?

    Tanks

    Paolo

  • Addendum

    Note that adding a delay (about 80usec) after the application transmission in the critical section before to restore the interrupt(mantaining irq disabled) the problem disappears.

    But this is not a satisfying solution!

  • Paolo,

    Since Zhaohong is out of the office for a while, Sunil asked me to look at this.

    First thing, do you have enough pbuf allocated? You need enough to keep the EMAC from running out between servicings by the CPU.

    If you leave at 10 - the default in HalCoGen - this likely is not enough.

    See e2e.ti.com/.../1605983

    -Anthony
  • Dear Anthony

    Dear Zhaohong 

    I'm get back again.

    As already guessed by Zhao, the problem of transmission long data frame at 100Mbit/s was depending on the fact that the data buffers were located in EMIF/SDRAM ( Zhao say "data buffers are located in the on-chip RAM").

    But the internal SRAM it very small memory (256KB)!

    q1) Can you confitm if Is there the same matter on the reception 100Mbit/s?

    q2) Is there any way to send/receive at 100mbit from/to EMIF/SDRAM (DMA Transfers 64bytes in less than 5,12μs) ?

    Thanks

    Paolo

  • Did you confirm that the issue is resolved when the data buffer is located to the internal SRAM? Can you use a scope to measure SDRAM clock frequency? Is your code executed from internal Flash? I do not see CPU in privilege mode and interrupts would affect EMAC DMA transfer. I believe that there is something wrong in your setup/software.

    Thanks and regards,

    Zhaohong
  • Dear Zhaohong 

    I can confirm both that my code is executed from internal Flash and that:

    The problem of transmission long data frame at 100Mbit/s was fixed by forcing the data transfer (and pbufs) from/to SRAM.

    But the internal SRAM is a very small memory (256KB)!

    And for various errata on tms570 (LDM/STM instructions in EMIF SDRAM) we are forced to put inside SRAM all the OS stack in addition to the system stack, and now also all the Ethernet buffers and descriptors!

    q1) Is there any way to send/receive at 100mbit from/to EMIF/SDRAM (DMA Transfers 64bytes in less than 5,12μs) ?

    q2) What about workaround on LDM/STM instructions in EMIF SDRAM (errata signaled in 2013) ?

    Thanks

    Paolo

  • Q1) From the speed of the 16 bit wide SDRAM interface, I do not think it an issue. For a worst case analysis, the DMA can move 16 bits (2 bytes) from SDRAM on average about every 3 SDRAM clock cycles (with all overheads). With a 50MHz SDRAM clock, it would take about 2 us for transfer 64 bytes. You need to check (1) SDRAM clock frequency (use a scope to measure) and EMIF settings and (2) Cortx-R4 MPU setting to make SDRAM region as "device" if you do not execute from SDRAM. Do you have any other data saved in SDRAM?
    Q2)The issue is related when CPU is interrupted during consecutive(back-to-back) execution of STM instruction. I do not it applicable here.

    Thanks and regards,

    Zhaohong
  • 1) We have an SDRAM ISSI-IS45S16320F with 16bit data bus. The clock is set in Halcolgen to 90MHz (i can also put a scope to discover the real rate and i will do). We have a lot of software application tasks using SDRAM and we hope that is possible without affecting DMA transfers for Ethernet 100Mbit/s.

    2) We have the SDRAM region in MPU set to "Device". It seems not changes behavior in both in "Device_Shared" and "Device_not Shared". While in "normal" the system had strange behaviors until it hangs.

    3) We ask for a workaround on LDM/STM instructions in EMIF SDRAM (errata signaled in 2013) indipendently from this matter (performance DMA/EMAC/SDRAM) but to have the possibility to use the SDRAM in order to avoid to put all much possible in internal SRAM (only 256KB)!

    Hope you can help us.
    Paolo

  • (1) "We have a lot of software application tasks use SDRAM ". You need to do an analysis about the usage of SDRAM to make sure that there is enough bandwidth for everything. There is going to be huge overhead in accessing individual data in SDRAM. For a single word SDRAM read (LDR), there is a 12 VCLK cycles internal delay in addition to the EMIF bus time. You need to put frequently used data such as stack in internal SRAM. I would also suggest you output clocks on the ECLK pin and use a scope to see if they are set up correctly. You may want to disable all other SDRAM access for debugging the EMAC issue,
    (2) "shared" or not shared" is only application to systems with more than one independent CPUs. It does not mean anything to TMS570.
    (3) TI compiler does not use the "STM" instruction in building user code. The only place I saw consecutive STM instructions is in the c-library function memset() which is normally used in initialization.

    Thanks and regards,

    Zhaohong
  • Dear all 

    in the "TMS570LS31x/21x Microcontroller Silicon Errata (Silicon Revision C) (Rev. F) " ("spnz195f.pdf" )

    it is wrote that "Compiler patch is available"

    Please, where can be downloaded the "Compiler patch" ??

  • Hi Paolo,

    you can force the compiler not to use STM commands (and also to use library functions without it) by specifying the compiler switch --no_stm.
    This works at least with the compiler version 5.1.12. (and also with earlier versions, but I don't know the exact version when this workaround was built in)
    I think this compiler switch was meant with "Compiler patch is available"

    Best regards
    Christian