This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

lwIP does not run on SDRAM

hi, 

I port lwIP on RM48hdk based on freeRTOS, it works well when the memp and mem in the lwIP is located in RAM.

it failed to send large package (about > 200bytes) when the memp and mem is located in SDRAM.

any difference between RAM and SDRAM when used by lwIP? is it related to DMA because the emac use DMA when sending?

  • Hi Eric,
    I think the reason is that the external SDRAM is much slower compared to the internal RAM in terms of latency of cycles to access them. The EMAC DMA will need to perform burst read/write to the SDRAM via the EMIF interface which is only 16-bit. The bottleneck will be at the EMIF.
  • Hi Charles,

    if the reason is like what you said, how to explain that a small package can se sent correctly. (< 200bytes)

  • Hi Charles,

    is there any difference between normal CPU access and DMA access to SDRAM?
    I can see that the burst length is fixed to 8 halfword (8*16bits), how to understand if I just read a byte, the others bytes (in a burst access) will be dropped?
  • Hi Eric,

     There is no much difference between a normal CPU access and a EMAC DMA access to the SDRAM. If you look at the block diagram in the RM48 datasheet you will find that  the EMAC access will first go through a switched central resource->main cross bar->switched central resource before reaching the EMIF module. For CPU it is just main cross bar->switched central resource before reaching the EMIF module. The EMAC DMA is a 32-bit bus master. I think it will send 4x32-bits of burst data. However, the EMIF interface is only 16-bit so the EMIF needs to break the 4x32bit into 8x16bits to the SDRAM where this step is not needed when transferring to the internal RAM. The latency taken going through the internal interconnect (i.e. main cross bar and others) and also the data size conversion in EMIF and the fact that the external SDRAM being slower than the internal RAM is most likely the reason that it can't keep up with the demand. I think running the LWiP with the payload in the internal RAM is the best option.  

  • Hi Charles,

    Now I am more clear about the difference between the normal CPU access and EMAC DMA access. thank for your detail explaination and chart.

    Where can I find the "chip-level master priority register"? According to the TRM, masters' priority to access the slave connected into the cross bar can be config, right?

    I can see a table describe the bus master/slave access privilege, can I say the cpu read has the highest priority?

    for the second picture, how to understand the memory latency restriction, why the transmit underrun when violate happen? 

    the "underrun" here means stop transmitting?

  • Hi Eric,
    would you try to disable cache?
  • Hi Eric,

     The bus matrix that was implemented in RM48 has equal priority to all the masters. So it is not that CPU has higher priority than EMAC.

     As I mentioned I think the issue has something to do with latency difference between internal RAM vs. external SDRAM. Here I'm showing a excerpt from the TRM about latency impact.

    29.2.13 Receive and Transmit Latency
    The transmit and receive FIFOs each contain three 64-byte cells. The EMAC begins transmission of a
    packet on the wire after TXCELLTHRESH (configurable through the FIFO control register) cells, or a
    complete packet, are available in the FIFO.
    Transmit underrun cannot occur for packet sizes of TXCELLTHRESH times 64 bytes (or less). For larger
    packet sizes, transmit underrun occurs if the memory latency is greater than the time required to transmit
    a 64-byte cell on the wire; this is 5.12 μs in 100 Mbps mode and 51.2 μs in 10 Mbps mode. The memory
    latency time includes all buffer descriptor reads for the entire cell data.
    Receive overrun is prevented if the receive memory cell latency is less than the time required to transmit a
    64-byte cell on the wire: 5.12 μs in 100 Mbps mode, or 51.2 μs in 10 Mbps mode. The latency time
    includes any required buffer descriptor reads for the cell data.
    Latency to system’s internal and external RAM can be controlled through the use of the transfer node
    priority allocation register available at the device level. Latency to descriptor RAM is low because RAM is
    local to the EMAC, as it is part of the EMAC control module.

  • Hi Eric,
    One thing I forgot to ask you is where is your buffer descriptor stored? is it in the CPPI memory which is local to the EMAC or you also store the buffer descriptors in the external SDRAM. I will suggest that you store the buffer descriptors in the local CPPI memory only.
  • Hi Charles,

    see my comments in the excerpt you showed. I copy it here.

    "29.2.13 Receive and Transmit Latency

    The transmit and receive FIFOs each contain three 64-byte cells. The EMAC begins transmission of a

    packet on the wire after TXCELLTHRESH (configurable through the FIFO control register) cells, or a

    complete packet, are available in the FIFO.

    Transmit underrun cannot occur for packet sizes of TXCELLTHRESH times 64 bytes (or less). For larger

    packet sizes, transmit underrun occurs if the memory latency is greater than the time required to transmit

    a 64-byte cell on the wire; this is 5.12 μs in 100 Mbps mode [Eric: I think the memory latency for SDRAM will not be that greater than RAM, if just consider what you mentioned] and 51.2 μs in 10 Mbps mode. The memory

    latency time includes all buffer descriptor reads for the entire cell data.

    Receive overrun is prevented if the receive memory cell latency is less than the time required to transmit a

    64-byte cell on the wire: 5.12 μs in 100 Mbps mode, or 51.2 μs in 10 Mbps mode. The latency time

    includes any required buffer descriptor reads for the cell data.

    Latency to system’s internal and external RAM can be controlled through the use of the transfer node

    priority allocation register [Eric: how to understand this description if you said the bus matrix has equal priority to all masters? can you show me where I can find this register?]available at the device level. Latency to descriptor RAM is low because RAM is

    local to the EMAC, as it is part of the EMAC control module."

  • Hi Charles,
    The buffer descriptor for TX and RX are both stored in the CPPI memory. The actual memory each pointer pointed is located in SDRAM.
  • Hi,
    The RM48L don't have a cache. Maybe I will try it in the RM57L later.
  • Hi Charles,

    After I add a for loop after start sending DMA, a larger package can also be sent corrcetly. (note: the _MY_DEBUG_  is disabled)

    From this result, I still feel like the CPU is occupied by the for loop and will not access SDRAM, so the EMAC DMA can access SDRAM without meeting a timeout issue. 

    How do you see this? if all the master in the bus matrix has equal priority, I think the timeout will not happen, if the CPU has a higher priority than EMAC it will happen.

  • Hi Eric,

    eric said:

    Latency to system’s internal and external RAM can be controlled through the use of the transfer node

    priority allocation register [Eric: how to understand this description if you said the bus matrix has equal priority to all masters? can you show me where I can find this register?]

     The statement is actually incorrect. The TRM needs to be updated. You can not program for different priority scheme in our current implementation.

  • Hi Eric,

      I think I have a better understanding of your setup. Both the CPU and the EMAC are competing to use the EMIF as your packets are stored in the SDRAM. When you finish transmitting packets the Tx interrupt is generated. Not sure what you have in the Tx handler. The CPU might be trying to setup new packets to transmit. At around the same time when the EOQ is reached you try to start another DMA transfer. The EMAC needs to read from the EMIF to retrieve the payload to transmit. When you put a big loop for the CPU then the EMAC has all the bandwidth to itself. 

      I'm not an EMAC expert. If the CPU is writing to the SDRAM to setup the packets, is it possible for your to wait until the CPU is done before launching an EMAC transfer. This suggestion is based on the fact that you run into the bus bandwidth issue. 

      I'd like to know why you can't store the packets in the internal RAM?

      

  • Hi Charles,

    The setup of our project is we just use SDRAM (stack, heap, bss, data) and we use freeRTOS also.

    I also think that  both the CPU and the EMAC DMA are competing to access EMIF. Why I add a for loop is based on the following consideration: After starting EMAC DMA, the EMAC Transmit DMA engine read package from SDRAM  and write into the Transmit FIFO (in the meantime read CPPI also), at this time the CPU still run and need to access SDRAM also (because we only use SDRAM), I guess this competition will lead to EMAC Transmit DMA timeout and more will stop sending, right? if I add a simple for loop behind the start EMAC DMA (that is the EMAC Transmit DMA engine read from SDRAM and wirte into FIFO), this will let the EMAC Transmit DMA engine finish sending the package.

    You mean the TRM's description about the transfer node priority register is not correct? I can't adujst the priveledge of the masters (the CPU and the EMAC) ?

    Maybe I can consider your advice that use the internal RAM instead for the EMAC packages. (we don't use the internal RAM because we need 4MB in our application and we hope to use the SDRAM only)

    Any more suggestions about this?

    Best regards,

        Eric

  • Hi Eric,

      You just elaborate one more point that I didn't know about. You said you are using SDRAM also for stack and etc. This certainly will introduce even more contention between the CPU and the EMAC. I will suggest that you use the internal RAM for some portion of your application especially the stack. This will help in alleviating the EMIF bottleneck. 

      Yes, the arbitration scheme in the SCR interconnect is hardwared. It is not programmable. 

  • Hi Charles,

    Thank you for your patient answers.

    Best regards,
    Eric