This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

How to improve C6678 EVM ethernet speed

Other Parts Discussed in Thread: SYSBIOS

Based on the demo image-processing project (IPC version), the test result on Ethernet speed is ~55Mbs/s for transmitting data from DSP to PC, ~220Mb/s for receiving data from PC.

Does anyone have experience to improve the Ethernet speed on 6678 evm board ?

Thanks a lot

  • There should be someone use 1.0 Gb connection for EVM 6678, what's your Ethernet speed ?

    Since my receiving speed is 4x the transmitting speed, there must be some way to improve the tx speed. Can anyone give some ideas where the problem should be, the configuration, the queue number, etc... ?

    Thanks

  • Adding the Tx buffer size from 8192 to 64000 could double the speed to ~100Mb /s. 

    There is no effect by adding the Rx buffer size.

    Also there is no effect by increasing the PKT_NUM_FRAMEBUF in pbm_data.c

    Could those TI experts give any suggestion?

    Thanks

  • Hi,

    Any idea what could be causing this? I'm getting similar results.

    Thanks

  • Hi,

    there are multiple problems with the NIMU driver for the C6678:

    If you increase the TCP transmit buffer to 32 kB or more, the driver will start dropping packets, because the TX packet queue can hold only 16 packets.
    See also:
    http://e2e.ti.com/support/embedded/bios/f/355/p/253488/891759.aspx

    The NIMU driver uses the CSL cache API, which doesn't implement a workaround for a silicon bug:
    http://e2e.ti.com/support/embedded/bios/f/355/t/253237.aspx

    There is also a potential problem with the prefetch buffer:
    http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/214649/762847.aspx

    After fixing these issues and some other improvements im getting about 920 Mbit/s for TCP transmission.

    Ralf

  • Hi Ralf, can you explain exactly how you you changed the driver to use the SYS/BIOS API instead of the CSL api?

    Can I just replace the CSL_cache functions with the analogous sysbios functions?


    Thanks

  • Hi,

    first, you have to include:

    #include <ti/sysbios/family/c66/Cache.h>

    Then, you can replace the function call.
    Example:

    Cache_inv((void *)pHostDesc, sizeof(Cppi_HostDesc), Cache_Type_L2, 1);

    Ralf

  • Thanks Ralf.

    I changed the cache API and did the prefetchtBufferInvalidate,recreated the correspondent library of the ndk and recompiled my .out,did you got a significant improvement by doing that?

    Here's a snippet of my code, did I changed the cache API right?

    /* Invalidate cache based on where the memory is */
    if((uint32_t)(pHostDesc) & EMAC_EXTMEM ){
    // CACHE_invL2((void *)pHostDesc, sizeof(Cppi_HostDesc), CACHE_WAIT);
    Cache_inv((void *)pHostDesc, sizeof(Cppi_HostDesc), Cache_Type_L2,CACHE_WAIT);
    }

    if ((uint32_t)(pHostDesc) & EMAC_MSMCSRAM ) {
    // CACHE_invL1d((void *)pHostDesc, sizeof(Cppi_HostDesc), CACHE_WAIT);
    Cache_inv((void *)pHostDesc, sizeof(Cppi_HostDesc), Cache_Type_L1D,CACHE_WAIT);
    }

    if((uint32_t)(pHostDesc->buffPtr) & EMAC_EXTMEM ){

    CSL_XMC_invalidatePrefetchBuffer();

    //CACHE_invL2((void *) pHostDesc->buffPtr, pHostDesc->buffLen, CACHE_WAIT);
    Cache_inv((void *) pHostDesc->buffPtr, pHostDesc->buffLen, Cache_Type_L2,CACHE_WAIT);
    }

    I got only 4 or 5 Mbits/s more by doing that. Can you give more tips on what you have done? Seems like you're the only person on the forum who got an decent speed of the Ethernet.

    Thanks in advance.

  • Don't forget to change the other cache calls for 'accum_list_ptr' and replace CACHE_wbL2() too.

    The next thing you can do is to decrease the interrupt delay in Setup_Rx():

    accCfg.timerLoadCount = 5;

    After this, I would try to increase the TCP transmit and receive buffer size (e.g. 64000). But first you have to make sure that the driver doesn't drop any outgoing packets if the transmit buffer is full:
    http://e2e.ti.com/support/embedded/bios/f/355/p/253488/891759.aspx#891759

    I also would change the ratio between RX and TX descriptors to get more room for outgoing packets. Simply put the following lines underneath the included header files:

    #if NIMU_NUM_TX_DESC + NIMU_NUM_RX_DESC != 126
    #error Sum of NIMU_NUM_TX_DESC and NIMU_NUM_RX_DESC must be same as defined in resource_mgr.h
    #endif
    #undef NIMU_NUM_TX_DESC
    #undef NIMU_NUM_RX_DESC
    #define NIMU_NUM_TX_DESC                48u /**< Maximum number of TX descriptors used by NIMU */
    #define NIMU_NUM_RX_DESC                78u /**< Maximum number of RX descriptors used by NIMU */

    Another modification I didn't mention yet is that I'm directly using PBM buffers as buffers for the RX descriptors. This removes an additional copy operation for each received packet in EmacRxPktISR(). But this change is more complicated.

    Ralf

  • I already had changed the cache calls for accum_list_ptr and CACHE_wbL2().

    Its weird whats happening with the interrupt delay. If I change from 40 (default) to 5, the speed its almost the same. So I tried 2 ticks, and the speed increased by ~150Mbits/s. Then I tried 1 tick and the speed increased by ~180Mbits/s. Is that normal? Anyway, thanks for that tip, thats better now, but im hitting 480Mbits/s, still far from from what it should be.

    I tried the other suggestions too, but the speed didn't changed very much.

    About the PBM buffers, you changed all the buffer functions for those functions on the PBM packet? Or something like that?


    Thanks again!

  • Can you verify the window size the NDK is using with Wireshark? It seems that it is still using a smaller window like 8kB, which would require more interrupts.

    About the PBM Buffers:
    What the driver normally does is that it allocates memory for the RX descriptors in Setup_Rx(). Then, if it receives a packet, it allocates a PBM Buffer and copies the descriptor buffer into the PBM buffer, which is then processed by the stack.
    In my version, I'm directly allocating PBM buffers used by the RX descriptors. If a new packet arrives, I can directly give the PBM buffer to the stack. I only have to allocate a new PBM buffer and assign it to the RX descriptor.
    I think this modification will only lower the CPU load a little bit.

    Ralf

  • I checked the  windows size, and I think it is indeed 8kB I guess, see the figure below:

    Is it possible to change this size to a more suitable value?

    Thanks!

  • If you are using the legacy API to configure the NDK, you can set the RX and TX buffers like this:

        int rc = 64000;
        // TCP Transmit buffer size
        CfgAddEntry( hCfg, CFGTAG_IP, CFGITEM_IP_SOCKTCPTXBUF,
                     CFG_ADDMODE_UNIQUE, sizeof(uint), (UINT8 *)&rc, 0 );

        // TCP Receive buffer size (copy mode)
        CfgAddEntry( hCfg, CFGTAG_IP, CFGITEM_IP_SOCKTCPRXBUF,
                     CFG_ADDMODE_UNIQUE, sizeof(uint), (UINT8 *)&rc, 0 );

    If you are using RTSC instead, it should work like this:

    var Tcp = xdc.useModule('ti.ndk.config.Tcp');
    Tcp.transmitBufSize = 64000;
    Tcp.receiveBufSize = 64000;

    Ralf
  • Thats it, now i'm getting more than 900Mb/s.

    First I tried your suggestion of 64000 and got 620Mb/s of speed, then I increased the size of the buffer till i got to more than 900Mb/s, now the buffer size is 150000 and I think I got maximum speed.

    Thanks for the help!!

  • Hi, Raf

    I did all your suggestions except the Rx buffer stuff.  I got over 900 Mb/s too. The window size is playing a big role here both in DSP and PC sides.

    Appreciate your replies.

  • Hallo experts,

    I correct my previous statement.

    did anyone take a look at load of core with IP stack? I found out that when I receive data (approx. 170MBit/s) via Eth,  core0 load (measured by Load module) reaches almost 70% 40% (even so it seems to me too high) .  Core0 receives only data from eth and send it to core1 via MessageQ I measured some times of execution and it looks that core spends most of time in parts of code related to IP stack (recv etc.) MessageQ costs much more CPU time then I thought. Could you tell me if your troubles, solved above,  have something to do with core load or are matter of other IPstack problems? If you have any experience with excessive load I appreciate any help.

    Thanks Ondrej