This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6678: Sending Ethernet Packets increases CPU Usage significantly and is not fast enough

Part Number: TMS320C6678

Hi,

we have an embedded platform using the TMS320C6678 and several TI packages (SYSBIOS (bios_6_76_02_02 + XDC), NDK_3_61_01_01, IPC_3_50_04_07, PDK_c667x_2_0_10, DSPLIB_c66x_3_4_0_4, ...).

Our project is very big and there is a lot going on in several other Threads (e.g. signal analysis, PCIe and with that a lot of memcpys etc.)

All of the mentioned computations happen on core0 and the other 7 cores just linger around.

We need to transfer some computed data to our host. We do this via TCP and with help of the functions of NDK.

We use a daemon handler (NDK's DaemonNew(...)) and NDK_recv(...) to receive data and we send data with NDK_send(...).

For the NDK stack we use the following options:

  • NC_SystemOpen(NC_PRIORITY_HIGH, NC_OPMODE_INTERRUPT)

  • TCP Transmit buffer size: 64000

  • TCP Receive buffer size: 64000

  • TCP Receive limit: 64000

  • UDP Receive limit: 8192

Without the TCP communication our Core0 utilization is around 20%. Enabling the ethernet communication (sending around 50 Mbit/s) yields a Core0 usage 100% and the core is most of the time in TcpPrSend() which calls SBWrite() which calls mmCopy() while randomly pausing the program via a JTag debugger. This "kills" core 0 and we are not able to send enough data to our host (where we do not reach the limits of the ethernet port (theoretically capable of 1 Gbit/s throughput).

Do you have any idea why sending tcp packets via ndk yields such high CPU usage?

Best

Paul

  • Hi,

    I am still struggling with this problem. Does anyone of you have an idea?

    Best

    Paul

  • Hi,

    as mentioned: I am still struggling with this issue. Can you please provide some solution?

    Thanks

    Paul

  • Hi Paul,

    Sorry to know that you are facing this issue. I am also wondering what could have caused this issue.

    Can't think of anything that can cause the issue.

    Let me dig deeper into this and get back to you.

    Is it possible for you to not have any other processing than the NDK and see if the CPU load is still high? Wanted to understand if there is some contention between NDK and other process in your application.

    -Thanks,

    Aravind

  • Hi Aravind,

    thanks for your reply.

    Aravind Batni said:
    Is it possible for you to not have any other processing than the NDK and see if the CPU load is still high? Wanted to understand if there is some contention between NDK and other process in your application.

    Yes, we set up the NDK in an empty project and additionally tried out this empty project code on the eval board. Still we have a very high CPU Usage. It is very much proportional to the amount of data we process (e.g. 10% CPU ~ 100Mbit/s; ...; 100% ~ 1Gbit/s; all @ 1Ghz frequency).
    Consequently, core0 is not useable in such cases.

    I would really appreciate your help with that.

    Thanks

    Paul

  • Hi Paul,

    Are you seeing any packet loss? OR just the CPU load being increased to 100% for 1Gbit/sec?

    My hunch is it is running out of descriptors at high packet rate and CPU is spinning in a wait loop to get the descriptors available for Rx to proceed.

    Can you increase (to startwith double) the NIMU descriptors and rebuild NIMU? You can update descriptors in below h file and see if it improves the throughput for your case?

    https://git.ti.com/cgit/processor-sdk/pdk/tree/packages/ti/transport/ndk/nimu/src/v2/nimu_internal.h#n201

    Change to below and rebuild: (please note double value here).

    #define NIMU_NUM_TX_DESC                (16u * 2u) /**< Maximum number of TX descriptors used by NIMU */
    #define NIMU_NUM_RX_DESC                (110u * 2u) /**< Maximum number of RX descriptors used by NIMU */

    You can increase this to maximum possible value for your system (not just limited to doubling, if you see improvements in the throughput).

    The reason, I suggested this is if the packets are arriving to NIMU at a higher rate and there is not enough descriptors to receive it, the system would wait until a descriptor is available, hence CPU would spend most of the time in waiting for a descriptor. By increasing the descriptors, we would eliminate/minimize that situation.

    Also, you can update Rx interrupt threshold (currently set to 4)

    /* High Priority QM Rx Interrupt Threshold */
    #define RX_INT_THRESHOLD 4u --> Update to 8 or some other value that may help you to reduce the CPU loading. (Note you should be having sufficient descriptors to handle more packets as we would hit the Rx Isr after receiving every 8 packets instead of 4 packets, which is set as default in the code).

    Let me know if that helps your case.

    By the way, please note that I would be on vacation starting tomorrow, so please note my response would be delayed. 

    Happy holidays and merry Christmas. 

    Thanks.

  • Hi Paul,

    Any updates on this? Were you able to make changes and address the issue?

    -Thanks,
    Aravind

  • Hi Aravind,

    I am also just coming back from Christmas holidays and will start working on the subject this and next week. I'll let you know if I had any success.

    Best

    Paul

  • Hi Aravind,

    I tested your suggestions: apparently it is not possible to increase

    NIMU_NUM_TX_DESC 
    NIMU_NUM_RX_DESC 

    since both of them are already at their maximum (for platform c6678). In other words, I could allocate more descriptors and as far as I know there are no more.

    Do you have any other suggestions what I could try? Maybe some platform specific optimizations?
    As I understand it: the NDK is written for various platforms and I guess this leads to a lack of platform individual optimizing.

    Hope you can help

    Thanks
    Paul

  • Hi Paul,

    Yes, NDK higher layer is very generic and is not aware of the platform. the NIMU layer abstracts the platform.

    By the way doubling the descriptor proposal was just a starting point. Did you see any change in the performance? 

    I see there is a memory copy involved after the packet is received in receive path from NIMU to NDK higher layer. Possibly, that can be optimized, if you have larger packet sizes in the system to use EDMA instead of CPU copy to free up CPU.

    Thanks, 

  • Hi Aravind,

    as mentioned in my previous reply: there are not enough descriptors so I can't double them (consequently i also do not get any speedup)

    I am currently working on the mmCopy and I let you know if I have any performance success.

    Meanwhile: do you have any other suggestions of what I coudl improve on the c667X platform within the ndk?

    Best

    Paul

  • Hi Paul,

    Thanks.. Do not have any other suggestions from my side. Let me know your observations.

    Thanks,

    Aravind

  • Hi Aravind,

    Indeed I had success with the mmCpy. I could decrease CPU usage by around 10-20%. What I did:

    - replaced the code in mmCpy with sth like this:

    extern void mmCopyModified( void* pDst, void* pSrc, uint32_t len );
    /* Memory Copy */
    void mmCopy( void* pDst, void* pSrc, uint32_t len )
    {
    	mmCopyModified(pDst, pSrc, len);
    }

    - I recompiled the NDK following this guideline: https://e2e.ti.com/support/processors/f/791/t/822409

    => hereby i came accross the following issue (maybe you could resolve this better thane i did with the proposed workaround): https://e2e.ti.com/support/processors/f/791/t/974725

    - in our project i implemented mmCopyModified the following way where i reimplemented mmCopy and placed it into l2sram

    void mmCopyModified( void* pDst, void* pSrc, uint32_t len ) {
    	if (len <= 512) {
    		mmCopyInL2Sram(pDst, pSrc, len);
    	}
    	else {
    		qdmaMemcpy(pDst, pSrc, len);
    	}
    }



    Another point which could be imporved within the NDK is maybe the tcpChecksum calculation. Do you have any idea how this code can be optimized on the c6678 platform or do you have an optimized version of this code? 

    Maybe you could also ask some other experts at TI if they could suggest any improvements? It is actually very urgened for us to reduce the cpu usage further regarding ethernet since otherwise we can't really use the c6000 platform for our needs.

    Thanks for your help :)

    Best


    Paul

  • Hi Paul,

    Glad that you could get CPU usage minimized with memCopy optimizations.

    The current Checksum calculations on TCP header and payload is done as below:

    /* Checksum the header and payload */
    pw = (uint16_t *)pTcpHdr;

    TSum = 0;
    for( ; tmp1 > 1; tmp1 -= 2 )
    TSum += (uint32_t)*pw++;

    As you can see, the above is a generic implementation that works on any SoC.

    C6678 has C66x core:

    Few optimization suggestions I can provide for C6678 is, you can exploit below optimizations: ( You can implement using intrinsics, do not need to write assembly code for this).

    1. Load 8 bytes at a time (double word load -- using _amemd8 intrinsic) instead of 2 bytes

    2. use C66x pack instructions to extend it for 32-bit

    3. Implement using _dadd() (add two 32 bit numbers)

    Few materials for you:

    training.ti.com/.../c66x-corepac-instruction-set-reference-guide.pdf

    www.ti.com/.../sprui04b.pdf

    Sorry, do not have a ready implementation/optimal implementation for this code. I am sure, you will gain further CPU cycles with a specific C66x core optimizations for C6678 DSP.

    If there is nothing else, please mark the thread as closed.

    Thanks for the feedback and working through the CPU cycle optimizations.

  • Hi Aravind,

    I did not try to work with instrinsics since I am not to good at this and the code looses understandability.
    Instead, I tried to tell the compiler what to do and stumbled across an issue: the TcpHeader is not aligned on 8 byte so a uint64_t pointer won't work.

    I came up with the following function:

    void TcpChecksum(TCPHDR *pTcpHdr ) {
    
    	int     tmp1;
    	uint16_t  *pw;
    	uint32_t  TSum=0;
    
    	/* Get header size in bytes */
    	int len = (int)HNC16(tpseudo.Length);
    	tmp1 = len;
    
    	/* Checksum field is NULL in checksum calculations */
    	pTcpHdr->TCPChecksum = 0;
    
    	/* Checksum the header */
    
    	pw = (uint16_t *)pTcpHdr;
    	uint64_t pw64val;
    
    	for( ; tmp1 > 7; tmp1 -= 8 ) {
    		memcpy(&pw64val, pw, 8);
    		// TSum  += (uint32_t)*pw++;
    		TSum += (uint32_t) ( ( pw64val & 0x000000000000FFFF ));
    
    		// TSum  += (uint32_t)*pw++;
    		TSum += (uint32_t) ( ( pw64val & 0x00000000FFFF0000 )>> 16 );
    
    		// TSum += (uint32_t)*pw++;
    		TSum += (uint32_t) ( ( pw64val & 0x0000FFFF00000000 )>> 32 );
    
    		// TSum += (uint32_t)*pw++;
    		TSum += (uint32_t) ( ( pw64val & 0xFFFF000000000000 )>> 48 );
    		pw += 4;
    
    	}
    
    	for( ; tmp1 > 1; tmp1 -= 2 ){
    		TSum += (uint32_t)*pw++;
    	}
    	if( tmp1 ){
    		TSum += (uint32_t)(*pw & 0x00FF);
    	}
    
    
    
    	/* Checksum the pseudo header */
    
    	pw = (uint16_t *)&tpseudo;
    	for( tmp1=0; tmp1 < 6; tmp1++ ) {
    		TSum += (uint32_t)*pw++;
    	}
    
    	TSum0 = (TSum0&0xFFFF) + (TSum0>>16);
    	TSum0 = (TSum0&0xFFFF) + (TSum0>>16);
    	TSum0 = ~TSum0;
    
    	/* Note checksum is Net/Host byte order independent */
    	pTcpHdr->TCPChecksum = (uint16_t)TSum;
    }


    Looking into the assmebly, I can see that indeed an ldndw is used to get the data. Unfortunately, this loop is not pipelined very good (ii = 9, 2 iterations in parallel) and the resulting function is slower than the original one. Do you know what I am missing here to improve performance?

    I also tried something different: optimizing the function to use both datapaths of the C6678. I cam up with the following modified function:

    void TcpChecksum(TCPHDR *pTcpHdr ) {
    	int     tmp1;
    	uint16_t  *pw;
    	uint32_t  TSum = 0;
    	uint32_t  TSum1 = 0;
    
    	/* Get header size in bytes */
    	tmp1 = (int)HNC16(tpseudo.Length);
    
    	/* Checksum field is NULL in checksum calculations */
    	pTcpHdr->TCPChecksum = 0;
    
    	/* Checksum the header */
    
    	pw = (uint16_t *)pTcpHdr;
    	
    
    	for( ; tmp1 > 3; tmp1 -= 4 ) {
    		TSum  += (uint32_t)*pw++;
    		TSum1 += (uint32_t)*pw++;
    	}
    
    	for( ; tmp1 > 1; tmp1 -= 2 ){
    		TSum += (uint32_t)*pw++;
    	}
    	if( tmp1 ){
    		TSum += (uint32_t)(*pw & 0x00FF);
    	}
    
    	/* Checksum the pseudo header */
    
    	pw = (uint16_t *)&tpseudo;
    	for( tmp1=0; tmp1 < 6; tmp1 += 2 ){
    		TSum  += (uint32_t)*pw++;
    		TSum1 += (uint32_t)*pw++;
    	}
    
    	TSum += TSum1;
    
    	TSum = (TSum&0xFFFF) + (TSum>>16);
    	TSum = (TSum&0xFFFF) + (TSum>>16);
    	TSum = ~TSum;
    
    	/* Note checksum is Net/Host byte order independent */
    	pTcpHdr->TCPChecksum = (uint16_t)TSum;
    }

    This function gets piplined quite good (ii=1, 6 iterations in parallel ) and yields a aprox. 3 times faster performance (for length of > 100 ) than the original implementaion.

    Note: both functions are written for Little Endian.

    Do you maybe have suggestions what I can improve in the first or second function to further improve performance?

    Best

    Paul

  • Hi Aravind,

    what I wanted to add: I still need to squash more performance out of the ndk since it still demands too much CPU time for our use case.

    If you don't have any more ideas maybe you can get in contact of someone else of TI.

    Some things I stumbled across while investigating on this problem and which helped other people on other TI platfroms:

    • use PBM buffers for incoming packets
    • rework the packet drivers like e.g. NIMU
    • try to avoid PBMQ_enq(), PBMQ_deq(): OEMSysCritOn() / OEMSysCritOff() => _disable_interrupts() / _restore_interrupts()

    Maybe you can comment on each of them (especially how to do this on c6678) and bring up your own optimization points.

    Thanks a lot in advance

    Best

    Paul

  • Hi Paul,

    Thanks for providing the inputs.

    On your response:

    the TcpHeader is not aligned on 8 byte so a uint64_t pointer won't work

    - You can use the _mem8 intrinsic, which allows to load and store unaligned data

    Looking into the assembly, I can see that indeed an ldndw is used to get the data. Unfortunately, this loop is not pipelined very good (ii = 9, 2 iterations in parallel) and the resulting function is slower than the original one. Do you know what I am missing here to improve performance?

    - Yes, without the intrinsic the compiler may not be able to produce very optimal loop/instructions. I see you need to use the intrinsics to get the best loop performance and cycle gains. I understand, the code becomes very specific to C66x instead of generic, but that is the specific path you can take to further cycle optimize the code.

    - The options you have provided are all good chunks (involves big NDK changes - like merging the NIMU adaptation and NDK layers ) - However, you may need to have it done on your own at this moment, for all those NDK changes. 

    You can make NDK support only for C66x core, by replacing below as you provided: (It can save CPU cycles).

    OEMSysCritOn() / OEMSysCritOff() => _disable_interrupts() / _restore_interrupts()

    I assume you had replaced all mmCopy instances to your modified/SoC specific mmCopy implementation, which saves the CPU cycles.

    Thanks,

    Aravind

     

     

  • Hi Aravind,

    Do you think that using instrinsics will provide major improvements on the performance or is the implementation already very optimal?

    I also tried this:

    OEMSysCritOn() / OEMSysCritOff() => _disable_interrupts() / _restore_interrupts()

    but unfortunately this didn't improve the performance.

    Yes I did replace all calls to mmCpy already.

    Looking into SPRS691 I found that the Network Coprocessor has some sort of Packet Accelerator. Does the NDK make use of the Coprocessor and if not how can I make use of it? Can I outsource all of the Ethernet communication to this Coprocessor?

    Best

    Paul

  • Paul,

    1. Yes, using intrinsics would definitely help. I have seen in the past very good improvements with CPU cycle performance.

    2. Yes, you may be able to significantly use Packet Accelerator from NDK, which can help you to run hardware Lookup for the L2, L3 and L4 entries. Note that NDK has not been optimized to use with Packet Accelerator to keep it very generic across all SOCs.

    There are packet accelerator examples that you can study on understanding how to submit the entries for L2, L3 and L4.

    You may need to do the study and feasibility and implementation using Packet Accelerator for NDK on your own.

    Please note that, there is no plan from TI to update NDK with packet accelerator.

    Thanks

  • Hi Aravind,

    1. As mentioned earlier, I am not good with intrinsics. Maybe you can contact someone at TI who can provide an optimization which uses instrinsics? (I am not able to do this on my own...)

    2. Unfortunately, your response is not very helpful:

    There are packet accelerator examples that you can study on understanding how to submit the entries for L2, L3 and L4.

    Where are these examples?

    You may need to do the study and feasibility and implementation using Packet Accelerator for NDK on your own.

    Can you please provide entry points to this?

    Please note that, there is no plan from TI to update NDK with packet accelerator.

    I understand that you don't want to update the NDK for that. The question that comes to my mind with this: why is the packet accelerator on the C6678 if I can't use it out of the box? Please comment on that an provide a good reason why I can't use it. If there is no good reason I would really appreciate more effort from your side (or one of your collegues) to make this work (since in the end the packet accelerator is a selling point for the C6678 and consequently we as customers of TI paid not only money for the C6678 but also for all of its features)

    Don't get me wrong here, but I am really annoyed that I have to engineer everything on my own especially for features which are according to the datasheet part of the C6678.

    Thanks for you help

    Paul

  • Hi Paul,

    Thanks for the feedback. I understand your feedback, but please note that it is not possible to have software release for all  possible use cases.

    The NDK <-> PA integration is not planned activity by TI.  It is a tailored implementation for a given  NDK <-> PA use case, hence broad market release is a challenge with NDK <-> PA design.

    NDK has its own collateral with developers guide, source code is available for customers to update/tweak as per their needs, there is a basic example available showcasing NDK functionality on C6678.

    Packet Accelerator comes with the tested firmware and user APIs from Processor SDK releases, to help broad market to use the hardware accelerator for their needs.

    You would need to have the NDK <-> Packet accelerator design by going through NDK developers guide and packet accelerator LLD and PA examples to meet your needs.

    I have notified the respective organization for this thread.

    Thanks,

  • Hi,friend,is there any exception condition hanppened result  into while(1) branch?

    even though project have mmcpy many times,cpu usage reaches to 100% rarely.

    so suggest to you,check your code carefully,especially driver layer, good luck.

  • Hi Aravind,

    thanks for your reply.
    I understand that broad market releases are difficult for NKD <-> PA use cases but since the PA is still a feature of the c6678 I would very much welcome a detailed step by step documentation and detailed examples on how to use the PA especially in combination with the NDK on c6678. As mentioned, the PA is a distinct feature in the c6678 datasheet. Thus, I think I am not asking for too much here (since in the end we as costumers have paid for the c6678 with ALL of it features and not for a TI cherry-picked feature selection which makes developement for TI as cheap as possible but makes developement on our side super expensive). 

    Hence, I would appreciate if the Team you notified there would give a detailed explanation on what I have to do to make use of the PA and if and how this can improve the performance of c6678 ethernet communication.

    Thanks

    Paul

  • Hi Aravind,
    do you have any news from the respective organization? This is still an unresolved issue for me.
    Thanks for keeping me up to date.

    Best
    Paul

  • Hi,

    I am still on this project and still need help.
    Anyone here?

  • Hi Paul

    Unfortunately I do not see a way to help you further on this. NDK support is now extremely limited and there are no plans to offer any additional enhancements. 

    There seem to be some posts from 8 yrs back on areas to optimize etc, but nothing that I am directly able to point to that influence your use-case scenario - but please see if any of them help. 

    https://e2e.ti.com/support/processors/f/processors-forum/271367/how-to-improve-c6678-evm-ethernet-speed