AM2432: Investigating EtherCAT receive interrupt stop (approximately 350ms) when using EnetDma_submitTxPktQ

Part Number: AM2432
Other Parts Discussed in Thread: TMDS243EVM

Tool/software:

Hello,

I am working with [enet_layer2_icssg] which I modified for EtherCAT communication experiments.

In this project, I've created a transmit task that calls EnetDma_submitTxPktQ three times every 500us to send an Ethercat frame three times,
and a receive task that, upon receiving the returned frame, uses uDMA to DMA transfer the received data to another memory location.

During this process, the receive task occasionally pauses for approximately 350ms, anywhere from a few minutes to an hour.
When this occurs, it always lasts for approximately 348-352ms, giving the impression that something is controlling it.
(The transmit task continues to run during this pause.)

Looking more closely at the phenomenon, it appears that no receive interrupts are received during this approximately 350ms period, leading me to suspect that the frame itself may be deleted due to a CRC error or similar.

I'd like to identify the cause of this behavior, so I'd appreciate any information that could provide a clue.

For example, EnetDma_submitTxPktQ should not be called again until the frame has been completely transmitted.

Supplementary Note
When sending three frames, the first one is sent, followed by the second and third ones a short time later.
EnetDma_submitTxPktQ is called at the following timings: first frame → 100us has elapsed → second frame → third frame.

Also, if the second and third frames are sent simultaneously using EnetQueue_enq, this issue has not yet occurred. (I've run it for about 16 hours.)


SDK mcu_plus_sdk_am243x_11_00_00_15
Board used AM243xEVM / Custom Board (AM2432)
Board has one ICSSG port connected to the slave and the other is disconnected

Configuration: AM2432(ethercat master) - profishark1G(LAN analyzer) - AM2432(ethercat slave1) - AM2432(ethercat slave2)

  • Hi, 

    I’ve assigned this thread to the corresponding expert for further investigation. Ping here if you don’t get a response by Tuesday.

    Also, from the above description, you’re trying to implement EtherCAT MainDevice demo with a cycle time of 500us using enet_layer2_icssg implementation. Is my understanding correct?
    If possible, please share the wireshark logs also which can help in for detailed investigation.

    Regards,
    Aaron

  • Hello,

    Also, if the second and third frames are sent simultaneously using EnetQueue_enq, this issue has not yet occurred. (I've run it for about 16 hours.)

    Can you please clarify if the issue is seen when the 3 packets are sent without delay, and is not seen when there is a delay between first packet and second packet? Further, please provide more details on the task priorities, and the application design? We will try to replicate the behavior on our side to understand the issue better.

    Regards,
    Teja,

  • I've learned something new, so I'll update the information.

    It appears that the 350ms pause is not caused by the way the three frames are sent.
    The receiving task calls ClockP_usleep(1); after receiving the frame and completing the DMA transfer,
    but it sometimes takes a long time to exit this function, perhaps because the tick isn't updated.
    Furthermore, after pausing with ClockP_usleep(1);, the receiving task doesn't seem to be working properly for 350ms.
    I tried using a manual busy loop like this:

    IRQ_busy_wait = GTC_getCount64() + 150;
    	while(IRQ_busy_wait > spent_time){
    		spent_time = GTC_getCount64();
    }

    The pause no longer occurs.

    Am I using ClockP_usleep incorrectly?

    Configuration

    Receiving task: TaskP_PRIORITY_HIGHEST
    Sending task: TaskP_PRIORITY_HIGHEST-1

    The send task calls this code three times:

    EnetQueue_initQ(&txSubmitQ);
    
    /* Retrieve TX packets from driver and recycle them */
    EnetMp_retrieveFreeTxPkts(perCtxt);
    
    /* Dequeue one free TX Eth packet */
    txPktInfo = (EnetDma_Pkt *)EnetQueue_deq(&gEnetMp.txFreePktInfoQ);
    
    /* Fill the TX Eth frame with test content */
    txFrame = (EthFrame *)txPktInfo->sgList.list[0].bufPtr;
    
    /* make send frame */
    pre_fill_ecat_frame(txFrame);
    fill_ecat_frame1(txFrame);
    
    txPktInfo->sgList.list[0].segmentFilledLen = txFrame->payload[0]+16+(txFrame->payload[1] & 0x07)*256;
    txLen = txPktInfo->sgList.list[0].segmentFilledLen;
    txPktInfo->sgList.numScatterSegments = 1;
    txPktInfo->chkSumInfo = 0U;
    txPktInfo->appPriv = &gEnetMp;
    
    EnetDma_checkPktState(&txPktInfo->pktState , ENET_PKTSTATE_MODULE_APP , ENET_PKTSTATE_APP_WITH_FREEQ , ENET_PKTSTATE_APP_WITH_DRIVER);
    
    /* Enqueue the packet for later transmission */
    EnetQueue_enq(&txSubmitQ, &txPktInfo->node);
    status = EnetDma_submitTxPktQ(perCtxt->hTxCh[0], &txSubmitQ2);// 2024/8/20 status=0 を繰り返している
    

  • I've also attached the WireShark log from when it was unable to operate for 350ms.

    The first frame sent drift compensation (CMD:ARMW),
    the second frame sent LRW, and the third frame also sent LRW.

    This data was captured using profishark1G,
    and as you can see, the third frame was not captured correctly.
    (The profishark options Transmit CRC errors, KeepCRC32, and Capture full frames are enabled.)

    This may be something I should ask profitap about, but since a bug in the analyzer seems unlikely,
    I would like to first eliminate any possibilities, such as API usage issues.

    Are there any restrictions on using EnetDma_submitTxPktQ?
    For example, after sending, do I need to receive data before sending the next data?

    Addendum: I was unable to attach the packet capture from Wireshark, so I pasted the image.

  • Hi,

    The receiving task calls ClockP_usleep(1)

    The minimum time period for the ClockP_usleep is 1ms (1000 usec). Anything less than that will use a loop based call to finish the sleep task. This wouldn't give consistent results. If your application needs a precision of 1us, we suggest you to make use of timers instead.

    From the wireshark logs image, I cannot locate packets with 350ms difference. To share the pcap files, you can compress the capture file, and attach to the post. That will help us analyze the traffic better.

    Are there any restrictions on using EnetDma_submitTxPktQ?
    For example, after sending, do I need to receive data before sending the next data?

    There are no restrictions in using EnetDma_submitTxPktQ API to use it back to back. It can also be used in applications where we only transmit traffic. If you are using this to exclusively send single packets, you can alternatively use EnetDma_submitTxPkt, which is for submitting single packet at a time.

    With the alternative approach to ClockP_usleep, are you still having the issue of intermittent delays?

    Regards,
    Teja.

  • There are no restrictions in using EnetDma_submitTxPktQ API to use it back to back.

    I understand.Thanks. 

    With the alternative approach to ClockP_usleep, are you still having the issue of intermittent delays?

    This issue no longer occurs when I use a busy loop with GTC instead of ClockP_usleep.

    Since a precision of 1-2 us is sufficient to keep the signal going, this isn't a problem,
    but I'd like to know the cause to prevent it from recurring.

    I did a little research with a debug build here, and it seems like the tick isn't updated inside uint64_t ClockP_getTimeUsec(void),
    and there's a delay until it's updated.

    Do you know why this is occurring?

    As already noted, the priority is set to the maximum value for the receiving task,
    and I'm not using any interrupt disable functions such as HwiP_disable/HwiP_disableInt/HwiP_destruct.
    (I'm using the MCU+SDK API, so it's possible they're being used there.)

    packetcapture.zip

  • Hi,

    From the capture, I am observing jumps of around 250us, but not 350ms as mentioned earlier. Can you please confirm this observation from your side as well?

    I was not able to test the behavior of the ClockP_usleep() api as of now. As you would have already seen in the code, The ClockP_usleep api wouldn't go into sleep, but rather spend time within the function for delays less than 1ms. Please let me profile the API in our test setup, and that will give some insights for where the issue is. 

    I will keep this thread posted about the results. Please expect a response by end of this week.

    Thanks and regards,
    Teja.

  • From the capture, I am observing jumps of around 250us, but not 350ms as mentioned earlier. Can you please confirm this observation from your side as well?

    Thank you for confirming.

    Sorry, there was some information I didn't mention.

    Please pay attention to data numbers 501-1901 in the WireShark data.

    Even though they are EtherCAT frames, the return frame could not be captured.

    I believe this is the reason the receive task was unable to operate for 350 ms.

    Just before this 350ms issue occurred, ClockP_usleep(1) was delayed for several tens of microseconds, after which we were no longer able to observe any return frames.
    By slightly changing the measurement environment, from master (custom board)-slave1 (custom board)-slave2 (ti AM243xEVAboard) to
    master (custom board)-slave1 (custom board)-slave2 (custom board) or
    master (custom board)-slave1 (ti AM243xEVAboard)-slave2 (ti AM243xEVAboard)
    the 350ms interval was no longer observed, but ClockP_usleep(1) started looping infinitely..

    It took about 5-10 minutes to reproduce the problem.

    The ClockP_getTimeUsec function also contains the following comment, which makes me wonder if there are any restrictions on setting priorities.
    /** Check if the timer has overflowed.
    * This is to handle cases in which this function is invoked from critical sections with
    * interrupts disabled, and hence `gClockCtrl.ticks` won't increment even in case of
    * overflow in the timer count.
    * When the ISR increments `gClockCtrl.ticks`, it will clear the overflow status. */

  • Hi,

    When both the slaves are am243x boards, then the issue is not seen without any changes to application. This could be potentially a setup issue related to the receive frames. Can you shift the read point from between master and slave1 (master-profishark-slave1) to between the 2 slaves (slave1-profishark-slave2)? I would also like to get a clarification that the observation is targeted on am243x EVA boards. Also, please provide statistics of the ICSSG port to understand the link status and more details of the application. 

    I will check with our experts regarding the timer implementation. But it is unlikely that timer overflow is causing the 350ms delays. But it could be possible that some critical section in different part of application is causing some delays in incrementing ticks. 

    I will revert back with test results soon.

    Regards,
    Teja.

  • Can you shift the read point from between master and slave1 (master-profishark-slave1) to between the 2 slaves (slave1-profishark-slave2)?

    As for the 350ms issue, we can no longer reproduce it here, and it may be due to the custom board we were using, so it's difficult to track it down any further.
    If it does reoccur, I'll gather the information and get back to you.

    Instead, I'd like to confirm the cause of the usleep delay.
    I've created an environment with the following: TIBoard(master)(TMDS243EVM) --- TIBoard(slave)(TMDS243EVM) --- TIBoard(slave)(TMDS243EVM)

    I've attached the respective projects for testing purposes.
    Please check them here if necessary.
    The delay occurs at ClockP_usleep(1); on line 3944 of the master source code, enet_layer2_icssg.c.
    The delay occurs a few minutes after starting operation.

  • Hi,

    It is good to know that the issue is not persistent now. Thank you for the update. Due to internal commitments, I was not able to test the timer behaviortill now. I will try to do it in a week, and update the same in this thread.

    Regards,
    Teja.

  • Hi,

    I will try to do it in a week, and update the same in this thread.

    I'm contacting you to ask about the progress.
    Could you please let me know what the current situation is?

  • Hi,

    Due to internal commitments, I am not able to test this yet, but it is being tracked. I will send the analysis details in the one week. Please let us know if this timeline works for you.

    Thanks and regards,
    Teja.

  • Hi,

    If possible, I would like to know the results by the end of this month (9/30).

    Regards,
    Nishimori.

  • Hi,

    I cannot promise for 9/30, but I will try to give you an update on the reproducibility by that date since this is a short week for India teams. 

    Thanks and regards,
    Teja.

  • Hi,

    I understand. Thanks.
    I look forward to your reply.

    Regards,
    Nishimori.

  • Hi,

    Thank you for understanding. I will update this thread by EoD of 9/30. 

    Regards,
    Teja.