AM3356: PRU firmware issue (package losses)

Abassin Mumand

Part Number: AM3356
Other Parts Discussed in Thread: TMDSICE3359, AM3359, TLK110

Hello all,

My customer is working with the AM3356 device and has reported the problem bellow. Before going into the detailed description, I would like to highlight following information:

The main contact person lead engineer has been assigned to manage this post. Please make sure to communicate with him about any updates;
TI has unfortunately not yet been able to reproduce the problem on customer side. This is where the most effort has been focused;
Please consider the customer’s offer of sending their board to TI for further analysis in case the problem cannot be reproduced at all.

Dear customer, please fill-in any missing information you feel is lacking. I hope this clarifies the current project status and helps us work as a unit, with centralized updates for everyone.

Here is the problem description:

"We have a problem with your PRU firmware for AM3356.

We are currently in the test lab with our Profinet card. Here we get the feedback that our Profinet IRT switch in MRP mode swallows telegrams at high bus load and thus cuts off downstream components on the bus.

This is extremely unfavorable, since not always only one packet is discarded, but sometimes also several packets are missing and thus other components in the network fail.

As you already know we have developed our Profinet card with the help of the TI Third Party .

They have already looked into the problem and found out that the cause is in the PRU firmware.

The following error description has already been sent to your colleagues in India:

our customer has a PROFINET door control product that is based on AM3356 and provides either a PROFINET IRT or a PROFINET RT with MRP interface. I will name this product simply DUT. It is tested in the test laboratory of a big automotive company. It is running PROFINET RT in a MRP ring. The problem is, that PROFINET devices behind the DUT sometimes loose PROFINET connection, though the DUT is expected to just switch through cyclic data frames for all devices behind. The DUT itself does not loose connection. If the DUT is replaced by a third party device, there are no connection losses.

This automotive company is a very important customer for this product, and it is important for us to find a solution.

Topology is shown in appended PDF. MRP ring contains 5 clients. There is a stitch line using a PROFINET switch. On this stitch line, about 100 PROFINET devices are simulated using a Siemens SIMBA PNIO box. During connection setup, the cable between MRP manager an Device 3 is open, so that all MRP ring traffic and all traffic from/to the 100 simulated devices goes through the DUT. After connection setup, the MRP ring is closed (cable between MRP manager an Device 3 connected) and this installation is run for multiple hours. The cyclic frames before and after the DUT are inspected by a Siemens BANY PNIO network analyser. With this analyser, the appended PCAP file is generated. Each cyclic frame, which goes through the DUT, is logged twice, before and after the DUT, with slight time difference. This PCAP file shows a single cyclic frame loss:

Use eth.src == 08:00:06:9D:37:BB as filter in Wireshark
Go to frame number 124415.

This frame (presumably) was dropped by the Ethernet switch in the DUT. This is a single frame loss. Because there were connection losses on the devices behind the DUT, it is assumed, that there are cyclic frame losses of three or more cyclic frames of one device.

It is intended to have a load on the network of about 25% average, with some peaks. I have analysed the PCAP file, which is 1s long. Wireshark statistics state that in this file, load on network is about 70%. There are 160 different MAC addresses. I had a look at ICSS EMAC learning: it uses 256 buckets with 4 entries each. Using these MAC addresses and the ICSS EMAC hashing of simple bytewise XOR, I get a maximum of 3 entries per bucket. So there is no bucket overflow.

The DUT is based on:

AM3356
Hardware is built quite similar to the TMDSICE3359
DP83822 Ethernet PHY
PRU-ICSS-Profinet_Slave_01.00.03.04
pdk_am335x_1_0_12

Do you have an idea, what could cause these cyclic frame losses ?

Is it possible for you to reproduce this, e.g. by using a TMDSICE3359 ?

We don’t have a SIMBA simulator box, do you have one ?"

Thank you for your ongoing support and efforts.

over 1 year ago

0 Laxman Chinnannavar over 1 year ago

TI__Intellectual 2820 points

Hi Abassin,
We are able to reproduce this issue and suspect this to be port buffer overflow caused due to increased cycles required for specific scenarios. We have observed that increasing the port queue size does not have any significant impact on the performance and are currently looking into further optimizations. We hope to provide an update by 24-Oct.
Thanks for your patience.
Laxman

0 Aravindh Vishnu over 1 year ago in reply to Laxman Chinnannavar

Prodigy 100 points

Dear Laxman

My name is Aravindh Vishnu and work as software developer at ABB Force Measurement in Sweden. We also have a product based on PRU-ICSS_Profinet.

Our DUT is based on:

AM3356
Hardware is built quite similar to the TMDSICE3359
DP83822 Ethernet PHY
PRU-ICSS-Profinet_Slave_01.00.05.00
pdk_am335x_1_0_17
TMG Profinet stack

And we have also recently seen similar faulty behavior such as Abassin describes.

The difference is that we do not even connect the device in MRP ring network. And we do not support IRT (only RT).

We just have three DUTs connected in daisy-chain.

In the PLC, we set the IO cycle time to be 4ms. So the PLC requests a new value every 4ms from each DUT (and also writes values to the DUT every 4ms).

What we notice is that we get a Profinet error on all three DUTs. Sometimes the error comes after few hours. Sometimes the error comes after few days. Our DUT shows error and the PLC also shows an error. We have got this error with two different PLC:s (ABB AC800M and Siemens S7-1515).

The way we define what a "Profinet error" is the following:

In the dua.c file, in the main_pn function, there is a state machine. After communication is established and if during runtime, the state has any other value than PND_DUA_STATUS_CONNECTED. Then we signal an error.

After the error has occurred, it takes between 1-6hours and then the error disappears. Communication is established again "somehow".

After doing a lot of testing we have found that the probability for the error to occur (how fast/often it occurs)is related to the value of the IO cycle time.

The shorter the cycle time is, the higher probability it is for the fault to occur.

Also the fault probability is related to the number of devices in the daisy chain. The more there are, the higher probability.

If point to point communication is used (one PLC and only one DUT), then we do not get the Profinet error. So that is why we suspected the PRU switch software.

@Laxman: If you come to any conclusions, we would appreciate if you could share it with us also.

Kind regards

Aravindh

0 Laxman Chinnannavar over 1 year ago in reply to Aravindh Vishnu

TI__Intellectual 2820 points

Hi Aravindh,

We are currently running some tests on this issue. We will provide an update regarding this within next week.

Kind Regards

Laxman

0 Aravindh Vishnu over 1 year ago in reply to Laxman Chinnannavar

Prodigy 100 points

Dear Laxman

Thanks for looking into this issue. We have done some more testing on our end and I thought it would be a good idea to share the results. We have logged the PRU statistics. Enclosed are the PRU statistics logs and a readme file which describes the test setup and and test observations.

What can be observed is that rxOverSizedFrames increments a lot for the DUT that has the communication fault active. Note that rxOverSizedFrames also increments when the communication fault is not active (when the communication is up and running and works), but then it increments very rarely.

We would appreciate if you could look at the enclosed files and maybe provide some analysis. Such as what does the rxOverSizedFrames counter mean?

Also maybe you have some suggestion on what other type of tests/measurements that we could do. PRU statistics.zip

Kind regards

Aravindh

0 Laxman Chinnannavar over 1 year ago in reply to Aravindh Vishnu

TI__Intellectual 2820 points

Dear Aravindh,

We would like to get more information regarding your exact test setup.

Have you tried running the test with ICE_AM3359 device?
Could you let us know which version of the TMG stack are you currently using? We are currently using v5.6x.

We are currently running a long test run and will provide an update on any progress made. The setup we are currently using:

Connected PLC with three ICE_AM3359 in daisy chain network.
Set the IO cycle time to 1ms to reproduce the problem faster.

Thanks and Regards,
Laxman

0 Aravindh Vishnu over 1 year ago in reply to Laxman Chinnannavar

Prodigy 100 points

Dear Laxman

Thanks for the update!

The TMG Profinet stack version we are using is v5.6.0.0, but we have done some patches TMG recommended on this code.

The TI Profinet example application version is v1.0.4.7, but we have done some patches that TI recommended on this code.

One of the patches is that we use the latest version of the PRU firmware.

As you suggest, we will also do a long time test with one Siemens S7-1515 PLC and three ICEv2 boards.

We will set the IO cycle time to 1ms.

For this test we will use the latest version of both the TMG Profinet stack and TI Profinet example application.

Kind regards

Aravindh

0 Aravindh Vishnu over 1 year ago in reply to Aravindh Vishnu

Prodigy 100 points

Dear Laxman

The test with the S7-1515 PLC and the three ICEv2 DUT:s connected in daisy-chain is up and running. IO cycle time is set to 1ms.

0 Aravindh Vishnu over 1 year ago in reply to Aravindh Vishnu

Prodigy 100 points

0 Laxman Chinnannavar over 1 year ago in reply to Aravindh Vishnu

TI__Intellectual 2820 points

Dear Aravindh,

We have also been running similar test on our side, currently it has been around 40 hours since the test begun and we have not observed any disconnection issues.
Meanwhile could you also share with us the ICSSM memory dump for when this issue occurs? The ICSSM memory is stored in location 0x4A30_0000 to 0x4A38_0FFF.

Thanks and Regards,
Laxman

0 Aravindh Vishnu over 1 year ago in reply to Laxman Chinnannavar

Prodigy 100 points

Dear Laxman

I had some questions...

1) Are there any ready-made APIs to printout the ICSS memory dump (just like there are APIs available for printing out PRU statistics).

2) Did you have a chance to look at the PRU statistics that we logged and enclosed in a previous entry. The rxOverSizedFrames counter increased a lot during the communication failure (and also sometimes rarely when communication was working). I would appreciate if you could look at the log files and provide an analysis. What does the rxOverSixedFrames counter mean and why is it incrementing?

3) In a previous entry (as a reply to Mr. Abassin Mumand) you wrote the following:

We are able to reproduce this issue and suspect this to be port buffer overflow caused due to increased cycles required for specific scenarios. We have observed that increasing the port queue size does not have any significant impact on the performance and are currently looking into further optimizations.

Now you write that you are not able to reproduce the same issue. It is a bit confusing.

Are you or are you not able to reproduce the issue that Mr. Abassin Mumand and we are facing?

Kind regards

Aravindh

0 Laxman Chinnannavar over 1 year ago in reply to Aravindh Vishnu

TI__Intellectual 2820 points

Dear Aravindh
1) You can printout the memory dump using CCS.

Open the memory browser in CCS, and click on this button

Specify the start address and the length as mentioned in the image below.

Save and Finish this and you will be able to see the dump in the default location

2) We are currently trying to reproduce the issue. Once we are able to reproduce, we will be checking how the communication fault is affecting the Rxoversized frames.

3) Both the issues are different from each other. The issue they are facing is related to packet drops occurring due to buffer queue overflow. While in your case, the issue seems to be caused due to connection issues.

Could you let us know the status of the test you are running currently?

Regards,
Laxman

0 Aravindh Vishnu over 1 year ago in reply to Laxman Chinnannavar

Prodigy 100 points

Dear Laxman

The test has been run for approximately 4 days (96 h). There has been no Profinet communication failure on any of the three ICEv2 DUT:s.

I think we can conclude that the communication is stable and therefore, there is no bug in the PRU firmware (we use the same PRU firmware version in our product). But there are differences in the HW between the ICEv2 and our custom board (PFEA122). The PFEA122 uses the DP83822 Ethernet PHY (ICEv2 uses TLK110) and a 25MHz oscillator (ICEv2 uses 24MHz).

Regarding the ICSS memory dump, for which of the three DUT:s is this needed?

For DUT3 only or DUT2 and DUT3 or all?

As you know, we use the below topology:

PLC <=> DUT1 <=> DUT2 <=> DUT3

Kind regards

Aravindh

0 Laxman Chinnannavar over 1 year ago in reply to Aravindh Vishnu

TI__Intellectual 2820 points

Dear Aravindh,

This issue is not reproducible with TMDSICE3359 DUT and also has been confirmed from your side too.
Please let us know if you have face any other issues regarding this in a separate thread.

Kind Regards,
Laxman

0 Laxman Chinnannavar over 1 year ago in reply to Laxman Chinnannavar

TI__Intellectual 2820 points

Hi Abassin,
We have shared the patch through mail. Please update to us after you have tested the new patch.

Regards,
Laxman

Processors

Processors forum

AM3356: PRU firmware issue (package losses)