6678 GbeSw issue

Sergey Vasilev

Hi.

I have custom 6678 board and network application, based on NKD demo from MCSDK 2.1.2.5. This application run from ROM at power on.

Usually, it works just fine, but I found a strange effect when interacting with another custom network device (switch): gbesw stops sending packets to the outside, even though accepts and process incoming packets.

I have kept the statistics register dump from the address 0х2090B00.

STATB shows that was transmitted only 1 packet and it was broadcast. But STATA shows that from the host received a lot of packages including 2 broadcast, ie at least 2 packages were to be transferred regardless of the settings of the lookup table. There is no marks in the error stat registers.

After the occurrence of this situation, all incoming traffic is accepted and successfully enters the NIMU ISR, but outgoing packets to the network are not given at all. Statistics of outgoing packets does not change too.

The custom switch device has a singularity: a few seconds after launch, it takes reset to all ports, ie Link down and then up. As a result, the PHY-chip restart SGMII link to the DSP (put it in not linked state and start auto-negotiation process when it sync back).

In other words, periodically occurs situation when, during the transmission of the packet SGMI link restarts.I think that strange effect comes after SGMII link down. If I connect another switch that is not interrupting connection, the effect is not observed.

How to protect yourself from this kind of effect?
Is it possible to restart something in case of problems?
Permanent link status monitoring, does not looks as a solution too. Link can fall when the package is queued for processing and software can not stop it.

over 12 years ago

0 Sergey Vasilev over 12 years ago

Intellectual 280 points

1184.gbe_stat.dat

0 Katherine Kelsch over 12 years ago

TI__Intellectual 1585 points

Hey Sergey,

This sounds like a descriptor problem to me.

As in, not enough in the free descriptor queue (FDQ). So that whenever the switch wants to send out packets to the network, but it doesn't have enough descriptors to tack onto the packets to go out. It's unable to send any more packets, and the ACK bit saying that the switch received it causes it to never send anymore, breaking hte link. Try that to see if that solves your issue.

0 Sergey Vasilev over 12 years ago in reply to Katherine Kelsch

Intellectual 280 points

Hi, Katherine.

I did not understand your answer.

Do you mean "Gigabit ethernet switch subsistem for keystone devices" (GBESW) when you write "switch"?

Is GBESW has its own FDQ?

Can you explain a little more about "the ACK bit saying that the switch received it causes it to never send anymore, breaking hte link"?

I'll try again to describe my situation, perhaps my description was not fully understood.

In DSP board, we have the chain: QMSS(with its DQ) --- CPPI --- PA --- GBESW --- SGMII --- PHY.

In some situations, I think when SGMII level breaks, GBESW level begins to behave unusually.

GBESW subsistem have two sets of STAT registers: STATA contains statistics of exchanges between the Host (PA+CPPI+QMSS) and the GBESW, STATB contains statistics of exchanges between the GBESW anhd external world (SGMII+PHY).

1184.gbe_stat.dat file attached to post above, shows that the GBESW subsystem received from the host many packages, including two broadcast. GBESW subsystem has started sending these packages to external world, and after the initial package just hangs.

0 Katherine Kelsch over 12 years ago in reply to Sergey Vasilev

TI__Intellectual 1585 points

Sergey,

"Switch" is another short hand for GBESW, or GBE Switch, or GBE. Sorry for the confusion.

No, the GBESW doesn't have it's own FDQ, but I was thinking you were sending packets from the QMSS (which your chain shows it does). What I'm saying is that the packet is first encapsulated with a descriptor from the FDQ, then sent out to the GBESW via the destination queue. If the FDQ is empty, or there isn't a descriptor with the packet to send out therefore it gets stuck. It's waiting for the GBE to say it received (acknowledged bit) that last packet, so it never sends another one.

But you're saying that the subsystem received "many packages" , but does that mean all of them? How many are you intending to send and how many are actually sent? If it's not the same number, than that means it's most likely the sender (i.e. the QMSS) had to stop sending packets for a reason. If the SGMII breaks, then the GBE switch cannot work. The SGMII is the only way out to the network for the module. If it did break, it would stop sending. But we need to find out if the packet stops at SGMII or at the QMSS. If the QMSS has one more packet waiting in the queue with a descriptor, than it's the GBE side. If the GBE has one packet waiting to go out on SGMII than it's the SGMII side.

It's better to mention ingress/egress versus RX TX when talking about these status registers. Because the SGMII Rx good frames means ingress (network into the modules) but the RX for CPPI is egress (going out to network) so it can get confusing and frustrating. You probably already know this, but I feel it's good practice to mention it.

I tried opening your .dat file, but it was giving me gibberish. Could you possible put it in another format? xls maybe? Or you could just put your values for STATSA/B etc here in text.

Hope this helps!

Kat

0 Sergey Vasilev over 12 years ago in reply to Katherine Kelsch

Intellectual 280 points

Hi, Katherine.

This is a text form of the gbe_stat.dat file

Fullscreen 4010.gbe_stat.txt Download

0000000:    00000124    00000002    00000101    00000000
0000010:    00000000    00000000    00000000    00000000
0000030:    0000810e    000001ca    000001ca    00000000
0000040:    00000000    00000000    00000000    00000000
0000060:    00000000    000072e4    000001c6    000000ec
0000070:    00000000    0000003c    00000000    00000000
0000080:    0000f3f2    00000000    00000000    00000000
00000b0:    0000f3fe    000001ca    000001ca    00000000
00000c0:    0000f3f2    00000000    00000000    00000000
00000e0:    0000f3f2    000072e4    000001c6    000000ec
00000f0:    0000f3f2    0000003c    00000000    00000000
0000100:    000001ca    000001ca    00000000    00000000
0000110:    00000000    00000000    00000000    00000000
0000130:    000072e4    00000001    00000001    00000000
0000140:    00000000    00000000    00000000    00000000
0000160:    00000000    00000044    000001c6    00000005
0000170:    00000000    00000000    00000000    00000000
0000180:    00007328    00000000    00000000    00000000
00001b0:    000073ec    00000001    00000001    00000000
00001c0:    00007328    00000000    00000000    00000000
00001e0:    00007328    00000044    000001c6    00000005
00001f0:    00007328    00000000    00000000    00000000

. Original file was a simple raw dump, saved by CCS.

If you look at the statistics of ingress packets, you will see that STATA counters match STATB counters. So all ingress packets received from Port2 was passed to Port0.

For egress packets statistics is very different. Port0 received from cppi(pa?) 0x124 packets, but only one of them (i think the first one) was passed to Port2. In fact, this packet was not passed to network and was lost somewhere.

The Nimu-driver uses a direct port forwarding, so packets are not to be lost in the process of routing. Further, broadcast packets must also have been correctly transmitted.

Taking in mind your opinion on the FDQ, I checked the behavior of the queues. The result was somewhat unexpected.

Taking the demo application, I added code cyclically sends a 64-byte broadcast packet through every microsecond. The nimu-driver uses 16 descriptors in FDQ and 2 more for PA control.

When SGMI interface is up, all packets successfully transmitted. The statistics sent by the program and switch, packets are match. All packets queued to PA are returned to qmss in about 600 nanoseconds.

When SGMI interface is down, packets queued to PA are returned to qmss in about 5-6 milliseconds (I think by the Internal timeout). Such a long delay leads to the fact that there are no handles in the FDQ and some packets discarded by Nimu-driver. But as the release handles, new packets are still sent.

I played around with the SGMI link up-down. Basically, after link-up the packet transfer continued again. But on several occasions been able to reproduce the situation where packets are queued to the PA and it remained without handling by anyone without going back to the queue processed.

0 Katherine Kelsch over 12 years ago in reply to Sergey Vasilev

TI__Intellectual 1585 points

Thanks Sergey,

Good to know that you were able to see some more information here.

So you're seeing a loss of packets mid routing. I'm glad you checked the queues and got more information.

"0x124 packets, but only one of them (i think the first one) was passed to Port2" So you're receiving packets from network, but only the first one is processed (this is typically a RX FDQ not having enough handles to move the packet into the IP, but since I have already mentioned this possibility I'll ignore it).

I was most curious about " But as the release handles, new packets are still sent.". You're saying that even though the SGMII link is down, it's still sending packets? Or are you saying that once the link is back up the packets are immediately sent? Which makes a lot of sense to me. It can't go until the path is open (unless you have it configured to loop back via the SGMII and you're closing off the receiving SGMII port). What are expecting otherwise?

Here's what your text for your memory is:

0000000: 00000124 00000002 00000101 00000000
0000010: 00000000 00000000 00000000 00000000
0000030: 0000810e 000001ca 000001ca 00000000
0000040: 00000000 00000000 00000000 00000000
0000060: 00000000 000072e4 000001c6 000000ec
0000070: 00000000 0000003c 00000000 00000000
0000080: 0000f3f2 00000000 00000000 00000000
00000b0: 0000f3fe 000001ca 000001ca 00000000
00000c0: 0000f3f2 00000000 00000000 00000000
00000e0: 0000f3f2 000072e4 000001c6 000000ec
00000f0: 0000f3f2 0000003c 00000000 00000000
0000100: 000001ca 000001ca 00000000 00000000
0000110: 00000000 00000000 00000000 00000000
0000130: 000072e4 00000001 00000001 00000000
0000140: 00000000 00000000 00000000 00000000
0000160: 00000000 00000044 000001c6 00000005
0000170: 00000000 00000000 00000000 00000000
0000180: 00007328 00000000 00000000 00000000
00001b0: 000073ec 00000001 00000001 00000000
00001c0: 00007328 00000000 00000000 00000000
00001e0: 00007328 00000044 000001c6 00000005
00001f0: 00007328 00000000 00000000 00000000

(I'm assuming that these are offsets of offsets because STATSA RXSGOODFRAMES egress is really at 0xB00, if not then you need to look back at the user guides) .

At 0x100 (aka STATSB RX good frames ingress, or 0xc00) I have you with x1ca good packets from network. And your STATSA (you're ingress packets to send out to Port 2 for STATSB to show) shows 0x124. They don't match. I've highlighted the differences above.

I think you're not really looping back. You're receiving more than you're sending. Check your SGMII config to make sure you're not just receiving from network random packets, when you're really looking to loop back and receive from Port 1.

If you're not looping back, then I don't see an issue.

Hope this helps,

Kat

0 Sergey Vasilev over 12 years ago in reply to Katherine Kelsch

Intellectual 280 points

Hi, Katherine.

I see we have completely failed to understand each other.

Let's start from the beginning.

I am not looping back packets. All ingress packets generated by the external network using Ping. All egress packets generated by NDK (ARP packets, and ping responses) and directed into the outer network. Ingress/egress counters will not necessarily be the same, and that's fine.

How do I interpret statistics:

Offset 00000000 is the absolute address 0x2090b00.

ingress: STATB RX good frames (at 0x100) = 0x1ca, STATA TX good frames (at 0x34)= 0x1ca. So all ingress packets were passes. Its ok.

egress: STATA RX good frames (at 0x00) = 0x124, STATB TX good frames (at 0x134)= 0x1. Here's the issue, only one packet is transmitted, all the rest were gone.

When SGMI link is restored, ingress counters are starting to grow in both STAT and STATB, ie incoming traffic is restored. STATB TX good frames never increases, even if the STATA RX good frames shows new packets.

Now as for my experiment with queues.

Work with queues in the Nimu-driver implemented as follows.

There is a FDQ containing 16 descriptors. This queue is controlled by software, no hardware does not use it. Send routine takes a descriptor from FDQ, attach data buffer to descriptor and push it to CPPI PA queue. CPPI handles descriptor and when data is transferred to the PA, the descriptor is placed in the return queue. A next time Send-call frees the buffer and moves the descriptor from return queue to the FDQ. If FDQ is empty when Send called, the packet is discarded.

As far as I can see, such a mechanism does not know anything about GBESW and SGMII. In fact the exchange is between the QMSS and the PA, and executes it CPPI. Egress packets are being queued to the PA, should no longer require FDQ.

In my experiment, I sent every microsecond new packet.

If SGMII link is up, packet is successfully transmitted, and the handle back to the return queue within less than a microsecond, ie FDQ always not empty.

If SGMII link is down, the package is returned to the return queue in 5-6 microseconds (10000 times longer, but it is certainly return). Since Send is called every microsecond, FDQ is fast becoming empty. But when descriptor returned they again fall into the FDQ, taken again and again queued to the PA. That's what I meant by writing " But as the release handles, new packets are still sent."

But the experiment has revealed another problem. With multiple SGMII link up-downs, descriptors stopped back in return queue even after 5-6 milliseconds. They remain in the queue to the PA as untreated.

Incidentally, an increase of 10,000 times the descriptor processing time is interesting itself. Suggests that there is some feedback from the SGMII to CPPI. Perhaps CPPI trying to communicate with application through the un-configured flow/queue. But in the documentation I have not found anything like this, and the TI`s NIMU-driver of such a mechanism is not supported.

0 Sergey Vasilev over 12 years ago in reply to Sergey Vasilev

Intellectual 280 points

Maybe I made a mess misapplying the ingress / egress terms.

In my posts

ingress - all incoming packets from network to application,

egress - outgoing traffic from application to network.

0 Katherine Kelsch over 12 years ago in reply to Sergey Vasilev

TI__Intellectual 1585 points

Sergey,

Thanks for giving us a more detailed explanation.

Sorry for the delayed response. I've been discussing this with another TIer.

I do have a one question I want to clarify first, do you know why the SGMII link is going up and down? Are you controlling that? If so, why are you disabling it?

If not, then the reason the packet isn't being sent from SGMII is because of that.

Thanks!

0 Sergey Vasilev over 12 years ago in reply to Katherine Kelsch

Intellectual 280 points

Hi, Katherine.

Yes, I know why SGMII link is going up an down. As i wrote in first post, "the custom switch device, connected to my 6678 board, has a singularity: a few seconds after launch, it takes reset to all ports, ie Link down and then up. As a result, the PHY-chip restart SGMII link to the DSP (put it in not linked state and start auto-negotiation process when it sync back)."

Marvell 88E1322 PHY, used in my 6678 board, always breaks the SGMII link when an external cupper link is lost.

In the general case, the disappearance and recovery link for Ethernet is a natural process that should not lead to the collapse of the system. I can not control this process, it is random and is asynchronous with respect to my board and software. At most I can say with considerable delay to find out what the connection was lost or restored.

My concern is not a link loss or loss of individual packets. The main problem is that the packets are generally cease to be transmitted even after the link back.

Since the problem is visible in the very boundary conditions, I want to see what else can so affect the DSP. Maybe interference on the power supply, PLL instability or something like that.

In addition, experiments with queues and SGMII link highlighted another problem - stopping the transmission path when multiple breaks SGMII link. Are these two manifestations of the same problem or two different problems?

0 Katherine Kelsch over 12 years ago in reply to Sergey Vasilev

TI__Intellectual 1585 points

Sergey,

I think that's your issue, the fact that the SGMII link isn't stable. The device is trying to send it out while it's going up and down, and it stops sending packets in an effort to reduce packet loss. I know it may not matter to you if the packets are lost, but it does to the SGMII link.

Why don't you just have your custom switch device send an interrupt to the 6678 when you're ready to receive and send packets? That way it wouldn't start sending until your device is ready, and you also won't have any packets lost.

" with queues and SGMII link highlighted another problem - stopping the transmission path when multiple breaks SGMII link. Are these twomanifestations of the same problem or two different problems?" I read this as "multiple transmission breaks will break the SGMII link". Yes, it will.

0 Sergey Vasilev over 12 years ago in reply to Katherine Kelsch

Intellectual 280 points

Katherine.

I understand that the problem is quite specific. Perhaps it is due to my lack of understanding of how the GBESW working.

I will try to divide the question into several simpler.

1.In what situations GBESW does not transmit to the ports 1 and 2 packets received on port 0?

2.Is the packages inside the Switch still switched (going nowhere), waiting when the link is restored, or drops by the switch, when SGMI link is omitted?

3. Should GBESW start to transmit packets to port 2 after restoring SGMII link? In my situation it never begins transmitting. I'm worried about the situation when the STAT at offset 0x0 = 0х124 (and may continue to grow when the program tries to send new packets) and STATB offset 0x34 is always 1. I think this behavior is abnormal. I interpret it this way: switch is hung or drops all the packets.

4.Why disappearance SGMII link may lead to a situation where the packets are queued to the PA remain in it untreated?

5. Is there an feedback between the SGMII (or ports 1 and 2) and QMSS? How it can be used in the program?

0 Sergey Vasilev over 12 years ago in reply to Sergey Vasilev

Intellectual 280 points

Katherine.

Please let me know whether you consider the subject closed or answers to my questions require more time?

0 Katherine Kelsch over 12 years ago in reply to Sergey Vasilev

TI__Intellectual 1585 points

Hey Sergey,

Sorry for the delay.

1. So, typically the ports don't really care about each other. That way you can TX on one and RX on another and not have them discuss it with each other simultaneously. If you RX on port 0, and can't send on other ports, it's usually the port's that is not doing (aka port 1/2) fault.

A couple of things that could cause this:

a. You receive from network. The packet has a mac address in the descriptor of where it wants to go. The packet couldn't be sent out because 1) the mac address is going to the wrong place or is wrong. 2) the network packet isn't supported by NETCP to classify in it's PDSPs 3) the PKTDMA channel is not enabled so it cannot be sent from NETCP to the module it's looking for.

b. You receive on port 0/the CPSW the packet that needs to be sent out. It is not being sent out to network on the SGMII port. a) the SGMII port channel is down, and disconnected to network. It cannot send. This is the most common problem. The switch needs a steady port to send it out on. b) the ALE classification table can't classify the packet. Usually this is because the packet isn't in a proper ethernet format or your device doesn't support it. If the header doesn't have a good ethertype (like VLAN), or no valid destination/source mac address.

2. In between the receive port 0 and the port 1/2 to the network, there is the packet streaming interface. It's basically a MAC module. One for each port. The TX ports have a 22K byte buffer, but the RX port does not. If the packet makes it to the TX buffer for port 0/1 to send out, then yeah it can hang there unless it's full. If it does then it probably will be dropped. (Also this RX port on the PSI has a PS_FLAGS register which may be useful for your debug).

3. Most users don't restore the SGMII link for fear of losing the packets, or because the link may not be secure. I don't think there is an option otherwise in the module. You're right in that the switch will receive more packets, and not send them out. If you have another SGMII module on your outer device, you could try an autonegotiations or a forced link procedure.

4. The GBESwitch Is connected to the QMSS (through it's PKTDMA/Queues) the PA and the SA. The main reason a switch would send it to the PA would be because it needs to complete a task (according to the ALE) before sending out. Usually this is a UDP checksum type of process. Or tto the SA to encrypt prior to sending out. If you set a bit in the protocol specific field that shouldn't be set, it could result in weird behavior. Possibly you have CRC or encryption enabled when it shouldn't be. It wouldn't stop your SGMII link, but you never know.

5. SGMII and QMSS have no direct communication. You could send an interrupt to the CPU to talk to both modules if you wanted them to communicate, but that's it. QMSS's only thought is to put a packet in the transfer queue to send out.

-Kat

0 Katherine Kelsch over 12 years ago in reply to Sergey Vasilev

TI__Intellectual 1585 points

It's not "closed" if you're still having issues, I don't want you to leave without an asnwer. But, I think this is more of a design situation not a switch problem.

Processors

Processors forum

6678 GbeSw issue