Interesting problem with 1Gbit, NDK2.0, UDP send, DM648

James Gort

Folks-

I have "fun" problem with NDK2.0, on custom hardware (DM648, Marvell PHY), sending UDP packets. It is test setup for video transmission, derived from "helloworld" project in NDK. The DSP, upon receiving a packet from host, responds by sending 10000 UDP packets back (in a while loop), of size 1472 (same issue exists with sizes of 1024 and 512). The ethernet connection is dedicated (through a hub/switch) betwen the DSP and a host PC.

At first, I was dropping sent packets (never were sent to host, according to WireShark) with either 100Mbit link or 1Gbit link, much more in the case of 100Mbit. Deciding that this was due to my loop throwing packets too fast at the bios/drivers, I added a sleep of 1 millisec in the send loop, and never lost a packet at either speed. Going with this, I increased the PKT_MAX in ethdriver.c, to allow more buffering of the packets I was throwing at the low-level, and sure enough was able to get no lost packets consistently with no sleep, but only in the case of 100Mbit/sec link to host. [I averaged about 95Mbit/sec payload transfer rate in this case]

When the link to host is 1Gbit, I still get a few (average maybe 20 out of 10000, and always in bunches) UDP packets that I send to the dspbios ("sendto") that never make it to the PC. The PC does not get partial packet, corrupt packet, or anything. The reason why this is interesting is that there is now never a problem at 100Mbit/sec physical link. One would think that ANY issues the DSP has with latency, buffer sizes, etc. would be alleviated by a 1Gbit link (compared to 100Mbit), not aggravated, as was the case before I increased the PKT_MAX. [I average about 300Mbit/sec payload transfer rate, lost packets notwithstanding]

I tried setting the priority of NETCTRL task to high, and increasing PKT_NUM_FRAMEBUF in PBM.C (per SPRU523g), but both to no effect/avail.

Does anyone have any idea what could be the problem? Or suggestions as to how I can diagnose the problem in the context of NDK2.0? Again, DSP is not sending occasional bunches of UDP packets (e.g., 8 lost in a row) when physical connection is 1Gbit, but sends all in same loop with a sleep in it (so likely not problem with physical layer, or chance of corruption would not depend on rate sent), and sends all in same loop with no sleep if physical connection is 100Mbit/sec (so I can't imagine it being problem with amount of buffering in the low-level DSP code, or with task latencies).

Thank you for any suggestions,

Jim

over 14 years ago

0 James Gort over 14 years ago

Intellectual 590 points

Update/More info:

I examined the 3PSW register containing the number of TX packets received (TXGOODFRAMES) and it always contains the correct number of (should have been) transmitted frames, even in cases where some were not sent out by the PHY.

I tried slowing down the max rate at which packets can be sent to the EMAC in the send loop, but slowing down the DSP clock and/or building in debug mode, and both of these methods caused missing packet problem to dissappear.

I tried looking at minimum "burst" required at max speed to cause problem (i.e., in loop, send N packets consecutively fast, then sleep, repeat), and missing packets can occur for all N>=2.

So.... I never get problem if EMAC is getting data significantly slower than it is sending it out to PHY (e.g., by sleep in loop, or by slowing down DSP clock), and I never get problem if EMAC is getting data significantly faster than it is sendinig it out to PHY (e.g., when DSP going balls to wall and link is 100Mbits/sec). And I always get proper number of TXGOODFRAMES in EMAC regardless of whether packets were being not sent or not.

This leads me to believe there is timing problem/race condition in FIFO between EMAC and PHY, that only happens when FIFO is not banging on empty or banging on full. The symptom is that the EMAC sees that the packet arrived in the FIFO, but for some reason throws it out instead of sending it to the PHY. Has anyone seen anything like this in the DM648?

Jim

0 Simon Roper39897 over 14 years ago in reply to James Gort

Intellectual 350 points

Jim,

This is not unlike the problem I currently have with a DM648 on a custom board. We are taking in RAW video, processing it, then converting to JPEG for upload via Ethernet TCP sockets with the DM648 as the server. Regularly the sent file fails to arrive at the client.

Thinking that the problem was with our code I have used the NDK benchmark example (NDK 2.0) and send it data with iperf, as described here:

http://processors.wiki.ti.com/index.php/NDK_benchmarks

The EVM gets a good benchmark, but our board is very slow and regularly crashes before completion. Putting in a printf in the receive loop of the benchmark testee program we get a low benchmark, but it doesn't crash as often. 100M is worse than 1Gbit. Resetting the DM648, but not the Phy and reloading it works the same as from a cold boot of the board. The hardware around the network is identical, same Phy as the EVM, but only one rather than two on the EVM. The other big difference is that we have a 27MHz clock for the DM648 and the EVM has 33MHz. What is your hardware setup? Do you have one or two Phys and which crystal are you using?

We have tested network reliability with a constant ping of the IP address from a PC and our board is regularly unreachable with identical code that runs fine on the EVM. I think the problem might be either related to the cache not being setup correctly of the different input clock because in our app when the JPEG is enabled the ping test often fails, but with it disabled the ping is quite reliably, no upload running in either case. I've tried NDK 1.92 and NDK 2.0, both have the same problem.

Any suggestions of thing to try would be much appreciated.

Simon

0 Steve15 over 14 years ago in reply to James Gort

Intellectual 385 points

I can't offer any help, but I'm looking at a similar problem. Perhaps the same issue existed in 1.9x -- this is a problem we've had for >1 year that we have blamed on a bug in our system somewhere. Recently we've converted to NDK 2.0, and we're revisiting the issue.

The symptoms are identical: packets go into sendto() and fail to make it out on the wire.

Processors

Processors forum

Interesting problem with 1Gbit, NDK2.0, UDP send, DM648