NDK stops responding

Kevin Alden

I am using NDK 2_20_04_26 on a C6472 DSP.

I have an issue where after some length of time (~8 hours?), I am unable to ping the DSP.

I was able to catch the problem with the debug pod attached and I traced down the following in the CSL. It appears that RxPacket is called from EMAC_RxServiceCheck. My ethernet driver is responsible for calling EMAC_RxServiceCheck whenever there is an interrupt for an incoming packet. I put a breakpoint in my ISR and it never hits. So now the question is, why does my packet receive ISR stop getting called?

I checked the IER register and it looks like the EMAC interrupts are enabled (IE9 and IE10). I also see these interrupts in the Hwi section of the ROV. The IFR register never shows interrupt 9 or 10 go high, so it looks to me that something happened to the EMAC on the chip to stop interrupts from triggering.

Has anyone seen something like this before?

over 13 years ago

0 Steven Connell over 13 years ago

TI__Mastermind 45025 points

Are you using the MCSDK? If so, which version?

Steve

0 Kevin Alden over 13 years ago in reply to Steven Connell

Intellectual 645 points

Here is a list of all the TI libraries I use:

bios_6_31_04_27

ipc_1_22_03_23

xdais_7_10_00_06

xdctools_3_20_08_88

ndk_2_20_04_26

pdk_c64x_1_00_00_06

csl_c6472_03_00_07_01

Also, I am using CGT 7.3.1

It doesnt look like I'm actually using the mcsdk. Our board doesn't have an actual phy, so we had to modify some of the EMAC initialization code. We are compiling in that modified ethernet source directly instead of using the mcsdk

0 Steven Connell over 13 years ago in reply to Kevin Alden

TI__Mastermind 45025 points

Sounds like you are using your own custom hardware as well as your own driver implementation? If so, it would be a bit difficult for me to help much with that. We also don't see any issue like this on our 6472 EVM board and drivers that ship with the MCSDK.

Are you able to run your app on the 6472 EVM and see if the same problem exists? If so, this would help to pinpoint where the problem is.

Another thing I'd suggest in either case would be to use the ROV tool in CCS. You can find it under the CCS tools menu.

You can use ROV to get some insight into the system, such as heap usage, stack usage, see which tasks are running in the system, and much more (note that the target must be halted in order to use it).

You might specifically check to see if the NDK stack thread is still running when this happens, as well as your task thread stack usage.

Steve

0 Kevin Alden over 12 years ago in reply to Steven Connell

Intellectual 645 points

I have verified everything looks ok using the ROV. stack and heap are good and all my tasks are still running. I just stop getting interrupts from the EMAC.

Our driver actually came from TI. We just tweaked it because we don't have a PHY on our board so there is no link speed negotiation. We are hard wired up to a switch on our board. We are compiling directly into our code the following:

/*
 * File: ethdriver.c 
 *
 * Description:
 * Ethernet Packet Driver written using the CSL and NIMU
 * Packet Architecture guidelines.
 *
 * Copyright (C) 2009 Texas Instruments Incorporated - http://www.ti.com/

...

*/

/*
 * File: nimu_eth.c 
 *
 * Description: Ethernet Packet Driver rewritten using the NIMU Packet
 * Architecture guidelines.
 *
 * Copyright (C) 2009 Texas Instruments Incorporated - http://www.ti.com/

...

*/

At this point I probably need to reproduce using the EVM, but it may be a while before I get the chance to do that.

0 Victor Ivanov over 12 years ago in reply to Kevin Alden

Expert 1160 points

Hi Kevin and Steven!

It seems that I have a similar problem but with a different DSP and configuration.

I use DM648 on our custom board, but an Ethernet related part is an exact copy of the EVM board and I use not-modified driver.

Also I use:

bios_5_42_00_07

ndk_20_20_06_35

pspdrivers_1_10_03

Our board sends and receives data to/from a PC. In the beginning, everything worked well. Then I started using an application on the PC, which sends quite a lot of multicast messages (dozens per second) to other PCs (this messages are not for our board). Since that I have the same issue as Kevin described. I can't ping the board, it doesn't receives or sends messages. I'm not absolutely sure that the multicast messages cause this problem. It is still under investigation and requires some time. However, our company has already had a problem when a third party device stopped working after being exposed to multicast messages.

The board always tries to send UDP messages and, using Wireshark on PC, I can see that the board still sends ARP requests but it is not able to receive the response. (I haven't checked about the interrupt, but EMAC_RxServiceCheck is not called.) So, the board doesn't know PC's MAC. After I had added PC's MAC statically, the board started sending UDP messages without problems but of course it didn't help with receiving.

Best Regards

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Kevin has mentioned interrupts in his messages, so I have tried using NC_OPMODE_POLLING to avoid using interrupts.

NC_SystemOpen( NC_PRIORITY_LOW, NC_OPMODE_POLLING);

Unfortunately it didn't help. The situation looks the same. I have not checked the details yet, but EMAC_RxServiceCheck is never called.

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Also I tried to filter out multicast messages replacing ETH_PKTFLT_MULTICAST with ETH_PKTFLT_DIRECT but it didn't help.

Now I'm testing the board while there is no multicast on the net. So far it works well but I have to wait at least 24 hours to be more or less sure.

By the way, I said before that I use not-modified driver. Actually, I did a couple of changes in NDK code to be able to send UDP datagrams over MTU size. Changes were done according to spru523_ug.pdf. Section 3.4.1

1. in "pbm.c" file. #define MMALLOC_MAXSIZE 16500
2. in "mem.c" file. #define RAW_PAGE_SIZE 16504

0 Kevin Alden over 12 years ago in reply to Victor Ivanov

Intellectual 645 points

I ran out of time to chase this issue further, but this is good information. It seemed to me that the issue is tied directly to the amount of network traffic sent to the DSP. If I sent no data, it continued to work. In your test, disabling multicast may just give the illusion of fixing the problem because there is less network traffic.

Although, I guess it is possible there is a bug in the NDK with handling multicast or broadcast packets. My environment didn't have any multicast traffic, but there was some broadcast traffic. However, the broadcast data alone never caused the issue.

0 Steven Connell over 12 years ago in reply to Victor Ivanov

TI__Mastermind 45025 points

Victor,

The DM648 driver has a known bug that caused the stack to become unresponsive or even crash after several hours. There is a patch for this bug. Have you already done that by chance?

Also, I believe the driver code may have a bug in which there are printf() calls within ISR context, which can cause a lot of problems.

Please see the following thread which discusses these issues:

http://e2e.ti.com/support/embedded/bios/f/355/t/158336.aspx

Steve

0 Kevin Alden over 12 years ago in reply to Steven Connell

Intellectual 645 points

I looked at that patch and it just seems to move Interrupt_init up to the top of HwPktOpen. Did this actually resolve a longevity problem?

Also, has anything similar been seen/fixed on the 6472?

0 Victor Ivanov over 12 years ago in reply to Kevin Alden

Expert 1160 points

Thanks a lot for you responses!

Steven, this patch has already been installed. Also, today I removed all printf from ethss_dm648 but it didn't help.

Kevin, you are right about multicast messages. Indeed, they just increase incoming traffic. Our board sends a lot of data but receives only a few messages. I waited longer and I got the same problem without multicast messages.

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

During my experiments I have noticed a strange behavior of NDK. I'm not sure that this is related to the problem, which we are investigating, but still.

Experiment 1.
A PC sends UDP messages to the board. The board doesn't send anything.
When I start an application on the PC, the PC sends an ARP request, gets an ARP response from the board and starts sending UDP messages. One short message every two seconds.The board receives them successfully. If I check ARP table on the board at this moment I can see a record about PC's IP and MAC.

Address Subnet Mask Flags Gateway
--------------- --------------- ------ -----------------
192.168.1.0 255.255.255.0 U C if-1
192.168.1.1 255.255.255.255 U H L local (if-1)
192.168.1.126 255.255.255.255 U H 00:1D:60:38:E8:10

In about 2 minutes a record about PC disappears from the ARP table
Address Subnet Mask Flags Gateway
--------------- --------------- ------ -----------------
192.168.1.0 255.255.255.0 U C if-1
192.168.1.1 255.255.255.255 U H L local (if-1)

In about another 7 minutes the board stops receiving messages. Function recvfrom returns SOCKET_ERROR and fdError() returns 35 which means timeout. Function EMAC_RxServiceCheck is not called.

After about 1 minute the PC sends an ARP request, EMAC_RxServiceCheck is called, PC gets an ARP response from the board and the board resums receiving UDP messages. Its ARP table contains a record about PC's IP and MAC again.
Address Subnet Mask Flags Gateway
--------------- --------------- ------ -----------------
192.168.1.0 255.255.255.0 U C if-1
192.168.1.1 255.255.255.255 U H L local (if-1)
192.168.1.126 255.255.255.255 U H 00:1D:60:38:E8:10

Then all the sequence repeats.

Experement 2.
IP and MAC of the board are written in the ARP table of the PC statically, so the PC never sends ARP requests. When I start an application, the PC starts sending UDP messages to the board. The board receives the messages successfully. If I check an ARP table of the board I can see that it doesn't contain an information about PC.
Address Subnet Mask Flags Gateway
--------------- --------------- ------ -----------------
192.168.1.0 255.255.255.0 U C if-1
192.168.1.1 255.255.255.255 U H L local (if-1)

In about 10 minutes the board stops receiving messages and it never resumes it. EMAC_RxServiceCheck is never called. However, if I remove a static record about IP and MAC of the board from PC's ARP table and PC sends an ARP request, the board resumes receiving the messages.

Victor

0 Steven Connell over 12 years ago in reply to Victor Ivanov

TI__Mastermind 45025 points

Victor,

Victor Ivanov said:
Actually, I did a couple of changes in NDK code to be able to send UDP datagrams over MTU size.

Are you trying to enable jumbo frames? The DM648 hardware doesn't support jumbo frames. Please see here for more info: http://processors.wiki.ti.com/index.php/Network_Developers_Kit_FAQ#Q:_Does_DM648_EVM_support_Gigabit_Ethernet.3F_What_about_Jumbo_Packets.3F

This thread also has more info on jumbo frames: http://e2e.ti.com/support/embedded/bios/f/355/t/149900.aspx

If not, I think the next step is to try to debug the stack a bit to figure out how far the packets are getting in the case in which you stop receiving them. You can do this by adding break points, starting with the driver. Do you see the packets coming into the driver's receive ISR?

If they are, then the next place a received packet will go is into the NIMU layer. In there you will find a switch statement that passes the packet up the stack based on the Type field of the Ethernet header. The code looks like this:

    if (Type == 0x8100)
        Type = VLANReceivePacket (hPkt);

    /* Dispatch the Packet to the appropriate protocol layer. */
    switch( Type )
    {
        case 0x800:
        {
            /* Received packet is an IP Packet. */
            IPRxPacket( ptr_pkt );
            break;
        }
        case 0x806:
        {
            /* Received packet is an ARP Packet. */
            LLIRxPacket( ptr_pkt );
            break;
        }
...

Do you see the ARP packet being received here?

Note that you may need to build the NDK libraries for debug mode.

Another thing to try is the Windows application called testudp.exe that is shipped with the NDK. This app is meant to run with the NDK's client example and sends UDP data to an echo server running on the board. A good sanity check would be to make sure that the client example's functionality is working with your setup as well as with the testudp app.

Steve

0 Victor Ivanov over 12 years ago in reply to Steven Connell

Expert 1160 points

Hi Steven!

Thanks a lot for your response!

No, I'm not trying to use jumbo frames. I just want to use UDP messages which are bigger then MTU size. In this case one big UDP message is split to several short messages (each is smaller then MTU) on the sending side and then they are assembled back to one UDP message on the receiving side. I made only the changes which are described in spru523_ug.pdf Section 3.4.1. This mechanism works well on DM648. Moreover, I made a test with small UDP messages and unmodified NDK and the result was the same.

I have not made the debugging of the stack yet, I'll do it tomorrow. However, I have tried testudp.exe with the NDK's client example.

For the experiment I used the evaluation board TMDXDVP648, the example and testudp.exe from NDK 2.0 and the gel file EVMDM648.gel provided by Lyrtech with the board .

The example was originally designed for CCS v3.3, so I had to convert the project for CCS v5.1. Also I replaced using of DHCP with a static IP address (char *LocalIPAddr = "192.168.1.1";)

First of all, I compiled the example using libraries from NDK 2.0. Then I copied source files of ethss_dm648 into my project and removed all printf. And finally I used NDK 2.20.06.35 (the latest NDK which I was able to use with DSP/BIOS 5). In all cases the behavior of the test was the following:

For several minutes (1-10) it worked well but then udptest.exe reported an error

...................................................

Test loop passed - resetting
Test loop passed - resetting
Failed on size 123

I ran the test several times and size mentioned by it seems to be random.

As I saw using Wireshark, the board hadn't sent the response.

.........................................................................................................

26689 4.285658 192.168.1.126 192.168.1.1 ECHO Request

26690 4.285704 192.168.1.1 192.168.1.126 ECHO Response

26691 4.285730 192.168.1.126 192.168.1.1 ECHO Request

Then I pinged the board from the PC and it responded. Using Wireshark I saw an interesting thing

26692 28.998115 192.168.1.126 192.168.1.1 ICMP Echo (ping) request (id=0x0400, seq(be/le)=36286/48781, ttl=128)

26693 28.998177 192.168.1.1 192.168.1.126 ECHO Response

26694 28.998203 192.168.1.1 192.168.1.126 ICMP Echo (ping) reply (id=0x0400, seq(be/le)=36286/48781, ttl=255)

As you can see, before responding to ping, the board sent the response to udptest.exe. Then I restarted udptest.exe (without restarting the board) and it worked for a few minutes and then the board stopped again. So, unlike to my previous experiment, in this test the board didn't stop completely.

However, when I started the application which sends plenty of UDP messages to the board (to another port which is not used in the client example and without waiting for the response), the board stopped to respond to ping like it did before.

I will continue the investigation tomorrow.

Best Regards

Victor

0 Steven Connell over 12 years ago in reply to Victor Ivanov

TI__Mastermind 45025 points

Victor,

A couple of comments ...

Victor Ivanov said:
No, I'm not trying to use jumbo frames. I just want to use UDP messages which are bigger then MTU size. In this case one big UDP message is split to several short messages (each is smaller then MTU) on the sending side and then they are assembled back to one UDP message on the receiving side

Yes, this sounds completely normal and is how all TCP/IP stacks (UDP in your case) should work. All physical mediums have an MTU defined, in the case of Ethernet it is 1500 byte for the payload, however it may be bigger or smaller for other physical mediums. In your application when you send a large chunk of data over a UDP socket the stack will break the data up into UDP packets that can fit into the 1500 byte Ethernet frame's payload (the Ethernet MTU is the bottleneck).

I thought based on what you said that you wanted to use jumbo frames (to enable larger than 1500 byte Ethernet payload).

Victor Ivanov said:
Also I replaced using of DHCP with a static IP address (char *LocalIPAddr = "192.168.1.1";)

Are you sure that this address is valid? 192.168.1.1 is usually the IP address for the router on your private network. If that's the case then I'm worried that you have a scenario of duplicate IP addresses, with multiple hosts on your LAN having the same IP address (the Router and your NDK host). This scenario could produce unpredictable results.

I'd recommend that you find an IP address that you know is free (192.168.1.100?) and then retry your test.

Steve

0 Victor Ivanov over 12 years ago in reply to Steven Connell

Expert 1160 points

Steven,

for all my latest experiments I'm using a direct connection between the PC and the board. To avoid complications, I set PC's and board's addresses statically and connected them by a single cable without a router. However, I changed the IP according to your suggestion (just in case) but it didn't help.

Also, today I tried to debug the driver. I added a break point in EMAC ISR void HwInt(void *stub). After the board had stopped receiving messages, the program never reached the break point.

Thanks

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Yesterday I wrote that HwInt is not called because a break point in HwInt is newer reached since the board stops receiving messages, but today I see that HwInt is called. I believe that I didn't change anything, so I can't explain the difference in the behavior.

Anyway, the situation today is the following :

when the boards stops receiving messages, HwInt is still called.

It calls EMAC_TxServiceCheck(hEMAC) which returns CPSW3G_ERR_MACFATAL.

Thereafter, emac_fatal_error counter is increased.

Inside EMAC_TxServiceCheck there is reading of the error

/* Read the error status - we'll decode it by hand */
pd->FatalError = CPSW3G_REGS->DMASTATUS;

The value is 0x2000 which indicates (according spruf57b.pdf) that "OWNERSHIP bit not set in input buffer".

Thanks

Victor

0 Steven Connell over 12 years ago in reply to Victor Ivanov

TI__Mastermind 45025 points

Hi Victor,

I've asked for some help from the author of the 6472 Ethernet driver. Hopefully he is familiar with these errors you have encountered.

Steve

0 Victor Ivanov over 12 years ago in reply to Steven Connell

Expert 1160 points

Hi Steven,

Thanks a lot, but, I have a comment:

Kevin started this discussion about C6472 DSP and I joined this discussion because it seems that we have similar problems. But I use DM648 (as I specified in my first post) and all my experiments were done with DM648 and its driver. I don't know how this two DSP are similar or different because I have used only DM648 so far.

Victor

0 Steven Connell over 12 years ago in reply to Victor Ivanov

TI__Mastermind 45025 points

Victor,

My mistake, I did accidentally mistake Kevin's h/w set up for yours.

I've been hunting around and there's a couple of possibilities.

First, I believe you mentioned you are using the PSP drivers? If so, we have seen confilcts in the event combiner configurations of the PSP drivers and the Ethernet driver for DM648. Can you check if there's overlap with the DM648 HWI mappings and event combiner settings in your configuration file? See this post for details (for different h/w but it still applies):

http://processors.wiki.ti.com/index.php/Network_Developers_Kit_FAQ#Q:_It_seems_that_the_NSP_for_OMAPL137_EVM_is_incompatible_with_the_EDMA3_LLD._How_can_I_fix_it.3F

Second, I found another post describing a problem that was found for DM648. Please check this to see if it may be what you're hitting:

http://e2e.ti.com/support/embedded/bios/f/355/p/57649/205966.aspx#205966

Also:

http://e2e.ti.com/support/embedded/bios/f/355/t/59642.aspx

Steve

0 Victor Ivanov over 12 years ago in reply to Steven Connell

Expert 1160 points

Steven,

I do mentioned pspdrivers_1_10_03 in one of the posts in the beginning of the discussion. I use soc.h and cslr_*.h files from pspdrivers in my project, but I don't use any *.lib or source codes from pspdrivers.

Anyway, since you have recommended me to try helloWorld example I do all my experiments with this example. As far as I know it doesn't use PSP drivers. I copied this example from NDK2.0 and did only few changes: I converted the project from CCSv3.3 to CCS v5.1 and replaced using DHCP with using a static IP. Config file remains untouched. Later I copied ethss_dm648 source codes to be able to remove printf and then I copied NDK code to be able to set breakpoints. Thereby, I tried to do as little changes in the example as possible and I'm using the EVM board, not our custom board, now.

Unfortunately, helloWorld.out file is not provided with the example, so there could be some differences between an executable file which was tested by TI and the executable that I'm working with. I use a different version of CCS, different version of the compiler, different DSP/BIOS version. Of course, we all expect that it should not mater but still something is goes wrong on my side and I can't figure out the reason.

About http://e2e.ti.com/support/embedded/bios/f/355/p/57649/205966.aspx#205966. This thread is about a bug which prevented the device to properly recognize 10/100 Mb connection. I saw this thread and the bug is fixed in my working project. Moreover, I work with 1Gb connection because we need to send quite a lot of data. So, I think that I see another problem.

About http://e2e.ti.com/support/embedded/bios/f/355/t/59642.aspx. It this case there was a bug in the code. But in my case, when I use helloWorld example, there isn't my code at all. Moreover, in my case the board doesn't halt completely.

Here I have tried to summaries the most useful information about my problem:

I use DM648 Evaluation Module maid by Lyrtech ID00103, CCS v5.1, bios_5_42_00_07, ndk_20_20_06_35, ethss_dm648 with removed printf, EVMDM648.gel provided with the board, helloWorld example and testudp.exe from NDK2.0, The board has a static IP (192.168.1.100) and directly connected to the PC (192.168.1.126).

Experiment 1.

I load helloWorld example in to the board and start testudp.exe on the PC. everything works well for several minutes (1-10) but then testudp.exe reports an error

...................................................

Test loop passed - resetting
Test loop passed - resetting
Failed on size 143

I ran the test several times and size mentioned by it seems to be random.

As I saw using Wireshark, the board hadn't sent the response.

Then I pinged the board from the PC and it responded. Using Wireshark I saw an interesting thing

As you can see, before responding to ping, the board sent the response to testudp.exe. It seems that the latest message from the testudp.exe stuck somewhere and then it was pushed by the ping request. Then I restarted testudp.exe (without restarting the board) and it worked for another few minutes and then it didn't respond again. (It can be repeated many times)

Experiment 2.

Using the same helloWorld example on the board, on the PC I started the application which sends plenty of UDP messages to the board (to another port which is not used in the helloWorld example and without waiting for the response)

while(1)
{
sendto(RawDataSocket, (const char*)Buffer, 1000, 0, (sockaddr*)&Sockaddrin, sizeof(Sockaddrin));

sendto(RawDataSocket, (const char*)Buffer, 1000, 0, (sockaddr*)&Sockaddrin, sizeof(Sockaddrin));

Sleep(1);

}

To monitor the boards state I pinged it continuously.

After several minutes (varies from a few minutes to a couple of hours) the board stopped responding to ping.

I set a breakpoint in HwInt and found out that It calls EMAC_TxServiceCheck(hEMAC) which returns CPSW3G_ERR_MACFATAL.

Thereafter, emac_fatal_error counter is increased.

Inside EMAC_TxServiceCheck there is reading of the error

/* Read the error status - we'll decode it by hand */
pd->FatalError = CPSW3G_REGS->DMASTATUS;

The value is 0x2000 which indicates (according spruf57b.pdf) that "OWNERSHIP bit not set in input buffer".

Thanks for your interest in my problem!

Victor

0 Steven Connell over 12 years ago in reply to Victor Ivanov

TI__Mastermind 45025 points

Victor,

Ok. I think at this point it's best that I try to reproduce the issue you are seeing here. Can you please attach your project to this thread? (if your project contains sensitive code that should not be public, then we will need to share it another way. But since you say it's just the basic hello world app, then I think it should be ok. But if not please let me know).

Also if you could just reconfirm/clarify the following again:

Victor Ivanov said:
helloWorld example and testudp.exe from NDK2.0, The board has a static IP (192.168.1.100) and directly connected to the PC (192.168.1.126).

Do you also see the issue with DHCP assigned addresses?

Is there a router or switch in between the PC and board?

Steve

0 Victor Ivanov over 12 years ago in reply to Steven Connell

Expert 1160 points

Steven,

Thanks a lot for you help!

We don't have any switches or routers between the PC and the board. They are directly connected with one cable. Also, I have never tried using DHCP because so far we use the board only with a statically configured IP address. As an experiment, I connected the PC and the board via a switch but it didn't change the situation.

However, I occasionally found what does change the situation! I connected the board to another PC and now everything has been working fine for more then 24 hours. It seems that my first PC sometimes sends something inappropriate and the board can't handle it. Now I'm trying my boards with other PCs and so far everything works fine.

For this reason I believe that you will not be able to reproduce the situation on you side.

Well, I'm glad that in most cases the board works well but at the same time I'm worried that a wrong behavior of an external system can affect mine. I'll try to find out what exactly sends this PC but I'm not sure how to do it.

Thanks again!

Victor

0 Kevin Alden over 12 years ago in reply to Victor Ivanov

Intellectual 645 points

I know we don't have the same board, but I have suspected that something similar is happening on my end. I would be very interested in any types of traffic that cause your board to choke.

0 Victor Ivanov over 12 years ago in reply to Kevin Alden

Expert 1160 points

Steven, I found out why at the beginning of my experiments HwInt was not called when the board stopped receiving data and now it is called and reports that "OWNERSHIP bit not set in input buffer". There was printf in function StatusUpdate which was called in the interrupt handler.

static void StatusUpdate( Handle hApplication )
{
EMAC_Status Status;

if( (Uint32)hApplication != 0x12345678 )
{
return;
}

EMAC_getStatus(hEMAC, &Status);
printf("Tx: %d Rx: %d FatalError: %d \n", Status.DmaStatus.txPending, Status.DmaStatus.rxPending, Status.DmaStatus.errPending);

emac_fatal_error++;
}

Since I had removed the printf according to your recommendation, HwInt is called and reports an error. Therefore, the driver knows that something went wrong but can it restore normal work without restarting the board?

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Steven!

My conclusion that everything works fine except one case was premature. Indeed, the helloworld test worked more then 24 hours without problems. However, after I had modified the heloworld example to make it close to what I do in my working project, the problem appeared again. In my working project I send quite a lot of data (UDP) from the board to PC and there is also an exchange of control messages between the board and the PC. So, in the helloworld project I removed dtask_udp_hello demon and added a thread which sends UDP messages in loop.

int MainTask(Arg arg0)
{
SOCKET RawDataSocket;
struct sockaddr_in Sockaddrin;
static Uint8 Buffer[1000];
struct timeval Timeout;
int i;
int iRet;

fdOpenSession( (HANDLE)TSK_self() );

bzero( &Sockaddrin, sizeof(struct sockaddr_in) );
Sockaddrin.sin_family = AF_INET;
Sockaddrin.sin_len = sizeof(Sockaddrin);
Sockaddrin.sin_addr.s_addr = inet_addr("192.168.1.126");
Sockaddrin.sin_port = htons(15100);

for(i=0;i<1000;i++)
{
Buffer[i]=(Uint8)i;
}

Timeout.tv_sec = 5;
Timeout.tv_usec = 0;

// Create a socket for sending raw data:
RawDataSocket = socket(AF_INET, SOCK_DGRAM, IPPROTO_UDP);
if( RawDataSocket == INVALID_SOCKET )
{
printf("\r\nERROR (Can't create RawDataSocket, Error=%d)",fdError());
return -1;
}

// Configure timeout:
setsockopt( RawDataSocket, SOL_SOCKET, SO_SNDTIMEO, &Timeout, sizeof( Timeout ) );
setsockopt( RawDataSocket, SOL_SOCKET, SO_RCVTIMEO, &Timeout, sizeof( Timeout ) );

while(1)
{
for(i=0;i<25;i++)
{
iRet=sendto(RawDataSocket, Buffer, 1000, 0, (PSA)&Sockaddrin, sizeof(Sockaddrin));
if(iRet==-1)
{
printf("\r\nERROR (sendto failed, Error=%d)",fdError());
break;
}
}
TSK_sleep(1);
}
}

This thread is created in NetworkOpen function:

static void NetworkOpen()
{
// Create our local server
//hHello = DaemonNew( SOCK_DGRAM, 0, 7, dtask_udp_hello, OS_TASKPRINORM, OS_TASKSTKNORM, 0, 1 );

TSK_Handle tskhMainTask;
TSK_Attrs attr = TSK_ATTRS; /* task attributes */

attr.priority=3; /* execution priority */
attr.name = "MainTask"; /* printable name */
attr.exitflag = 1; /* prog termination requires */
attr.stacksize = 0x10000;

tskhMainTask = TSK_create((Fxn)MainTask, &attr, 0);
if(tskhMainTask == NULL)
{
printf("\r\nERROR (Can't create MainTask)");
}

}

When I start this example on the board and ping the board continuously from the PC (ping 192.168.1.100 -t) , in 1-2 hours the board stops responding to ping. Also, sendto function starts returning code 65 which means "No route to host". At the same time, the boards sends ARP requests to obtain information about PC's MAC address and the PC sends ARP responses (I can see it with wireshark). I think it happens because the board doesn't receive packets, including ARP responses. Looking into HwInt with the debugger I can see that EMAC_TxServiceCheck(hEMAC) is never called. But EMAC_RxServiceCheck(hEMAC) is called regularly. In contrast to my previous experiments, NO errors are reported.

I have uploaded this example (helloworld.zip) and evmdm648.gel and helloWorld.out. It seems, that I can't solve this problem without your help. Could you pleas have a look on the code, to check if I did it right. If everything is correct could you please try to reproduce the problem on your side.

Thank you very much for the help!

Victor

0 Steven Connell over 12 years ago in reply to Victor Ivanov

TI__Mastermind 45025 points

Hi Victor,

I didn't see the helloworld.zip + gel file, etc. Could you please attach? Then I can run on my side and try to reproduce the issue you are seeing.

Thanks,

Steve

0 Victor Ivanov over 12 years ago in reply to Steven Connell

Expert 1160 points

Hi Steven,

I probably did something wrong with the attachment previous time. I hope, I did it right now.

0044.helloWorld.zip

8688.EVMDM648.gel

Thanks a lot!

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Hi Steven,

have you had an opportunity to try the example?

Thanks

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Hi Steven!

I still can't solve my problem. Could you please try my example. I need to found out what is wrong but I can't do it without your help.

Thanks!

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Hi Steven!

Just a reminder...

Victor

0 Damian Paklos over 12 years ago in reply to Victor Ivanov

Intellectual 465 points

Dear Team, Steven,

would you be so kind and continue to support Victor, if possible?

Thank you in advance,

Damian

0 Gilen over 12 years ago in reply to Damian Paklos

Intellectual 300 points

Hi all,

I have the same issue in a Beaglebone with NDK and SYS/BIOS.

The board hangs and is halted with frames flooding in the network. I can reproduce it with a "ping flood" attack or with video streaming by RTP protocol with multicast frames.

Is any way to avoid this effect?

Regards,

Guillermo

0 Victor Ivanov over 12 years ago in reply to Gilen

Expert 1160 points

Hi All,

it seems that there is a difference between 1Gb connection and 100Mb connection. I have connected the same board with the same test to a PC with 100 Mb card and the test has been working for more then 48 hours so far. Of course, the amount of data which is sent via 100Mb connection is less then via 1Gb.

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

I modified the test. Now it sends 80 Mb per second. So, now I send the same amount of data via 1Gb connection and via 100Mb connection.

Reducing of amount of data didn't change the situation. The board, connected via 1Gb, stopped responding to ping after about half an hour.

Victor

0 Kevin Alden over 12 years ago in reply to Victor Ivanov

Intellectual 645 points

As another data point, my problem on the 6472 occurs using 1gbit. My board is designed such that only 1gbit is available so I cannot try 100 mbit.

0 Victor Ivanov over 12 years ago in reply to Kevin Alden

Expert 1160 points

The board connected to the PC with 100 Mb card is still working...

Meanwhile, I connected two other boards with two other PCs with 1Gb cards via 100 Mb switches so in both cases 100Mb links are established. Switches and cards are identical but PCs are different. Let's name them PC1 and PC2

The card which is connected to the PC1 has worked for about a day and is still working. The card which is connected to the PC2 stopped responding after about half an hour. I restarted it several times and each time it stopped responding.

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Hi All.

We continued our investigation.

First of all, I misunderstood a little my previous experiments. When a PC receives a UDP message and there is no application, waiting for messages on this port, then the PC sends an ICMP reply to the sender. So, in my previous experiments, the PC sent not only pings but also ICMP packets. So, we have returned to the point when we had started: if the boards sends a lot of data to a PC and the PC sends data to the board, NDK stops working. When we configured the PC not to send responses for UDP the board worked for at least a day. It doesn't mater if the interface is 1Gb or 100Mb, what maters is sending of the data to the board.

Due to the fact that sending of the data to the board is important for reproducing the problem, we have replaced an OC on the PC. We replaced Windows with FreeBSD which has a set of utilities for Net testing. Our configuration was again direct connection between the board and the PC, IP addresses were set statically. We tested both 100 Mb and 1Gb connections. Also we have tested two cases: when the board is sending UDP packets (80Mb/s) and also when the board is doing nothing, so there wasn't my code at all in the project. To generate a stream of data from the PC to the board the following command was used:

hping --flood --udp --ipproto 1 -d 20 192.168.1.100

To make waiting shorter we started hping in two threads.

In both cases when the board is sending data and when the board is doing nothing the NDK stopped working. Time of waiting was seconds when the board was sending data and minutes when not.

We also tested sending of other types of packets (ICMP, TCP, raw) and the NDK stopped in all cases, so it seems that the type of traffic is not important. (These tests were done only with the board which were sending data).

Then I decided to check if NDK restart resumes its working. I restarted NDK using NC_NetStop(1) and it resumed working but when it stopped working the second time, restarting didn't help. However, I found out that NDK successfully restarts only once on DM648. See the thread http://e2e.ti.com/support/embedded/bios/f/355/t/239456.aspx.

Please, pay attention to this problem! Now you can easily reproduce the problem on your side and the problem appears without long waiting. The problem appears when only NDK stack is working and there is no my code at all in the project! It appears on the board which was bought from TI.

Best Regards

Victor

0 krishnamaiden over 12 years ago in reply to Victor Ivanov

Prodigy 155 points

Hi Victor,

I not an expert on the eth driver and NDK but going through your email I had a few questions

What do you mean by NDK stops ?
Do you receive the ethernet interrupts Tx and RX? Is 'emacDequeueRx' being called ?
If answer to 2 is yes, then add 'memory_squeeze_error' to the watch window and please track the value of the value of it. It should not be greater than 0.
In you application does any module disable/enable global interrupts ?

Regards

Krishna

0 Victor Ivanov over 12 years ago in reply to krishnamaiden

Expert 1160 points

Hi Krishna,

1. I mean that NDK stops receiving any packets. It is able to send but can't receive and it never recovers.

2. I receive Tx interrupts but not Rx. emacDequeueRx is not called.

3. I can easily have memory_squeeze_error >0 when I test the system with hping flood. NDK receives more packets that it is able to proceed and drops them. It's normal. However, after some time NDK stops receiving packets completely (memory_squeeze_error is not increasing in this situation) and doesn't recover when the load is removed. This is the problem. In my working project I don't overload the system but the problem still happens.

4. Yes, in my application I enable several interrupts but in a test example I don't. Actually, most of the experiments were done with the project which almost doesn't have my code.

Best Regards

Victor

0 krishnamaiden over 12 years ago in reply to Victor Ivanov

Prodigy 155 points

Hi Victor,

Thanks for you response.

Do the eth statistics registers 0x2d03400 onward report any kind of errors either CRC or Frame overruns ? Also does DMASTATUS show any thing in case your app stops receiving RX interrupts ? I recollect you mentioning once that DMASTATUS shows up value of HOST_PEND interrupt, does its happen in sample code that you are working on ?
Since the source of your problem arises when you start transmitting too, what I have observed that while enuqueuing the tx packets 'EMAC_sendPacket' the driver disables both Rx and TX interrupts. I am not sure why RX interrupts should also be disabled for the same period of time.
The time when the system does not receive any RX interrupt does system how the RX interrupt as enabled ?

0 Victor Ivanov over 12 years ago in reply to krishnamaiden

Expert 1160 points

Hi Krishna,

I have repeated the simple test several times:

1. There wasn't any CRC errors or Frame Overrun errors.

DMASTATUS contained sometimes 0x2000 (OWNERSHIP bit not set in input buffer) in other cases 0x80000000 (IDLE).

3. About interrupts:

RX_CONTROL->RX_EN =1

RX_INTSTAT_RAW=0

RX_INTSTAT_MASKED=0

RX_INTMASK_SET=1

RX_INTMASK_CLEAR=1

DMA_INTSTAT_RAW=0/1/2/3

DMA_INTSTAT_MASKED=0/1/2/3

DMA_INTSTAT_SET=3

DMA_INTSTAT_CLEAR=3

By the way, if I use NC_OPMODE_POLLING instead of NC_OPMODE_INTERRUPT in NC_SystemOpen I have the same problem.

Thanks!

Victor

0 krishnamaiden over 12 years ago in reply to Victor Ivanov

Prodigy 155 points

Hi Victor,

Thanks again for the response. The surprising thing is that Rx interrupts are not received. The interrupts are generated when the port writes to the completion pointer. Wonder why that's not happening.

In the sample app that you are trying can you place the packet buffers in Internal memory (L2) and see the response ? i mean placing .far:NDK_PACKETMEM in L2.

Regards

Krishna

0 Victor Ivanov over 12 years ago in reply to krishnamaiden

Expert 1160 points

Hi Krishna,

thanks for your advice. As I understand, you mean to move pBufMem and pHdrMem to L2. I was not able to do it because their size is too big but I did another experiment. I placed them in not cacheable part of memory. It seems that it helps. I was not able to repeat the problem with this configuration. Of course, the maximum speed is lower. I was able to send up to 100 Mb. I'll continue the experiment tomorrow.

Best regards

Victor

0 Steven Connell over 12 years ago in reply to Victor Ivanov

TI__Mastermind 45025 points

Hi Victor,

I do apologize for the lack of response. But I am back on your problem now. I have your helloworld application downloaded, just need to rebuild it to get the static IP addresses updated.

Before I go too far, I wanted to make sure that this app that you provided is still the best way to reproduce the issue. Should I continue with the helloworld app? If you have something better by now, please attach.

Also, what's the best way to test/reproduce from the client/PC side? Should I run the hping app you show or is there another client app or script you are using? If so please attach.

Steve

0 Victor Ivanov over 12 years ago in reply to Steven Connell

Expert 1160 points

Hi Steven,

I'm glad that you are going to reproduce my problem, thank you very much!

We still use the same helloWorld application because it contains only a few changes done by me. I think that this minimize possibility of my mistake. So, the project's configuration is the following:

CCS 5.1.1.00031

bios_5_42_00_07

ndk_2_20_06_35

ethss_dm648 from ndk_2_0_0. According to TI's advice all "printf" were removed

The project was taken from

ndk_2_0_0\packages\ti\ndk\example\network\helloWorld

This project originally was for CCS 3.3 so I had to convert it to use with CCS 5.1.

I did the following changes in the pfoject:

- an IP address is set statically

- dtask_udp_hello demon and its creation were commented out

- A MainTask thread was created

MainTask sends UDP packets (1000 bytes). The stream is about 150 Mb/s. We also reproduced the problem with 80 Mb/s. However, with 30 Mb/s I was not able to reproduce it but maybe I just had to wait longer.

The evaluation module DM648 DVDP is directly connected to a PC. To have more control on the PC we use Free BSD UNIX now. On the PC's side we use hping in flood mode to send packets to the board, netstat to see what is going on on the net and normal ping to test if the board responds (when hping works in flood mode it reports nothing). hping can provide a very big amount of packets and it makes the problem appears faster. So, I recommend to use hping. Usually we start hping in two threads to use both CPUs of the PC and create maximum traffic. In this case the problem appears within minutes.

There is no application on the PC which would receive data from DM648. As I was told by my colleagues, if a system receives a UDP packet and there is no application which is waiting data on a specified port, the system, by default, responds to the sender with an ICMP message. I didn't know it and it caused misunderstanding of the results of one of my earliest experiments. Anyway, we configured the PC not to respond in this case. It makes the situation more clear. So, now the PC sends only packets created by hping, normal ping and ARP.

As we understand the situation now, it is necessary that packets were sent in both ways from DM648 and to DM648. The problem happens with both 100 Mb and 1Gb connections. (at one moment I thought that 100 Mb is more stable but it is not).

We also tried to find out what is more important: amount of packets or amount of bytes. It seems that amount of packets is more important but we are not sure.

So, the easiest way to reproduce the problem is the following:

1. Connect the board and the PC directly with a cable. All modern cards don't require cross cable anymore.

2. Set desired IP addressed on the board and PC.

3. Run the project as it is on the board.

4. Run ping on the PC to watch the state of the board.

5. Start two terminals on the PC and in each run hping –flood –udp -d 200 192.168.1.100

The board will stop responding to the ping within a few minutes. After another few minutes, when the timeout expires in the route table on the board, the application will not be able to send data and sendto function will return 64.

I hope, you will easily reproduce the problem on your side.

By the way, while I was looking for a workaround for this problem, I tried to restart NDK. It helped, but NDK can be restarted only once in case of DM648. This is an another problem and I created a separate thread for it thttp://e2e.ti.com/support/embedded/bios/f/355/t/239456.aspx

Thanks a lot and hope to here from you soon!

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Hi Krishna,

during the night the system stopped responding to ping again. So, it seems that moving the buffers didn't solve the problem but only postponed it.

Victor

0 Victor Ivanov over 12 years ago in reply to Victor Ivanov

Expert 1160 points

Hi Steven!

Have you been able to reproduce the problem?

Victor

0 Steven Connell over 12 years ago in reply to Victor Ivanov

TI__Mastermind 45025 points

Hi Victor,

I haven't been able to reproduce it yet as I have only been getting it set up including fighting with build problems when trying to rebuild the project you attached. I should have an update for you tomorrow.

Steve

0 Victor Ivanov over 12 years ago in reply to Steven Connell

Expert 1160 points

Hi Steven,

thanks a lot for the response. I'm looking forward to the result.

Victor

Processors

Processors forum

NDK stops responding