NDK stops working in case of a big bidirectional traffic.

Victor Ivanov

Other Parts Discussed in Thread: OMAPL138

Hi All!

This thread was created as a replacement of http://e2e.ti.com/support/embedded/bios/f/355/t/223029.aspx?pi239031349=4. I was told by our local TI distributor that due to the fact that the previous thread was marked as answered it doesn't get enough attention.

Brief description of the problem:

Network Development Kit (NDK) working on DM648 stops receiving packets in case of significant bidirectional traffic and does not recovers until restart.

In our project, we digitize analog signal and send it to a PC via Ethernet. The traffic is about 150 Mb/s - 200 Mb/s. Also, the board and the PC exchange commands and status messages. While testing the system we noticed that it stops communicating in a few hours. Later we reproduced the problem on TMS320DM648DVDP bought from TI company using slightly modified example project provided by TI (helloWorld). With help of hping tool we were able to reproduce the problem faster, without waiting for a few hours.

Configuration:

CCS 5.1.1.00031

bios_5_42_00_07

ndk_2_20_06_35

ethss_dm648 from ndk_2_0_0. According to TI's advice all "printf" were removed

The project was taken from

ndk_2_0_0\packages\ti\ndk\example\network\helloWorld

This project originally was for CCS 3.3 so I had to convert it to use with CCS 5.1.

I did the following changes in the pfoject:

- an IP address is set statically

- dtask_udp_hello demon and its creation were commented out

- A MainTask thread was created

MainTask sends UDP packets (1000 bytes). The stream is about 150 Mb/s. We also reproduced the problem with 80 Mb/s.

The evaluation module DM648 DVDP is directly connected to a PC. To have more control on the PC we use Free BSD UNIX now. On the PC's side we use hping in flood mode to send packets to the board, netstat to see what is going on on the net and normal ping to test if the board responds (when hping works in flood mode it reports nothing). hping can provide a very big amount of packets and it makes the problem appears faster. Usually we start hping in two threads to use both CPUs of the PC and create maximum traffic. In this case the problem appears within minutes.

There is no application on the PC which would receive data from DM648.

As we understand the situation now, it is necessary that packets were sent in both ways from DM648 and to DM648. The problem happens with both 100 Mb and 1Gb connections.

How to reproduce the problem:

1. Connect the board and the PC directly with a cable. All modern cards don't require cross cable anymore.

2. Set desired IP addressed on the board and PC.

3. Run the project as it is on the board.

4. Run ping on the PC to watch the state of the board.

5. Start two terminals on the PC and in each run hping –flood –udp 192.168.1.100 (where 192.168.1.100 is the board's IP address)

The board will stop responding to the ping within a few minutes. After another few minutes, when the timeout expires in the route table on the board, the application will not be able to send data and sendto function will return 64.

Investigation:

Investigating this problem we found out that there is no Rx interrupts and emacDequeueRx function is not called.

Previous communication with TI:

I have placed the information about this problem on TI's e2e forum: http://e2e.ti.com/support/embedded/bios/f/355/t/223029.aspx?pi239031349=2 in 24th of November. This thread was created by another engineer who has a similar problem but on the other processor.

Then there was a short discussion with Steven Connell from TI and a pause till 23rd of January.

In 23rd of January Steven told me that he was ready to try to reproduce the issue and asked about the details.

In 7th of February Steven wrote the following:

Hi Victor,

I just wanted to update you. We are able to reproduce this issue and are seeing it on different hardware platforms. We are currently debugging the issue and will let you know as soon as we have something ... thanks again for your patience.

Steve

Since that moment we haven't received any updates about the status of the issue.

Also I tried to communicate with TI's support [Service Request# 1-926930884] but they advised me to continue to use e2e.

We are running out of time, our project is behind of the scheduler and we need at list to know what to expect:

1. Is TI working on it?

2. Is it software or hardware problem?

3. Is it going to be fixed and when?

Best Regards

Victor Ivanov

over 11 years ago

0 Kevin Alden over 11 years ago

Intellectual 645 points

just replying so i get email updates

0 Victor Ivanov over 11 years ago in reply to Kevin Alden

Expert 1160 points

Today I got information from the local TI distributor that TI is working on this issue. This gives us hope and we are glad to hear it. At the same time, we are disappointed that we don't see any status updates here...

Victor

0 Victor Ivanov over 11 years ago in reply to Victor Ivanov

Expert 1160 points

It is two weeks since I have created a new thread on e2e forum as you had recommended. Unfortunately, there has not been any responses from TI on this thread. For this reason I think that a lack of response on the previous thread was not because it was marked as answered. I think that it was just a pretext.

It seems, that for some reason TI doesn't want to discuss the problem and reveal the status.

The situation is critical. We have a batch of devices in our stock which can not be delivered to our customers before the issue is fixed, so we have postponed the delivery several times. We are loosing money. It is too bad that we don't have any time estimation. Moreover, we still don't know if this is a software problem or hardware. If it is a hardware problem that all our stock is just trash. And finally we still don't know is it going to be fixed at all. We still hope that it will be fixed or a sufficient workaround will be suggested but we need to be sure. Now we are not able to make any planes.

Have we made a mistake by choosing TI ?...

Victor

0 Srirama Govindarajan over 11 years ago in reply to Victor Ivanov

TI__Expert 8230 points

Victor

Sorry for the inconvenience caused thus far - we are analyzing the issue and based on observations so far, it seems that the Rx DMA handling (and DMA re-trigger portion) could potentially be causing the hang condition. We are currently checking on whether this can be handled with a SW patch - we will get back to you with more updates shortly(hopefully in a day or two)

Regards

Sriram

0 Victor Ivanov over 11 years ago in reply to Srirama Govindarajan

Expert 1160 points

Sriram, thanks a lot for you message!

We are waiting for the updates.

Victor

0 Kevin Alden over 11 years ago in reply to Victor Ivanov

Intellectual 645 points

any news on this yet?

0 Victor Ivanov over 11 years ago in reply to Kevin Alden

Expert 1160 points

Yes, on the 10th of April I got a patch from TI but unfortunately it didn't help. I haven't noticed any changes it the behavior. I reported it to TI and now I'm waiting for the response. I don't really understand why the patch was not posted here; I was told that it was because the NSP (driver) is under TI Commercial License but it is available as a part of NDK 2.0 on TI's site.

I'll let you now in case of any progress.

Victor

0 Ricardo Martinez over 11 years ago in reply to Victor Ivanov

Intellectual 290 points

Dear all,

I have the same problem with a different platform: C6670.

I'm using EVMC6670 revision 3A.

- XDCTools version: 3.24.7.73

- SYS/BIOS version: 6.35.1.29

- MCSDK version: 2.1.2.6

- PDK for C6670 version: 1.1.2.6

- NDK version: 2.22.2.16

I reported this in the following link but did not get any replies:

http://e2e.ti.com/support/embedded/bios/f/355/t/258234.aspx

Maybe the topic of my post is a bit confusing. But appart from that debug message "Illegal priority call to llEnter", the result is the same: NDK dies under high bidirectional UDP traffic.

Any help is wellcome.

Thanks in advance.

Kind regards,

Ricardo

0 Sudhakar Ayyasamy over 11 years ago in reply to Ricardo Martinez

TI__Intellectual 1010 points

We are working on this and will update here as soon as we have information. Thanks for waiting.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 11 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Dear Sudhakar,

are there any news for us?

Best Regards

Victor

0 Alan Colantino over 11 years ago in reply to Victor Ivanov

Prodigy 10 points

I was wondering if there is any update on the status of this investigation. We are having a similar problem with an OMAPL138. With lots of UDP packets coming in the stack seems to be locking up.

0 Victor Ivanov over 11 years ago in reply to Alan Colantino

Expert 1160 points

Hello Alan!

We ask the same question via our local distributor regularly. Unfortunately, there wasn't any real progress so far. The only useful information that we have now is the following: "an engineer has now been put on the case and has been issued with an eval board and the appropriate SW." We got this message on the 6th of May. So, we do hope that the work finally has been started. However, we haven't got any updates about the current status so we can only believe that it is in progress.

Unfortunately, we have found this issue on the final stage of the project, when almost everything was ready. If we hadn't invested so much money and hadn't produced the batch of the devices we would have already dropped this project and would have moved to another platform (not TI's of cause).

0 Ricardo Martinez over 11 years ago in reply to Victor Ivanov

Intellectual 290 points

Dear Victor,

Did you take a look at the workaround provided in http://e2e.ti.com/support/embedded/bios/f/355/t/258234.aspx?

We detected data coherency problems (DDR3 <-> L1D Cache) as the main cause which made NDK crash.

It consists of replacing CSL_Cache functions (buggy) in NDK source with SYS/BIOS equivalent functions.

Please give feedback if it helps

0 Victor Ivanov over 11 years ago in reply to Ricardo Martinez

Expert 1160 points

Dear Ricardo,

in Ethernet driver for DM648 cache related functions just call DSP/BIOS functions. Also we have tried to investigate if the cache is involved by moving PBM buffers in the not cached part of memory. For the beginning we thought that it helped but later we realized that it didn't help.

By the way, Steven Connell has mentioned OMAPL138 as a platform which passes our tests well:

Victor Ivanov

By the way, you have mention that you were able to reproduce this issue on different hardware platforms. I'm curious, if there where any platforms which you have tried but were not able to reproduce the issue on?

Steven Connell

The evmOMAPL138 handles the flood well.

Victor

0 Steven Connell over 11 years ago in reply to Alan Colantino

TI__Mastermind 45025 points

Hi Alan Colantino,

Which version of the NDK are you using? I was able to solve the ping flood issue for the evmOMAPL138. I found it was due to the NDK using a counting semaphore instead of a binary semaphore. Unfortunately, this did not solve the issue on the DM648. But, I suspect this is the issue you are running into.

The fix was made in NDK 2.22.02.16, but you should try the latest release 2.22.03.20. You can download from here:

http://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/ndk/index.html

Steve

0 Sudhakar Ayyasamy over 11 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

I am working on reproducing this issue on a DM648 board. I have a project that I am able to compile and run. Since Steven is already able to reproduce this, I will work with him in understanding the issue, and how to reproduce it. I will let you know once I am able to reproduce this on my side. Thank you for your patience.

Sincerely,

Sudhakar Ayyasamy

0 Sudhakar Ayyasamy over 11 years ago

TI__Intellectual 1010 points

Victor,

I ran the test for around four hours and I was not able to reproduce the issue. Here is more information on my test environment:

Hardware: DM648 EVM

CCS 5.4.0.00091

bios_5_42_00_07

ndk_2_20_06_35

I ran the project, started ping on one terminal, and did hping on seven different terminals. The ping does slow down (round trip time increases after hping is started), but the board did not stop responding.

I am attaching the binary (in the .rar file attached) I used for you to try it on your side.

1803.helloWorld.rar

Board ip address will be set to 192.168.1.100 in this case.

I will run this test on a different board, for a longer time, and will let you know if this issue occurs.

Thank you.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 11 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Dear Sudhakar,

I have tried the binary which you sent me and I have reproduced the issue.
To do it I have performed the following steps:

1. Connect DM648 EVM with a UNIX PC via Ethernet cable

2. Load EVMDM648.gel to the DM648 EVM (GEL file was received together with the EVM)
C64XP_0: GEL Output:
Connecting Target...
C64XP_0: GEL Output: Setup Cache...
C64XP_0: GEL Output: L1P = 32KC64XP_0: GEL Output: L1D = 32KC64XP_0: GEL Output: L2 = ALL SRAMC64XP_0: GEL Output:
Setup Cache... Done.
C64XP_0: GEL Output: PLL1 Setup...
C64XP_0: GEL Output: PLL1 Setup for DSP @ 891 MHz, SYSCLK2 = 222.75 MHz, SYSCLK4 = 222.75 MHz.
C64XP_0: GEL Output: PLL1 Setup... Done.
C64XP_0: GEL Output: Power on all PSC modules and DSP domains...
C64XP_0: GEL Output: In DVR mode.
C64XP_0: GEL Output: Power on all PSC modules and DSP domains... Done.
C64XP_0: GEL Output: DDR2 Setup for 32 bits DDR @ 265.9 MHz...
C64XP_0: GEL Output: DDR2 Setup... Done.
C64XP_0: GEL Output: Set Board and DSP Pin Mux...
C64XP_0: GEL Output: Set EVM muxes for 5 video ports SD capture...
C64XP_0: GEL Output: Set EVM muxes for 5 video ports SD capture... Done.
C64XP_0: GEL Output: Set EVM muxes for McASP to AIC access...
C64XP_0: GEL Output: Set EVM muxes for McASP to AIC access... Done.
C64XP_0: GEL Output: Set EVM mux for UART mode.
C64XP_0: GEL Output: Set Board and DSP Pin Mux... Done.
C64XP_0: GEL Output: EMIFB setup ( 16 bits bus )...
C64XP_0: GEL Output: EMIFB setup... Done.
C64XP_0: GEL Output: Set 27 MHz clock for SD video output/capture...
C64XP_0: GEL Output: Set 27 MHz clock for SD video output/capture... Done.
C64XP_0: GEL Output: Configure PCI...
C64XP_0: GEL Output: Configure PCI... Done.
C64XP_0: GEL Output: Setup Board Peripherals...
C64XP_0: GEL Output: Setup Board Peripherals... Done.
C64XP_0: GEL Output: Connecting Target... Done.

3. Load and run helloWorld.out which you have sent me
[C64XP_0]
[C64XP_0] TCP/IP Stack 'Hello World!' Application
[C64XP_0]
[C64XP_0] Here we go
[C64XP_0] Using MAC Address: 08-00-28-2c-68-0b
[C64XP_0] Network Added: If-1:192.168.1.100
[C64XP_0] Link Status: 1000Mb/s Full Duplex on PHY 0

4. Start ping on the UNIX PC to see if the board is responding
ping 192.168.1.100
64 bytes from 192.168.1.100: icmp_seq=1372 ttl=255 time=0.492 ms
64 bytes from 192.168.1.100: icmp_seq=1373 ttl=255 time=0.474 ms
64 bytes from 192.168.1.100: icmp_seq=1374 ttl=255 time=0.478 ms
64 bytes from 192.168.1.100: icmp_seq=1375 ttl=255 time=0.446 ms
64 bytes from 192.168.1.100: icmp_seq=1376 ttl=255 time=0.478 ms
64 bytes from 192.168.1.100: icmp_seq=1377 ttl=255 time=0.486 ms
64 bytes from 192.168.1.100: icmp_seq=1378 ttl=255 time=1.586 ms
64 bytes from 192.168.1.100: icmp_seq=1379 ttl=255 time=0.292 ms
64 bytes from 192.168.1.100: icmp_seq=1380 ttl=255 time=0.289 ms
64 bytes from 192.168.1.100: icmp_seq=1381 ttl=255 time=0.485 ms
64 bytes from 192.168.1.100: icmp_seq=1382 ttl=255 time=0.282 ms
64 bytes from 192.168.1.100: icmp_seq=1383 ttl=255 time=0.281 ms
64 bytes from 192.168.1.100: icmp_seq=1384 ttl=255 time=0.348 ms
64 bytes from 192.168.1.100: icmp_seq=1385 ttl=255 time=0.136 ms
64 bytes from 192.168.1.100: icmp_seq=1386 ttl=255 time=1.842 ms
64 bytes from 192.168.1.100: icmp_seq=1387 ttl=255 time=2.210 ms
64 bytes from 192.168.1.100: icmp_seq=1388 ttl=255 time=2.128 ms
64 bytes from 192.168.1.100: icmp_seq=1389 ttl=255 time=1.915 ms
64 bytes from 192.168.1.100: icmp_seq=1390 ttl=255 time=1.979 ms
64 bytes from 192.168.1.100: icmp_seq=1392 ttl=255 time=2.020 ms
64 bytes from 192.168.1.100: icmp_seq=1393 ttl=255 time=2.075 ms
64 bytes from 192.168.1.100: icmp_seq=1394 ttl=255 time=2.299 ms
64 bytes from 192.168.1.100: icmp_seq=1395 ttl=255 time=2.233 ms
64 bytes from 192.168.1.100: icmp_seq=1396 ttl=255 time=2.093 ms
64 bytes from 192.168.1.100: icmp_seq=1397 ttl=255 time=2.004 ms
64 bytes from 192.168.1.100: icmp_seq=1398 ttl=255 time=2.185 ms
64 bytes from 192.168.1.100: icmp_seq=1399 ttl=255 time=2.024 ms
64 bytes from 192.168.1.100: icmp_seq=1400 ttl=255 time=2.117 ms
64 bytes from 192.168.1.100: icmp_seq=1401 ttl=255 time=2.086 ms
64 bytes from 192.168.1.100: icmp_seq=1402 ttl=255 time=2.179 ms
64 bytes from 192.168.1.100: icmp_seq=1403 ttl=255 time=2.120 ms
64 bytes from 192.168.1.100: icmp_seq=1405 ttl=255 time=2.029 ms
64 bytes from 192.168.1.100: icmp_seq=1407 ttl=255 time=2.027 ms
64 bytes from 192.168.1.100: icmp_seq=1408 ttl=255 time=2.180 ms
64 bytes from 192.168.1.100: icmp_seq=1409 ttl=255 time=2.027 ms
64 bytes from 192.168.1.100: icmp_seq=1411 ttl=255 time=2.085 ms
64 bytes from 192.168.1.100: icmp_seq=1413 ttl=255 time=2.049 ms
64 bytes from 192.168.1.100: icmp_seq=1414 ttl=255 time=2.117 ms
64 bytes from 192.168.1.100: icmp_seq=1415 ttl=255 time=2.142 ms
64 bytes from 192.168.1.100: icmp_seq=1416 ttl=255 time=2.076 ms
64 bytes from 192.168.1.100: icmp_seq=1417 ttl=255 time=2.008 ms
64 bytes from 192.168.1.100: icmp_seq=1418 ttl=255 time=2.152 ms
64 bytes from 192.168.1.100: icmp_seq=1419 ttl=255 time=2.173 ms
64 bytes from 192.168.1.100: icmp_seq=1420 ttl=255 time=2.187 ms
64 bytes from 192.168.1.100: icmp_seq=1421 ttl=255 time=2.150 ms

5. Start netstat on the UNIX PC to see what is going on on the LAN
netstat -w1 -Ire0

packets errs idrops bytes packets errs bytes colls
2 0 0 162 3 0 140 0
1 0 0 98 1 0 98 0
1 0 0 98 1 0 98 0
1 0 0 98 1 0 98 0
0 0 0 0 0 0 98 0
1 0 0 98 1 0 0 0
1 0 0 98 1 0 98 0
1 0 0 98 1 0 98 0
1 0 0 98 1 0 98 0
1 0 0 98 1 0 98 0
1 0 0 98 1 0 98 0
1 0 0 98 1 0 98 0
1 0 0 98 1 0 98 0
9741 0 0 681898 9741 0 2357178 0
56259 0 0 3938158 61867 0 14971912 0
57434 0 0 4020408 64842 0 15691378 0
57718 0 0 4040288 65958 0 15962418 0
57697 0 0 4038818 66260 0 16034534 0
57686 0 0 4038048 65813 0 15926118 0
48505 0 0 3395378 98356 0 23802492 0
46201 0 0 3234070 106957 0 25883208 0
input (re0) output
packets errs idrops bytes packets errs bytes colls
46911 0 0 3283798 103766 0 25111470 0
46149 0 0 3230458 106958 0 25883208 0
45969 0 0 3217858 107556 0 26029618 0
45831 0 0 3208198 107911 0 26113350 0
46387 0 0 3247118 105853 0 25616282 0
45949 0 0 3216458 107582 0 26034700 0
45890 0 0 3212328 107822 0 26093022 0
45904 0 0 3213308 108046 0 26146988 0
46171 0 0 3231998 106341 0 25733894 0
46079 0 0 3225558 107272 0 25959680 0
45849 0 0 3209458 107557 0 26029860 0
45924 0 0 3214708 107694 0 26062288 0
45787 0 0 3205090 107811 0 26088424 0
45962 0 0 3217368 107461 0 26005418 0
45935 0 0 3215450 107829 0 26094232 0
45810 0 0 3206728 106891 0 25869172 0
45758 0 0 3203088 107098 0 25916362 0
46000 0 0 3220028 107728 0 26069548 0
46646 0 0 3265150 104980 0 25406710 0
46974 0 0 3288278 102940 0 24909642 0
46157 0 0 3230990 105610 0 25559654 0
input (re0) output
packets errs idrops bytes packets errs bytes colls
45942 0 0 3215968 107643 0 26047284 0
45873 0 0 3211138 107781 0 26083100 0
45965 0 0 3217578 107336 0 25974926 0
45815 0 0 3207078 107755 0 26076566 0
45823 0 0 3207638 107313 0 25971296 0
45865 0 0 3210578 108096 0 26157394 0
45879 0 0 3211558 107698 0 26063982 0
46018 0 0 3221288 107516 0 26017518 0
5381 0 0 376698 160746 0 38904502 0
2 0 0 480 165606 0 40077258 0
0 0 0 0 166198 0 40219046 0
0 0 0 0 167311 0 40490328 0
0 0 0 0 167610 0 40560508 0
0 0 0 0 165564 0 40065376 0
0 0 0 0 166748 0 40349968 0
0 0 0 0 167145 0 40449672 0
0 0 0 0 167731 0 40592210 0
0 0 0 0 166701 0 40339562 0
0 0 0 0 167192 0 40466612 0
0 0 0 0 166629 0 40318266 0
0 0 0 0 167244 0 40478954 0

6. Start hping in 2 threads on the UNIX PC to provide load on the board

hping --flood --udp -d 200 192.168.1.100
HPING 192.168.1.100 (re0 192.168.1.100): udp mode set, 28 headers + 200 data bytes
hping in flood mode, no replies will be shown

After about one minute the board stopped responding to ping. I have repeated the experiment several times with the same result. Also I can reproduce this issue on our custom board.

By the way, Steven also reproduced this issue.

I hope it will help you to reproduce the issue on your side. Please, let me know about your further results.

Best Regards

Victor

0 Sudhakar Ayyasamy over 11 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

I am able to reproduce the issue on a Linux desktop. I was using a Linux virtual machine on my Windows laptop and hping was not creating much traffic due to Virtual machine limitations. Thanks.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 11 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Hello Sudhakar,

thanks a lot for the update. I'm very glad that you have reproduced the issue. Now I'm sure that you will be able to find the cause of the problem and fix it. We really need the issue to be fixed soon because our project has already been delayed for half a year.

FYI. Flood ping is not the only way to arouse this issue. In our real project, the board sends a stream of data to the PC about 150-200 Mb/s. In this situation significantly less traffic from the PC produces the same effect.

Thanks!

Victor

0 Victor Ivanov over 11 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Sudhakar!

Do you have any updates? What can be the cause of the problem? If it is a software or a hardware issue?

Victor

0 Sudhakar Ayyasamy over 11 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

We are still looking into the issue. One thing we noticed is that the receive interrupt (RX_STAT register) is disabled when the problem occurs. There are no changes in the System Module Registers. Other changes we observed with respect to Ethernet Registers are:

0x02D03190: CPDMA Input Vector Register (read only): CPDMA_IN_VECTOR
   0x00000000 before ping
   0x00000001 during ping
   0x00000000 after crash
   Bit 0-31: DMA input vector

0x02D03194: CPDMA End Of Interrupt Vector Register: CPDMA_EOI_VECTOR
   0x00000000 before ping
   0x00000001 during ping
   0x00000002 after crash
   Bit 0-4: DMA end of interrupt vector

0x02D031A0: CPDMA Rx Interrupt Status Register (raw value): RX_INTSTAT_RAW
   0x00000000 before ping
   0x00000001 during ping
   0x00000000 after crash
   Bit 0: RX0_PEND raw interrupt read (before mask): RX0_PEND

0x02D031A4: CPDMA Rx Interrupt Status Register (masked value): RX_INTSTAT_MASKED
   0x00000000 before ping
   0x00000001 during ping
   0x00000000 after crash
   Bit 0: RX0_PEND raw interrupt read: RX0_PEND

We are expecting it to be a software issue, and could be related to the driver. I will update you with what is causing this issue, in a week.

Thank you.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 11 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Hello Sudhakar!

Thank you very much for the update! We are looking forward to get know the cause of the issue.

Best Regards

Victor

0 Victor Ivanov over 11 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Sudhakar!

Could you please update us about the status?

Thanks!

Victor

0 Sudhakar Ayyasamy over 11 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

We have been working on this, and observed few things. The semaphore count can be monitored using ROV tool in CCS, and one of them goes to a very large value (tens of thousands) when flooded with packets. We are trying to modify the NDK stack to use a binary semaphore to see if this fixes the issue. Steven Connell is helping with this. We are also seeing if interrupt thrashing, interrupt pacing or a backlog might cause this issue. Yet we still are not able to find what is causing this issue. I will let you know once we find that.

Meanwhile, I have the following question for you:

Flooding with hping (I use hping3) floods with smaller frames. Have you tried flooding with larger frames, say, like frames of size 1500 bytes or more and still able to reproduce the issue? If so, how do I do it?

Thank you.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 11 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Hello Sudhakar,

thanks a lot for the update. We do appreciate any information about the progress.

Regarding your question. Yes, I have tried flooding with various sizes of frames and I was able to reproduce the issue with all sizes I have tried. We use Free BSD Unix for our tests and hping in Unix has the parameter -d which sets packet body size.

hping --flood --udp -d 200 192.168.1.100
HPING 192.168.1.100 (re0 192.168.1.100): udp mode set, 28 headers + 200 data bytes
hping in flood mode, no replies will be shown

As far as I know, hping3 also has the same parameter. This is the extraction from hping3 man page:

-d --data data size

Set packet body size. Warning, using --data 40 hping3 will not generate 0 byte packets but protocol_header+40 bytes. hping3 will display packet size information as first line output

Thanks!

Victor

0 Sudhakar Ayyasamy over 11 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

Thank you for the response. Can you confirm you are seeing this issue when flooded with frames of 1500 data bytes? On our side, I am not seeing this issue when I do a

sudo hping3 --flood --udp -d 1500 <ip_address>

I used hping3 on seven terminals, and response to normal ping delayed but never stopped. After few minutes, I stopped all hping3s, and ping was still working.

However, with zero data bytes, the issue occurs immediately when flooded with hping3 on one terminal.

Thank you.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 11 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Hello Sudhakar,

unfortunately I don't have the Unix PC which I used for the tests in the winter so today I have used hping3 on a Linux PC. Today I have not been able to reproduce the issue with the frames bigger then 600. However, according to my memory and my notes I have seen the issue with all frame sizes. It was a few months ago and I can't be 100% sure so tomorrow I'll try to obtain that Unix PC and I will repeat the tests.

Thanks!

Victor

0 Victor Ivanov over 11 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Sudhakar,

today I have flooded the board from a Unix PC and I haven't seen the issue with big frames.

Victor

0 Victor Ivanov over 10 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Sudhakar,

Any news about the progress?

Victor

0 Sudhakar Ayyasamy over 10 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

The problem is occurring because the packets are received faster than the host can process them. This affects the equilibrium between the hardware and software. The peripheral, when receiving packets, clears the OWNERSHIP flag of the buffer descriptors (Buffer descriptors point to the data in the buffer) to let the host know that reception is complete. The host looks at the OWNERSHIP flag, and if cleared, it understands that the packet is received and processes the packet corresponding to that buffer descriptor, and sets the OWNERSHIP flag again. Buffer descriptors are circularly linked (the 'next' field of last descriptor points to the first descriptor).

If there is an equilibrium (flooding with larger packets), host can process a packet, and set the OWNERSHIP flag before the port tries to access the same buffer descriptor. So, when the port accesses it, OWNERSHIP bit is set by the host and can be used by the port.

If there is no equilibrium (flooding with smaller packets), the port clears the OWNERSHIP flag of a buffer descriptor, and before the host can access it, the port tries to access the same buffer descriptor whose OWNERSHIP flag is already cleared. This causes an error.

We are trying to increase the CPPI RAM to allocate more buffer descriptors to see if it helps. The driver is not designed to recover from error, but with a hardware reset, recovery is possible.

Thank you.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 10 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Hello Sudhakar,

Thanks a lot for the detailed explanation!

Is it possible for the host to check if the OWNERSHIP flag is already cleared and just to drop the next packet and avoid corruption of the buffer? In my opinion it would be the right behavior. I'm afraid that the large buffer will not solve the problem but only make it less frequent.

Also you have mentioned that the driver is not designed to recover from error but the recovery is possible with a hardware reset (I hope you mean the reset of the Ethernet retailed hardware but not the whole board). I think that the recovery is essential for the embedded systems and it would be very useful to implement it. Our device must work for months without any attention so it have to be able to recover. However, I don't know how to catch the error. Is there any reliable indicator of this error which can be used as a trigger for the Ethernet reset? If the recovery is fast enough, it could even be the solution for us.

Thanks you!

Victor

0 Victor Ivanov over 10 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Sudhakar!

What is the status now? Has increasing of the buffer's size helped?

Also what do you think about protection against buffer overrunning?

Victor

0 Sudhakar Ayyasamy over 10 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

As expected, increasing the CPPI RAM (Rx buffer descriptor memory) did not help. The hardware is sending out error messages, and we cannot change the behavior of the hardware. The host is setting the OWNERSHIP bit after processing a BD and the port clears it after receiving a packet and associating it with the BD. So, the port should see the OWNERSHIP bit set by the host when receiving a packet for a BD. From the error message, we understand that it sees a cleared OWNERSHIP bit and sends out an error message to DMASTATUS register. If you can help, can you please provide the following information?

Run the test program, flood it with pings, wait for the issue to occur. Then

1. Check the value of DMASTATUS register (location 0x02D03124). Especially bits 12-15. It is 0010 when the issue occurs.

2. Look at the CPPI RAM for Rx channel (ranging from 0x02D00C00 - 0x02D00FFF). There are 64 BDs, each having four words (each word is four bytes): Word 0 - pointer to next BD, Word 1 - pointer to data buffer, Word 2 - Buffer Length and Offset, Word 3 - Flags and Packet Length. When the issue occurs, please see if there are two adjacent BDs with Word 0 being 0x00000000. This can prevent the Rx channel to stop receiving packets.

You will be seeing one of the above two when the issue occurs. Please run the test three to five times and see how many times you see situation 1 above and how many times situation 2.

We are currently working on changing the csl code where the BDs are handled during interrupt, to make the driver more robust. We are hoping this could fix the issue.

Please let me know if there are any questions. Thank you.

Sudhakar

0 Victor Ivanov over 10 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Hello Sudhakar,

I have run the test for 10 times and the situation 1 occurred 3 times and the situation 2 occurred 7 times.

Thanks!

Victor

0 Sudhakar Ayyasamy over 10 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

Thank you very much for checking this. One last question. Do you remember when you ran the test times, both situations are mutually exclusive? that is, did you have a situation with DMASTATUS bits 12-15 reading 0010 and two BD's with 'next' pointer NULL?

The new modification we have now on the driver eliminates situation two (two BDs with zero 'next' pointers), but we still get the OWNERSHIP error. We now suspect it to be a hardware issue, and we are still debugging the driver as well. I will let you know if there is any news.

Thank you.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 10 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Hello Sudhakar,

yes, in my tests the situations were mutually exclusive.

Thanks a lot for the new information! This is great that the second situation has been fixed. At the same time we are very concerned about the possibility that the situation one is a result of a hardware problem. If the OWNWERSHIP error is a hardware error will it be possible to find a workaround for it or will TI fix it in new batches of chips?

Thanks!

Victor

0 Victor Ivanov over 10 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Sudhakar,

could you please send me the fixed driver (which eliminates the situation with two BD's with 0 next pointers) and I will test it in our installation.

Thanks!

Victor

0 Sudhakar Ayyasamy over 10 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

Please use the attached file. Thank you.

10/25/2013: This issue is fixed. Please follow the Wiki link posted at the end of this thread to get the final version of the driver that provides the fix.

Sudhakar

0 Victor Ivanov over 10 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Thanks a lot!

0 Victor Ivanov over 10 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Sudhakar!

I have tested the fixed version. The situation 2 never occurred.

In all my tests bits 12-15 in DMASTATUS were not 0 but they not always were 0010. In about 40% they were 0110.

The good thing about the results is that when the problem occurs it can easily be detected in EMAC_TxServiceCheck. If there is no way to avoid the problem we can try to recover the driver using DMASTATUS as a trigger.

Victor

0 Victor Ivanov over 10 years ago in reply to Victor Ivanov

Expert 1160 points

FYI

Just occasionally I have disabled miscellaneous interrupt

/* enable host,stats interrupt in cpsw_3gss_s wrapper */
ECTL_REGS->MISC_EN = /* CPSW3G_ECTL_HOSTERR_INTMASK |*/ CPSW3G_ECTL_STATPEND_INTMASK;

and I have noticed that the Ethernet resumes work within a few seconds - a few minutes after stopping.

0 Victor Ivanov over 10 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Sudhakar!

What are the news about the OWNERSHIP error? You have mentioned that it can be a hardware issue. Is it more clear now?

Victor

0 Sudhakar Ayyasamy over 10 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

Sorry for the delayed response. And thanks for the patience. Please let me know the procedure to reproduce the recovery you saw when you disable the miscellaneous interrupts. I disabled the interrupts, did a flooding, stopped it after few seconds, and I could not find a recovery.

And we don't have a confirmation yet if its a hardware related issue, and experts are looking at the design topology. I shall let you know once we have some news. Please let me know if there are any questions.

Thank you.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 10 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Hello Sudhakar,

I have tried to reproduce the recovery and I was not able. The procedure was as you have described but it doesn't work now. I can't figure out why. Maybe I misunderstood something. Sorry about this.

0 Sudhakar Ayyasamy over 10 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

We just found that, when flooded, the hardware is generating an interrupt before the previous ISR is completed and the interrupt is acknowledged. In other words, the ISR is re-entered before the current ISR is completed. The hardware team is looking at why this is happening as this is not supposed to happen.

Using a check to ignore the ISR when it's re-entered prevented the issue from happening, on our side. You can try it on your side, and if you are fine with this and don't see issues anymore, let us know.

Please replace the helloworld/ethss_em648/ethdriver.c with the attached file, and use the original csl_emac.c (not the modified one I sent few days back), and see if the issue still occurs. If it does, we must wait for the hardware team to fix it.

10/25/2013: This issue is fixed. Please follow the Wiki link posted at the end of this thread to get the final version of the driver that provides the fix.

Thank you.

Sincerely,

Sudhakar Ayyasamy

0 Victor Ivanov over 10 years ago in reply to Sudhakar Ayyasamy

Expert 1160 points

Great news! Thanks a lot! I will start testing now.

Victor

0 Victor Ivanov over 10 years ago in reply to Victor Ivanov

Expert 1160 points

Hello Sudhakar!

In the test project the fix worked fine but in the real system the Ethernet stopped again. I haven't found out what went wrong. DMASTATUS register (0x02D03124) was 0x80000000 and there weren't two BDs with 0x00000000.

I will try to find what is the difference between the test and the real system and modify the test to make it reproduce the situation.

Thanks

Victor

0 Sudhakar Ayyasamy over 10 years ago in reply to Victor Ivanov

TI__Intellectual 1010 points

Victor,

Thanks for your confirmation on fix on the test project. Please let me know once you find what went wrong. Thanks.

Sudhakar

0 old Wang over 10 years ago in reply to Victor Ivanov

Intellectual 465 points

Hello Sudhakar and Victor

We have the same issue, our platform is DM648 and ndk is 2.0.

We have done some test according to your disscution(the project is our real system).

and we found that there are three kind of values of DMASTATUS register when the ndk stop responding :

0x00020000

0x00060000

0x80000000

when the value is 0x80000000, I noticed that the value of RX_HDP is always 0, and can not recoverable.

Processors

Processors forum

NDK stops working in case of a big bidirectional traffic.