This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Intermittent "No Tx free descriptor" error after integrating EMAC example project

Other Parts Discussed in Thread: SYSBIOS

TI Experts, 

I am running into an intermittent problem after integrating a modified version of the EMAC multicore example project in the PDK 1.1.2.6.

I am working with Advantech's DSPC-8681E card. I started with PA_multicoreExample_exampleProject included in the PDK and modified it so that it works with the 8681 card and sends packets out on the network instead of doing an internal loopback. I increased the TX and RX buffers sizes to allow for a max packet size of 1514 and increased the number of packets sent by the test code to 1000. I also added a delay between calls to SendPacket() as my capture interface was having trouble keeping up with rate.

I used TI's Desktop Linux SDK and the dsp_utils app included to do the following:
1. Reset all 4 DSPs using the dspresetall.sh script
2. Initialize all 4 DSPs using the init_dsps.sh script
3. Load the PA_multicoreExample_exampleProject .out file to cores 1-7
4. Repeat #3 for core 0 as core 0 should be loaded last since it starts the other cores
5. Wait for packets on capture interface

I can repeat this test many times and it seems to be mostly reliable except for the occasional missed packets on the capture interface, but this doesn't seem to be an issue with the C66x side. 

I have two projects that I am trying to integrate this network I/O code from the PA_multicoreExample_exampleProject into. After integrating it with either project, I run the same above test substituting out the PA_multicoreExample_exampleProject .out file with the .out file for either project, but run into some problems.

Project A:
Intermittently the test will fail. I get a "No Tx free descriptor" error message from the SendPacket() function around the 289th packet, and I don't see any packets at all from the C66x on the capture interface. It almost always occurs on the 289th packet, but I have a seen a few cases where this wasn't the case yet the packet was near 289 (e.g. 287). This can happen on the first test after a server reboot or after several iterations of the test have ran successfully. If I replace Step #5 above with a sleep command and run the test in an infinite loop, the problem occurs in fewer iterations of the test. I have stripped down the project to the bare minimum as far as what code gets executed and have made it as close as possible to the modified version of PA_multicoreExample_exampleProject that I had working for the code that gets executed, but the problem still occurs.

When the problem occurs, the code continues to run and I can continue to receive packets. The next time I try to reset the DSP, I get the following error:

setPscState: dsp_id 0: Current transition in progress pid 2 mid 7 state: 0

setPscState: dsp_id 0: MD stat for pid 2 mid 7 expected state: 0 state: 10 timeout

setPDState: dsp_id 0: Previous transition in progress pid 2 state: 0

setPDState: dsp_id 0: Current transition in progress pid 2 state: 0

After looking in the source where this message is produced and understanding the numbers, it looks like there's a problem with resetting the PA. One thing I don't understand though is how the state is 10 as the only non-reserved values according to the PSC UG are 0 and 3.

Project B: 
This project is a combination of the modified PA_multicoreExample_exampleProject and a modified version of the H.264 encoder example code. I have spent less time debugging this project, but the problem appears to be similar to what happens in project A. In this project TI Desktop Linux SDK is not used, but the test procedure is similar to what I have outlined above. In this case the "No Tx free descriptor" message from the SendPacket() function generally occurs for the 129th packet, but sometimes occurs slightly later (around packet 150). When the problem occurs, the behavior seems to be similar to project A, code keeps running, but no packets are seen on the network. In this case, the C66x chip can still be reset over PCIe with no errors from the function accessing the PSC, but the "No Tx free descriptor" message will be received from the SendPacket() function for the 129th packet for every subsequent run of the test. 

In both projects, a server reboot is required to get out of the bad state. I have tried to reproduce the project with just the modified PA_multicoreExample_exampleProject , but haven't been able to do so.

Has anyone encountered an issue similar to this before?

What are some specific areas to look at while trying to debug this problem?

What areas in the PA_multicoreExample_exampleProject code or in the PA itself might be sensitive to other code?

Regards,
Chris
Signalogic

  • Chris,
    It seems, you are modifying the PA example code according your requirement.
    It is not simple to get your query and give the technical suggestion as per your experient.
    Is it possible provide your modified code and testing scenario? So that the expert will provide detailed information for your query.

  • Pubesh,

    I was able to cut out all of our software and was left with only the network I/O portion which is a modification of the original EMAC multicore example project in the PDK. The problem still occurs with this project. 

    I have attached both the project and the script (loader.sh) that I use to run the project. I place the script and the .out file in the dsp_utils binary directory inside the TI Desktop Linux SDK and run it with the following command:

    while true; do ./loader.sh 1 8 0 C66xx_RTAF_SYSBIOS_CCSv54.out ../../scripts/initcfg.txt; done

    It can also be ran without an infinite loop:

    ./loader.sh 1 8 0 C66xx_RTAF_SYSBIOS_CCSv54.out ../../scripts/initcfg.txt

    Running it in an infinite loop seems to make the problem occur quicker. 

    The script usage is as follows:

    Usage: ./loader.sh <number_of_chips> <number_of_cores> <boot_entry_address> <image_file> <config_file>

    The script will use the dspallreset.sh and init_dsps.sh scripts in the scripts directory in the TI Desktop Linux SDK then use the dsp_utils load command to get the .out file running on the cores. I have only tested this on one chip as shown in the above example commands.

    The project source can be modified to change the packet contents if needed, but capturing the packets on another device on the same LAN as the PCIe card is not necessary to reproduce the problem.

    Chris

    3056.tx_desc_debug.rar

  • Pubesh,

    Is there any update on this issue? Has it been reproduced?

    Regards,
    Chris
  • Hi Chris,

    Sorry for the delayed responses here.

    Currently I don't have a "DSPC-8681E card" with me but trying to reproduce your problem in "Shannon EVM C6678" since you are running on only one DSP.

    And I'm trying to reproduce the problem using the "PA_multicore_example" project.

    Are you able to receive/send the packets without modifying the source code except "size of packets" and iterations in TI's "PA_multicore_example" project ?

    ie you are not able to get any issues with internal loop back code with modified the no of packets.


    I can repeat this test many times and it seems to be mostly reliable except for the occasional missed packets on the capture interface, but this doesn't seem to be an issue with the C66x side.


    What is your capture interface here to the receive packets ?
    like wireshark ?
  • Titus S,

    Thanks for looking into this issue for us.

    I didn't see any issues with testing the internal loop back code, but I did not extensively test this. I will do some more testing with that mode today. At this point, I have further simplified the project that I previously attached. I can not find a difference between that project and what I had working from the PA_multicore_example, but I am not able to reproduce the problem using the modified PA_multicore_example. I even ran that in a loop with the TI SDK overnight.

    The capture interface I am using is Wireshark. I don't think there's anything to look into there. I think the issue with occasionally missing packets is just due to the rate of packets and the hardware I am using or other congestion on the network (In some tests I am sending packets between public IPs instead of within our LAN).

    Regards,
    Chris

  • Titus, Pubesh-

    Thanks for assisting.  If we need to get to you a DSPC-8681E card please let us know and we can coordinate that.

    At this point, if you guys can simply reproduce the problem on the 6678 EVM and/or DSPC-8681E with Chris' minimal project code, and if you don't think it's a C66x chip level issue, give us your best guess at where to look or how to debug, that would be great.  We can then get Advantech guys and our customers who are using high performance network I/O more involved.

    Some update notes:

    1) We have tried power cycling the server, and a full POR does not help.

    2) If we disconnect the RJ-45 on the card, we don't see the problem.  It might have something to do with Broadcom PHY initialization or reset, or flow control from PHY to EMAC0 to PA.

    Thanks.

    -Jeff
     Signalogic

  • Hi Chris and Jeff,

    I received the DSPC-8681E board from my colleague and trying to reproduce this issue.

    Have you tried to increase the "NUM_HOST_DESC" to 512 in "net_io.h" ?


    I have further simplified the project that I previously attached. I can not find a difference between that project and what I had working from the PA_multicore_example, but I am not able to reproduce the problem using the modified PA_multicore_example. I even ran that in a loop with the TI SDK overnight.


    Not able to reproduce the problem after you have simplified the code (i.e newly modified PA_multicore_example ) ?

    Then, what portion of code you have changed , and revert it to old again when you got the issue then check whether the same problem is coming or not ?

    Also please refer to the following TI E2E post similar to this.

    e2e.ti.com/.../276938

    e2e.ti.com/.../1410252

    e2e.ti.com/.../646905
  • Titus,

    After more closely comparing the example multicore project with my own project, I finally found some differences in the compiler settings that seemed to be the reasons for the different behavior I was seeing between the two projects. It seems that for the cpsw_mgmt.c file specifically, I need to use the compiler options --opt_level=off and --symdebug:dwarf. The rest of the project uses --opt_level=3 and --symdebug:skeletal.

    The simplest case that I have found to reproduce this problem is this:
    1. Start with clean PA_multicore_example project
    2. Change NUM_CORES defined in multicore_example.h to 1

    #define         NUM_CORES      1

    3. Change MAX_NUM_PACKETS defined in multicore_example.c to 300u

    #define                     MAX_NUM_PACKETS                         300u

    4. Change cpswLpbkMode declared/initialized in multicore_example.c to CPSW_LOOPBACK_NONE

    Int cpswLpbkMode = CPSW_LOOPBACK_NONE;

    5. Set project optimization level to 3 (--opt_level=3)

    With these few changes, I am able to reproduce the problem using the previously sent loader.sh script ran in an infinite shell loop. If I then change the optimization level just for the cpsw_mgmt.c file to off (--opt_level=off), the problem does not occur.

    I am not sure if 300 packets is the minimum needed to reproduce the problem, but I have tested at values of 10 and 100 and didn't see the problem even after 100+ iterations.

    I have also tested with adding a 10ms delay after SendPacket() is called in multicore_example.c, but this did not make a difference.

    Even with these test results, I do not think that the optimization level was actually causing the problem, but somehow removing any optimization for cpsw_mgmt.c helps avoid the problem. This is because in one of our full projects that we are using this network I/O code in, simply changing the optimization levels for the cpsw_mgmt.c file did not help. We still need to find the root cause of this problem.

    Please let me know if you are able to reproduce the problem or if you have any suggestions on what to test further or possibly how to better debug this problem.

    Titusrathinaraj Stalin said:

    Have you tried to increase the "NUM_HOST_DESC" to 512 in "net_io.h" ?

    When I try this in the original PA_multicore_example, I see this from core 0:

    ************************************************
    *** PA Multi Core Example Started on Core 0 ***
    ************************************************
    Initializing Free Descriptors.
    QMSS successfully initialized
    CPPI successfully initialized
    PASS successfully initialized
    Ethernet subsystem successfully initialized
    Error allocating Tx free descriptors
    Tx setup failed

    Regards,
    Chris

  • Hi Chris,


    Initializing Free Descriptors.
    QMSS successfully initialized
    CPPI successfully initialized
    PASS successfully initialized
    Ethernet subsystem successfully initialized
    Error allocating Tx free descriptors
    Tx setup failed


    It is due to variable "i" that had declared as 8 bit.

    So declare "i" variable as a 32 bit format to get 256 free descriptors in both "Setup_Tx" and "Setup_Rx" functions .

    UInt8 isAllocated,i;

    TO

    UInt8 isAllocated;
    UInt32 i;
  • Titus,

    I made that change to the Setup_Tx and Setup_Rx functions and am able to run the demo with 512 total descriptors, but this doesn't prevent the problem. With the few changes I mentioned above to the PA_multicore_example project and the changes you suggested to increase the number of descriptors I have to send more packets before seeing the "No Tx Free descriptor" message, but even with sending 300 packets and not seeing that message, I eventually get a case where no packets are actually sent out and the next iteration fails during reset.

    If I then change the optimization level for cpsw_mgmt.c to --opt_level=off, I don't see the problem. Same as before.

    Regards,
    Chris
  • I have some more info on this problem:

    The problem seems to be related to other network traffic. If I use tcpreplay to blast a large number of packets at a high speed on the LAN, then the problem occurs instantaneously. This seems to be in line with what Jeff had mentioned where if we initially disconnect the cable running to the card and reconnect it after the C66x code has finished initialization and has already sent some packets then we don't see the problem.

    Chris Johnson1 said:

    3. Change MAX_NUM_PACKETS defined in multicore_example.c to 300u

    #define                     MAX_NUM_PACKETS                         300u

    Chris Johnson1 said:

    I am not sure if 300 packets is the minimum needed to reproduce the problem, but I have tested at values of 10 and 100 and didn't see the problem even after 100+ iterations.

    With MAX_NUM_PACKETS set to the default value of 10u and the aforementioned method of blasting a larger number of packets at a high rate on the network, I am able to reproduce the problem.

    Also, as a note, the change to setting NUM_CORES to 1 was done to simplify the test and to avoid having to use any extra code to get the other cores started since I am loading/running over PCIe with the TI SDK. 

    Regards,
    Chris

  • Do you have Shannon EVM C6678 with you ?
    If yes, Able to reproduce this issue on that board ?
    Could you share the current modified project.
  • Titus,

    I do have a Shannon EVM C6678 board, but I don't currently have access to a JTAG Debugger. I can try to get a hold of one, but this will take some time on our part to locate one.

    I have attached the modified project that is modified from the original project with the few steps I laid out previously.

    I noticed that in the case of blasting a large number of packets at a high rate, the problem occurs even with the original unmodified PA_multicore_example project. In this case there isn't a "No Tx free descriptor" error message, but the dspresetall.sh script in the TI SDK will fail with the same aforementioned pscstate error messages. It seems like network traffic can have a negative effect when initializing the network I/O portion of the chip. 

    Were you able to reproduce the problem at all on the EVM or PCIe card with the previous project that I sent?

    If you are able to reproduce the problem on the PCIe card, but not the EVM, then we can get Advantech more involved in debugging this issue. 

    Also, if you are trying to reproduce this problem on the EVM board, it will be better if you use a soft reset similar to what takes place with the TI SDK for the PCIe card. I think the problem is less likely to occur (if it will occur at all) when a hard reset takes place.

    Note: I am blasting a large number of packets at a high rate using tcpreplay with the following command:

    tcpreplay -i Auto_eth0 --unique-ip --loop 2 bigFlows_bcast.pcap

    The .pcap file that I use can be found here: 

    I used tcprewrite to change all destination MAC addresses in the original pcap file to broadcast MAC.

    Regards,
    Chris

    pdk_C6678_1_1_2_6.rar

  • Titus-

    Chris mentions that he can reproduce the issue with the TI unmodified "PA_multicore_example" project, using tcpreplay to simulate network congestion.

    Can you try this on your EVM board and PCIe card?  It should work on both unmodified (i.e. TI distribution code, no changes), since the example is using internal loopback.  It's unclear how external congestion can affect this example.

    Thanks.

    -Jeff

  • Hi Jeff and Chris

    Chris mentions that he can reproduce the issue with the TI unmodified "PA_multicore_example" project, using tcpreplay to simulate network congestion.

    Can you try this on your EVM board and PCIe card?  It should work on both unmodified (i.e. TI distribution code, no changes), since the example is using internal loopback.  It's unclear how external congestion can affect this example.

    Jeff, its not possible.

    I've tried it but I'm not getting any problem on Shannon EVM (C66748).

    I think that Chris might have used "CPSW_LOOPBACK_NONE".

    Chris, Is that right ?

    Else you were facing the problem even with TI's example (unmodified "PA_multicore_example" project") and when you broadcasting the large number of packets at a high rate using "tcpreplay"  as like Jeff said ?

    Note that I'm using C66748 Shannon EVM with XDS560 emulator and using "System Reset" (ctrl+shift+S)  option in CCS to reset the board.

    One time, I'm able to see the "No Tx free descriptor. Cant run send/rcv test" error when I enabled "CPSW_LOOPBACK_NONE"

    Also I used tcprewrite to change all destination and source (machine IP) MAC addresses in the original pcap file to broadcast MAC and SRC and DSP IP addresses.

    Still I'm trying to reproduce the issue with C6678 Shannon EVM by enabling "CPSW_LOOPBACK_NONE" and broadcasting the large number of packets at a high rate using "tcpreplay".

    I'm not at all facing any problem with "CPSW_LOOPBACK_INTERNAL"

    Also I'm not able to work with DSPC8681E board since this board is new to me.

    I've used your loader script to load the .out to DSP.

    ./loader.sh 1 4 0x0084f6e0 pa.out initcfg_1000.txt

    Where, 0x0084f6e0 this is the ENTRY point taken from pa.map file.

    I've attached the log that tried in DSPC8681E board and not able to get any console log to check the error.

    Could you please tell to me where I can get the console log if you knows.

    Note that I've checked the onboard COM1 (UART) port for console prints but got nothing so I'm not able to check with DSPC board.

    titus@titus-desktop:~/ti/Titus_PA_issue$ sudo ./loader.sh 1 4 0x0084f6e0 pa.out initcfg_1000.txt
    [sudo] password for titus:
    Num of devices 4

     Iterations waited for entry point to clear 1
    Dsp 0:  DSP Reset success !

     Iterations waited for entry point to clear 1
    Dsp 1:  DSP Reset success !

     Iterations waited for entry point to clear 1
    Dsp 2:  DSP Reset success !

     Iterations waited for entry point to clear 1
    Dsp 3:  DSP Reset success !
    Number of devices: 4
    Device : 0 : dspc8681

    boot config file initcfg_1000.txt
     DSP boot config addr 0x86ff00
     DSP boot config size in bytes 8
    Boot config words: 0xbabeface,
    Boot config words: 0x14,
     Overriding image entry point with input 860000 Download image success !
    Device : 1 : dspc8681

    boot config file initcfg_1000.txt
     DSP boot config addr 0x86ff00
     DSP boot config size in bytes 8
    Boot config words: 0xbabeface,
    Boot config words: 0x14,
     Overriding image entry point with input 860000 Download image success !
    Device : 2 : dspc8681

    boot config file initcfg_1000.txt
     DSP boot config addr 0x86ff00
     DSP boot config size in bytes 8
    Boot config words: 0xbabeface,
    Boot config words: 0x14,
     Overriding image entry point with input 860000 Download image success !
    Device : 3 : dspc8681

    boot config file initcfg_1000.txt
     DSP boot config addr 0x86ff00
     DSP boot config size in bytes 8
    Boot config words: 0xbabeface,
    Boot config words: 0x14,
     Overriding image entry point with input 860000 Download image success !
    Load to chip 0, core 1

    boot config file initcfg_1000.txt
     DSP boot config addr 0x86ff00
     DSP boot config size in bytes 8
    Boot config words: 0xbabeface,
    Boot config words: 0x14,
     Overriding image entry point with input 1184f6e0 Download image success !
    Load to chip 0, core 2

    boot config file initcfg_1000.txt
     DSP boot config addr 0x86ff00
     DSP boot config size in bytes 8
    Boot config words: 0xbabeface,
    Boot config words: 0x14,
     Overriding image entry point with input 1284f6e0 Download image success !
    Load to chip 0, core 3

    boot config file initcfg_1000.txt
     DSP boot config addr 0x86ff00
     DSP boot config size in bytes 8
    Boot config words: 0xbabeface,
    Boot config words: 0x14,
     Overriding image entry point with input 1384f6e0 Download image success !
    Load to chip 0, core 0

    boot config file initcfg_1000.txt
     DSP boot config addr 0x86ff00
     DSP boot config size in bytes 8
    Boot config words: 0xbabeface,
    Boot config words: 0x14,
     Overriding image entry point with input 1084f6e0 Download image success !
    titus@titus-desktop:~/ti/Titus_PA_issue$

  • Titus,

    Titusrathinaraj Stalin said:

    I think that Chris might have used "CPSW_LOOPBACK_NONE".

    Chris, Is that right ?

    In the modified version where I send 300 packets and am not using tcpreplay to create network congestion, I am using CPSW_LOOPBACK_NONE. In the other case where I use tcpreplay to create network congestion, I am using the original project unmodified which is CPSW_LOOPBACK_INTERNAL.

    Titusrathinaraj Stalin said:

    Could you please tell to me where I can get the console log if you knows.

    To get the console log on a PCIe card, I modify the .cfg file to use a SysMin buffer instead of SysStd for logging, and save the memory location used by the SysMin buffer. I don't think the dsp_utils app in the TI SDK has the capability to do this so I use Advantech's dsp_loader app with the savebinary command instead.

    /* var SysStd   = xdc.useModule('xdc.runtime.SysStd');
    System.SupportProxy = SysStd; */
    var SysMin = xdc.useModule('xdc.runtime.SysMin');
    SysMin.bufSize = 10240; 

    The buffer will have the symbol name: "xdc_runtime_SysMin_Module_State_0_outbuf__A" 

    I don't usually check the console log to determine that the error occurred. I look for the reset to fail on the next iteration of the test. 

    Regards,
    Chris

  • Hi Chris,

    Sorry for the delayed response here.

    I'm able to reproduce the problem on C66748 Shannon EVM and working on it.

    Could you try the following sequence while you test.

    1)Run the PA_multicore example on DSPC8681 board.
    2) Run 2 or 3 times the same code after every reset.
    3) Now try to create network congestion on destination host machine through "tcpreplay"
    sudo tcpreplay -i eth1 -tK --loop 50 bigFlows.pcap

    4) Now again start to run the PA_multicore example code on DSPC8681 board.

    Now check whether you are able to see the problem or not.

    Is there any update or any different behavior from your end with different test case?
  • That means, create congestion (network traffic) after run the program (PA_multicore)
  • Titus,

    I don't have any update from my end. I've been trying to find a workaround using the PHY's registers, but haven't had any luck yet.

    I tried the sequence that you mentioned and I was able to reproduce the problem. That sequence is actually a test sequence that I have already tried. I am using a modified version of the bigFlows.pcap file with the destination MAC address for all packets changed to broadcast.

    Note: Previously when I mentioned using the original unmodified PA_multicore_example project, I have actually changed the NUM_CORES define to 1. This is the only change I have made and is necessary to avoid having to add additional code for core 0 to start cores 1-7 since I am loading over PCIe.

    Regards,
    Chris

  • Titus-

    I would like to add to Chris' update.  It's good to keep in mind that the problem can be avoided by simply disconnecting the Shannon EVM RJ-45 cable during EMAC / PA initialization, and then re-connecting.  For example, disconnect, run C6678 code (wait a couple of seconds), reconnect.  This works regardless of the level of network congestion.

    This seems to imply a flow control issue between PHY and 6678 during initialization.  To achieve a work-around, I think the question might be:  how to make the PHY "appear to be" disconnected?  Internal loopback?  Other method? 

    Thanks.

    -Jeff
    Signalogic

  • Titus-

    Can you give some suggestions on this?  What register(s) to set in the PHY to simulate "no cable connected"?  Thanks.

    -Jeff

  • Jeff,
    We are looking into this, and will reply to the thread once we have more details.

    Lali
  • Hi Lali,

    I am working with Jeff and Chris.
    Thank you very much for continuing to look into it, the issue remains pending and critical.

    Thank you,
    Harshal Patel
    HPC Systems Engineer
    Signalogic Inc.
  • Titus, Pubesh-

    This is becoming a critical thing for us.  One of our customers is considering to use an Enet switch with remote control power, so they can "programmatically disconnect" the card's RJ-45 during initialization.  As you can guess they're not happy with that.

    Here is some more info, plus related questions:

    1) After a server reboot, which resets the PCIe card and C6678, it appears that PA init can sometimes succeed one time regardless of network traffic / congestion.  After that, successive C6678 soft reboots and PA inits can fail depending on network traffic (as documented in this thread).

    2) Given this, we have tried the following "shutdown sequence":

      -disable packet output, delay for 1 sec
      -SGMII and PHY reset, delay for 1 sec
      -soft reset C6678

    and that did seem to help on some systems (reduce the init failure rate), but not on all systems / networks.  Can you suggest additional shutdown steps or sequence?

    3) The issue may at least be somewhat dependent on variation in C6678 devices.  We got a new shipment of PCIe cards, and found one card that was somewhat more resilient than others.  (This was a 1.25 GHz card, but other 1.25 GHz cards from the same batch did not show any improvement).  My understanding is there have been no errata changes on 6678 for some time, so what would you suggest that we look for or measure?   Anything at all that you want us to probe, please let us know.

    4) Have you tried programming the PHY in loopback or other modes that might simulate an RJ-45 disconnected condition?  Was there any difference in the problem with this?  Although the problem has been demonstrated in conjunction with both Marvell and Broadcom PHYs, and we don't think it's a PHY problem, any suggestions you can give us for PHY programming we're happy to try.

    Thanks.

    -Jeff

  • Hi Jeff,

    I have taken a look at this issue, and it seems that the switch port does not know how to handle the large influx of packets prior to the device execution. I’m going off the example in this thread where tcpreplay is sending the packets specifically to the C6678 Eth port, and using the PA_multicoreExample in CPSW_LOOPBACK_NONE; The screenshot below was taken with Core0 connected, code loaded (prior to running) and tcpreplay running.


    Questions:

    -          Is this what you also observe?
    -          Are you using a GEL to initialize?
    -          I’m not sure what version of the C6678 GEL you are using, but is there a configSGMIISerdes() step in the sequence?

    Additionally I have been thinking of ways to work around this problem. Essentially find a way to reset the SGMII just before transmitting the packets. The thought here is to somehow clear the switch status by doing the SGMII re-init. Please see if the below suggestion helps.

    1) In cpsw_mgmt.c (around line 134) Please add a routine that basically does the configSGMIISerdes in C code instead of the GEL. You can comment out this step in the GEL if it is being done there. You will now be doing it in the code instead as described below.

    if (!cpswSimTest)

        { do

           {
           Init_SGMII_SERDES(); //SGMII config SERDES routine
          CSL_SGMII_getStatus(macPortNum, &sgmiiStatus);
         } while (sgmiiStatus.bIsLinkUp != 1);

     2) Place the Init_SGMII_SERDES function in somewhere in cpsw_mgmt.c. Its essentially the same routine as in the GEL. The changeRegister function is a quick way to poke the relevant registers without much hassle.

     void changeRegister( unsigned int x, unsigned int val)

    {
    unsigned int *p5 ;
    p5 = (unsigned int *) x ;
    *p5 = (unsigned int) val ;
    }
    
    int Init_SGMII_SERDES(void)
    {
    printf("Initialzing SGMII SERDES...\n");
    
        CSL_BootCfgUnlockKicker(); //unlock
    
        changeRegister(0x2090210,0x0); //SGMII_SERDES_CONTROL_PORT1 = 0x0;
    
        changeRegister(0x2090110,0x0); //SGMII_SERDES_CONTROL_PORT0 = 0x0;
    
        CSL_BootCfgSetSGMIIConfigPLL (0x00000041); //SGMII_SERDES_CFGPLL = 0x00000041;
    
         CycleDelay(100);
    
         CSL_BootCfgSetSGMIIRxConfig (0, 0x00700621); //SGMII_SERDES_CFGRX0 = 0x00700621;
    
        CSL_BootCfgSetSGMIIRxConfig (1, 0x00700621); // SGMII_SERDES_CFGRX1 = 0x00700621;
    
         CSL_BootCfgSetSGMIITxConfig (0, 0x000108A1); //SGMII_SERDES_CFGTX0 = 0x000108A1;
    
        CSL_BootCfgSetSGMIITxConfig (1, 0x000108A1); //SGMII_SERDES_CFGTX1 = 0x000108A1;
    
         changeRegister(0x2090138,0x41); //SGMII_SERDES_AUX_CFG_PORT0 = 0x00000041;
    
        changeRegister(0x2090238,0x41); //SGMII_SERDES_AUX_CFG_PORT1 = 0x00000041;
    
        CycleDelay(100);      
    
        changeRegister(0x2090118,0x1); //SGMII_SERDES_MR_ADV_PORT0 = 0x1;
    
        changeRegister(0x2090218,0x1); //SGMII_SERDES_MR_ADV_PORT1 = 0x1;
    
        CycleDelay(100);      
    
        changeRegister(0x2090210,0x1); //SGMII_SERDES_CONTROL_PORT1 = 0x1;
    
        changeRegister(0x2090110,0x1); //SGMII_SERDES_CONTROL_PORT0 = 0x1;
    
        changeRegister(0x2090950,0x2520); //SGMII_SLIVER_MAXLEN2 = 0x2520;
    
        changeRegister(0x2090910,0x2520); //SGMII_SLIVER_MAXLEN1 = 0x2520;
    
         changeRegister(0x2090944,0xA1); //SGMII_SLIVER_MACCONTROL2 = 0xA1;
    
        changeRegister(0x2090904,0xA1); //SGMII_SLIVER_MACCONTROL1 = 0xA1;
    
         CycleDelay(100000000);
    
        CSL_BootCfgLockKicker(); //lock
    
            /* SGMII SERDES Configuration complete. Return. */
    
           return 0;
    
    }

    Mind you this isn’t polished code. Just an experiment to see if the descriptor errors don’t show if you re-init the SGMII SERDES beofre TX.
    Please let us know if this helped.

    Thanks,
    Lali

     

  • Lali-

    Thanks very much, we'll try this right away.

    In addition, we've been working on a parallel approach:  isolate the PHY from incoming packets during initialization.  We tried placing the TRD+/- lines in high impedance state but somehow that impacted initialization.  Now we're trying to isolate the PHY from the GMII interface, which seems to show early promise, but needs a lot more testing.

    -Jeff

  • Lali,

    I observe similar behavior in my tests where those stats registers have values like you show. I run the reset script, the init script, then start tcpReplay and see those values increment.

    In my test case, I am using Advantech's 8681 card and loading code over PCIe with the TI Desktop Linux SDK. I am not using a GEL file during this.

    I tried adding the Init_SGMII_SERDES() function as you had shown. I had to change a couple lines for it to work at all with the 8681 card:

        CSL_BootCfgSetSGMIIConfigPLL (0x00000051); //SGMII_SERDES_CFGPLL = 0x00000041;
        changeRegister(0x2090904,0x400A1); //SGMII_SLIVER_MACCONTROL1 = 0xA1;

    Without these changes, I wasn't able to see any packets being transmitted at all. After making those changes to the code you sent, I tried calling Init_SGMII_SERDES() in the location that you suggested as well as prior to init and immediately before sending packets. In all these cases, I still ran into the same problem: once I started running tcpReplay, within a few iterations of my test, the chip soft reset would fail.

    Did the modifications that you suggested help in your tests? I do not have access to a JTAG and an EVM board right now so I want to find out if there are differences seen with this code added between the EVM board and the 8681 card.

    Regards,

    Chris

  • Hi Chris,

    Here are some thoughts on the problem.

    In the initialization sequence, if the SGMII is initialized before all the other configurations (namely, the queues and the descriptors that are part of the PA LLD) are done, and there’s an influx of packets, the unready CPPI and QMSS system seemingly cannot handle packets.

    So our solution is to postpone the initialization of the SGMII serdes after all other initializations are completed.

    Linux Desktop SDK’s  library  platform_lib  has routines that configures the SGMII serdes.  If these routines are called before all other initializations are done, then you may observe the error that you see.

    Our suggestion is to eliminate the SGMII serdes initialization routine in the platform_lib source. Please take a look at /ti/desktop-linux-sdk_01_00_03_00/platform_patch/dspc8681/platform_lib/src/platform.c (around line 493) where configSerdes() gets called. This function points to dspc8681_phy.c (around line 48). You might want to remove this step in your initialization, and initialize the SGMII from the code as you have done from my earlier suggestion (you may have to re-build platform_lib after the modifications).

    Please report if this works.

    In my experiments I commented out the configSGMIISerdes() in the GEL and called it in the C code instead. It worked on the C6678 EVM without issue with tcpreplay activated.

    Initializing Free Descriptors.
    QMSS successfully initialized
    CPPI successfully initialized
    PASS successfully initialized
    Initialzing SGMII SERDES...
    Ethernet subsystem successfully initialized
    Tx setup successfully done
    Rx setup successfully done
    PASS setup successfully done
    Waiting for all cores to reach the barrier before transmission starts ...
    Packet Transmission Start ...
    Packet Transmission Done.
    Wait for all packets to be Received ...

    Regards,

    Lali

  • Chris,
    To add to my above posts, please also note:

    In the original cpsw_mgmt.c file around line 451, a call to Init_SGMII_SERDES() exists inside inside Init_Cpsw. Please remove this so as to avoid calling Init_SGMII_SERDES() twice in your code.

    Please let me know if these suggestions fixed the issue.
    I have tried this on my C6678 EVM and they work.

    Lali

  • Lali,

    I tried your suggestion of modifying the platorm_lib. I rebuild both platform_lib and platform_init. I ended up taking the code from platform_lib and using that as the Init_SGMII_SERDES() function and left the call at line 451. I didn't add the function to that do-while loop in Init_SGMII(). This seems to work for the simplified test case of using tcpReplay, but doesn't fix the problem in our main project. 

    Your explanation of the problem has helped though. I think the problem should be similar in the case of our main project because, as Jeff mentioned, disconnecting the Ethernet cable and reconnecting it after init has finished and the first hundred packets or so have been "sent" prevents the issue from occurring. 

    We're currently looking into a way of "disconnecting" the cable through software by writing to the PHY registers as a workaround in this case.

    Chris

  • Lali,

    After working with Broadcom to figure out what we needed to do with the PHY and using what you suggested in combination, we now have a fix/workaround that prevents the issue from occurring in all of our current projects.

    Lali, Titus, and Pubesh, thank you all for your help in resolving this issue.

    Regards,
    Chris
  • Hi Chris,
    Great ! nice to hear.
    Sounds good.
    Thanks for your update.
  • Hi Chris,

    In our project we have similar problem, that you had. Could you shed some light on "what we needed to do with the PHY". Because just changing location of SerDes initialization didn't work for our case

    Best regards,
    Pavlo!
  • Pavlo-

    We put the PHY in high-impedance mode; i.e. as if external Rx and TX were disconnected (same as if cable was disconnected).  This ensures that no Rx packet activity occurs during PA initialization.

    The c66x NetCP and PA truly need "absolute quiet" to be reliably/consistently initialized.

    -Jeff
    Signalogic