This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

The problem occurs when 2 6678 communicate with each other by SRIO

Hello

        I just test the performance of our new DSP board with 2 c6678 on it.I let DSP-A send data by DIO NWRITE to DSP-B,write 1MB then read it back by DIO NREAD.Most of times all of things is OK.But sometime error will happen.Errors  described as followed:

        A write B successfully,we can make sure it by checking the memory in B by JTAG.But when A read,it seems A read nothing at all,because the value in receive buffer doesn't change.When this happen,A will take much longer time to finish the read(the time it got completion code in LSU_STAT_REG) than usual.It take about 1000000 cycles to read when it's successfully,and 5600000000 cycles if not.The completion code that A got is 0----it means it read successfully,isn't it?

       At the same time,we read the ERR_DETECT_REG,it is 0x01800000.According to the document,it means time-out for expected response and unexpected response received.If error doesn't happen,this reg is 0x0,no error.

       Because error doesn't happen every time,so we think maybe errors come from hardware,but why NREAD wrong and NWRITE right?And if A doesn't get response in time,how can it get completion code in LSU_STAT_REG?

       Our CCS environment is CCS5.0.3,SRIO rate is 3.125GHz,DSP frequence 1GHz.Looking forward to your response,thank you very much.

 

  • Everything you describe suggests that there are link level errors that are delaying your transactions.  These link level errors will cause the HW statemachines to try to recover through low-level handshaking.  They can be caused from multiple things, like bit errors, error states, ackID alignment.  Please take a look at the following threads and make sure you have implemented/followed them:

    http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/196080/850001.aspx#850001 - VMIN setting

    http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/170264/752157.aspx#752157 - Software error recovery

    http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/267043/937560.aspx#937560- disable C66x port-writes

    and the following may help decode any error conditions:

    http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/264325/927003.aspx#927003 – Error status, and debug gel

    Regards,

    Travis

  • Hello 

              Sorry to reply so late.I have read the pages you show us. you think error come from the link-level handshake error,do you?

              For this problem we find some more issues:

               We continue to send NWRITE and NREAD after error occurs,and what happens as following:

               1st time,NREAD:560000000 cycles  read nothing as we described before;

               2nd time,NREAD:280000000  cycles,get right data.

               3rd time,NREAD:13000 cycles,get right data.

               The error we describe disappers then.So the link seems unstable only several seconds after we send first data package.

                But still I want to ask these questions

                 1、Does this error happen only because the quality of the signal,or there is any bug in software?

                2、If link level errors delay the transactions, why NWRITE is not influnced?

                3、Since 1st NREAD transcation read nothing,why LSU_STAT shows it is completed correctly?

                Thank you very much.

                Regards.

  • So you have set VMIN appropriately and made sure the input/output errorred stopped states are cleared after port_ok is achieved and made sure the port-writes are disabled?   Please confirm.   If you are getting a timeout between two DSPs, then something is causing an interruption in the link, especially if it works sometimes but not others. 

    Please run the debug gel script and provide the register dump.

    1)  need the register dump to understand which errors you are seeing.

    2) NWrites don't have timeouts. They will eventually go through if the link HW can recover from the issue.

    3) Not sure.  Are you using more than 1 LSU?  There is an errata on this Advisory 14 in http://www.ti.com/lit/er/sprz334f/sprz334f.pdf.  Maybe try reading in 256B NREADs.

    Regards,

    Travis

  • Hello

            I am curious about that why we need to disable port-write?What its function is and what kind of effect it could have to the link?Thank you.

  • Port-writes are used for error condition notification to a system host.  You can read about it in the SRIO error management (Part 8)spec:

    http://www.rapidio.org/specs/current/rev2.1_spec_stack.zip

    If your system is not setup to handle these messages, you will be wasting bandwidth since they can be sent multiple times, in addition, if the ackids are not aligned when these get sent out, there can be complications and error states.  I recommend to disable them unless you know what you are doing and have a system host that can respond to them.

    Regards,

    Travis