update: Problems Remain - C6455 SRIO interoperability problem

Marc S.

We've got a C6455 here that's connected to an MPC8548 via SRIO. I've got a nasty message passing issue that has been very difficult to debug. We're using one lane (lane 0) at 3.125Gb and physical connectivity is good. Maintenance operations also work and the 8548 is able to read and write C6455 SRIO registers over the bus.

No messages can make it across properly, however. When the 8548 attempts to originate the first message to the C6455 only one instanceo f a changed register value can be seen at the C6455. The Port Local Ack ID Status CSR (for port 0) does indeed change to reflect the inbound ack ID value sent by the 8548, however the rest of the register doesn't change. Only the upper 2 nibbles of this register (INBOUND_ACKID) change. Meanwhile, no other bits in this register or anywhere else in the C6455 SRIO internal register space change, and the acknowledgement is never sent back to the 8548. The 8548 then times out waiting for the acknowledgement.

I've tried a variety of register settings in order to capture any physical layer errors, logical layer errors, or otherwise determine what's going wrong. Promiscuous destination address decode is being used (so the 6455 should receive for any destination address seen on the bus), and all capturing is enabled both at the physical and the logical level as per the 6455 SRIO register definitions. Hardware error recovery is also enabled, so the chip is not waiting for software to generate the ack.

Can anyone point out how it's possible for physical layer and maintenance operations to properly work, but for only the INBOUND_ACKID field of the Port Local Ack ID Status register to be updated (and no other registers whatsoever) when a message is received? No error status is available anywhere and I'm not quite sure where to look next...

Thanks for all info...

over 16 years ago

0 tscheck over 16 years ago

TI__Mastermind 23525 points

Marc,

There is a known bug when reading the ACKID_STAT register, it will always return a zero value for OUTBOUND_ACKID. You can still write to the OUTBOUND_ACKID field in the case of error recovery to align ACKIDs, it is just that when read it returns zero. If maintenace packets are working you should see INBOUND_ACKID and OUTSTANDING_ACKID fields incrementing.

By your description of the problem, the messages aren't even getting to the logical layer RXU. If they were, you would at least be getting a ERROR response if there was some sort of RXU configuration error. I'm assuming you have the TXU and RXU BLK_ENs (0x50 and 0x58) asserted or else you wouldn't even have been able to configure the RXU. I think the problem is device ID related. The maintenance packets are responded to be the physcial layer automatically. The physical layer matches the incoming maintenance DESTID with the value found in the BASE_ID register (0x1060). You obvioulsy have that programmed correctly. The logical layer checks the incoming message's DESTID against the DEVICEID_REG1 value (0x80). If it does not match the packet is destroyed before even reaching the logical layer, and no response packet is sent. Make sure software writes both of these registers with the correct deviceID value. I'm not sure what you mean by "promiscuous destination address decode" but the C6455 will only accept messages with DESTID = to the DEVICEID_REG1 value. On the C6455, the only DESTID checking control available is the bit 26 of SP_IP_MODE (0x12004). There were two additional control bits available on later devices that offer greater flexibility, but not on the C6455.

Hope that helps,

Travis

0 Marc S. over 16 years ago in reply to tscheck

Prodigy 70 points

Thanks for the response, Travis.

The info on the ACKID_STAT register is appreciated.

You're right that it doesn't look like anything is getting delivered to the logical layer unit. I'm getting no indication of errors at the Logical/Transport Layer Error or Capture CSRs. I'm also not getting any indication of packet error or capture at the Port 0 Packet/Control Symbol Error Capture CSRs.

My BASE_ID register (LARGE_BASE_DEVICEID field) is set to 0x0001 for extended addressing and my DEVICE_ID_REG1 and DEVICE_ID_REG2 (16BNODEID field) are both also set to 0x0001. I do have bit 4 of the PE_FEAT register set for large address support and also have bit 26 of the SP_IP_MODE register set to provide for delivery of all packets to the logical layer irrespective of destination address (this is indeed what I was alluding to earlier when I talked about promiscuously receiving, along with the following paragraph).

I am also enabling PROMISCUOUS bit 1 in each mailbox-to-queue mapping register that I have configured so the the source address of originated packets should be disregarded as well. With the destination ID being ignored by the physical layer (bit 26 of SP_IP_MODE) and subsequently source ID being ignored by the logical layer (bit 1 of mailbox-to-queue mapping registers) I'm confused as to why every packet received at the physical layer shouldn't be at least attempted for dispatch to the logical layer. And if they are getting dispatched to the logical layer then I'm confused about why I'm not seing any values at all in the Logical/Transport Layer Error or Capture CSRs. I do have Logical/Transport Layer Errors Enabled with RIO_ERR_EN (addr 0x02d0 200c) set to 0xFFC000C0.

The C6455 does indeed appear to be acknowledging the inbound packet now, but I can't discern just what it did with said packet besides acknowledging it to the originating peer. The remote peer (MPC8548) is sending the packet to destination address 0x0001, just as i have the C6455 DEST_ID relevant registers programmed.

Anywhere else I can look?

Marc

0 tscheck over 16 years ago in reply to Marc S.

TI__Mastermind 23525 points

Marc,

It is good that you have the BASE_ID and the DEVICEID_REG1 values equal. It important to verify that you are in fact sending the messages with the tt field = 0b01 though. FYI, the bit 4 of the PE_FEAT doesn't control the address size for generated or received packets, it simply is an indicator of capability or preferred ID size. The ID size can be controlled on a per packet basis for TX by programming the tt field in the message descriptor, and RX will decode the IDs based on the received tt encoding . I'm not sure how the MPC8548 sets the tt and ID sizes, but make sure it is sending 16b IDs. What are the 8BNODEID values of those register set to? What DESTID was used for the working maintenance packets?

The implementation of the C6455 peripheral is such that the bit 26 of the SP_IP_MODE register controls whether the physical layer passes a packet to the logical layer... When this bit is zero, it will only pass non-maintenance packets with DESTID = BASEID to the logical layer. The logical layer will accept and process this packet because BASEID = DEVICEID_REG1. All other DESTID packets will be destroyed in the physical layer. When this bit is set, the physical layer will pass all non-maintenance packets, regardless of DESTID to the logical layer. The logical layer will then accept and process packets with DESTID equal to DEVICEID_REG1 or the multicastID in DEVICEID_REG2. Additionally, it will forward packets if the DESTID falls within the range of the packet forwarding registers. All other DESTID packets are destroyed in the logical layer and not responded to. No matter what this bit is set to, the physical layer will only respond to maintenace packets with DESTID=BASE_ID and destroy all other maintenance packets.

I would suggest you set the bit 26 to zero, and make sure you get it working with this setting first.

If you are confident that you have all the deviceID checking setup correctly then I would ask the following questions.

1) Is the message payload getting transferred to DSP memory? Is there ever a RX queue interrupt (are the interrupts setup correctly)? I'm assuming no or else you'd get a DONE response message. If the payload is not reaching memory and an RX interrupt didn't fire, it doesn't mean the the RXU didn't get the packet. It could have sent an error response.

2) Are you sure the MPC8548 isn't getting a response? Seems to me I vaguely remember that the MPC8548 may treat ERROR responses differently than a normal DONE response. It may not notify the processor in the same way with this type of response. Something to check. Let me know.

Regards,

Travis

0 Marc S. over 16 years ago in reply to tscheck

Prodigy 70 points

Thanks again for the further good information, Travis.

Things are working here now. To close the loop, it appears that things were configured properly for at least a day or two previous as well. The reason I wasn't seeing any errors (or capture information) was more obscure than I had given credit for... It was because, in actuality, there weren't any errors. I was so busy looking for anomalous behaviors that I missed the fact that the ICCR was actually showing that a packet had indeed been received on the expected CPPI queue. The interrupt was routed to the correct destination, and ultimately causing the IFR to be set as well.

The problem was that other [non-SRIO related] threads were initializing subsequently (but, prior to SRIO packet ingress time), and one of them had caused an instruction packet fetch exception that I had failed to notice. The default NMI handler HWI was thus invoked and was just sitting there spinning. Consequently the GIE bit in the CSR had been cleared, thus preventing the generated SRIO receive interrupt from causing the appropriate HWI vector from being taken (and thus my RX handler from being called).

We resolved the the NMI situation and now bidirectional messages are being passed via SRIO.

Thanks very much for your prompt assistance, Travis.

Regards,

Marc

0 Marc S. over 16 years ago in reply to Marc S.

Prodigy 70 points

Looks like I counted the chickens too soon, Travis.

There is now another problem that looks similar to the original one I referenced.

Messages are properly passing in both directions, from 8548 to 6455 and vice versa. However, there seems to be some kind of 6455 resource exhaustion problem that is preventing the 6455 SRIO peripheral from acknowledging any further messages subsequent to the receipt of 8 segments.

I eliminated all traffic from the 6455 to 8548 to focus on only traffic headed toward the 6455. In every case the 6455 peripheral receives as many packets as are sent from the 8548 (acknowledging them all) and passes them up to my driver via the appropriate interrupt. However, as soon as 8 segments worth of messages have been received the 6455 peripheral goes out to lunch and refuses to acknowledge the subsequent message.

When the 8548 sends packets less than 256 bytes the number of packets received, acknowledged, and passed up to my driver is 8. The 9th packet sent then results in the 6455 LASCSR showing that the proper ackID value was indeed received from the 8548, but that the actual acknowledgement was never sent back to the 8548. When the packets are between 256 and 512 bytes (thus requiring 2 segments per pkt) then the 6455 peripheral dies after receiving only 4 packets. When packets are between 512 and 1024 bytes (thus requiring either 3 or 4 segments) then the peripheral dies after receving only 2 packets.

I've checked virtually the entire SRIO peripheral register space subsequent to the problem manifestation and can see no errors of any sort. I've also changed a variety of 6455 peripheral settings in an attempt to resolve the problem, to no avail. I've walked my RX descriptor list both before an after being processed by the peripheral and everything looks fine. It really doesn't seem like anything that could be caused by the driver operation at receive time, since the number of receive descriptors isn't related to the number of SRIO message segments being sent.

This problem is definitely tracking the number of [max 256 byte] SRIO message segments and appears to be a resource exhaustion issue of some type at the 6455 peripheral. Yet, no error or status of any kind is being reported when the dysfunction occurs. Is there any resource mechanism in the peripheral itself that would sensibly account for this behavior where the peripheral stops acknowledging after clearly receiving a packet from the remote peer?

Thanks for all info,

Marc

0 tscheck over 16 years ago in reply to Marc S.

TI__Mastermind 23525 points

Just to be clear, are you referring to logical layer message response packets or physical layer acknowledge control symbols?

If physical layer Acks...

Is the port_ok bit set? What does the SP(n)_ERR_STAT register read on both devices? Actually it is helpful to know these regardless of whether you are talking about physical layer Acks.

If message response packets...

There should be no reason for the peripheral to stop sending message response packets mid message for multi-segment or after 8 single segment messages. I would concentrate on getting the single segment messages working first since there are added complexities with multi-segment. What is the CC in the DSP RX descriptor when it fails? Is there any indication that the peripheral starts to send ERROR responses to the MPC8548 mid way? Is all the message data correctly arriving in the DSP memory (just not sending responses for all segements)? Is the RX completion pointer advancing past the error packet? You may need to single step through this by sending one message at a time to the DSP, once it is received correctly (look at DSP descriptor, interrupts, Completion pointer, etc) and the MPC8548 TX side is happy, then send the next packet and repeat.

Let me know when you have more details.

Regards,

Travis

0 Marc S. over 16 years ago in reply to tscheck

Prodigy 70 points

When I wrote that the C6455 doesn't acknowledge the 9th packet (when packets are less than 256 bytes each) I meant the physical layer acknowledgement. As in, the LASCSR value is showing that the 9th packet did actually arrive inbound from the remote peer (LASCSR field INBOUND_ACKID), however, the C6455 never responds with the same acknowledgement ID (OUTSTANDING_ACKID). In the previous 8 packets the C6455 always immediately updated the LASCSR OUTSTANDING_ACK field to reflect the fact that it had acknowledged the INBOUND_ACKID value just received from the remote peer. The 9th packet case acts differently... These fields are difficult to understand, even in the context of the RapidIO Trade Organization documents on interconnection so it may be that I'm misunderstanding how the OUTSTANDING_ACKID field works. The empirical data does show, however, that irrespective of what the field is really used for the C6455 is behaving differently on the 9th packet (these are all single segment packets less than 256 bytes total).

I worked exclusively with single segment messages today as you suggested. The mailbox to queue mapping registers all look correct and it makes no difference how many descriptors I'm providing to the CPPI queue in question (I'm always using CPPI queue 0 for these tests, by the way). I've tried increasing the number of RX descriptors to sixteen and reducing them to four with no change in behavior. I have watched the descriptor list and HDP and completion pointers for queue 0 before and after each packet receive (in lockstep with the sending peer), and there are no errors to speak of in the received packets. All the packet data is present and correct in the proper buffer location as provided in word two of the RX descriptor currently being used by the peripheral. So, the CC in the RX descriptor for packets 1 through 8 is 0, and again, all payload data is exactly correct. Packet 9 is never delivered to the driver... The expected interrupt is never generated, and the RX_CPPI_ICSR never reflects that the interrupt condition exists.

After examining more closely today I saw that the 9th packet *is* actually DMA'd into memory by the peripheral, but that appears to be the last thing it does. So, the 9th packet does indeed make it to memory every time (and properly so, with all payload data intact as expected) at the appropriate buffer location that was specified in the next queue descriptor used (which is the current one shown in the HDP register for that CPPI queue). The peripheral does not, however, update the descriptor status (OWNERSHIP bit, Message Length, etc.), nor does it update the CPPI queue HDP or completion pointer when this 9th packet arrives.

So, the peripheral IS definitely receiving the 9th packet, updating the LASCSR to reflect receipt, and writing the payload data to memory as instructed by the current HDP descriptor Buffer Pointer word. The peripheral IS NOT, however, updating the descriptor status (words 2 and 3) in any way to reflect packet receipt, updating the CPPI queue HDP or completion pointer, updating the RX_CPPI_ICSR register, or acknowledging the packet/message at the physical layer (neither the LASCSR is updated to reflect the OUTBOUND_ACKID as occured for all previous packets, nor did the remote peer get the expected ACK as evidenced by the fact it timed out on the packet/message send operation).

Regards,

Marc

0 tscheck over 16 years ago in reply to Marc S.

TI__Mastermind 23525 points

Marc,

Thanks for the detailed description, it make things clearer. I still don't have a clear answer, but hopefully some of this helps out. I can definitely say there is no limit to the number of RX packets which can be received.

The OUTSTANDING ACKID field will read a value of the last acknowledged packet. For example, say the OUTBOUND ACKID = OUTSTANDING ACKID = 5, when the DSP sends a packet, the OUTBOUND will immediately go to 6, but the OUTSTANDING will stay at 5 until it is acknowledged by the connected device. Since ACKID field is 5 bits, you could in theory have sent 31 packets without an acknowledge back. So the fact that this field is not changing doesn't necessarily mean that the 9th packet response wasn't sent from the DSP, it could mean that the response was sent to the MPC8548 but the physical layer ack for the response packet wasn't recieved back by the DSP.

If the 9th packet payload is arriving in memory, that is good because it means there were no transmit errors and the RXU was configured properly to route it there. The 9th packet response should be sent. If it is a single segment message, then the descriptor should be updated (words 2 and 3), the CP will be updated, and an interrupt should be generated. Did you get an interrupt for the first 8 packets? The interrupt won't be cleared unless you the CP written by software during the ISR is equal to the CP written by the port for the last received descriptor. The interrupt will remain asserted if the port has processed additional packets. Also, the interrupt pacing register must be written fire the interrupt to the CPU.

Regarding CPPI interrupts, I remember an issue that was raised and I don't remember if it was fixed on the C6455 or not, I'll see if I can dig up the info again. The issue was that if the software wrote the CP=n at the end of the ISR (so that the port would clear the ICSR) and on the exact same clock cycle the port wrote the CP=N+1 to indicate another descriptor had been processed, the port actually cleared the ICSR instead of leaving it set. If that occurred, the only way to know that N+1 was complete was to wait for the N+2 interrupt to occur. I'll let you know what I find on this.

Some of your description leads me to believe that your expectation of the HDP is not exactly correct. The RX queue HDP is written intially by software to point to the 1st descriptor in the queue. Software never has to write this HDP again, unless an empty queue condition has occurred where the peripheral has reached a descriptor with NEXT_DESCRIPTOR_POINTER=0. If this happens the port sets the EOQ bit in the current descriptor and writes all zeros into the HDP. Then, in order to receive any more messages to that queue, the HDP must be written by software again to kick off the new RX buffer desriptor location. Software should not update the HDP in each ISR unless this empty queue condition has occurred. The HDP is not updated after each received message by the port either, the CP is updated by the port after each message is received, so that during the ISR, the software can read the CP and determine how many RX buffer descriptors it must process.

Regards,

Travis

0 Marc S. over 16 years ago in reply to tscheck

Prodigy 70 points

I interspersed my comments below, Travis. Thanks for your continued support. It is greatly appreciated...

tscheck said:

Marc,

Thanks for the detailed description, it make things clearer. I still don't have a clear answer, but hopefully some of this helps out. I can definitely say there is no limit to the number of RX packets which can be received.

The OUTSTANDING ACKID field will read a value of the last acknowledged packet. For example, say the OUTBOUND ACKID = OUTSTANDING ACKID = 5, when the DSP sends a packet, the OUTBOUND will immediately go to 6, but the OUTSTANDING will stay at 5 until it is acknowledged by the connected device. Since ACKID field is 5 bits, you could in theory have sent 31 packets without an acknowledge back. So the fact that this field is not changing doesn't necessarily mean that the 9th packet response wasn't sent from the DSP, it could mean that the response was sent to the MPC8548 but the physical layer ack for the response packet wasn't recieved back by the DSP.

Ok, this makes sense. So, when you talk about the prospect of the DSP having sent a response that wasn't acked at the physical layer by the 8548 are you referencing a logical (or transport) layer response [by the DSP]? Thus, does the logical layer message from the 8548 then result in both a physical layer ack and a separate logical layer response from the DSP? Then does the DSP logical layer response subsequently result in a physical layer ack from the 8548? If this is the case then how do I discern whether the physical layer ack to the 8548 logical layer message never made it back to the 8548 or whether the logical layer response never made it back? The 8548 is reporting that its outbound messaging unit is busy before timing out. So, it's clear that the logical layer response isn't coming back, at least... It's unfortunate that the 6455 bug prevents the ability to see what the OUTBOUND value is in the LASCSR to help determine the last ID actually sent by the DSP...

I do suspect, in any event, that a detailed understanding of serial peripheral operations in the context of single segment ingress messages would help to better pinpoint just where things are going awry. The fact that the 9th packet is actually DMA'd to memory, but neither the associated descriptor, HDP or completion pointer updated should potentially shed some light. Then again, if the Buffer Pointer from the 9th descriptor that's being used was actually pre-staged by the peripheral at the end of the 8th packet receive time (so it could be used immediately at next packet receipt time without fetching) then the situation is a bit more ambiguous...

If we presume, for the sake of analysis, that the DSP actually did send the 9th packet logical (or transport) layer response back to the 8548, but failed to receive the associated physical layer ack then how would that explain the fact that there was no update by the DSP of the RX descriptor, HDP, or completion pointer? It doesn't seem to make sense that these three updates would be dependent upon receiving back the physical layer ack to the message response...

tscheck said:

If the 9th packet payload is arriving in memory, that is good because it means there were no transmit errors and the RXU was configured properly to route it there. The 9th packet response should be sent. If it is a single segment message, then the descriptor should be updated (words 2 and 3), the CP will be updated, and an interrupt should be generated. Did you get an interrupt for the first 8 packets? The interrupt won't be cleared unless you the CP written by software during the ISR is equal to the CP written by the port for the last received descriptor. The interrupt will remain asserted if the port has processed additional packets. Also, the interrupt pacing register must be written fire the interrupt to the CPU.

Yes, the first 8 packets each result in an interrupt. Then the driver handles each packet individually (one per interrupt in this case since I'm running the 8458 originator in lockstep via manual intervention), updates the CP appropriately and finally writes the interrupt pacing register with a 0. I write a 0 to the pacing register so that every packet will be delivered immediately the the driver. Everything works as expected with no anomalies on the first 8 packets...

tscheck said:

Regarding CPPI interrupts, I remember an issue that was raised and I don't remember if it was fixed on the C6455 or not, I'll see if I can dig up the info again. The issue was that if the software wrote the CP=n at the end of the ISR (so that the port would clear the ICSR) and on the exact same clock cycle the port wrote the CP=N+1 to indicate another descriptor had been processed, the port actually cleared the ICSR instead of leaving it set. If that occurred, the only way to know that N+1 was complete was to wait for the N+2 interrupt to occur. I'll let you know what I find on this.

I do believe this problem still exists in the C6455, since the test program driver for it provided by TI does indeed have a check for this condition (thus I am checking for it as well). In any event, I can't see how this issue could be related to my particular problem since I am running in lockstep with the 8548 originator. Thus, there is never another packet arriving at the DSP until I have first exited the driver RX ISR (each interrupt results in no more and no less than one packet being serviced at present)...

tscheck said:

Some of your description leads me to believe that your expectation of the HDP is not exactly correct. The RX queue HDP is written intially by software to point to the 1st descriptor in the queue. Software never has to write this HDP again, unless an empty queue condition has occurred where the peripheral has reached a descriptor with NEXT_DESCRIPTOR_POINTER=0. If this happens the port sets the EOQ bit in the current descriptor and writes all zeros into the HDP. Then, in order to receive any more messages to that queue, the HDP must be written by software again to kick off the new RX buffer desriptor location. Software should not update the HDP in each ISR unless this empty queue condition has occurred. The HDP is not updated after each received message by the port either, the CP is updated by the port after each message is received, so that during the ISR, the software can read the CP and determine how many RX buffer descriptors it must process.

I think we missed each other on the HDP operation. I do understand that the HDP operates the way you described. The peripheral is never running into a NULL NEXT_DESCRIPTOR_POINTER since I'm running in lockstep (and checking via breakpoint at that code location). The HDP is only being written one time at initialization in my environment. When I previously talked about the value of the HDP and CP not changing after the 9th packet is DMA'd to memory I only meant to note that the peripheral was going out to lunch prior to updating either of those locations (as well as prior to updating any of the descriptor data)... on the prior 8 packets the peripheral always immediately updates the HDP and CP to accurately reflect the way it's walking the descriptor chain (of course, it also updates the appropriate descriptor data itself, too, on the first 8 packets). Does that make sense?

It really is strange how the lockup condition directly seems to track the number of [maximum 256 byte each] message segments sent from the 8548 to the DSP. Is there any chance that the peripheral has a buffer region internally (or externally) of 8 message segments to deal with ingress traffic during times of bus contention? I'm having great difficulty getting my arms around the relationship here... and try as I might, I can't seem to conceive of how any of the configuration settings could cause this relationship to manifest... I certainly may be programming something wrong, or there may be some kind of interoperability issue with 8548 (in the context of its configuration), which is adversely influencing the situation, but I'm struggling to understand why 4 packets of 2 message segments each cause the same failure as 2 packets of 4 message segments each as 8 packets of 1 message segment each...

Thanks again for all of your support, Travis.

Marc

0 tscheck over 16 years ago in reply to Marc S.

TI__Mastermind 23525 points

Marc,

As promised, attached is a small doc I created with the registers that have to be programmed for correct initialization. It also contains the SerDes settings for 125Mhz reference clock with 2.5Gbps operation. It is still unclear to me the cause of the behavior you are seeing. I hope the MQT example that you can run without this problem helps pinpoint the issue.

You might also try to initiate some NREAD transactions to see if the 8 packet limitation is restricted to the messaging, or all traffic types. If it exists with the NREADs as well, it has to be a setup issue. If only the messages, then there is probably something wrong in your driver.

Regards,

Travis

8032.Boot_regs.pdf

Processors

Processors forum

update: Problems Remain - C6455 SRIO interoperability problem