timeout on sending SRIO doorbell.

Peter Robertson

I have two C6678 DSPs with SRIO connected by a CPS-1848 switch from IDT.
This is standard, working hardware from a board supplier.

DSP-1 is in a loop waiting for a doorbell from DSP2 by polling the
      DOORBELL INT ICSR and then acknowledging it. When it receives
      this doorbell it sends one back to DSP2 and waits again.
DSP-2 is doing the inverse: sending a doorbell to DSP-1 and waiting
      for one coming back.

DSP-1 sends the doorbell by writing the following values to LSU[0]:
      R1=0, R2=0, R3=0, R4=0x30010440 R5=000000A0

DSP-2 sends the doorbell by writing the following values to LSU[0]:
      R1=0, R2=0, R3=0, R4=0x30020440 R5=000100A0

On both DSPs, the LSU is properly locked each time before writing
the registers.

This runs for an indeterminate number of loops (varying from 2 to >50)
but then everything stops and I get solid errors when trying to
send doorbells. The relevant field from LSU_STAT shows a completion
code of usually 1 (Transaction Timeout) but occasionally 6
("Retry" response).

After these errors, I have tried flushing the LSU. This seems to work
once or twice, but then the error becomes permanent.

There is only ever one outstanding LSU transaction at a time on each DSP.
There are no interrupts involved.
Interrupt pacing is disabled.

This is a cut-down version of a larger example that is also doing data
transfers across SRIO. This works for a random number of transfers (from
five to several thousand) before getting stuck in the same way as
the doorbell example.

Does anyone have a suggestion as to what might be going on and why?

over 12 years ago

0 tscheck over 12 years ago

TI__Mastermind 23525 points

Peter,

It sounds like something is happening at the link level (physical layer). Are you getting other errors identified too, i.e. SPn_ERR_STAT, ERR_DET, SPn_ERR_DET? If you would like, there is a SRIO debug gel script that you can run to dump this info:

http://processors.wiki.ti.com/index.php/Keystone_Device_Architecture look for Keystone SRIO Debug Gel. Instructions are included.

This would be the first thing I check.

The doorbell timeout CC is self explanatory, but the Retry CC occurs when a DSP receives a Retry doorbell response. A retry response is sent by our DSP when an incoming doorbell request tries to set a ICSR bit that is already set, i.e. the RX devices has not serviced/cleared the preceding interrupt. So if you modified your example to poll the ISCR bit, clear it, then send back a doorbell request, you should never see this CC.

Regards,

Travis

0 Brandy Jabkiewicz over 12 years ago in reply to tscheck

Mastermind 6325 points

Also, don't forget to use the stop error recovery if the

tscheck said:
Are you getting other errors identified too, i.e. SPn_ERR_STAT, ERR_DET, SPn_ERR_DET?

are set.

Here is a post:

http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/170264/675812.aspx#675812

0 Peter Robertson over 12 years ago in reply to Brandy Jabkiewicz

Expert 2770 points

Thanks for that.

I now think I understand why I've been getting the results I have observed in several different experiments.

It appears that at least some of the timeouts I am seeing happen after the message has been successfully sent.
When the transmitter sees the error response and retries sending the message, the receiver gets two (or more)
copies of the mesage, breaking any synchronisation that is assumed between the two ends.

Can you confirm that it is possible to get an error response even when the transaction has been successfully
completed (at least from the other end's point of view)?

If this is true, is there some other status that can be used to determine this?

If there is no way to detect that the receiver has actually received the message, it means that I will have to
assume that the hardware is completely unreliable and impose yet another protocol on top of the complexity
already needed to drive this device.

0 tscheck over 12 years ago in reply to Peter Robertson

TI__Mastermind 23525 points

Peter Robertson said:
Can you confirm that it is possible to get an error response even when the transaction has been successfully
completed (at least from the other end's point of view)?

I want to make sure I understand exactly what you are asking... Let me know if this doesn't address the question:

- If you are asking if it is possible to receive a NWRITE_R Error response, but still have the payload actually land in the RX device's memory, then the answer is no.

- If you are asking if it is possible to receive a doorbell Retry response, but still have the interrupt ICSR bit actually set in the RX device for that doorbell request, then the answer is no.

- If you are asking if it is possible to get a LSU timeout CC (meaning that the response packet wasn't received back to the LSU before the timer expired), but still have the payload actually land in the RX device's memory (NWRITE) or the ICSR bit gets set (Doorbell) , then the answer is yes. This could happen, in the situation discussed above where there is something happening on the link that caused and error state, maybe HW recovers, maybe not, but the request packets get sent and are received, but the response packets are delayed enough and the timer expires. Again, this isn't normal behavior (timer can be set to 3s before expiring), it would only happen if things are fouled up. In this case you would get a non-solicited response, flagged in the RIO_ERR_DET register.

For data packets the NWRITE_R does guarantee reception of the payload in the receivers memory. However, even NWRITEs, the packets will not be dropped, if you send them, them will arrive.

Hope that helps,

Travis

0 Peter Robertson over 12 years ago in reply to tscheck

Expert 2770 points

The timeouts are from LSU operations and the evidence I have is that the transfer has happened.

I'm not sure what you mean by the ERR_DET flag. Are you saying it will get set along with the timeout, or do you mean that, after the timeout response, a late-arriving response will subsequently be flagged?

0 tscheck over 12 years ago in reply to Peter Robertson

TI__Mastermind 23525 points

Ok, so you are falling into that third category I mentioned above. The sequence of events would look something like this...

1) LSU on DSP 1 sends NWRITE_R and starts the response timer counting

2) The NWRITE_R request lands at the DSP 2 successfully and is written to memory successfully

3) Some sort of error happens on the link, causing a long delay and HW tries to recover

4) DSP1 LSU response timer expires, the LSU sets the CC = timeout, and moves on to the next transaction. It is done with the transaction and did not receive a response from DSP2.

5) DSP2 now sends the response since the physical layer link error (or whatever caused the long delay and timeout) was solved.

6) DSP1 LSU is not expecting a response anymore for that transaction since it is moved on. The late response is considered an unsolicited response and marked in the RIO_ERR_DET register bit 23.

Regards,

Travis

0 Peter Robertson over 12 years ago in reply to tscheck

Expert 2770 points

Thanks - that makes it clear.

I've done some more investigation and I'm finding that, after sending a doorbell to the other DSP, the port ERR_DET register is very often (but not always) getting set to one of three values:

0x00100000, 0x00100010, or 0x00100011. The NOT_ACC bit is always set.

These are also sometimes set after the DSP has received the doorbell event (I'm polling, so there is no interrupt).

Trying to clear these bits by writing 0 to the register takes several attempts.

In nearly all cases, the completion code from the LSU_STATUS registers is 0.

Even with these bits being set, the doorbells seem to get through and the program works as before. I find this strange as the description of the bits seems to suggest that packets have been rejected.

Do you have a better description of what these bits are trying to say?

0 tscheck over 12 years ago in reply to Peter Robertson

TI__Mastermind 23525 points

Peter,

Bit 20 is the received Packet-not-accepted control symbol. This indicates that the link partner received a bad packet, for example the CRC wasn't correct. So the link partner goes into input-error stopped state and sends the packet-not-accepted CS. There is a normal HW handshake mechanism to overcome this.

Bit 4 is the protocol error. Don't see this one often, but it indicates bad control symbol or unexpected packet. Again, this is link layer, not logical layer error, so could be bit errors again. When you see this set, I believe you should see the input error stopped encountered bit set in the SP(n)_ERR_STAT.

Bit 0 is the link timeout. Basically the link partner physical layer ACK didn't occur before the timer expired (SP_LT_CTL). This could be occurring because of the bit errors and HW recovery discussed above.

You can still have errors indicated although the HW may have recovered. The error indicators that your are "stuck" and require software to intervene are the the fatal port error (bit 2), output error stopped (bit 16), or input error stopped (bit 8) of SP(n)_ERR_STAT.

What I would suggest, if you aren't already doing this, is to follow that software error recovery document (from the other e2e thread referenced above) and make sure after port_ok is set, you clear any error conditions and reset any error flags, do the ACKID alignment if needed, and then do your normal tests. If you are getting these errors at that point, you may have a marginal board. I'm also going to point you to a couple of other useful threads that I don't think I mentioned to you yet...

http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/p/196080/850001.aspx#850001 - VMIN setting

http://e2e.ti.com/support/dsp/c6000_multi-core_dsps/f/639/t/255031.aspx - Refclk port_ok discussions, VRANGE, MSYNC Serdes settings

Regards,

Travis

0 Peter Robertson over 12 years ago in reply to tscheck

Expert 2770 points

The program already has:

LONG_CS_TX1 = 0x2003F044.

VMIN_EXP set to 0x0F030300

MSYNC 1 for SERDES CRG TX[0] only

Does anything else need to be changed?

Is it normal to need to repeat setting ERR_DEF to zero to clear the bits?

The real program (from which my test is a much-simplified extract) runs for many many iterations of correct transfers (over 2**25) before it fails in a bizarre way. The failure mode is when transferring a 9-word message.,Words 0,1,2,3 & 8 are transferred correctly but the memory of words 4,5,6 & 7 is not written to.

It's beginning to look like your comment about a marginal board may be the case.

0 tscheck over 12 years ago in reply to Peter Robertson

TI__Mastermind 23525 points

Peter Robertson said:
Does anything else need to be changed?

Not that I can think of right now.

Peter Robertson said:
Is it normal to need to repeat setting ERR_DEF to zero to clear the bits?

I don't think this is normal, only thing I can think of is that they are immediately being set again.

Peter Robertson said:
The real program (from which my test is a much-simplified extract) runs for many many iterations of correct transfers (over 2**25) before it fails in a bizarre way. The failure mode is when transferring a 9-word message.,Words 0,1,2,3 & 8 are transferred correctly but the memory of words 4,5,6 & 7 is not written to.

By failure, you are saying the memory contents at the RX device aren't expected, not that you are seeing a SRIO error reported, correct? Not sure, but it could be boundary related, i.e. cache, or endian ???

0 Peter Robertson over 12 years ago in reply to tscheck

Expert 2770 points

The program that is failing is a much larger test, the essence of which is as follows:

DSP2 repeats the following forever:

send a 9-word (36-byte) message to DSP1 followed by a doorbell;

wait for a doorbell from DSP1.

DSP1 repeats the following:

clear each byte of the receiving buffer to 0.

wait for a doorbell from DSP1;

check the received message.

The 9 words of the messages are: 32, A, A+1, A+2, A+3, A+4, A+5, A+6, A+7.

A starts at 0 and is incremented by 10 after sending each message, so the first three mesages contain the following words:

00000020, 00000000, 00000001, 00000002, 00000003, 00000004, 00000005, 00000006, 00000007
00000020, 0000000A, 0000000B, 0000000C, 0000000D, 0000000E, 0000000F, 00000010, 00000011
00000020, 00000014, 00000015, 00000016, 00000017, 00000018, 00000019, 0000001A, 0000001B

The receiving buffer in DSP2 is 128-byte aligned and 128 bytes long, so there shouldn't be any alignment issues.

The buffer is in internal memory (actually at address 0x10807300).

The cache is flushed in DSP2 before sending the data and is invalidated in DSP1 before it sends the doorbell.

As both the sending and receiving buffers are in internal memory, this should not be necessary, but I do it anyway.

If I completely ignore the ERR_DEF status, this program runs for many iterations until it fails having received something like this in the input buffer:

00000020 00A8EE08 00A8EE09 00A8EE0A 00000000 00000000 00000000 00000000 00A8EE0B

Changing DSP1 to set the bytes of the input buffer to 1 instead of 0 results in a failure like this:

00000020 017FCE92 017FCE93 017FCE94 01010101 01010101 01010101 01010101 017FCE99

The failure is detected by the incorrect data, not by any SRIO status; the SRIO status appears to be no different from the successful transfers.

After the failure, the 9th word seems to be randomly either the 5th word sent or the 9th word sent.

0 tscheck over 12 years ago in reply to Peter Robertson

TI__Mastermind 23525 points

Peter,

I'm traveling, so sorry for the late reply. This is strange and I don't have a lot of ideas. If the packet arrives and is error free, SRIO simply uses the address in the packet and moves the payload across an internal bus. Please try your same test with cache completely disabled on the TX and Rx DSP and let me know what you see. That is all I can think of right now.

Regards,

Travis

0 Peter Robertson over 12 years ago in reply to tscheck

Expert 2770 points

Travis,

As everything on both DSPs is in internal memory, turning the cache off (on both DSPs) had no effect on the problem as I would have expected.

There must be something really strange happening, as my test runs for a huge number of successful transfers (more than 7 million in the last test).

Although interrupt driven, the code is essentially purely sequential.

Is there anything to do with SRIO which works in blocks of 4 words (16 bytes)? The puzzling thing is that the failure either writes 5 correct values but skips 4 memory words (32, X, X+1, X+2, 0, 0, 0, 0, X+3), or writes 9 correct values but 4 of them never reach memory (32, X, X+1, X+2, 0, 0, 0, 0, X+7).

There is still the unexplained oddity of nearly every transaction apparently working, but setting ERR_DET bit 20 (and sometimes bits 4 and 0).

Regards,

Peter

0 tscheck over 12 years ago in reply to Peter Robertson

TI__Mastermind 23525 points

Peter,

When you say internal memory, are you referring to MSMC or L2 memory? The MSMC address range is cached by default and can not be disabled. So if you are reading from this space on the RX DSP, caching could still be an issue. The MAR bits control whether the L2 address range is cached or not. So if you are using L2 and have disabled all the MAR for that region, you shouldn't have a problem. This is still a strange issue, and nothing else is coming to mind. There should be no 4 word block dependency by SRIO.

Regarding your occasional setting of bits 20, 4, and 0 of the SP(n)_ERR_DET register, these can show up due to a transmission bit error (CRC). So I wouldn't put any more thought behind those. I apologize if I already told you this (I may be mixing threads), but please disable port-writes with: RIO_EM_DEV_PW_EN bit 0 and RIO_ERR_EN = 0x00000000 and RIO_SPn_RATE_EN = 0x00000000.

Regards,

Travis

0 Peter Robertson over 12 years ago in reply to tscheck

Expert 2770 points

Travis,

Everything (code & data) is in L2 (DSP1: 800000..81A000, DSP2: 800000..810F00)

You say "The MAR bits control whether the L2 address range is cached or not". This contradicts the corepac documentation that shows MAR0, which would cover this address range, as being read-only.

I find it difficult to believe that the cache has anything to do with this. The code is essentially sequential and runs for a very long time before failing, and the error is affecting only part of a 128-byte-aligned memory area that is 128 bytes long. Even if the cache were causing problems, I would expect the whole of the area to appear to be corrupted rather than a block of 4 words (always in the same position) being either inserted or skipped.

Regards,

Peter

0 tscheck over 12 years ago in reply to Peter Robertson

TI__Mastermind 23525 points

Peter,

Sorry for the confusion... You are right, the MAR only control the cachability of other core's L2. Whether using local or global L2 addresses, when a core accesses it's own L2 space it is cached. The only way to make sure it is not cached in L1D is either, 1) make L1D all SRAM or 2) use the MPAX registers to map the L2 physical addresses to a different virtual address. I tend to agree with you it wouldn't seem to be a cache issue, though I have seen unexpected cache issues cause many things. Thus I was trying to rule it out. It would be a quick check if you can by setting L1D to SRAM, or just keep the program as is and when you hit a failure, read the memory region directly via CCS memory window and cache box unchecked. Do you know if you are using rev 2.0 silicon?

Regards,

Travis

0 Peter Robertson over 12 years ago in reply to tscheck

Expert 2770 points

I tried setting the whole of L1P as RAM (it then gets used for various memory areas in the code) but this makes no difference to the failure.
Similarly, doing the same for L1P ends up with various bits of code in that memory and has no effect on the failure.
I think we've ruled out cache problems.

I can't use CCS to debug as my emulator (SEED) doesn't support the C6678. It's a pain, but such is life.

I'll have to check with the board manufacturer to see which silicon version I have - I can't see the chips because of heat sinks and other things that I don't want to remove.

I have also now seen a few variants of the failure condition. They all involve the second block of 4 32-bit words in the transfer, but sometimes the other words have missed values.

I have been trying to think of ways I could provoke this behaviour deliberately in software, but so far, I can't think of anything.

0 Peter Robertson over 12 years ago in reply to Peter Robertson

Expert 2770 points

The board has PG1 silicon.

0 tscheck over 12 years ago in reply to Peter Robertson

TI__Mastermind 23525 points

I think it was a typo above, just confirming you tried L1D as SRAM?

Is there any chance you will be able to try PG2.0 silicon?

I'm getting another set of eyes to look through this see if any ideas come up.

Regards,

Travis

0 Peter Robertson over 12 years ago in reply to tscheck

Expert 2770 points

Oops - you are right. The first L1P should have been L1D.

To repeat, I have tried making BOTH l1P and L1D SRAM and this has no effect on the problem.

I don't know if the manufacturer has different hardware available; I shall have to ask them.

Regards,

Peter

Processors

Processors forum

timeout on sending SRIO doorbell.