Recovering SPI Communication after Failure

Matt Riggs

Other Parts Discussed in Thread: DLPC200

Hello again!

I have been working with the DLPC200 in several custom board builds now for over a year, and have been using the SPI to command the DLPC200 successfully for everything we've needed.

I've now got two boards (out of a total of ~25 boards in three different builds) that will occasionally have a problem where the 'echo' back from the DLPC200 will be incorrect (the echo byte on the MISO line will not mirror what was on the MOSI line during the previous byte). This specifically occurs during either the SetDataSource, WriteImageOrderLut, or DisplayPatternForAutoStep commands. If the DLPC200's USB interface is plugged into a PC, when the 'echo' error occurs the DLPC200 device drops off the USB bus and enumerates as an 'unknown device', which makes me think this is a deeper problem than the SPI bus.

I am still investigating this issue and may have further questions soon, but for now I'd like to try recovering from the 'echo' failure. Right now, when this 'echo' error occurs I stop communicating with the DLPC200 as the SPI specification doesn't offer a way to recover from this situation.

Is there a way that I can attempt to recover the SPI communication? Sending 0x00 bytes a certain number of times, or sending some kind of reset packet, etc? I'd like to avoid resetting the DLPC200 (via the RESET signal), because with 120 8-bit images in the default solution it takes a little under a minute to boot.

Thanks for any help you can provide!

-Matt Riggs

over 11 years ago

0 Sanjeev over 11 years ago

TI__Mastermind 30875 points

Hello Matt Riggs,

Just to make sure, ensure you are, as on date on the latest firmware version 2.2.0.

Per DLPC200 SPI interface command specification, there is DLPC200 reset command, in the document, section 7.7 DLPC200 Reset, but as I understand you want to avoid a complete chip reset and initialization.

when the 'echo' error occurs the DLPC200 device drops off the USB bus and enumerates as an 'unknown device', which makes me think this is a deeper problem than the SPI bus. >> We haven't noticed such problem, but would like to understand more about the problem.

The DLPC200 SPI works on command response protocol, as I understand, the error is occurring when sending the command data, echo data showing wrong, the command parser starts working on received command only after receiving the command packet completely or number of data bytes in the command packet exceeding 512 to prevent erroneous condition as none of the commands are more than 512 bytes.

So, the suggestion is that, when you see wrong echo, i would still recommend completing the command message transfer, does it return some kind of error in command response? The other way is sending 0x00, looking at the command definition, there are 6-bytes, cmd1 - cmd2, then 2-byte of length, followed by payload and check sum, so sending something like 0x00 0x00 0x00 0x00 0x02 0x00 0x00 0x00 0x02 -> should signal it as wrong command and device should now be ready to execute following command properly.

Regards,

Sanjeev

0 Matt Riggs over 11 years ago in reply to Sanjeev

Prodigy 170 points

Sanjeev-

We are using DLP firmware version 2.2.0, yes!

Thank you for the information; I'm still trying to determine exactly what is happening. I've learned that only one set of hardware exhibits the 'echo' problem and drops off the USB bus -- I need more time to investigate this.

For the second board set, communication was failing on the same three commands (SetDataSource, WriteImageOrderLut, or DisplayPatternForAutoStep). I was able to look at the response Data[0] and Data[1] and noticed that while Data[0] was 0x00, Data[1] was 0x80. This occurred three separate times, making it unlikely that the response was corrupted.

Since all three packets that I am sending to the DLPC200 are extended commands, this is strange since the documentation states bit 7 is "Not applicable for Extended Packet". Does bit 7 have any meaning with regard to extended packets?

I am going to mark your first post as the answer since it did answer the core question. Thanks!

-Matt Riggs

0 Sanjeev over 11 years ago in reply to Matt Riggs

TI__Mastermind 30875 points

Hello Matt,

[Sanjeev] It depends, on what is that DLPC200 saw as part of command, i meant, the command echo bytes, what was shown in the command echo for these commands, CMD1 through CMD4? especially the CMD2 echoed byte, is it showing 0xAA? if not then it is not processed as an extended command in the first place.

Regards,

Sanjeev

0 Matt Riggs over 11 years ago in reply to Sanjeev

Prodigy 170 points

Hello Sanjeev-

I have been able to confirm that I have two separate issues -- one single instance with an apparently noisy SPI bus, where the echo from the DLP is incorrect, and now three instances where the echo is fine, but the DLP responds with a 0x80 in Data[1].

I don't have access to the 'noisy SPI bus' hardware at the moment, and so I am focusing on the 0x80 in Data[1]. With the echo being correct for the commands, what could that value mean? The DLPC200's SPI Slave Interface Specification lists bit 7 as "CMD_ERR_CORRUPT_PKT_RCVD - Not applicable for Extended Packet."

For all four cases, the hardware is writing the same set of three commands and then issuing a single hardware trigger over and over again. This change in Data[1] value happens seemingly at random during this test.

I apologize for being a little scattered with this thread, and appreciate any insight you can give!

Thanks,

-Matt Riggs

0 Sanjeev over 11 years ago in reply to Matt Riggs

TI__Mastermind 30875 points

Hello Matt Riggs,

We appreciate your keen observation into the behavior.

As you know the DLPC200 controller embedded software command handler works on two physical interfaces namely USB (via Cypress USB chip) and SPI Slave port.

In the firmware, this flag is specifically set when there is a activity happening ONLY on the USB side, the flag - CMD_ERR_CORRUPT_PKT_RCVD, means that in the received payload the CMD1 is neither 0x02 - command nor 0x04 - Response, so the flag is set.

By any chance is your board having USB interface enabled and there is any activity going on?

I also see there is minor issue in the packet parsing logic where irrespective of packet received source USB or SPI in the code it is looking for correctness of first byte in the cmd buffer, technically it should be checked only for USB source, you can safely ignore while communicating via SPI interface alone.

If we fix the above condition you will ideal stop seeing the DATA[1] bit7 being set while communicating on the SPI side.

Since it is not urgent issue, i think we should OK.

Let me know if i answered your question.

Regards,
Sanjeev

0 Matt Riggs over 11 years ago in reply to Sanjeev

Prodigy 170 points

8182.psx dsp checksum error.zipThanks Sanjeev!

I do not have the USB interface plugged in -- the interface is implemented and functional, but when these issues are occurring the USB interface is not plugged in.

Taking your advice, for Extended Packets I selectively ignored the CMD_ERR_CORRUPT_PKT_RCVD error. Going the next step, for Extended Packets I also ignored bits 3 and 5 in Data[0] and all Data[1] bits except bit 3 -- these are bits listed as not applicable in the table on Page 7 of DLPU005A.

Having done that, the DLP will now report a CMD_ERR_CHK_SUM_ERROR error (Data[0] = 0x01), both with and without a CMD_ERR_CORRUPT_PKT_RCVD error. I can see packets with all four variations -- neither error bit, checksum only, corrupt packet only, and both. I'm not seeing any other bits set.

The checksum error happens about as regularly as the CMD_ERR_CORRUPT_PKT_RCVD error used it, which is to say on affected hardware it will occur after a few hours of continuous operation and then regularly (within 100 commands) after that.

Assuming I had a checksum error, I obtained a logic analyzer capture of the checksum failure. Looking at the SPI data, my checksum is correct, and the DLP's echo byte for the checksum (and all bytes in the command) is also correct. In this case, I see that CMD_ERR_CHK_SUM_ERROR and CMD_ERR_CORRUPT_PKT_RCVD are both set in the response.

While this isn't an oscilloscope so I don't have a good look into what the electrical signals look like, the logic analyzer shows signals that look like they start and stop on time and don't vary much, down to the 20ns resolution I was using. I have looked at oscilloscope captures of non-erroneous SPI data and everything looked healthy, but I wasn't able to induce this error while the oscilloscope was attached.

I've attached the logic analyzer capture in case you'd like to take a look -- you'll need Tektronix's free TLA viewer, which can be downloaded here: http://www.tek.com/tla5201-software-0

Do you have any ideas? I really am stumped about how the DLP could echo back the correct checksum and then flag a checksum error. I plan to reduce the SPI clock 50% (from 4MHz to 2MHz) and see if that makes a difference. I'm open to any suggestions you have!

Thanks again,

-Matt Riggs

0 Sanjeev over 11 years ago in reply to Matt Riggs

TI__Mastermind 30875 points

Hello Matt,

I went through the details provided by you.

Assuming I had a checksum error, I obtained a logic analyzer capture of the checksum failure. Looking at the SPI data, my checksum is correct, and the DLP's echo byte for the checksum (and all bytes in the command) is also correct. In this case, I see that CMD_ERR_CHK_SUM_ERROR and CMD_ERR_CORRUPT_PKT_RCVD are both set in the response.

[Sanjeev] When the echoed data is correct, it doesn't make sense to me either on why it should report error flags. Are you able to reproduce this consistently on the faulty board?

but I wasn't able to induce this error while the oscilloscope was attached.

[Sanjeev] Do you mean no errors occurring on the spi bus? is this something to do with noise on the signal? By the way how are you inducing the errors?

I really am stumped about how the DLP could echo back the correct checksum and then flag a checksum error.

[Sanjeev] The echo is simply byte receive is copied to tx buffer and then copied to command buffer for processing. The logic is very well tested.

You mentioned,

SPI CMD send -> command echo is correct -> wait for command process complete -> Read response message (this message is throwing DATA[1] BIT7 = '1' and also DATA[0] BIT0 = '1'.

Once this error occurs, does is quit working? I mean all further commands are just failed or not-executed? Does it become totally unusable? Only way would be to reboot the board?

It is good idea to check with slower clock 2MHz and instead of 4MHz.

Regards,
Sanjeev

0 Matt Riggs over 11 years ago in reply to Sanjeev

Prodigy 170 points

Hello Sanjeev-

Here are my responses:

[Sanjeev] When the echoed data is correct, it doesn't make sense to me either on why it should report error flags. Are you able to reproduce this consistently on the faulty board?

Yes, I can reproduce it consistently on two of our boards here. A third board has seen this issue twice in 3 weeks of solid (>4 hours/day) use. On the two more consistent boards, these are the reproduction steps:

Send the SetDataSource command (0x000E), SL_EXT1P8. Accept response.
Send the WriteImageOrderLUT command (0x000D), setting the LUT to a single image (which image varies). Accept response.
Send the DisplayPatternAutoStepForSinglePass (0x0033) command. Accept response.
Using a hardware trigger pulse 10us in duration, trigger the DLP.
Wait 200ms.

Repeat steps (1)-(5) until one of the command responses in step (1), (2), or (3) has one of the grayed bits in the table on page 7 of the DLPU005A document set in Data[0] or Data[1].

During a single power cycle, the first error (failure bits being set) occurs between 15 minutes and 6 hours after the above process is started. After that, it takes < 100 commands to cause an error. Power cycling the system causes this to 'reset', and another 15 minutes-6 hours is needed to reproduce the error.

[Sanjeev] Do you mean no errors occurring on the spi bus? is this something to do with noise on the signal? By the way how are you inducing the errors?

I was able to trap the error (failure bits being set) using a logic analyzer, and the capture I sent in my previous post shows the error occurring. Using an oscilloscope to try and characterize signal noise, I was not able to see the actual error occur, but I was able to confirm that each of the SPI signals look clean. My personal theory is that it is noise related somehow, but from the captures I saw that was not the case.

Also note that the error (failure bits being set) that I caught with the logic analyzer was not the first error (the one that takes 15 minutes-6 hours to reproduce), but one of the errors that takes <100 commands to occur. So this error may be different from that initial error.

[Sanjeev] Once this error occurs, does is quit working? I mean all further commands are just failed or not-executed? Does it become totally unusable? Only way would be to reboot the board?

No, it does not become unusable. I can continue to communicate with and use the DLP, but as I mentioned above, the checksum error then occurs much more frequently. My control application stops executing when the error is seen, but if I restart it (without restarting the board) and re-send those commands described above it continues to work. I have implemented a 'retry' mechanism which will identify errors and re-send the problem command up to 10 times; I am still testing this out. This obviously masks the issue instead of resolving it and is not an ideal resolution.

[Sanjeev] It is good idea to check with slower clock 2MHz and instead of 4MHz.

I just got this test running today; it has run 7 hours so far without a failure, I will let it run overnight and see what happens.

Any other suggestions or questions are welcome!

Thanks,

-Matt

0 Sanjeev over 11 years ago in reply to Matt Riggs

TI__Mastermind 30875 points

Hello Matt

On your comment above

No, it does not become unusable. I can continue to communicate with and use the DLP, but as I mentioned above, the checksum error then occurs much more frequently.

This is important input, i suspect it to be happening when there is error/noise on the bus, since you mention the DLPC200 always echo properly, I am not able to reason it,

Also I saw your steps, at the end you are putting 200ms delay, i presume you doing with the assumption that DisplayPatternAutoStepForSinglePass command would have displayed ALL patterns; then proceed to step #1, set data source. Is that correct?

I would wait for your final result with 2MHz clock; since it is failing on 3 out of 25 boards, if the failing boards start working at 2MHz, you can reduce the clock for all the boards to be safer side.

Regards,
Sanjeev

DLP®︎ products

DLP products forum

Recovering SPI Communication after Failure