We have found a MSP430F2617 microcontroller with faulty bits in the RAM memory. After a lot of debugging, it appears that the faulty bits are influenced by the values of their neighboring bits (as described below).
The batch is the following:
76ARXZTG4
Going into detail, the faulty bit occurs at address 0x1A6A, where bit6 does not retain the value that is attempted to be written:
After writing 0x10, we read 0x30
After writing 0x20, we read 0x00
... and so on. It appears that bit5 is copying from bit4. We found exceptions to this rule, apparently with a certain dependence on the adjascent WORD. There is also a faulty bit at address 0x1A6C with a similar behavior.
Testing some pieces of code that should detect the fault, we came across some stranger behavior. The following assembler code writes value 0x5555 at address 0x1A6A (&p), and next compares the content of 0x1A6A precisely with 0x5555 (to check the retention).
\ 000008 1D42.... MOV.W &??p, R13
\ 00000C 8D930000 CMP.W #0x0, 0(R13)
\ 000010 BD4055550000 MOV.W #0x5555, 0(R13)
\ 00001C BD9055550000 CMP.W #0x5555, 0(R13)
Watching at address 0x1A6A in the debugger we see that after the writing instruction, the value recorded at 0x1A6A is 0x7555 rather than 0x5555, because of the faulty bit. We expected to catch this fault with the subsequent comparison, but the comparing instruction returns true, as if 0x5555 were recorded at 0x1A6A. Is it not fetching the value from the RAM but rather from a buffer?
We found a way around it by inserting a dummy memory access (to address 0x1100) in between the writing and the comparing instructions:
\ 000008 1D42.... MOV.W &??p, R13
\ 00000C 8D930000 CMP.W #0x0, 0(R13)
\ 000010 BD4055550000 MOV.W #0x5555, 0(R13)
\ 000016 B24005000011 MOV.W #0x5, &0x1100
\ 00001C BD9055550000 CMP.W #0x5555, 0(R13)
With this code the failure is detected (comparison of 0x5555 with the content at address 0x1A6A returns false), though there is no logical difference from the previous code. What could be the reason which makes the first code not detect the failure?
I appreciate if you can answer these questions:
1. Have you had particular RAM faults with this batch, or other batches? Do you have a percentual estimation of faulty RAMs released in the MSP4302617 line?
2. Can you list which tests you do to the RAMs and what is the percentage of coverage?
3. Can you explain the behavior described, especially why the first code does not detect the fault?
Thanks in advance