RM48L952: Speculative fetch appears rarely during debugging even the whole memory has been filled once

Jarkko Silvasti

Part Number: RM48L952
Other Parts Discussed in Thread: SEGGER

Hi,

In order to ask correct question from Segger (or IAR) I would appreciate in case you could guess what kind of operation may put the CPU in such state that ECC is corrupted since I am pretty sure that this is caused somehow by our development environment. I am basically repeating the same "compile and download&debug pattern" and usually everything works but really rarely this prefetch issue raises its head even the flash has been fillen once.

Question 1: In case the debugger sometimes would perform whole chip erase for some reason, would the ECC be corrupted (does erase clear ECC area) - in case yes then this unexpected whole chip erase could be a potential reason for such behavior?

I am using IAR IDE and Segger's Jlink and have encounted the speculative fetching problem a couple of times (3 or 4 to be exact, so not much compared how many times I a have downloaded code) after I have filled the whole memory as instructed here (and my colleague has also encountered this once so this should not be user or individual device-chain issue).
https://e2e.ti.com/support/microcontrollers/hercules/f/312/t/588269

After 1 whole memory filling I have always restored the image size to be "normal size" so in normal debugging the whole memory is not programmed again and again. I also use Segger's compare as "use fastest method" so it should flash only required sections and typical programming takes few seconds. Since this has happened only a couple of times and I download code a lot the probability of this prefetch error is most likely less than per mille but when it hits it takes a while before you understand the root cause - that's why it is rather annoying and would like to solve it.

The FUNC_ERR_ADD has been every time same as in initial problem (0x135a98) where CPU memory has not been ever filled and our code is still only ~0x20000 long so normal programming should not touch to that problematic area and if understood correctly the ECC bits for those 2 quite separate memory locations are not adjacent so "minor mistake" in ECC writing should not either corrupt the value... I have filled the code with 0xFF when performed the full chip flashing in case that matters. And problem is immediately removed when you erase the chip and enable binary fill and use "IAR's flash loaders" to flash whole chip, after that you can return back to "optimized" downloading.

Related to this:

- Question 2: in production you need to program the whole flash (since it is first usage of CPU) so one needs to generate a filled image to be programmed or configure the programming tool so that it fills to the end of the flash?

- Question 3: Firmware update (haven't looked that side yet, I know that there is library available which handles the programming including ECC), do you need to manually (or does the library this) program 0xff (or something else data) to the end of last erased sector in case code does not fill it up? Other option would be to use filled binary also in firmware update...

- Question 4: In case there is still hidden ECC error somewhere in unused flash area, will it be always triggered "immediately" or could it take like week or months (or never) to pop out? Mine hasn't ever came in start up routines before main(), looks like that the error is always activated ~when OS is started and tasks started to run, which is still in good phase since you quite fast notice that nERROR is down and immediately know that there is problem.

over 8 years ago

0 QJ Wang over 8 years ago

TI__Guru**** 197446 points

Hello,

1. When the flash bank/sector is erased, the corresponding ECC area is also erased. The ECC is not enabled by default, and the application must enable the CPU's ECC checking for accesses on the CPU's ATCM and BTCM interfaces. The CPU signals an ECC error via its Event bus, and this signaling mechanism is not enabled by default and must be enabled by the application. We will not see ECC related error at erase/program stage as in another device (RM57). I am not sure if the error is caused by the whole chip erase (bank1/2 and bank7).

2. To fill the unused flash with the corresponding ECC is to address speculative access by the CPU. We don't have to fill all unused flash (reduce the downloading time). I think it might be enough to fill the last used sector and its next one.

3. The Flash API (Fapi_issueProgrammingCommand(...)) can programs the data buffer and auto generates and programs the ECC. After the application code is programmed, we can manually program the unused sector with 0xFFFFFFFF and the flash API will program it's ECC to ECC area.

4. If an unused flash location is speculative fetched from then we can get an uncorrectable ECC error. If the hidden ECC error (if any) is far away from the end of the code, it will not be triggered.

Regards,

0 Jarkko Silvasti over 8 years ago in reply to QJ Wang

Expert 1395 points

Hi,

1. We have ECC checking activated in the start-up "in early stage" as suggested in CPU initialization document. So full chip erase could be a one potential cause for speculative fetch error? (***, see also bottom of the post)

QUESTION: Since in our case FUNC_ERR_ADD has been every time 0x135a98 (***, see bottom of the post) would that give any hints (our code is only ~0x20000 long (6 digits vs 5 digits), Se our code should only use maximum of sectors 0-4 from bank 0, maybe not even that 4.

2. In this post you suggest that in order to prevent speculative fetch error the whole flash (both banks) should be filled once (and this solved our initial problem and it took a really long time until speculative fetch hit again)
https://e2e.ti.com/support/microcontrollers/hercules/f/312/t/588269

QUESTION: It would be quite important to know is it "whole flash" (as suggested in previous thread) or only "last used sector" and perhaps "next one" also? I do not understand why you would need to fill "last used and next one" since the "next one" should be similar sector as every other sector after that (in normal CPU there is no need to erase that "next" sector since it is not programmed/used). In case the "next one" is important could you briefly tell why (is the answer in my question for item 4)? (***, see also bottom of the post)

3. ok so some "manual work" may be needed.
QUESTION: Will it automatically still "fill" the last used (or programmed - don't know correct word for it) sector or do you need to fill "last used sector" manually? example: I have code which is 0x15000 long, do I need to manually flash from 0x15000 to 0x1FFFF or does API do that automatically (I haven't even looked the API since our development is not at that stage yet). Automatic fill would require that you somehow tell the API that "I am ready" that's why I guess that you need to do that fill manually (just trying now seek what needs to be done in order to prevent this rather annoying feature to pop out in real product).

4. I tried to ask "time frame" for that error, based on you answer I understand that it could be 10us, 1sec, 1min, 1 week, 1 month or never so it is just a pure luck what addresses are speculative fetched?
QUESTION: Now you say that ECC error location from code location matters will there be even a possibility to speculative error, does the speculative fetching mechanism somehow monitors which area of code is actually used and tries to read addresses near those areas?

*** In previous thread (https://e2e.ti.com/support/microcontrollers/hercules/f/312/t/588269) you said about FUNC_ERR_ADD register "If the register value is 0x135a98, the UNC_ERR_ADD should be 0x25B53 (0x135a98 >> 3)."
After re-reading (multiple times) the register description I tend to somehow agree with you but keep reading since I managed to reproduce the problem and that indicates something else.

"This register captures the full 32 bit incoming address when there is a bus parity error. It only captures address of 22:3 for multiple bit ECC errors"
I understood this originally (and after latest testing) that now were are talking about whole FUNC_ERR_ADD register not this field UNC_ERR_ADD which in inside FUNC_ERR_ADD register (guessing there cannot be registers inside registers) which would mean completely different thing so basically bits 0-2 are masked away from errornous address and rest bits are written as is into UNC_ERR_ADD (no shifting).

QUESTION: could you verify that 0x135a98 as whole register value really means that failure has occured in 0x25B53 address or is actually the 0x135a98 address, this is really important information in this "our debug problem case" in order to be able to find the root cause for it? Keep reading and see also my experiment...

NOTE: I just managed to reproduce the issue with 0x24DBA length code, I erased the flash and then programmed that code and after a while there are in SR3 = 0x00000080 (and also single bit error in SR1). now the FUNC_ERR_ADD is 0x139C10 that /8 == 0x27382 in your interpretation (== would be the area in same sector as last part of code but which is beyond actual code).

After I fill code to 0x3FFFF (to the end of sector 4), this should help since it would write that 0x27382 area... I still receive this speculative fetch error. So this does not help, which has to mean that 0x27382 is not correct address range. Here is the memory window view which proves that memory is filled (used 0xAA as fill to separate it from erase state 0xff), dont know why it start to show "carbage" after some 0xff's in sector 5 (in IAR memory viewer) which is not touched after erase) and Jlink refuses to read beyond that address, IAR's memory viewer shows -- -- -- for many bytes after that and some are 0xff and some 0xfd (does the debuggers somehow know that this address is not valid, ECC's not set and reading it could cause uncorrectable errors?). Both viewing methods (IAR IDE & seggers Jlink.exe command line shows similar results except that segger refuses to read from the point from where IAR starts to show '--'
J-Link>mem32 0x3FF80 256
0003FF80 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0003FF90 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0003FFA0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0003FFB0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0003FFC0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0003FFD0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0003FFE0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0003FFF0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
00040000 = FFFFFFFF FFFFDFFF

Then I filled "next sector" also so fill goes into 0x5FFFF, does not help either, speculative fetch error still comes (and yes in reset vector, I manually acknowledge the previous ESM error away in with debugger and code manages to go main() and I have bunch of SafeTI testing in startup that which also checks in function entries that error pin is up so code would not enter main if CPU would not behave properly)....
J-Link>mem32 0x5FF80 256
0005FF80 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0005FF90 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0005FFA0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0005FFB0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0005FFC0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0005FFD0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0005FFE0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA
0005FFF0 = AAAAAAAA AAAAAAAA AAAAAAAA AAAAAAAA

Then I filled into 0x13FFFF (6 digits) (and removed these pending ESM errors) and speculative fetch error disappeared and code runs again as it should.

QUESTION: This would mean what FUNC_ERR_ADD registers UNC_ERR_ADD fields stores offending address just as masked with 0x7 so in pseudocode the address storing mechanism is following
F_UNC_ERR_ADD = failured_address & 0x7) and the error has occured on 0x139C10 address range which is sector 11 in bank 0?

QUESTION: So based on my testing, It looks like your previous thread instructions are correct and "whole flash" shall be filled once (in production), and in firmware update all erased sectors must be filled. And speculative fetch seeks code from very far from actual code location so only option is to practically fill whole chip at least once as you originally suggested in previous thread

PS. sorry about long post, this was short until I managed to reproduce the issue :)
PPS. What comes to the our rare debugging issue, based on this latest experience I am pretty sure that this "occasional prefetch error comes back"-issue must be related to full chip erase (either debugger accidentally erases flash sometime or I have done it manually for some reason (but I have no reason manually erase the flash when debugging code :) and even cant do that easily while not using IAR's flash loaders but of course I still may have done it especially because only now I understand at least one way to reprocude this))... Need to keep watching will it again come back and focus at that point what I have done.

0 42Bastian over 8 years ago

Expert 1430 points

Hi
I did not read all in detail, just my two cents:
- Flash break points may destroy ECC
- "unused Flash": Why not at least read all Flash once?
- Full chip erase: At least if you use J-Link to flash (disable Flash download in C-Spy options), you can choose to erase all or partial parts of the flash

Arm-based microcontrollers

Arm-based microcontrollers forum

RM48L952: Speculative fetch appears rarely during debugging even the whole memory has been filled once