This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hello,
Sorry about long post, but problem is quite complicated...
We have discovered that VIM RAM vector gets corrupted during/after VIM RAM parity test in case the "moon is properly aligned towards mars when doing the test"...
1. It does not matter whether the test is SafeTI-test (SL_SelfTest_VIM( VIM_SRAM_PARITY_TEST ) or HalCoGen test (vimParityCheck())
2. Behavior can be re-produced both in IAR and CCStudio and running code from main() in while loop so OS activity/real application does not matter
3. This is really timing / configuration sensitive, for example attached CCStudio project does not fail in case start-up VIM PBIST and VIM parity test is not enabled (haven't tested those individually), with current project this fails always after 571 tests (u32TestCnt == 571) in main()
4. Vector under test is corrupted, looks like that at least the lower byte is always turned to either 0x00 or 0x01 (weirdly IAR shows that only lowest byte is altered but CCStudio show that whole vector is changed). It does not matter whether the tested vector is 0xFFF82000 or 0xFFF82008.
5. It looks like that you will also need some other interrupt activity/sources (like RTI) to come which doing the test interrupts disabled - without it couldn't crash our IAR main-loop tester either.
Other notices:
a) VIM_FBPARERR function gets called a lot - almost on every test but not exactly every time (typically the PAR_FLG == 0 so this is "ghost call"). This gets called even the setup is such that VIM RAM does not ever corrupt
- this is also quite weird
b) Sometimes only VIM vector gets corrupted and if you fix it in "after"-branch everything is ok. BUT some times it also trigger VIM_FBPARERR with FLG=1 and in that case you will get also ESM-error like this
status (984599651|1|1|1007999222)<CR><LF>
after: VIM_RAM differs: 0x1 vs 0x53bec -- 1007999431<CR><LF> // bolded is current test number
VIM_FBPARERR: Fixing ch: 2 --- 1007999431<CR><LF> // bolded is current test number so FLG=1 happened after the test
ESM:1, ch: 15 p1: 0xfff82008, p2: 0x0, p3: 0x0 @ 0 ms<CR><LF> // this is ESM low interrupt handler
ESM: activated failsafe: G: 1 ch: 15<CR><LF>
status (1002149607|2|2|1025999160)<CR><LF>
And sometimes you only get vector corruption - this does not make sense. Our printing buffer is very big and at that time thoroughly tested that I am 99,999999% sure that prints are not lost in this case
status (2054919|0|0|2099865)<CR><LF>
after: VIM_RAM differs: 0x0 vs 0x53bec -- 2099909<CR><LF>
status (2347396|1|0|2399839)<CR><LF>
Our IAR-project main() looks similar as CCStudio (decided to do OS independ testing, but exatcly similar "behavior" can be seen in run time with OS active):
/////////////////// CODE /////////////////
#define vimRAMLoc ((volatile uint32 *)0xFFF82008U)
uint32 u32Address = *vimRAMLoc;
boolean bFail = FALSE;
uint32 u32Time = 0U;
while( bFail == FALSE )
{
u32Temppi++;
if( *vimRAMLoc != u32Address )
{
DBG_PRINT( "entry : VIM_RAM loc differs: 0x%x vs 0x%x\r\n", *vimRAMLoc, u32Address );
}
SL_VIM_FailInfo failInfoVIM = { .stResult = ST_FAIL }; /* error injection test may not set this at all */ /*lint !e785 */ /* only one instance initialized */
_disable_IRQ_interrupt_();
#if 0
boolean bRetVal = SL_SelfTest_VIM( VIM_SRAM_PARITY_TEST, &failInfoVIM );
SL_SelfTest_Result failInfo = failInfoVIM.stResult;
#else
vimParityCheck();
boolean bRetVal = TRUE;
SL_SelfTest_Result failInfo = ST_PASS;
#endif
if( *vimRAMLoc != u32Address )
{
DBG_PRINT( "after: VIM_RAM differs: 0x%x vs 0x%x -- %u\r\n", *vimRAMLoc, u32Address, u32Temppi );
//bFail = TRUE; // force stop
u32VIMRAM_corr++;
*vimRAMLoc = u32Address;
}
if( (!bRetVal) || (failInfo != ST_PASS) )
{
DBG_PRINT( "VIM test fail: %u\r\n", failInfoVIM.failInfo );
}
_enable_interrupt_();
if( HAL_u32TimeGet()-u32Time > 60U*1000U*1000U )
{
u32Time = HAL_u32TimeGet();
extern uint32 u32Calls;
extern uint32 u32Fixes;
DBG_PRINT( "status (%u|%u|%u|%u)\r\n",u32Calls, u32VIMRAM_corr, u32Fixes, u32Temppi );
}
}
while( 1 );
////////////////// CODE ENDS /////////////////////
CCstudio project which fails in Hercules RM48 MCU demo kit (TMDSRM48HDK):
- This project uses purely HalCoGen files, 0 manual modifications to anything
- In this project the interrupts are not disabled "properly" (I bit state is not preserved in disable and used when enabling) but since the interrupt disable&enable is used in 2 places in this CCStudio project that is ok, the IRQ cannot get enabled due to nested disable-disable-enable-enable pattern.
6886.VIM_RAM_TEST.zip
Please note that "debug printing" is also radically simplified here, it will trigger 1 "FOO\r\n" print when initing the debug prints - that is enough to get failure. In case you request more to print it will always trigger that same FOO\r\n despite what you give as arguments. Printing is done by using DMA. And it goes to that "after test RAM vector corruption" branch when 571 tests has been made so that initial printing "setup" which also starts DMA cannot alter the VIM RAM content since 570 test can be made without corruption....
Quessing that in CCStudio code there is zero possibility that application would write into VIM RAM, in our real application there is 0,00001% change.
- In IAR I have checked with segger watchpoint that no one writes to that VIM RAM vector (watch point triggers after the test when correcting the vector back inside if-sentece where corruption has been already detected - that also proves that watchpoint works and it should also prove that our real IAR application does not either write into that VIM RAM area).
Also VIM RAM test is protected by interrupt disable/enable so anyone couldn't write into that register, only possibility would be DMA but since we have only 1 channel in use and it writes to UART and you see text in UART it is not possible that DMA would write into VIM RAM (don't know/haven't tested if watchpoint is capable of seeing also DMA writes)...
In our real IAR application and in main()-test loop the probability of VIM RAM failure radically depends on the debug print activity the more often you print the more VIM RAM corruption errors you get, also RTI period seems to have an effect 1ms period gives less errors than 100us period but then 10us look give less than 100us...
- with 5sec test interval we can run the real application code from hours to multiple days until that corruption happens like randomly from 4h to a week which gives strong indication that the error to appear requires some magic event&timing to match to realize into VIM RAM corruption...
Here I have printed that status on every 60sec in main-while-loop illustrate that only some errors occured :
status (1546199563|2|2|1583998758)<CR><LF>
The format is: total amount of FB_PARERR | detected RAM corruption after the test | FB_PARERR FLG ==1 calls | tests made
- So those FLG == 0 calls has been triggered on almost every test, RAM has corrupted only 2 times and both times has come also FLG=1 call
Here is same status printing in same main-while-loop but printing it every 1sec, the ratio of errors is much greater
status (17815128|1155|29|780321957)<CR><LF>
after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780322779<CR><LF>
status (17819918|1156|29|780531940)<CR><LF>
after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780532761<CR><LF>
status (17824708|1157|29|780741923)<CR><LF>
after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780742218<CR><LF>
status (17829498|1158|29|780951906)<CR><LF>
after: VIM_RAM differs: 0x51301 vs 0x513f8 -- 780952137<CR><LF>
VIM_FBPARERR: Fixing ch: 2 --- 780952137<CR><LF> / this does not check is the vector value valid or not, it just writes from backupped table the value to given vector
ESM:1, ch: 15 p1: 0xfff82008, p2: 0x0, p3: 0x0 @ 0 ms<CR><LF>
ESM: activated failsafe: G: 1 ch: 15<CR><LF>
after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780953681<CR><LF>
status (17834280|1160|30|781161858)<CR><LF> // note now it turned from 29 to 30 since we got FLG=1 FBPARERR call also
During experiment noticed also following:
- In case after VIM init and before main() you set that "reserved" vector under test to at least of values 0x1 or 0x0 instead of that "phantomInterrupt" then SafeTI VIM RAM test starts to occasionally fail and reason for failure is that ESM channel has not get active (didn't test that with HalCoGen-test).
status (7892|0|0|3367759)<CR><LF>
status (9473|0|0|3579331)<CR><LF>
after: VIM_RAM differs: 0x1 vs 0x0 -- 3579770<CR><LF> // Here we have set to vector manually to 0x0 so after the test it has changed to 0x1, just fix it back
status (11046|1|0|3790893)<CR><LF>
status (11046|1|0|4000888)<CR><LF>
VIM test fail: 5<CR><LF> //now actual VIM test failed after 4 million tests has been made, 5 == esm channel hasn't activated during the test /* Check if ESM group1 channel 15 is not flagged */ commented branch in sl_selftest.c // there are more failures after that, so this wasn't the only one
VIM_FBPARERR: Fixing ch: 2 --- 4001073<CR><LF> // no "after print so RAM context is ok, but went into real FB_PARERR handler after the SafeTI test // we have own function which prints this only if FLG=1
ESM:1, ch: 15 p1: 0xfff82008, p2: 0x0, p3: 0x0 @ 0 ms<CR><LF> // and got also ESM error
ESM: activated failsafe: G: 1 ch: 15<CR><LF>
status (12622|1|1|4212427)<CR><LF>
status (14201|1|1|4424000)<CR><LF>
So in this case, the RAM content is not corrupted but still it set FLG=1 but the SafeTI couldn't see ESM error...
Initially we noticed this problem due to our test failed very occasionally which checked VIM RAM content (CRC over VIM RAM), then we reduced tests so that we had only VIM parity test & VIM RAM content test and speeded up test interval -> started to fail more often -> then started to print the content of VIM RAM during CRC failure -> noticed that 2nd vector had wrong value -> added check before & after every test that is the vector in proper value. After that I have spent ~4 days to test different combinations with IAR and lately built CCStudio project from scratch and evaluated what needs to be done there to get it also fail and now I got it...
Still I cannot say accurately why this fails, basically some VIM PBIST in startup&RTI&DMA&some DMA activity is needed to get that failure, if you remove 1 item you most likely won't get any failures, but in our real application every element is in use. For me this looks CPU problem which cannot be prevented...
We would need to understand what happens and WHY, guessing that removing the VIM RAM parity test (does not give any FIT benefit) from runtime tests removes the problem. But that just solves the issue, not the problem behind it .Can this same problem appear also some other way -> if yes, then our VIM RAM CRC test fails again and system trips...
- When implemented that VIM_FBPARERR handler to use SafeTI ESM vectors and restore proper content (vectors are modified after vimInit()) I noticed those "ghost" FLG == 0 calls but at time decided to ignore those since occurence was rather small and we had FLG == 0 guard to prevent any processing so effectively those only consumed minor amount of CPU time. But now thinking that those FLG=0 calls are symptom of much more critical problem and those FLG == 0 calls exists in pure HalCoGen project also when doing HalCoGen's vimramparity-test...
Question 1: Is there some error/problem in CPU VIM_RAM handling which causes corruption during parity testing which should be documented in errata?
Question 2: Why SafeTI test starts to fail in case vector is not "phantominterrupt" or should be asked why ESM channel does not always react in such situation (the SafeTI test looks to work properly since it signals error)?
- When "phantominterrupt" interrupt is used now in these accelerated experiments, haven't received any SafeTI-failures (in our real application we have seen 2 times in total that vim ram parity test has failed but since the test by default does not say why it failed, we cannot know the reason, most likely either the nERROR was actived due to speculative fetch or that same ESM channel not activated problem happened - added reasons to SafeTI but haven't got that error after that)
Question 3: Why those FBPARERR calls comes with FLG=0, those shouldn't come since interrupts are disabled why test is made and it ack that FLG away - and since those comes why those doesn't come on _every_ test?
Screenshot was lost from CCStudio was lost, here ir is again, see how 0xFFF82000 is changed to 0x1 when inside "after check" if branch
Had a couple of minor mistakes, the DMA actually didn't get kicked with CCStudio at all, despite of that it went (always once) to the error. So basically can rule out that it is "the dma" which causes something...
Then I made slight code change (moved before check inside interrupt) -> didn't went anymore into error --> real timing critical.
Reverted that change -> went again to error...
Then tried changed/fixed printing a bit (still didn't start any dma) -> didn't went into error -> real timing critical...
Overall I then made a couple of changes and now it goes consistently and multiple times to that error (commented that while(1) away to automatically gather multiple errors -> errors can be seen by putting breakpoint into error branch and wait it to re-trigger (or somewhere else) and then see amount of errors from u32VIMRAM_corr-variable.
See number of errors vs. tests performed in this run 11 pcs out of 4,8 million test runs
Here is more robust project:
6170.VIM_RAM_TEST.zip
Great that you were able to duplicate hte issue.
Quick questions:
- did you used the .zip package from 1st post or the later one (4th post)? Should use the later -zip package, that should be more robust what comes to error appear, in 1st package the compiler version will most likely effect. My goal was to provive packet which fails "consistently" and 2nd package should take seconds to reach first error and sequential errors should come less than 10sec.
Later .zip package should consistently come here - haven't tested/looked the actual uart-pin on demo-board since I was happy to get that BTC interrupt
#pragma WEAK(dmaGroupANotification)
void dmaGroupANotification(dmaInterrupt_t inttype, uint32 channel)
{
/* enter user code between the USER CODE BEGIN and USER CODE END. */
/* USER CODE BEGIN (54) */
#ifdef DEBUG_MON
#include "DMON.h"
if( inttype == BTC )
{
if( channel == (uint32)DMA_CH_DBG_TX )
{
DMON_vPacketSent(); // here
}
}
#endif
What comes to a stack corruption:
- good idea, have to also check that more closely but our real application has stack monitors in place for every cpu-stack and for every os-task-stack - this monitor is not real time but performed in certain period, but it has 20% limit (or at least nn stack items) to actual stack size so it should warn/trigger is in case IRQ or FIQ or abort stack has gone near it's real size.
- Interrupts shall be disabled during the VIM parity test, the after check (and also before in 2nd .zip) is inside critical sections so only item which should ran is vim parity tester and when it runs vim-shall have interrupt sources to register incoming interrupt.
I am not familiar with CCstudio, could find how/where the IRQ stack is located from linker output, but when inside dmaBTCAInterrupt->dmaGroupANotification->DMON_vPacketSent->vStartDmaTransfer the SP is 0x08001278 and when check memory browser there is zeroes until 0x08001000 (assuming that sys/usrmode stack starts from there)
when checking .map file, it does not show what is before 0x08001500 (most likely stacks), so when from 0x08001278->0x08001000 is zeros I assume stacks are ok in this "demo" also
00006f74 00000008 (__TI_handler_table)
00006f7c 00000010 (__TI_cinit_table)
.bss 0 08001500 000007f8 UNINITIALIZED
08001500 000007d0 DMON_Main.obj (.bss:au8DataBuf)
08001cd0 00000028 sci.obj (.bss:g_sciTransfer_t)
.data 0 08001cf8 0000002c UNINITIALIZED
08001cf8 0000000c rtsv7R4_T_le_v3D16_eabi.lib : exit.obj (.data:$O1$$)
08001d04 0000000c sys_main.obj (.data)
08001d10 00000009 DMON_Main.obj (.data)
08001d19 00000003 --HOLE--
08001d1c 00000008 rtsv7R4_T_le_v3D16_eabi.lib : _lock.obj (.data:$O1$$)
Then found this from HalCoGen files, so IRQ stack is from 0x08001300 to 0x08001200 so SP as 0x08001278 when inside IRQ it is clearly in safe-area --> this problem "cannot" be related to stack pointer corruption...
userSp .word 0x08000000+0x00001000
svcSp .word 0x08000000+0x00001000+0x00000100
fiqSp .word 0x08000000+0x00001000+0x00000100+0x00000100
irqSp .word 0x08000000+0x00001000+0x00000100+0x00000100+0x00000100
abortSp .word 0x08000000+0x00001000+0x00000100+0x00000100+0x00000100+0x00000100
undefSp .word 0x08000000+0x00001000+0x00000100+0x00000100+0x00000100+0x00000100+0x00000100
Hello Jarkko,
Although we have not clearly nailed down which interrupt is causing the interference, we have identified a change to the code and the way interrupts are disabled/enabled that appears to resolve the issue. Essentially, we have to disable and re-enable by writing directly to the VIM as shown in the code snippet below.
uint32 reqmaskbackup0; uint32 reqmaskbackup1; uint32 reqmaskbackup2; boolean bFail = FALSE; while( bFail == FALSE ) { // _disable_IRQ_interrupt_(); // Backup Interrupt Enable reqmaskbackup0 = vimREG->REQMASKSET0; reqmaskbackup1 = vimREG->REQMASKSET1; reqmaskbackup2 = vimREG->REQMASKSET2; // CLear Interrupt Enable vimREG->REQMASKCLR0 = vimREG->REQMASKCLR0; vimREG->REQMASKCLR1 = vimREG->REQMASKCLR1; vimREG->REQMASKCLR2 = vimREG->REQMASKCLR2; u32TestCnt++; if( VIMRAMLOC != u32ExpectedAddress ) { while(1){}; } vimParityCheck(); if( VIMRAMLOC != u32ExpectedAddress ) { //bFail = TRUE; // force stop u32VIMRAM_corr++; VIMRAMLOC = u32ExpectedAddress; while(1){}; } // _enable_interrupt_(); // Restore Interrupt Enable vimREG->REQMASKSET0 = reqmaskbackup0;// = vimREG->REQMASKSET0; vimREG->REQMASKSET1 = reqmaskbackup1; //= vimREG->REQMASKSET1; vimREG->REQMASKSET2 = reqmaskbackup2; //= vimREG->REQMASKSET2;
I have made this change to the sample project provided and have executed for 48 hours without failure.
Hi and great,
Correct me in case I am wrong, but wouldn't that experiment shows that there is some unidentified problem in the VIM peripheral since both methods prevents interrupts but CPU core's I-bit usage causes VIM problem while doing parity test for the VIM?
That experiment would also rule out coding error / stack corruption / wrong returning from IRQ?
With that approach, did you got any VIM_FB_PAR_ERR calls with FLG == 0?
Just to clarify, did you let the other debug-printing critical section to be _enable_interrupt_()? If yes, that would narrow down the problem just to parity test so regular core I-bit can be used without problems anywhere else (at least we haven't discovered any other VIM RAM corruption in our system or any other problems with regular I-bit usage)?
So 1 way to get rid of this problem could be as pseudo-code (since using I-bit in other places of code is quite mandatory for example due to OS-restriction, we are using certified OS so cannot alter its critical entry/exit methods):
1. Enter generic SafeTI test-harness (whole real harness is not so simple as below in below steps :))
2. CPU I-bit disable
3. if test == VIM RAM PARITY
3a. backup & disable VIM req masks
4. Do 1 SafeTI test
5. if test == VIM RAM PARITY
5a. restore VIM req masks
6. CPU I-bit enable
7. Exit generic SafeTI test-harness
Basically just handle the VIM RAM parity test a little bit differently than any other SafeTI test would most likely provide most easiest/best solution also in code modification wise?
I quickly tried that pseudo approach in the same CCStudio project (let those your commented irq-enable/disable lines be uncommented just moved I-enable after restore) and didn't receive any errors, not even that VIM_FB_PAR_ERR with FLG == 0 side effect...
So looks like FLG == 0 call is also real side effect which indicates that everything is not ok in the setup if it comes?
So looks like the your solution (& using pseudocode above in real code) would provide the a way to eliminate the problem & side effects completely, so no need for manual VIM RAM content checking & repairing after the test & handling/ignoring ESM IRQ after that if vector content is ok...
BUT unless you are capable of digging out the root cause, how we can be sure that the problem has really disappeared not popping out after running 1 month? I am of course pretty sure of this fix, since for me it initially looked like that incoming IRQs and/or DMA transfer (maybe needs several & maybe needed simultaneously) during the test was the cause of failure. With fix VIM most likely works a bit differently since it is not registering those IRQs immediately when those comes, it registers them when masks are re-enabled and that most likely changes things...
Are you going to make errata-entry or something regarding to this issue (would help to properly mark the code since I doubt that these e2e-links works for "eternity").