Other Parts Discussed in Thread: HALCOGEN, CCSTUDIO, TMDSRM48HDK, SEGGER
Hello,
Sorry about long post, but problem is quite complicated...
We have discovered that VIM RAM vector gets corrupted during/after VIM RAM parity test in case the "moon is properly aligned towards mars when doing the test"...
1. It does not matter whether the test is SafeTI-test (SL_SelfTest_VIM( VIM_SRAM_PARITY_TEST ) or HalCoGen test (vimParityCheck())
2. Behavior can be re-produced both in IAR and CCStudio and running code from main() in while loop so OS activity/real application does not matter
3. This is really timing / configuration sensitive, for example attached CCStudio project does not fail in case start-up VIM PBIST and VIM parity test is not enabled (haven't tested those individually), with current project this fails always after 571 tests (u32TestCnt == 571) in main()
4. Vector under test is corrupted, looks like that at least the lower byte is always turned to either 0x00 or 0x01 (weirdly IAR shows that only lowest byte is altered but CCStudio show that whole vector is changed). It does not matter whether the tested vector is 0xFFF82000 or 0xFFF82008.
5. It looks like that you will also need some other interrupt activity/sources (like RTI) to come which doing the test interrupts disabled - without it couldn't crash our IAR main-loop tester either.
Other notices:
a) VIM_FBPARERR function gets called a lot - almost on every test but not exactly every time (typically the PAR_FLG == 0 so this is "ghost call"). This gets called even the setup is such that VIM RAM does not ever corrupt
- this is also quite weird
b) Sometimes only VIM vector gets corrupted and if you fix it in "after"-branch everything is ok. BUT some times it also trigger VIM_FBPARERR with FLG=1 and in that case you will get also ESM-error like this
status (984599651|1|1|1007999222)<CR><LF>
after: VIM_RAM differs: 0x1 vs 0x53bec -- 1007999431<CR><LF> // bolded is current test number
VIM_FBPARERR: Fixing ch: 2 --- 1007999431<CR><LF> // bolded is current test number so FLG=1 happened after the test
ESM:1, ch: 15 p1: 0xfff82008, p2: 0x0, p3: 0x0 @ 0 ms<CR><LF> // this is ESM low interrupt handler
ESM: activated failsafe: G: 1 ch: 15<CR><LF>
status (1002149607|2|2|1025999160)<CR><LF>
And sometimes you only get vector corruption - this does not make sense. Our printing buffer is very big and at that time thoroughly tested that I am 99,999999% sure that prints are not lost in this case
status (2054919|0|0|2099865)<CR><LF>
after: VIM_RAM differs: 0x0 vs 0x53bec -- 2099909<CR><LF>
status (2347396|1|0|2399839)<CR><LF>
Our IAR-project main() looks similar as CCStudio (decided to do OS independ testing, but exatcly similar "behavior" can be seen in run time with OS active):
/////////////////// CODE /////////////////
#define vimRAMLoc ((volatile uint32 *)0xFFF82008U)
uint32 u32Address = *vimRAMLoc;
boolean bFail = FALSE;
uint32 u32Time = 0U;
while( bFail == FALSE )
{
u32Temppi++;
if( *vimRAMLoc != u32Address )
{
DBG_PRINT( "entry : VIM_RAM loc differs: 0x%x vs 0x%x\r\n", *vimRAMLoc, u32Address );
}
SL_VIM_FailInfo failInfoVIM = { .stResult = ST_FAIL }; /* error injection test may not set this at all */ /*lint !e785 */ /* only one instance initialized */
_disable_IRQ_interrupt_();
#if 0
boolean bRetVal = SL_SelfTest_VIM( VIM_SRAM_PARITY_TEST, &failInfoVIM );
SL_SelfTest_Result failInfo = failInfoVIM.stResult;
#else
vimParityCheck();
boolean bRetVal = TRUE;
SL_SelfTest_Result failInfo = ST_PASS;
#endif
if( *vimRAMLoc != u32Address )
{
DBG_PRINT( "after: VIM_RAM differs: 0x%x vs 0x%x -- %u\r\n", *vimRAMLoc, u32Address, u32Temppi );
//bFail = TRUE; // force stop
u32VIMRAM_corr++;
*vimRAMLoc = u32Address;
}
if( (!bRetVal) || (failInfo != ST_PASS) )
{
DBG_PRINT( "VIM test fail: %u\r\n", failInfoVIM.failInfo );
}
_enable_interrupt_();
if( HAL_u32TimeGet()-u32Time > 60U*1000U*1000U )
{
u32Time = HAL_u32TimeGet();
extern uint32 u32Calls;
extern uint32 u32Fixes;
DBG_PRINT( "status (%u|%u|%u|%u)\r\n",u32Calls, u32VIMRAM_corr, u32Fixes, u32Temppi );
}
}
while( 1 );
////////////////// CODE ENDS /////////////////////
CCstudio project which fails in Hercules RM48 MCU demo kit (TMDSRM48HDK):
- This project uses purely HalCoGen files, 0 manual modifications to anything
- In this project the interrupts are not disabled "properly" (I bit state is not preserved in disable and used when enabling) but since the interrupt disable&enable is used in 2 places in this CCStudio project that is ok, the IRQ cannot get enabled due to nested disable-disable-enable-enable pattern.
6886.VIM_RAM_TEST.zip
Please note that "debug printing" is also radically simplified here, it will trigger 1 "FOO\r\n" print when initing the debug prints - that is enough to get failure. In case you request more to print it will always trigger that same FOO\r\n despite what you give as arguments. Printing is done by using DMA. And it goes to that "after test RAM vector corruption" branch when 571 tests has been made so that initial printing "setup" which also starts DMA cannot alter the VIM RAM content since 570 test can be made without corruption....
Quessing that in CCStudio code there is zero possibility that application would write into VIM RAM, in our real application there is 0,00001% change.
- In IAR I have checked with segger watchpoint that no one writes to that VIM RAM vector (watch point triggers after the test when correcting the vector back inside if-sentece where corruption has been already detected - that also proves that watchpoint works and it should also prove that our real IAR application does not either write into that VIM RAM area).
Also VIM RAM test is protected by interrupt disable/enable so anyone couldn't write into that register, only possibility would be DMA but since we have only 1 channel in use and it writes to UART and you see text in UART it is not possible that DMA would write into VIM RAM (don't know/haven't tested if watchpoint is capable of seeing also DMA writes)...
In our real IAR application and in main()-test loop the probability of VIM RAM failure radically depends on the debug print activity the more often you print the more VIM RAM corruption errors you get, also RTI period seems to have an effect 1ms period gives less errors than 100us period but then 10us look give less than 100us...
- with 5sec test interval we can run the real application code from hours to multiple days until that corruption happens like randomly from 4h to a week which gives strong indication that the error to appear requires some magic event&timing to match to realize into VIM RAM corruption...
Here I have printed that status on every 60sec in main-while-loop illustrate that only some errors occured :
status (1546199563|2|2|1583998758)<CR><LF>
The format is: total amount of FB_PARERR | detected RAM corruption after the test | FB_PARERR FLG ==1 calls | tests made
- So those FLG == 0 calls has been triggered on almost every test, RAM has corrupted only 2 times and both times has come also FLG=1 call
Here is same status printing in same main-while-loop but printing it every 1sec, the ratio of errors is much greater
status (17815128|1155|29|780321957)<CR><LF>
after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780322779<CR><LF>
status (17819918|1156|29|780531940)<CR><LF>
after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780532761<CR><LF>
status (17824708|1157|29|780741923)<CR><LF>
after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780742218<CR><LF>
status (17829498|1158|29|780951906)<CR><LF>
after: VIM_RAM differs: 0x51301 vs 0x513f8 -- 780952137<CR><LF>
VIM_FBPARERR: Fixing ch: 2 --- 780952137<CR><LF> / this does not check is the vector value valid or not, it just writes from backupped table the value to given vector
ESM:1, ch: 15 p1: 0xfff82008, p2: 0x0, p3: 0x0 @ 0 ms<CR><LF>
ESM: activated failsafe: G: 1 ch: 15<CR><LF>
after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780953681<CR><LF>
status (17834280|1160|30|781161858)<CR><LF> // note now it turned from 29 to 30 since we got FLG=1 FBPARERR call also
During experiment noticed also following:
- In case after VIM init and before main() you set that "reserved" vector under test to at least of values 0x1 or 0x0 instead of that "phantomInterrupt" then SafeTI VIM RAM test starts to occasionally fail and reason for failure is that ESM channel has not get active (didn't test that with HalCoGen-test).
status (7892|0|0|3367759)<CR><LF>
status (9473|0|0|3579331)<CR><LF>
after: VIM_RAM differs: 0x1 vs 0x0 -- 3579770<CR><LF> // Here we have set to vector manually to 0x0 so after the test it has changed to 0x1, just fix it back
status (11046|1|0|3790893)<CR><LF>
status (11046|1|0|4000888)<CR><LF>
VIM test fail: 5<CR><LF> //now actual VIM test failed after 4 million tests has been made, 5 == esm channel hasn't activated during the test /* Check if ESM group1 channel 15 is not flagged */ commented branch in sl_selftest.c // there are more failures after that, so this wasn't the only one
VIM_FBPARERR: Fixing ch: 2 --- 4001073<CR><LF> // no "after print so RAM context is ok, but went into real FB_PARERR handler after the SafeTI test // we have own function which prints this only if FLG=1
ESM:1, ch: 15 p1: 0xfff82008, p2: 0x0, p3: 0x0 @ 0 ms<CR><LF> // and got also ESM error
ESM: activated failsafe: G: 1 ch: 15<CR><LF>
status (12622|1|1|4212427)<CR><LF>
status (14201|1|1|4424000)<CR><LF>
So in this case, the RAM content is not corrupted but still it set FLG=1 but the SafeTI couldn't see ESM error...
Initially we noticed this problem due to our test failed very occasionally which checked VIM RAM content (CRC over VIM RAM), then we reduced tests so that we had only VIM parity test & VIM RAM content test and speeded up test interval -> started to fail more often -> then started to print the content of VIM RAM during CRC failure -> noticed that 2nd vector had wrong value -> added check before & after every test that is the vector in proper value. After that I have spent ~4 days to test different combinations with IAR and lately built CCStudio project from scratch and evaluated what needs to be done there to get it also fail and now I got it...
Still I cannot say accurately why this fails, basically some VIM PBIST in startup&RTI&DMA&some DMA activity is needed to get that failure, if you remove 1 item you most likely won't get any failures, but in our real application every element is in use. For me this looks CPU problem which cannot be prevented...
We would need to understand what happens and WHY, guessing that removing the VIM RAM parity test (does not give any FIT benefit) from runtime tests removes the problem. But that just solves the issue, not the problem behind it .Can this same problem appear also some other way -> if yes, then our VIM RAM CRC test fails again and system trips...
- When implemented that VIM_FBPARERR handler to use SafeTI ESM vectors and restore proper content (vectors are modified after vimInit()) I noticed those "ghost" FLG == 0 calls but at time decided to ignore those since occurence was rather small and we had FLG == 0 guard to prevent any processing so effectively those only consumed minor amount of CPU time. But now thinking that those FLG=0 calls are symptom of much more critical problem and those FLG == 0 calls exists in pure HalCoGen project also when doing HalCoGen's vimramparity-test...
Question 1: Is there some error/problem in CPU VIM_RAM handling which causes corruption during parity testing which should be documented in errata?
Question 2: Why SafeTI test starts to fail in case vector is not "phantominterrupt" or should be asked why ESM channel does not always react in such situation (the SafeTI test looks to work properly since it signals error)?
- When "phantominterrupt" interrupt is used now in these accelerated experiments, haven't received any SafeTI-failures (in our real application we have seen 2 times in total that vim ram parity test has failed but since the test by default does not say why it failed, we cannot know the reason, most likely either the nERROR was actived due to speculative fetch or that same ESM channel not activated problem happened - added reasons to SafeTI but haven't got that error after that)
Question 3: Why those FBPARERR calls comes with FLG=0, those shouldn't come since interrupts are disabled why test is made and it ack that FLG away - and since those comes why those doesn't come on _every_ test?