This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RM48L952: VIM RAM corrupts during VIM RAM parity test - looks to be very time sensitive and requires simultaneous DMA usage (or DMA IRQ)

Part Number: RM48L952
Other Parts Discussed in Thread: HALCOGEN, CCSTUDIO, TMDSRM48HDK, SEGGER

Hello,

Sorry about long post, but problem is quite complicated...

We have discovered that VIM RAM vector gets corrupted during/after VIM RAM parity test in case the "moon is properly aligned towards mars when doing the test"...
1. It does not matter whether the test is SafeTI-test (SL_SelfTest_VIM( VIM_SRAM_PARITY_TEST ) or HalCoGen test (vimParityCheck())
2. Behavior can be re-produced both in IAR and CCStudio and running code from main() in while loop so OS activity/real application does not matter
3. This is really timing / configuration sensitive, for example attached CCStudio project does not fail in case start-up VIM PBIST and VIM parity test is not enabled (haven't tested those individually), with current project this fails always after 571 tests (u32TestCnt == 571) in main()
4. Vector under test is corrupted, looks like that at least the lower byte is always turned to either 0x00 or 0x01 (weirdly IAR shows that only lowest byte is altered but CCStudio show that whole vector is changed). It does not matter whether the tested vector is 0xFFF82000 or 0xFFF82008.
5. It looks like that you will also need some other interrupt activity/sources (like RTI) to come which doing the test interrupts disabled - without it couldn't crash our IAR main-loop tester either.


Other notices:
a) VIM_FBPARERR function gets called a lot - almost on every test but not exactly every time (typically the PAR_FLG == 0 so this is "ghost call"). This gets called even the setup is such that VIM RAM does not ever corrupt
- this is also quite weird
b) Sometimes only VIM vector gets corrupted and if you fix it in "after"-branch everything is ok. BUT some times it also trigger VIM_FBPARERR with FLG=1 and in that case you will get also ESM-error like this

status (984599651|1|1|1007999222)<CR><LF>
after: VIM_RAM differs: 0x1 vs 0x53bec -- 1007999431<CR><LF> // bolded is current test number
VIM_FBPARERR: Fixing ch: 2 --- 1007999431<CR><LF> // bolded is current test number so FLG=1 happened after the test
ESM:1, ch: 15 p1: 0xfff82008, p2: 0x0, p3: 0x0 @ 0 ms<CR><LF> // this is ESM low interrupt handler
ESM: activated failsafe: G: 1 ch: 15<CR><LF>
status (1002149607|2|2|1025999160)<CR><LF>

And sometimes you only get vector corruption - this does not make sense. Our printing buffer is very big and at that time thoroughly tested that I am 99,999999% sure that prints are not lost in this case
status (2054919|0|0|2099865)<CR><LF>
after: VIM_RAM differs: 0x0 vs 0x53bec -- 2099909<CR><LF>
status (2347396|1|0|2399839)<CR><LF>



Our IAR-project main() looks similar as CCStudio (decided to do OS independ testing, but exatcly similar "behavior" can be seen in run time with OS active):

/////////////////// CODE /////////////////
    #define vimRAMLoc       ((volatile uint32 *)0xFFF82008U)

    uint32 u32Address = *vimRAMLoc;

    boolean bFail = FALSE;
    uint32 u32Time = 0U;
    while( bFail == FALSE )
    {
        u32Temppi++;

        if( *vimRAMLoc != u32Address )
        {
            DBG_PRINT( "entry : VIM_RAM loc differs: 0x%x vs 0x%x\r\n", *vimRAMLoc, u32Address );
        }
        SL_VIM_FailInfo failInfoVIM = { .stResult = ST_FAIL }; /* error injection test may not set this at all */ /*lint !e785 */ /* only one instance initialized */
        _disable_IRQ_interrupt_();
#if 0
        boolean bRetVal = SL_SelfTest_VIM( VIM_SRAM_PARITY_TEST, &failInfoVIM );

        SL_SelfTest_Result failInfo = failInfoVIM.stResult;
#else
        vimParityCheck();
        boolean bRetVal = TRUE;
        SL_SelfTest_Result failInfo = ST_PASS;
#endif

        if( *vimRAMLoc != u32Address )
        {
            DBG_PRINT( "after: VIM_RAM differs: 0x%x vs 0x%x -- %u\r\n", *vimRAMLoc, u32Address, u32Temppi );
            //bFail = TRUE; // force stop

            u32VIMRAM_corr++;
            *vimRAMLoc = u32Address;
        }

        if( (!bRetVal) || (failInfo != ST_PASS) )
        {
            DBG_PRINT( "VIM test fail: %u\r\n", failInfoVIM.failInfo );
        }

        _enable_interrupt_();

        if( HAL_u32TimeGet()-u32Time > 60U*1000U*1000U )
        {
            u32Time = HAL_u32TimeGet();
            extern uint32 u32Calls;
            extern uint32 u32Fixes;
            DBG_PRINT( "status (%u|%u|%u|%u)\r\n",u32Calls, u32VIMRAM_corr, u32Fixes, u32Temppi );
        }
    }

    while( 1 );
////////////////// CODE ENDS /////////////////////

CCstudio project which fails in Hercules RM48 MCU demo kit (TMDSRM48HDK):

- This project uses purely HalCoGen files, 0 manual modifications to anything
- In this project the interrupts are not disabled "properly" (I bit state is not preserved in disable and used when enabling) but since the interrupt disable&enable is used in 2 places in this CCStudio project that is ok, the IRQ cannot get enabled due to nested disable-disable-enable-enable pattern.
6886.VIM_RAM_TEST.zip

Please note that "debug printing" is also radically simplified here, it will trigger 1 "FOO\r\n" print when initing the debug prints - that is enough to get failure. In case you request more to print it will always trigger that same FOO\r\n despite what you give as arguments. Printing is done by using DMA. And it goes to that "after test RAM vector corruption" branch when 571 tests has been made so that initial printing "setup" which also starts DMA cannot alter the VIM RAM content since 570 test can be made without corruption....

Quessing that in CCStudio code there is zero possibility that application would write into VIM RAM, in our real application there is 0,00001% change.

- In IAR I have checked with segger watchpoint that no one writes to that VIM RAM vector (watch point triggers after the test when correcting the vector back inside if-sentece where corruption has been already detected - that also proves that watchpoint works and it should also prove that our real IAR application does not either write into that VIM RAM area).

Also VIM RAM test is protected by interrupt disable/enable so anyone couldn't write into that register, only possibility would be DMA but since we have only 1 channel in use and it writes to UART and you see text in UART it is not possible that DMA would write into VIM RAM (don't know/haven't tested if watchpoint is capable of seeing also DMA writes)...

In our real IAR application and in main()-test loop the probability of VIM RAM failure radically depends on the debug print activity the more often you print the more VIM RAM corruption errors you get, also RTI period seems to have an effect 1ms period gives less errors than 100us period but then 10us  look give less than 100us...
- with 5sec test interval we can run the real application code from hours to multiple days until that corruption happens like randomly from 4h to a week which gives strong indication that the error to appear requires some magic event&timing to match to realize into VIM RAM corruption...

Here I have printed that status on every 60sec in main-while-loop illustrate that only some errors occured :
status (1546199563|2|2|1583998758)<CR><LF>

The format is: total amount of FB_PARERR  | detected RAM corruption after the test | FB_PARERR FLG ==1 calls | tests made
- So those FLG == 0 calls has been triggered on almost every test, RAM has corrupted only 2 times and both times has come also FLG=1 call

Here is same status printing in same main-while-loop but printing it every 1sec, the ratio of errors is much greater

status (17815128|1155|29|780321957)<CR><LF>

after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780322779<CR><LF>

status (17819918|1156|29|780531940)<CR><LF>

after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780532761<CR><LF>

status (17824708|1157|29|780741923)<CR><LF>

after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780742218<CR><LF>

status (17829498|1158|29|780951906)<CR><LF>

after: VIM_RAM differs: 0x51301 vs 0x513f8 -- 780952137<CR><LF>

VIM_FBPARERR: Fixing ch: 2 --- 780952137<CR><LF> / this does not check is the vector value valid or not, it just writes from backupped table the value to given vector

ESM:1, ch: 15 p1: 0xfff82008, p2: 0x0, p3: 0x0 @ 0 ms<CR><LF>

ESM: activated failsafe: G: 1 ch: 15<CR><LF>

after: VIM_RAM differs: 0x51300 vs 0x513f8 -- 780953681<CR><LF>

status (17834280|1160|30|781161858)<CR><LF> // note now it turned from 29 to 30 since we got FLG=1 FBPARERR call also



During experiment noticed also following:

- In case after VIM init and before main() you set that "reserved" vector under test to at least of values 0x1 or 0x0 instead of that "phantomInterrupt"  then SafeTI VIM RAM test starts to occasionally fail and reason for failure is that ESM channel has not get active (didn't test that with HalCoGen-test).

status (7892|0|0|3367759)<CR><LF>

status (9473|0|0|3579331)<CR><LF>

after: VIM_RAM differs: 0x1 vs 0x0 -- 3579770<CR><LF> // Here we have set to vector manually to 0x0 so after the test it has changed to 0x1, just fix it back

status (11046|1|0|3790893)<CR><LF>

status (11046|1|0|4000888)<CR><LF>

VIM test fail: 5<CR><LF> //now actual VIM test failed after 4 million tests has been made,  5 == esm channel hasn't activated during the test  /* Check if ESM group1 channel 15 is not flagged */ commented branch in sl_selftest.c // there are more failures after that, so this wasn't the only one

VIM_FBPARERR: Fixing ch: 2 --- 4001073<CR><LF> // no "after print so RAM context is ok, but went into real FB_PARERR handler after the SafeTI test // we have own function which prints this only if FLG=1

ESM:1, ch: 15 p1: 0xfff82008, p2: 0x0, p3: 0x0 @ 0 ms<CR><LF> // and got also ESM error

ESM: activated failsafe: G: 1 ch: 15<CR><LF>

status (12622|1|1|4212427)<CR><LF>

status (14201|1|1|4424000)<CR><LF>


So in this case, the RAM content is not corrupted but still it set FLG=1 but the SafeTI couldn't see ESM error...

Initially we noticed this problem due to our test failed very occasionally which checked VIM RAM content (CRC over VIM RAM), then we reduced tests so that we had only VIM parity test & VIM RAM content test and speeded up test interval -> started to fail more often -> then started to print the content of VIM RAM during CRC failure -> noticed that 2nd vector had wrong value -> added check before & after every test that is the vector in proper value. After that I have spent ~4 days to test different combinations with IAR and lately built CCStudio project from scratch and evaluated what needs to be done there to get it also fail and now I got it...

Still I cannot say accurately why this fails, basically some VIM PBIST in startup&RTI&DMA&some DMA activity is needed to get that failure, if you remove 1 item you most likely won't get any failures, but in our real application every element is in use. For me this looks CPU problem which cannot be prevented...

We would need to understand what happens and WHY, guessing that removing the VIM RAM parity test (does not give any FIT benefit) from runtime tests removes the problem. But that just solves the issue, not the problem behind it .Can this same problem appear also some other way -> if yes, then our VIM RAM CRC test fails again and system trips...
- When implemented that VIM_FBPARERR handler  to use SafeTI ESM vectors and restore proper content (vectors are modified after vimInit()) I noticed those "ghost" FLG == 0 calls but at time decided to ignore those since occurence was rather small and we had FLG == 0 guard to prevent any processing so effectively those only consumed minor amount of CPU time. But now thinking that those FLG=0 calls are symptom of much more critical problem and those FLG == 0 calls exists in pure HalCoGen project also when doing HalCoGen's vimramparity-test...


Question 1: Is there some error/problem in CPU VIM_RAM handling which causes corruption during parity testing which should be documented in errata?
Question 2: Why SafeTI test starts to fail in case vector is not "phantominterrupt" or should be asked why ESM channel does not always react in such situation (the SafeTI test looks to work properly since it signals error)?

- When "phantominterrupt" interrupt is used now in these accelerated experiments, haven't received any SafeTI-failures (in our real application we have seen 2 times in total that vim ram parity test has failed but since the test by default does not say why it failed, we cannot know the reason, most likely either the nERROR was actived due to speculative fetch or that same ESM channel not activated problem happened - added reasons to SafeTI but haven't got that error after that)
Question 3: Why those FBPARERR calls comes with FLG=0, those shouldn't come since interrupts are disabled why test is made and it ack that FLG away - and since those comes why those doesn't come on _every_ test?

  • Screenshot was lost from CCStudio was lost, here ir is again, see how 0xFFF82000 is changed to 0x1 when inside "after check" if branch

  • hmm. the picture is visible in writing view (copy&paste but when sending it gets lost (new try with insert media feature)...

  • Had a couple of minor mistakes, the DMA actually didn't get kicked with CCStudio at all, despite of that it went (always once) to the error. So basically can rule out that it is "the dma" which causes something...

    Then I made slight code change (moved before check inside interrupt) -> didn't went anymore into error --> real timing critical.

    Reverted that change -> went again to error...

    Then tried changed/fixed printing a bit (still didn't start any dma) -> didn't went into error -> real timing critical...

    Overall I then made a couple of changes and now it goes consistently and multiple times to that error (commented that while(1) away to automatically gather multiple errors -> errors can be seen by putting breakpoint into error branch  and wait it to re-trigger (or somewhere else) and then see amount of errors from u32VIMRAM_corr-variable.

    See number of errors vs. tests performed in this run 11 pcs out of 4,8 million test runs



    Here is more robust project:
    6170.VIM_RAM_TEST.zip

  • Hello Jarrko,

    Thank you for the very detailed post. I am going to forward to our software team so they can have a look at it and also see if I can reproduce the issue on my bench.
  • Hello Jarkko,

    A quick update...

    I was able to download, compile and load the code that you provided (from your last post). However, I have not yet been able to see it catch an error. I am going to kick it off and let it run over night to see if it is something that will happen more rarely that I can catch after a long run time.

    As you mentioned also in your posts, the DMA never kicks off so I never see any data transmitted on the SCI. I don't know if this is somehow affecting things or not. It seems that this should not be different between the two IDEs. I have tried playing with a few parts of the initialization of the code to see if I could get it to start; but, I haven't had any success with this. If you were able to recreate the failure without is, maybe this isn't so critical.

    It is interesting to me that there is different behavior between the IAR tools and the TI tools. Since it is a timing critical thing, it could have something to do with the compilation/binary being slightly different between them that is creating a slight difference in timing that results in the different behavior.
  • Hello Jarkko,

    Another quick update. I am able to recreate the issue with the code supplied without issue now. I am able to see the VIM RAM content be corrupted resulting in a content of 0x00000001 and also 0x00000000. I have not seen any relationship to timing of the tests (in other words there is no predictability that every nth test run will result in a corruption or every nth test will be 0x0 and mth test 0x1, etc). I have observed similar things as you have described such as small changes to code have an impact on the frequency of the corruption and can even cause it to seemingly go away completely. I have also noted that removal of the DBG_PRINT code also causes the issue to go away even though it doesn't appear the DMA is working. In addition to this, I have also observed that disabling interrupts also causes the issue to go away.

    With what I have learned so far, I still am not certain of root cause, but I believe it has something to do with stack corruption due to an interrupt happening during the VIM RAM test. I am going to dig into this more tomorrow and will report my findings at my COB or earlier.
  • Great that you were able to duplicate hte issue.

    Quick questions:
    - did you used the .zip package from 1st post or the later one (4th post)? Should use the later -zip package, that should be more robust what comes to error appear, in 1st package the compiler version will most likely effect. My goal was to provive packet which fails "consistently" and 2nd package should take seconds to reach first error and sequential errors should come less than 10sec.

    Later .zip package should consistently come here - haven't tested/looked the actual uart-pin on demo-board since I was happy to get that BTC interrupt
    #pragma WEAK(dmaGroupANotification)
    void dmaGroupANotification(dmaInterrupt_t inttype, uint32 channel)
    {
    /*  enter user code between the USER CODE BEGIN and USER CODE END. */
    /* USER CODE BEGIN (54) */
    #ifdef DEBUG_MON
    #include "DMON.h"
        if( inttype == BTC )
        {
            if( channel == (uint32)DMA_CH_DBG_TX )
            {
                DMON_vPacketSent(); // here
            }
        }
    #endif


    What comes to a stack corruption:
    - good idea, have to also check that more closely but our real application has stack monitors in place for every cpu-stack and for every os-task-stack - this monitor is not real time but performed in certain period, but it has 20% limit (or at least nn stack items) to actual stack size so it should warn/trigger is in case IRQ or FIQ or abort stack has gone near it's real size.
    - Interrupts shall be disabled during the VIM parity test, the after check (and also before in 2nd .zip) is inside critical sections so only item which should ran is vim parity tester and when it runs vim-shall have interrupt sources to register incoming interrupt.

    I am not familiar with CCstudio, could find how/where the IRQ stack is located from linker output, but when inside dmaBTCAInterrupt->dmaGroupANotification->DMON_vPacketSent->vStartDmaTransfer the SP is 0x08001278 and when check memory browser there is zeroes until 0x08001000 (assuming that sys/usrmode stack starts from there)

    when checking .map file, it does not show what is before 0x08001500 (most likely stacks), so when from 0x08001278->0x08001000 is zeros I assume stacks are ok in this "demo" also
                      00006f74    00000008     (__TI_handler_table)
                      00006f7c    00000010     (__TI_cinit_table)

    .bss       0    08001500    000007f8     UNINITIALIZED
                      08001500    000007d0     DMON_Main.obj (.bss:au8DataBuf)
                      08001cd0    00000028     sci.obj (.bss:g_sciTransfer_t)

    .data      0    08001cf8    0000002c     UNINITIALIZED
                      08001cf8    0000000c     rtsv7R4_T_le_v3D16_eabi.lib : exit.obj (.data:$O1$$)
                      08001d04    0000000c     sys_main.obj (.data)
                      08001d10    00000009     DMON_Main.obj (.data)
                      08001d19    00000003     --HOLE--
                      08001d1c    00000008     rtsv7R4_T_le_v3D16_eabi.lib : _lock.obj (.data:$O1$$)

    Then found this from HalCoGen files, so IRQ stack is from 0x08001300 to 0x08001200 so SP as 0x08001278 when inside IRQ it is clearly in safe-area --> this problem "cannot" be related to stack pointer corruption...
    userSp  .word 0x08000000+0x00001000
    svcSp   .word 0x08000000+0x00001000+0x00000100
    fiqSp   .word 0x08000000+0x00001000+0x00000100+0x00000100
    irqSp   .word 0x08000000+0x00001000+0x00000100+0x00000100+0x00000100
    abortSp .word 0x08000000+0x00001000+0x00000100+0x00000100+0x00000100+0x00000100
    undefSp .word 0x08000000+0x00001000+0x00000100+0x00000100+0x00000100+0x00000100+0x00000100

  • Checked also IAR project (when test code run inside while-loop in main() before real application would start), there we have much bigger stacks than in CCStudio.

    IRQ stack is 0x08010f80 - 0x08010a80
    SP 0x08010f08 when quite deep in the interrupt handler.

    Also memory windows shows latest non-zero content in 0x08010f00 (so this is most likely the deepest point where we have been while()-loop tester, stack monitor not yet initialized which fills same special pattern to stacks) and after that it is only zeroes all the way to 0x08010a80. First non-zero content comes in 0x08010400 (this is our sys-stack) and zeroes begin again from 0x080101b0 and sys-stack ends to 0x08010020 so plenty of room here also...

    between irq & sys-stack is FIQ and SVC stacks 0x300 each (in case you wonder that 0x08010a80-0x600 is not ...400 that is because we also fixed extra empty safety gap between each CPU stack to give some extra margin to stack monitor checks since it is primarily used for task-stack size checking which are also protected against leaking by MPU) so there would be quite much room for irq-stack to leak in test-code since at that point fiq & svc is not used (we don't use SafeTI stack init in IAR which keeps/puts cpu in svc-mode) so basically irq-mode has room to use all the way to sys-stack...

    So after inspection of both projects (IAR & CCStudio) I would rule out stack-problem for several reasons
    1) parity test is done inside critical section (test code critical sections should also wotk even though those are not the best/robust possible as expained in 1st post)
    2) SP's are feasible when quite deep of interrupt handler
    3) stack memory areas has lot of zeros before stack boundaries
    4) segger's watchpoint write checker targeted to vim vector under test does not trig until code really restores the vector in after-if-check

    Of course there could be some other similar "stupid mistake" which causes this, but since example project is so simple compared to real application it should be quite easily found if one exists... Also if one exists segger's watchpoint should trigger.
  • Thanks Jarkko for the additional information.

    This has been quite difficult to narrow down since as I add markers and debug code within the project, the point at which the issue occurs moves or goes away. This is really why I believe it is interrupt related. i.e., I wonder if there is an interrupt firing at a point in time that is creating a type of semaphore causing the corruption. This is why I mentioned the stack and the possibility, not that the stack is corrupted but more along the lines that the wrong value is loaded for the return value when causing the issue. Of course this is just speculation since I haven't been able to correlate the occurrence with a particular time interval or execution of an ISR. I am going to keep looking on Monday and try to capture some timestamps when events happen to see if I can associate the occurrence of the issue with one of the interrupts happening.

    By the way I used the project from your 4th post which seems to generate the issue quite easily.
  • Hello Jarkko,

    Although we have not clearly nailed down which interrupt is causing the interference, we have identified a change to the code and the way interrupts are disabled/enabled that appears to resolve the issue. Essentially, we have to disable and re-enable by writing directly to the VIM as shown in the code snippet below.

        uint32 reqmaskbackup0;
        uint32 reqmaskbackup1;
        uint32 reqmaskbackup2;
    
        boolean bFail = FALSE;
        while( bFail == FALSE )
        {
        // _disable_IRQ_interrupt_();
    
           // Backup Interrupt Enable
           reqmaskbackup0 = vimREG->REQMASKSET0;
           reqmaskbackup1 = vimREG->REQMASKSET1;
           reqmaskbackup2 = vimREG->REQMASKSET2;
    
           // CLear Interrupt Enable
           vimREG->REQMASKCLR0 = vimREG->REQMASKCLR0;
           vimREG->REQMASKCLR1 = vimREG->REQMASKCLR1;
           vimREG->REQMASKCLR2 = vimREG->REQMASKCLR2;
    
           u32TestCnt++;
    
            if( VIMRAMLOC != u32ExpectedAddress )
            {
                  while(1){};
            }
    
            vimParityCheck();
    
            if( VIMRAMLOC != u32ExpectedAddress )
            {
                //bFail = TRUE; // force stop
    
                u32VIMRAM_corr++;
                VIMRAMLOC = u32ExpectedAddress;
    
               while(1){};
            }
    
            //   _enable_interrupt_();
            
            // Restore Interrupt Enable
            vimREG->REQMASKSET0 = reqmaskbackup0;// = vimREG->REQMASKSET0;
            vimREG->REQMASKSET1 = reqmaskbackup1; //= vimREG->REQMASKSET1;
            vimREG->REQMASKSET2 = reqmaskbackup2; //= vimREG->REQMASKSET2;
    
    

     

    I have made this change to the sample project provided and have executed for 48 hours without failure.

  • Hi and great,

    Correct me in case I am wrong, but wouldn't that experiment shows that there is some unidentified problem in the VIM peripheral since both methods prevents interrupts but CPU core's I-bit usage causes VIM problem while doing parity test for the VIM?

    That experiment would also rule out coding error / stack corruption / wrong returning from IRQ?

    With that approach, did you got any VIM_FB_PAR_ERR calls with FLG == 0?

    Just to clarify, did you let the other debug-printing critical section to be _enable_interrupt_()? If yes, that would narrow down the problem just to parity test so regular core I-bit can be used without problems anywhere else (at least we haven't discovered any other VIM RAM corruption in our system or any other problems with regular I-bit usage)?

    So 1 way to get rid of this problem could be as pseudo-code (since using I-bit in other places of code is quite mandatory for example due to OS-restriction, we are using certified OS so cannot alter its critical entry/exit methods):
    1. Enter generic SafeTI test-harness (whole real harness is not so simple as below in below steps :))
    2. CPU I-bit disable
    3. if test == VIM RAM PARITY
    3a. backup & disable VIM req masks
    4. Do 1 SafeTI test
    5. if test == VIM RAM PARITY
    5a. restore VIM req masks
    6. CPU I-bit enable
    7. Exit generic SafeTI test-harness

    Basically just handle the VIM RAM parity test a little bit differently than any other SafeTI test would most likely provide most easiest/best solution also in code modification wise?

    I quickly tried that pseudo approach in the same CCStudio project (let those your commented irq-enable/disable lines be uncommented just moved I-enable after restore) and didn't receive any errors, not even that VIM_FB_PAR_ERR with FLG == 0 side effect...

    So looks like FLG == 0 call is also real side effect which indicates that everything is not ok in the setup if it comes?

    So looks like the your solution (& using pseudocode above in real code) would provide the a way to eliminate the problem & side effects completely, so no need for manual VIM RAM content checking & repairing after the test & handling/ignoring ESM IRQ after that if vector content is ok...

    BUT unless you are capable of digging out the root cause, how we can be sure that the problem has really disappeared not popping out after running 1 month? I am of course pretty sure of this fix, since for me it initially looked like that incoming IRQs and/or DMA transfer (maybe needs several & maybe needed simultaneously) during the test was the cause of failure. With fix VIM most likely works a bit differently since it is not registering those IRQs immediately when those comes, it registers them when masks are re-enabled and that most likely changes things...

    Are you going to make errata-entry or something regarding to this issue (would help to properly mark the code since I doubt that these e2e-links works for "eternity").

  • Jarkko,

    I am going to request to see if we can get some design resources to simulate the issue, but I am not hopeful that will happen since the design teams are focused on high priority, new designs. What this means is that I will have to try to see if there is a way for me to better isolate the cause of the issue. As you noted, it is likely that this is a result of interactions with either other IRQs or the DMA. Since we don't see this issue in other cases where we have IRQs and DMA transactions happening, I would speculate that it is something specific to this VIM test and its interactions with the ongoing activities. Further speculation is that it is specific to the DMA activity since it is a second master in the architecture and has the ability to modify addressable memory. I am going to setup another test whithout the direcr VIM interrupt access but enable the DMA MPU to disallow it to write to the VIM RAM space. If it is related to the DMA transfer, I should get an exception. Once I can narrow this down, I can focus more on the DMA and its operation.
  • Hi Jarkko
    I am marking this is as TI thinks it is resolved.
    Please do feel free to open another post - if there are pending issues around this - the system locks out old threads on a daily basis so you may not be able to respond to this one, in a day or two.

    REgards
    Mukul
  • Hi,

    Chuck's post "Feb 22, 2018 4:54 PM" using vim req masks solved the problem in our end also (using pseudo code steps mentioned in post "Feb 23, 2018 7:23 AM") but the root cause was still open in post "Feb 23, 2018 3:01 PM".

    If something is marked as resolved I'll think it should be the post "Feb 22, 2018 4:54 PM" - I'll do it now. However the root cause for this would have been nice to receive or errata marking or something, since now in own code is just reference to this thread why something is done as it has been done.