This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

RM48L952: N2HET: reading from N2HET RAM gives occasionally the real value shifted by roughly << 1

Part Number: RM48L952
Other Parts Discussed in Thread: TMDSRM48HDK

Hello,

We are having rare occasional problems which we now managed to reproduce with attached project by using CCS IDE & RM48 devboard (Hercules Safety MCU development kit RM4 MCU).

What we are trying to do in real life

1. HET runs & calculates milli second time
2. Application reads HET time once per sec and compares time elapsation to equivalent RTI time elapsation

Like once in a month we got an error where following log was received for 1 failure: (1000 == rti_diff, 2842175 == n2het_diff, d: == 'diff', t; = allowed tolerance, f: = sequential failures, t1: = RTI 'new'-prev', t2: = N2HET 'new'-'prev'

DRIFT_ERR: Elaps: 1000 vs 2842175, d: 2841175, t: 3, f: 1<CR><LF>

TIMES: t1: 2843242-2842242, t2: 5682352-2840177<CR><LF>

DRIFT_ERR: Elaps: 1000 vs 4292127120, d: 4292126120, t: 3, f: 2<CR><LF>

TIMES: t1: 2844242-2843242, t2: 2842176-5682352<CR><LF>

As can be seen 5682352 time has been received from N2HET, that is wrong since previous time was 2840177 and next one 2842176 which has 1999 in between so either 2841177 or 2841176 should have been received from N2HET. Second time compare of course fails since completely carbage has been received before as used as previous time (second "new" time is correct).

We are having in real application EXT_CLK as RTI clk-source and we also monitor via DCC that OSCIN&EXT_CLK's are valid, DCC is happy all the tme. We need to monitor also timestamps since clocksources does not produce the time stamps those just helps to "generate" them and based on this monitoring we look to have a real problem with n2het based time stamp.


Now this exactly same behavior happens with attached simplified project which basically does the something as in our real application but "much faster", just run it and you see that 'u32ReReadSuccess' matches to 'u32Fails'. We have tested this with 2 different RM48 dev board and 2 different computers which are under different IT organizations and in different geographic location. Note that re-read is added to illustrate the problem that first access might fail (with roughly ~2x the expected results) and second access gives expected time.

Steps to reproduce:
1. Use attached project and download it to RM48 board
2. Set CPU to "run", wait for example 5sec and stop
3. Check 'u32ReReadSuccess' and 'u32Fails' variables

You can also put breakpoint to u32ReReadSuccess++ line and see that 'u32Time' is roughly 2x the 'u32ReRead' time.

Note1: you can cut&paste the code in dma.c to sys_main.c and behavior changes
Note2: if you comment out _enable_interrupt_(); then no failures are detected
Note3: in case you modify _enable_interrupt_()-function so that only FIQ or IRQ is enabled (but not both) the errors stays
Note4: if you put __nop(); between 2 HET_TimeGet() functions the error looks to disappear.

I see 2 possibilities

A) N2HET code has bug, but we just can't find it - still couldn't understand how bug in n2het code could cause the RAM content multiply "itself" by 2 and then revert that...

B) Somekind of an N2HET RAM access problem - I have understood from TRM that access is atomic and allowed from CPU and also that ADD instruction is atomic  ("case 110:Immediate Data Field[31:0] = IR2") so by default what we are doing should be ok

TRM 20.2.2.1: "N2HET accesses to its own internal RAM are given priority over accesses from an external host (CPU or DMA),"


Please, point out the problem, since using that re-read in real code would be "fixing the symptom" not root cause, also based on experiment with "democode" shows that with some modifications also 'u32ReReadFails' are encounterd--- Also I do not understand how FIQ&IRQ enable can affect this and also why code starts to behave differently when you cut&paste code from file to another or you add a 'nop'... Is this somekind of a pipeline problem so that 'DMB' 'DSB' intstruction should be given or something?

 
Here is the project for RM48 dev board
5444.N2HET_read_problem.zip

  • Hi Jarkko,

    As an FYI, I have forwarded your post to our NHE expert so they can setup the test and try to find the root cause. I would expect that you should hear back from them within the next day or so. If you do not, please let me know and I will ping them again.
  • Hello Jarkko,

     

    I run the test case (attached) on my machine and RM48Lx HDK, but I could not reproduce the problem.

    Attached is the project you posted, but I did several minor modifications to get rid of the compiling errors:

    1. imported to CCS7.1, the compiler version is 16.9

    2. added " uint32 u32HetClkDiv;"

    3 Changed "lkDiv = u32HetClkDivider();" to "u32HetClkDiv = u32HetClkDivider();"

    https://e2e.ti.com/cfs-file/__key/communityserver-discussions-components-files/312/5850.N2HET_5F00_RAN_5F00_READ_5F00_TEST.7z

     

  • It is not possible use the .out directly which was in the .zip or use same CCS IDE version as we are using since we didn't had to do anything with other computer to get this project into it and run & detect the same failures?

    Note that we have encountered this issue initially with IAR compiler so in that way the compiler should not affect to it since we get it also with CCS. But since have noticed with this "demo project" that even a slight change may change the code behavior, that's the reason why setup should be "identical" which I think could be achieved by using the .out file from .zip and same RM48L952ZWT CPU...

    Our development board package looks to have sticker behind it with following data: TMDSRM48HDK

    This is what we get (with 2 different boards & computers where other computer just imported the project & used it as is, also verified that for example removing the interrupt enable function in imported project cased the failures go away like in first computer): code has run ~6814ms before it was stopped (just put code to run after system reset/download - wait a bit - press "suspend" (alt+F8)) and during that time we have encountered 253 failures in "primary read" which does appear in "secondary read".

    And here you can see what was read as first time and reread time when first failure will be encounted (prevtime == 25, time == 50 and read == 26)

    Based on your project, you have not actually changed anything even you said that you had to do changes to get code compile, just declaring variable before assigning the value to it (HetC not used)? From where that u32HetC or lkDiv came, should not be in our .zip?

    Yours:

        uint32 u32HetC;
        uint32 u32HetClkDiv;
        //lkDiv = u32HetClkDivider();
        u32HetClkDiv = u32HetClkDivider();


    Ours:

    uint32 u32HetClkDiv = u32HetClkDivider();

    With your sys_main.c (in our project) the failures still exists


    Lets also clarify a couple of things:

    A) it is OK to read from HET RAM like we are doing and nothing extra should not be needed?
    B) Can you see any problem in the N2HET code or its usage?

    C) There should not be a reason to use for example HTU to transfer data from N2HET RAM to CPU RAM (this was one plan to try if it helps to give always valid values)?

    And yes, I/we know that this whole thread seems very weird but as you can see from the screenshots, this is real deal and actual problem (also with CCS & dev board). The reason why it took so long from us to find the "symptom" was that initially we didn't have debug prints which would have had printed the current & previous values, we just printed the diffs, tolerance & failure count (the 1st line). Originally we though that this may be some kind of an "glitch" with debugger or  "wrap around issue" in 32-bit n2het data but we initialized n2het time to near wrap around point and noticed that it works ok, so we let it be until it hit again, then we added more prints and just last week we received the first real failure with improved prints and noticed that value read from n2het is "crazy". Then we "speeded up" our monitoring a bit (moved it from 1sec to 1 ms) and noticed that it crashes very fast. Then we added the re-read and noticed that it is always ok. after that we decided to eliminate our application (OS, interrupts etc) and established this CCS based demo just to notice that errors still exists and with slight changes which should not affect to RAM reading (like enabling IRQ)  it looks to go away...

  • Hello,

    1. I used your out file directly (compiler version: ARM 16.3.0.STS), and got the problem:

    2. I compiled the project with ARM16.9.1.LTS on my machine, and could not reproduce the problem:

    3. I don't see any problem in your NHET microcode. I'd like to use MOV32 or MOV64 to update the data field and control field.

    4. The way you use to read the  NHET RAM is good.

    5. HTU is NHET DMA (transfer NHET data to or from SRAM in the backgroud of PCU operation).  Using HTU will improve your code performance, but don't think it will solve this problem.

    Could you please try the new compiler?

  • Hi,

    QJ Wang said:

    1. I used your out file directly (compiler version: ARM 16.3.0.STS), and got the problem:

    Great that the error was reproduced!

    QJ Wang said:

    3. I don't see any problem in your NHET microcode. I'd like to use MOV32 or MOV64 to update the data field and control field.

    Sorry my poor english, do you mean that MOV32 operation here is "sub optimal" since it is not used at all? How we could otherwise get ADD instruction to increase data field by 1 (now MOV32's data field is just used as place holder for value 1 the MOV32 instruction is not executed at all (see BR instruction branching). Do you think that this could have any influence to read problem?

    QJ Wang said:

    5. HTU is NHET DMA (transfer NHET data to or from SRAM in the backgroud of PCU operation).  Using HTU will improve your code performance, but don't think it will solve this problem.

    My motivation to try HTU was to circle the direct CPU access/read to N2HET ram since it looks like that there is some problem with HTU the CPU would read that same value from it's own RAM (DMA could be also used for that transfer but since HET has only 1 DMA trigger would like save those for future - still maybe worth of testing), which may or may not exists depending on the code/compiler. Of course can't be sure that the problem is in read, it just looks based on code behavior that it would be there... if HTU method would work then it would be an indication of something (don't know what but something)...

    QJ Wang said:

    Could you please try the new compiler?

    Of course I can try newer CCS (cannot try newer IAR), but since we have this same issue with IAR (which is our primary compiler) I can't see any benefit of trying newer CCS version (I trust to your test that this same code does not detect fails, because the .out was different (diffed that and .map-file from your project) still failure may re-appear if you modify the code a bit). Even with this our CCS version slight code change (like adding 'nop' or comment out IRQ/FIQ enable to code which does not encounter failures but if you cut&paste dma.c content to sys_main.c then also re-read starts occasionally find failures. We have slightly similar results with IAR test code (with it the getting rid of errors looks much harder), common thing is that the any changes made to the code should not any way influence the n2het RAM read but still it some how affect to it.


    Since you managed to reproduce the issue I'll hope that TI tells us what is wrong and how to fix it, this goes way beyond my/our expertise and I'll think that this is not solvable with basic debugger...

    Just trying "something" is not effective since by that way you most likely can't find the root cause... This could be something so simple as add some pipeline flushing or something before read (but based on TRM it looks that nothing special should not be needed, like you also confirmed in 3. and 4.) or this is errata material and some special procedures are needed to circle it in a way that problem never occurs...


    This issue will be blocker for us in the future since that time is needed for safety fieldbus (it requires 2 timestamps from different clock sources and the fieldbus stack calculates time elapsation from both sources and takes longer one) and as our "monitor" fails most likely same will happen inside the fieldbus stack (haven't got that fieldbus yet to run in this device), it calculates crazy time elapsation and fieldbus protocol raises timeout and device trips... With current code we would need to "sanity check" the values given to stack, if crazy value got, re-read until "better" value is received and pass that to stack - this is something what we don't want to do now especially because currently we do not understand the root cause...


    Initially we had RM44 series device and with that we used 2 eQEP modules (1st divided external clock to ms base and it's output was looped to 2nd eQEP input which then directly counted milliseconds) to produce that second millisecond time, it worked without issues but this RM48 series does not have eQEPs so we invented this N2HET time method which initially looked to work...