Other Parts Discussed in Thread: TMDSRM48HDK
Hello,
We are having rare occasional problems which we now managed to reproduce with attached project by using CCS IDE & RM48 devboard (Hercules Safety MCU development kit RM4 MCU).
What we are trying to do in real life
1. HET runs & calculates milli second time
2. Application reads HET time once per sec and compares time elapsation to equivalent RTI time elapsation
Like once in a month we got an error where following log was received for 1 failure: (1000 == rti_diff, 2842175 == n2het_diff, d: == 'diff', t; = allowed tolerance, f: = sequential failures, t1: = RTI 'new'-prev', t2: = N2HET 'new'-'prev'
DRIFT_ERR: Elaps: 1000 vs 2842175, d: 2841175, t: 3, f: 1<CR><LF>
TIMES: t1: 2843242-2842242, t2: 5682352-2840177<CR><LF>
DRIFT_ERR: Elaps: 1000 vs 4292127120, d: 4292126120, t: 3, f: 2<CR><LF>
TIMES: t1: 2844242-2843242, t2: 2842176-5682352<CR><LF>
As can be seen 5682352 time has been received from N2HET, that is wrong since previous time was 2840177 and next one 2842176 which has 1999 in between so either 2841177 or 2841176 should have been received from N2HET. Second time compare of course fails since completely carbage has been received before as used as previous time (second "new" time is correct).
We are having in real application EXT_CLK as RTI clk-source and we also monitor via DCC that OSCIN&EXT_CLK's are valid, DCC is happy all the tme. We need to monitor also timestamps since clocksources does not produce the time stamps those just helps to "generate" them and based on this monitoring we look to have a real problem with n2het based time stamp.
Now this exactly same behavior happens with attached simplified project which basically does the something as in our real application but "much faster", just run it and you see that 'u32ReReadSuccess' matches to 'u32Fails'. We have tested this with 2 different RM48 dev board and 2 different computers which are under different IT organizations and in different geographic location. Note that re-read is added to illustrate the problem that first access might fail (with roughly ~2x the expected results) and second access gives expected time.
Steps to reproduce:
1. Use attached project and download it to RM48 board
2. Set CPU to "run", wait for example 5sec and stop
3. Check 'u32ReReadSuccess' and 'u32Fails' variables
You can also put breakpoint to u32ReReadSuccess++ line and see that 'u32Time' is roughly 2x the 'u32ReRead' time.
Note1: you can cut&paste the code in dma.c to sys_main.c and behavior changes
Note2: if you comment out _enable_interrupt_(); then no failures are detected
Note3: in case you modify _enable_interrupt_()-function so that only FIQ or IRQ is enabled (but not both) the errors stays
Note4: if you put __nop(); between 2 HET_TimeGet() functions the error looks to disappear.
I see 2 possibilities
A) N2HET code has bug, but we just can't find it - still couldn't understand how bug in n2het code could cause the RAM content multiply "itself" by 2 and then revert that...
B) Somekind of an N2HET RAM access problem - I have understood from TRM that access is atomic and allowed from CPU and also that ADD instruction is atomic ("case 110:Immediate Data Field[31:0] = IR2") so by default what we are doing should be ok
TRM 20.2.2.1: "N2HET accesses to its own internal RAM are given priority over accesses from an external host (CPU or DMA),"
Please, point out the problem, since using that re-read in real code would be "fixing the symptom" not root cause, also based on experiment with "democode" shows that with some modifications also 'u32ReReadFails' are encounterd--- Also I do not understand how FIQ&IRQ enable can affect this and also why code starts to behave differently when you cut&paste code from file to another or you add a 'nop'... Is this somekind of a pipeline problem so that 'DMB' 'DSB' intstruction should be given or something?
Here is the project for RM48 dev board
5444.N2HET_read_problem.zip