This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

OMAP-L137 Sporadic kernel NULL pointer dereference of Davinci Linux on higher ambient temperatures

Other Parts Discussed in Thread: OMAP-L137

Hello,

we are working with an OMAP-L137 processor on a custom hardware similar to the OMAP L137 EVM evaluation board. We are using a Davinci Linux and SYSLINK for the communication between the ARM and DSP core. On different hardware prototypes we are using different versions of the OMAP processor. On some of them we are getting sporadic crashes of the Linux kernel, with error messages like this

Unable to handle kernel NULL pointer dereference at virtual address 00000005
pgd = c1fd0000
[00000005] *pgd=c1ff4031, *pte=00000000, *ppte=00000000
Internal error: Oops: 801 [#1] PREEMPT
last sysfs file: /sys/kernel/uevent_seqnum
Modules linked in: syslink
CPU: 0    Not tainted  (2.6.33 #1)
PC is at get_page_from_freelist+0x250/0x51c
LR is at 0x41
pc : [<c0077328>]    lr : [<00000041>]    psr: 60000093
sp : c0fd7c50  ip : 00000000  fp : c0fd7cc4
r10: 00000000  r9 : c03c3f84  r8 : c03c3f60
r7 : 60000013  r6 : c03c3f60  r5 : c0fd6000  r4 : c03c3f78
r3 : 00200200  r2 : 00000001  r1 : c03c3f84  r0 : c03c3f78
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  ISA ARM  Segment user
Control: 0005317f  Table: c1fd0000  DAC: 00000015
Process SlaveThread (pid: 894, stack limit = 0xc0fd6270)
Stack: (0xc0fd7c50 to 0xc0fd8000)

The address the NULL pointer is detected at can vary. When processing a simple memory test on the ARM core, like allocation of some MB RAM and writing, reading back and verifying the data, it fails sproradically, too. We can observe that this problem seems to be dependent on the ambient temperature and the temperature range of the OMAP. Especially when there is high activity on the external SDRAM and when the DSP and ARM core is running, the crashes are getting more frequently. In normal ambient temperatures, the problem exists on the OMAP L137BZKB3, which has a temperature range of 0°C to 90°C. When cooling this processor down to about 5°C ambient temperature it works properly. When using an OMAP L137BZKBA3, we can raise the ambient temperature a little higher. With an OMAP L137DZKBA3 we could not observe the problem up to an ambient temperature of about 80°C. Please notice, that these processors have a different processor version and a different tmeperature range.

At the moment we are assuming that we have a timing problem with the SDRAM interface or the processor itself, that increases on higher temperatures. We have no passive or active cooling for the processor. The ARM and DSP is running at 300 MHz clock. We are using the SDRAM timings of an older ARM UBL and have a SDRAM that is similar to the ISSI RAM located on our OMAP EVM board. We already tried to match the timing of the SDRAM interface better to our SDRAM, but with no success. The problem still exists.

Did anybody experience similar problems? Is this maybe a problem of an older version of the OMAP processor, e.g. the B version, we received as samples? At the moment we have one newer OMAP, the D version, on a prototype that seems to work properly.

Thanks in advance!

Best regards!

Dom

  • Hi,

    Are you getting kernel panic only when you are accessing the RAM ?

    Any time, Are you getting the same behavior while running any app?

    Did your board get any reboot at higher temp ?

    I have not seen this behavior before.

    Let me check errata for this device "OMAP L137BZKBA3"

  • Hi!

    Thanks for your fast reply.

    I did not see any automated reboot, but on higher temperatures, when i experience the problem with my test application and then manually reboot the system, the kernel often crashes during the boot process with similar NULL pointer dereference errors. When i cool down the processor again, the kernel boots correctly again.


    In my opinion, the kernel maybe reads a pointer back from RAM and this could be erronously NULL, when the required RAM access to the external SDRAM fails or a previous write operation fails which would overwrite the pointer value.

    First we thought about a critical timing of the SDRAM interface, that is just okay on low temperatures but is getting more critical with rising temperatures. So i modified the RAM timings as i thought it would be better, but with no effect. When looking at various versions of the ARM UBL, where the timing is set up, there are some different settings in the different versions, so i am not sure, which is the correct or best one. But as the problem seems to be correlated to the OMAP version and temperature grade, it is possibly something different?


    Thanks in advance!

    Best regards,

    Dom

  • Hi,

    I did not see any automated reboot, but on higher temperatures, when i experience the problem with my test application and then manually reboot the system, the kernel often crashes during the boot process with similar NULL pointer dereference errors. When i cool down the processor again, the kernel boots correctly again.

    Thanks for your information.

    Could you please mention the area that panic occurs while loading/initializing linux kernel driver ?

    Randomly ?

    How did you increasing or decreasing the temp ?

    Through chamber ?

    So i modified the RAM timings as i thought it would be better

    So, Now, You are not getting any kernel panic error, Right !

  • Hi!

    Yes we have got a chamber here for temperature tests. But with the OMAP L137BZKB3 i dont need the chamber, because the error happens at normal ambient temperatures of about 25°C when the processor was running for a while.

    I think the error when booting happens randomly at different states of the boot process. But only when the processor is warm.

    After modification of the RAM timing, i still get the error.

    Thanks for your help!

    Best regards,

    Dom

  • Hi Dom,
    I am experiencing a similar problem on an OMAPL-137 custom board derived from the DSK.
    After some minutes the board is working OK, suddenly the ARM side gives a bad frame error or the DSP gets stuck on weird memory regions like 0x3eef0000; I put some code in the DSP side of the code and I have found that the stack pointer gets bad values, while the stack seems not overflowing and the rest of the memory seems good.
    Have you any further result news about your problem? Me, I suspect too a chip temperature or SDRAM timing related problem, so I wonder if you can help me.
  • Hey Michele,

    we did not solve the problem on the B-version (L137BZKBA3), but we replaced all B-versions by the D-version (L137DZKBA3) of the OMAP and after that we never experienced the problem again.

    So in my opinion, it was possibly a timing problem on the SDRAM interface in that version on higher temperatures when excessive RAM accesses were processed from the ARM side and DSP side in parallel.

    So which version of the processor are you working on? Do you have the possibility to test the temperature dependency of your problem?

    Best regards,
    Dom
  • Hi Dom,
    many many thanks for your reply; Friday I will check the silicon version and I'll let you know. I suspect it is a quite old version.
    I will make some temperature tests too, BTW I noticed that during the morning the problem arises less frequently than in the afternoon, so maybe there is really a temperature correlation, or maybe I am just forcing my mind to believe in the silicon bug, which at this moment is the only explanation I have :).
    Maybe you remember the approx max work temperature of the OMAP case? Today I just checked the case temperature touching the OMAP and it was IMO approx 40°C so nothing particular, do you think it can be too much for the buggy silicon version?
    Many thanks again!
  • Hi Dom, an update: the problem seems not related to bad reading from the SDRAM, but to something wrong happening when a floating point operation is interrupted by an external interrupt; still struggling to fully understand what is happening...