AM625: kernel errors "rcu stall" running memory test

Jhon Q

Part Number: AM625

Hi,

We are using AM6254 SoC with 16Gb/8Gb DDR4-3200 Samsung P/Ns: K4AAG165WA-BIWE/K4A8G165WC-BIWE.
We have performed memory test running memtester for over more than 2 days. On both configurations, we receive a few kernel errors during the test stating with "rcu stall":

[ 1119.092938] rcu: INFO: rcu_preempt self-detected stall on CPU
[ 1119.098740] rcu: 0-....: (5624 ticks this GP) idle=3be/1/0x4000000000000004 softirq=12719/12725 fqs=2625
[ 1119.108401] (t=5254 jiffies g=14553 q=144)
[ 1119.112595] Task dump for CPU 0:
[ 1119.115831] task:memtester state:R running task stack: 0 pid: 960 ppid: 1 flags:0x00000202

We are unsure is this a kernel issue or DDR timing issue, note that we are using latest DDR Config version v9.08.

Referring to the 16Gb part, as per TI's feedback, for Samsung part, we have modified the tCCD_L to 5tCK and left the rest of the registers unchanged from TI's default setting.
I have attached the 16Gb dmesg logs showing the errors, our DDR config files. DDR datasheet can be found here: https://www.memorydistri.com/pub/media/downloads/datasheets/K4AAG165WA_BITD.pdf
(Note, in 16Gb test, our script left 550MB of free memory and ran 23 threads of 50MB each for more than 2 days.)

Has TI validated the latest DDR config v9.08 on its EVK? have you encountered such errors?

We would appreciate your assistance on resolving this issue.

dmesg-16Gb.txt DDR_Config_0.09.08.0000.zip

Thank you for your support,

Jhon Q.

over 3 years ago

0 Anshu Choudhary over 3 years ago

TI__Genius 12430 points

Jhon ,

Thanks for your Query, I have assigned the thread to our Kernel expert. Please allow some time to get back.

Regards

Anshu

0 JJD over 3 years ago in reply to Anshu Choudhary

TI__Guru* 93451 points

From the log, it doesn't appear the kernel is crashing, just that you are getting the rcu stall errors. I would say that if the test continues to run, it is probably not a DDR issue. You are not getting any memtester errors, just kernel errors. I'll let our software team comment on the kernel errors.

Regards,

James

0 Jhon Q over 3 years ago in reply to JJD

Intellectual 265 points

Hi James,

Thanks for your reply, we will wait for the software team's response.

Also, are you able to check our DDR configuration (Attached above) against the DDR part we use (K4AAG165WA-BIWE) and verify settings are OK?

Thank you,

Jhon

0 Nate Drude over 3 years ago

Prodigy 180 points

JJD, Anshu Choudhary,

We have discovered the RCU stalls are reproducible on the TI AM62 EVK. Here are the steps:

Reproduce steps on TI AM62 EVK:

1. Plug in a HDMI display to the EVK

2. Boot https://dr-download.ti.com/software-development/software-development-kit-sdk/MD-PvdSyIiioq/08.06.00.42/tisdk-default-image-am62xx-evm.wic.xz

3. Start 23 instances of memtester (it may happen with more or less, our scripts have been using 23 to leave a certain amount of memory free):

for i in {1..23}; do memtester 50m | grep FAIL >> /tmp/memtest.log & done && \
   sleep 1 && 
   echo $(expr $(free | awk '/^Mem:/{print $4}') / 1024)MB Free

4. Usually the RCU errors will print before 24 hours of running memtester.

We discovered it only happens if the display is enabled. If the HDMI cable is not attached (and therefore fb0 not initialized) or if dss and dss_ports device tree nodes are disabled, the RCU errors will not occur.

After reviewing the reference manual section in the screenshot below, we are wondering if this is related to the QoS configuration. From "3.1.12 QoS Programming Guide":

> AM62 has two real-time peripherals DSS and CSI_RX, both of them have real time deadline.

...

> In addition, GPU can burst out a long back to back burst of transactions to completely exhaust the DDR bandwidth. In order to mitigate the risk of missing real-time peripheral deadline, bandwidth limiters are added to AM62 to control the transactions from both GPU and A53. User can configure and tune the bandwidth limiter parameters based on the use cases.

Can you please confirm this behavior on your side, and advise if 1) the QoS parameters need to be adjusted or 2) some other configuration should be changed?

Thanks,

Nate

0 Krunal Bhargav34 over 3 years ago in reply to Nate Drude

TI__Mastermind 45035 points

Hi Nate,

I will test the above on my setup and get back to you. In general, could you please provide more background on your test case? I am curious why you are running 20+ instances of memtester and attaching a HDMI monitor. Also, is GPU activated and rendering any objects on the screen.

Regards,
Krunal

0 Nate Drude over 3 years ago in reply to Krunal Bhargav34

Prodigy 180 points

Hi Krunal,

> I am curious why you are running 20+ instances of memtester

We are running 20+ instances of memtester to verify our DDR configuration.

> and attaching a HDMI monitor.

On our custom hardware, an LVDS display is always enabled. We discovered that the RCU errors did not occur on our hardware if the dss and dss_ports device tree nodes were disabled. We also discovered that the RCU errors do not occur on the TI AM62 EVK unless the HDMI monitor is attached, since, without an HDMI monitor the framebuffer is not initalized.

> Also, is GPU activated and rendering any objects on the screen.

During our normal memtester testing, the display is showing the weston desktop.

I did try running /usr/bin/SGX/demos/Wayland/OpenGLESBinaryShaders while also running memtester, it did not seem to have any impact.

0 Krunal Bhargav34 over 3 years ago in reply to Nate Drude

TI__Mastermind 45035 points

Thanks! I am trying the above on my setup and will let you know if I observe the same behavior. In general, if HDMI is not connected, I am assuming no stalls are observed even if we increase the number of memtester instances?

Regards,
Krunal

0 Nate Drude over 3 years ago in reply to Krunal Bhargav34

Prodigy 180 points

> In general, if HDMI is not connected, I am assuming no stalls are observed even if we increase the number of memtester instances?

I don't know if adding more threads will make it worse. If I recall correctly, I was even able to reproduce this on a 1GB system with ~5 threads.

Thanks for your help investigating this!

Nate

0 Krunal Bhargav34 over 3 years ago in reply to Nate Drude

TI__Mastermind 45035 points

Hi Nate,

Just wanted to update you that I am observing the issue on my side as well and still trying to debug what's causing the stall.

Regards,
Krunal

0 Krunal Bhargav34 over 3 years ago in reply to Krunal Bhargav34

TI__Mastermind 45035 points

Hi Nate,

With that many # of memtesters running, the only way to avoid stall is to add swap memory. I added a 1GB of swap memory and did not see the issue.

Regards,
Krunal

0 Nate Drude over 3 years ago in reply to Krunal Bhargav34

Prodigy 180 points

Krunal Bhargav34 said:
With that many # of memtesters running, the only way to avoid stall is to add swap memory. I added a 1GB of swap memory and did not see the issue.

Hi Krunal,

I don't think it's related to swap, or running out of memory. During the test, we make sure to leave 500MB of memory free.

Regards,

Nate

0 Krunal Bhargav34 over 2 years ago in reply to Nate Drude

TI__Mastermind 45035 points

Hi Nate,

I came across this document and it might be useful for your usecase: https://docs.kernel.org/RCU/stallwarn.html#fine-tuning-the-rcu-cpu-stall-detector

Regards,
Krunal

Processors

Processors forum

AM625: kernel errors "rcu stall" running memory test