AM67A: EDGE-AI image randomly fails to boot

Part Number: AM67A
Other Parts Discussed in Thread: SYSCONFIG

Tool/software:

Hello,

We built a custom board with different  4GB DDR memory. Not 8GB DDR at J722SEVM.

I' have creates the EDGE-AI images using the Linux and RTOS SDKs and setting the memory addresses correctly.

The problem is this:
Most of the time, the kernel boots, and I can access the terminal screen via HDMI and use Linux. But very rarely, 2 times totally, the EDGE-AI App Gallery screen appeared. As I said, mostly, it just displays the Linux terminal.

I don't understand why the EDGE-AI App Gallery randomly works. I'm using the same image everytime, and this is a strange thing. I can't explain this randomness.

When App Gallery doesn't appear, it says some Qt elements are missing or invalid .so files are existed. Why would these files be deleted or corrupted?

Could it be related to the DDR configuration? 

  • Hello,

    It could be related to DDR or something else as well. Is there a crash or some random hang?
    If you are always booting to Linux consistently and the failure is with 

    I don't understand why the EDGE-AI App Gallery randomly works. I'm using the same image everytime, and this is a strange thing. I can't explain this randomness.

    This can be if some of the order is not met? I am not sure either. Do you see a random crash or only missing gallery is the issue?

    - Keerthy

  • It could be related to DDR or something else as well. Is there a crash or some random hang?

    Both Linux and the gallery issue were present. After changing the DDRs, the kernel always started booting. This is OK. But,

    Do you see a random crash or only missing gallery is the issue?

    The problem is the missing gallery for now. Sometimes the gallery doesn't open at all, and sometimes it opened but it stucks and hangs when I select anything(such as image classificaiton button) from the menu from the gallery UI.

    We tried a few more tests using this image in case it was related to the DDR configuration.

    We soldered two different LPDDR4 4GB memory modules on the our exact boards. EDGEAI works perfectly on one. The other one has the problems I mentioned above.

    I'm never sure if I've configured the DDR correctly. Everything looks correct according to the datasheet. Is there any documentation on which settings are critical?

    This the logs(from the board we are having problems with) when I tried to click any button on gallery:

    root@j722s-evm:/opt/edgeai-gst-apps# watch dmesg | tail
    [   67.427094] kauditd_printk_skb: 5 callbacks suppressed
    [   67.427115] audit: type=1701 audit(1741187672.628:19): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=615 comm="edgeai-gui-app" exe="/usr/bin/edgeai-gui-app" sig=11 res=1
    [   67.467089] audit: type=1334 audit(1741187672.668:20): prog-id=20 op=LOAD
    [   67.473978] audit: type=1334 audit(1741187672.672:21): prog-id=21 op=LOAD
    [   67.480827] audit: type=1334 audit(1741187672.680:22): prog-id=22 op=LOAD
    [   80.386657] audit: type=1334 audit(1741187685.588:23): prog-id=22 op=UNLOAD
    [   80.393675] audit: type=1334 audit(1741187685.588:24): prog-id=21 op=UNLOAD
    [   80.400660] audit: type=1334 audit(1741187685.588:25): prog-id=20 op=UNLOAD

    After these, buttons disappear and only background image is shown and app is freeze but still I can access terminal from UART pins. 

  • I am looping in DDR expert to make sure that's fine.

  • Thank you.

    For the DDR expert, this is the datasheet of DDR that we are having problems with: RS1G32LX4D4BNR-53BT


    DDR on working board: B3221PM3BDGUI-U

    Aren't these DDRs somehow identical?

  • Here is the extra logs I get when I click the "Image Classification" button. The first part of the log appears because of connecting a USB hub.

    root@j722s-evm:/opt/edgeai-gst-apps# watch dmesg | tail
    [   73.038480] usb 1-1.3: new high-speed USB device number 4 using xhci-hcd
    [   73.132080] hub 1-1.3:1.0: USB hub found
    [   73.136176] hub 1-1.3:1.0: 4 ports detected
    [   73.148852] kauditd_printk_skb: 5 callbacks suppressed
    [   73.148868] audit: type=1701 audit(1741187678.280:26): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=1227 comm="(udev-worker)" exe="/usr/bin/udevadm" sig=6 res=1
    [   73.177635] audit: type=1334 audit(1741187678.308:27): prog-id=23 op=LOAD
    [   73.184508] audit: type=1334 audit(1741187678.316:28): prog-id=24 op=LOAD
    [   73.191351] audit: type=1334 audit(1741187678.324:29): prog-id=25 op=LOAD
    [   73.418621] usb 1-1.3.1: new full-speed USB device number 5 using xhci-hcd
    [   73.525464] input: Logitech USB Receiver as /devices/platform/bus@f0000/f920000.usb/31200000.usb/xhci-hcd.8.auto/usb1/1-1/1-1.3/1-1.3.1/1-1.3.1:1.0/0003:046D:C52B.0001/input/input1
    [   73.662691] audit: type=1334 audit(1741187678.796:30): prog-id=25 op=UNLOAD
    [   73.669704] audit: type=1334 audit(1741187678.796:31): prog-id=24 op=UNLOAD
    [   73.670860] hid-generic 0003:046D:C52B.0001: input: USB HID v1.11 Keyboard [Logitech USB Receiver] on usb-xhci-hcd.8.auto-1.3.1/input0
    [   73.676734] audit: type=1334 audit(1741187678.796:32): prog-id=23 op=UNLOAD
    [   73.692448] input: Logitech USB Receiver Mouse as /devices/platform/bus@f0000/f920000.usb/31200000.usb/xhci-hcd.8.auto/usb1/1-1/1-1.3/1-1.3.1/1-1.3.1:1.1/0003:046D:C52B.0002/input/input2
    [   73.712938] input: Logitech USB Receiver Consumer Control as /devices/platform/bus@f0000/f920000.usb/31200000.usb/xhci-hcd.8.auto/usb1/1-1/1-1.3/1-1.3.1/1-1.3.1:1.1/0003:046D:C52B.0002/input/input3
    [   73.786843] input: Logitech USB Receiver System Control as /devices/platform/bus@f0000/f920000.usb/31200000.usb/xhci-hcd.8.auto/usb1/1-1/1-1.3/1-1.3.1/1-1.3.1:1.1/0003:046D:C52B.0002/input/input4
    [   73.804491] hid-generic 0003:046D:C52B.0002: input: USB HID v1.11 Mouse [Logitech USB Receiver] on usb-xhci-hcd.8.auto-1.3.1/input1
    [   73.819054] hid-generic 0003:046D:C52B.0003: device has no listeners, quitting
    [   73.898487] usb 1-1.3.3: new low-speed USB device number 6 using xhci-hcd
    [   74.002990] input: Primax Lenovo Traditional USB Keyboard as /devices/platform/bus@f0000/f920000.usb/31200000.usb/xhci-hcd.8.auto/usb1/1-1/1-1.3/1-1.3.3/1-1.3.3:1.0/0003:17EF:6099.0004/input/input6
    [   74.146728] hid-generic 0003:17EF:6099.0004: input: USB HID v1.10 Keyboard [Primax Lenovo Traditional USB Keyboard] on usb-xhci-hcd.8.auto-1.3.3/input0
    [   74.198946] audit: type=1701 audit(1741187679.332:33): auid=4294967295 uid=0 gid=0 ses=4294967295 subj=kernel pid=1230 comm="(udev-worker)" exe="/usr/bin/udevadm" sig=11 res=1
    [   74.227322] audit: type=1334 audit(1741187679.360:34): prog-id=26 op=LOAD
    [   74.234220] audit: type=1334 audit(1741187679.364:35): prog-id=27 op=LOAD
    [   85.330507] ti-sci 44043000.system-controller: Mbox timedout in resp(caller: ti_sci_cmd_get_device_exclusive+0x18/0x24)
    [   85.341341] ti-sci 44043000.system-controller: Mbox send fail -110
    [   86.482485] ti-sci 44043000.system-controller: Mbox timedout in resp(caller: ti_sci_cmd_get_device_exclusive+0x18/0x24)
    [   86.493320] ti-sci 44043000.system-controller: Mbox send fail -110
    [   86.502843] Internal error: synchronous external abort: 0000000096000010 [#1] PREEMPT SMP
    [   86.511028] Modules linked in: overlay bluetooth ecdh_generic ecc cfg80211 rfkill onboard_usb_dev wave5 cdns3 cdns_usb_common videobuf2_dma_contig rpmsg_ctrl v4l2_mem2mem rpmsg_char videobuf2_v4l2 snd_soc_hdmi_codec pci_endpoint_test videobuf2_memops crct10dif_ce videobuf2_common tidss snd_soc_simple_card at24 snd_soc_davinci_mcasp snd_soc_simple_card_utils cdns3_ti videodev tps65219_pwrbutton pwm_fan rti_wdt k3_j72xx_bandgap rtc_ds1307 drm_dma_helper display_connector snd_soc_ti_udma ti_k3_dsp_remoteproc rtc_ti_k3 ti_k3_r5_remoteproc mc drm_display_helper sii902x snd_soc_ti_edma ti_k3_common snd_soc_ti_sdma mcrc64 sa2ul drm_kms_helper pwm_tiehrpwm pwm_tiecap omap_mailbox omap_hwspinlock drm fuse drm_panel_orientation_quirks backlight ipv6
    [   86.576582] CPU: 1 UID: 0 PID: 1326 Comm: multifilesrc0:s Not tainted 6.12.33-g3f6fedd7142e-dirty #2
    [   86.585697] Hardware name: Custom-AI (DT)
    [   86.591167] pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
    [   86.598114] pc : wave5_vdi_read_register+0x8/0x20 [wave5]
    [   86.603519] lr : wave5_vpu_is_init+0x14/0x28 [wave5]
    [   86.608480] sp : ffff8000865b3a90
    [   86.611782] x29: ffff8000865b3a90 x28: ffff000005d3a010 x27: 0000000000000000
    [   86.618907] x26: ffff0008416ec930 x25: ffff0008419b4300 x24: ffff000005d3bcc0
    [   86.626032] x23: ffff0008416ec950 x22: ffff0008416ec880 x21: ffff000005d3a000
    [   86.633156] x20: 0000000000000000 x19: ffff8000865b3ae8 x18: 0000000000000006
    [   86.640281] x17: 3a72656c6c616328 x16: 00000000f0000000 x15: 0000000000000001
    [   86.647406] x14: 0000000000000000 x13: 0000000000000002 x12: 0000000000020be9
    [   86.654530] x11: 0000000000000068 x10: 0000000000000000 x9 : 0000000000000000
    [   86.661654] x8 : ffff8000871b7000 x7 : 0000000000000000 x6 : 000000000000003f
    [   86.668778] x5 : 0000000000000040 x4 : 0000000000000000 x3 : 0000000000000000
    [   86.675902] x2 : 0000000000000000 x1 : 0000000000000004 x0 : ffff800083810004
    [   86.683027] Call trace:
    [   86.685464]  wave5_vdi_read_register+0x8/0x20 [wave5]
    [   86.690511]  wave5_vpu_dec_open+0x88/0x158 [wave5]
    [   86.695299]  wave5_vpu_dec_start_streaming+0x1ac/0x334 [wave5]
    [   86.701126]  vb2_start_streaming+0x68/0x178 [videobuf2_common]
    [   86.706961]  vb2_core_streamon+0x100/0x1c4 [videobuf2_common]
    [   86.712701]  vb2_streamon+0x18/0x64 [videobuf2_v4l2]
    [   86.717669]  v4l2_m2m_ioctl_streamon+0x38/0x98 [v4l2_mem2mem]
    [   86.723419]  v4l_streamon+0x24/0x30 [videodev]
    [   86.727919]  __video_do_ioctl+0x330/0x3fc [videodev]
    [   86.732916]  video_usercopy+0x2e0/0x67c [videodev]
    [   86.737737]  video_ioctl2+0x18/0x28 [videodev]
    [   86.742212]  v4l2_ioctl+0x40/0x60 [videodev]
    [   86.746513]  __arm64_sys_ioctl+0xac/0xf0
    [   86.750431]  invoke_syscall+0x48/0x10c
    [   86.754173]  el0_svc_common.constprop.0+0xc0/0xe0
    [   86.758868]  do_el0_svc+0x1c/0x28
    [   86.762174]  el0_svc+0x28/0x98
    [   86.765223]  el0t_64_sync_handler+0x120/0x12c
    [   86.769569]  el0t_64_sync+0x190/0x194
    [   86.773227] Code: b9000022 d65f03c0 f940b000 8b214000 (b9400000)
    [   86.779305] ---[ end trace 0000000000000000 ]---
    

    root@j722s-evm:/opt/edgeai-gst-apps# watch dmesg | tail
    [  239.987152] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
    [  239.993261] rcu:     1-...0: (4 GPs behind) idle=4964/1/0x4000000000000000 softirq=11347/11347 fqs=14717
    [  240.002467] rcu:     (detected by 3, t=41850 jiffies, g=12601, q=5932 ncpus=4)
    [  240.009415] Sending NMI from CPU 3 to CPUs 1:

  •  Did you get any return from the expert?

  • Hi  ,

    A kind reminder.

    Best regards,

    Daniel Waleniak

  • Hi,

    Both Linux and the gallery issue were present. After changing the DDRs, the kernel always started booting.

    Do you mean that you replaced the physical DDR on the board and the behavior of the system changed?

    We soldered two different LPDDR4 4GB memory modules on the our exact boards. EDGEAI works perfectly on one. The other one has the problems I mentioned above.

    Do you mean that you have only tested 2x boards, and they have different memories?

    OR, you have 2x different set of boards, and ALL boards with 1x DDR vendor fails, and ALL boards with the other memory vendor pass?

    For the DDR expert, this is the datasheet of DDR that we are having problems with: RS1G32LX4D4BNR-53BT


    DDR on working board: B3221PM3BDGUI-U

    Aren't these DDRs somehow identical?

    I am not sure what you mean by identical, but since they came from two different memory vendors, I wouldn't call them identical even if they had the same bus width, number or ranks / DIE, same density, etc.  

    But if you look at page 6 of the datasheet of the Kingston memory and page 5 of the datasheet of the Rayson memory , these memories do not appear to even have the same architecture. The Kingston memory is single rank, with 16Gb channel density, whereas the Rayson memory is dual rank, with 8Gb channel density.

    Are you using the same DDR register configuration for both? If so, that would definitely be an issue for whichever memory has the incorrect configuration.

    Regards,
    Kevin

  • Thank you for your replies.

    Do you mean that you replaced the physical DDR on the board and the behavior of the system changed?

    Yes, that's the case. We didn't find any other problems on Linux except for Edge AI. I think if the DDR-configs was wrong, it would get stuck at u-boot or kernel startup. 

    Do you mean that you have only tested 2x boards, and they have different memories?

    OR, you have 2x different set of boards, and ALL boards with 1x DDR vendor fails, and ALL boards with the other memory vendor pass?

    Sorry for the confusion. We produced two of the our custom AM67A board. But, for the experiment, we installed different DDRs to boards. (So, the boards are the same except for the DDR.)
    The first board has a 4GB Rayson DDR.
    The second board has a 4GB Kingston DDR.

    Are you using the same DDR register configuration for both? If so, that would definitely be an issue for whichever memory has the incorrect configuration.

    You're right. I misled you. We already use two different configurations for the this two DDRs:

    Rayson:

    LPDDR4 Boot Frequency (MHz) : 50
    LPDDR4 Operating Frequency (MHz) : 1866
    DDR Data Bus Width (Bits) : 32
    DDR Density (per x16-bit channel, per rank)(Gbit) : 8
    Chip Selects / Ranks : 2
    Max Operating Temperature : <= 85C

    Kingston:

    LPDDR4 Boot Frequency (MHz) : 50
    LPDDR4 Operating Frequency (MHz) : 1866
    DDR Data Bus Width (Bits) : 32
    DDR Density (per x16-bit channel, per rank)(Gbit) : 16
    Chip Selects / Ranks : 1
    Max Operating Temperature : > 85C

    I don't know how to send you sysconfig file that includes all the configurations. These are screenshots:

    For Rayson:
      

    For Kingston:

  • HI,

    Sorry for the confusion. We produced two of the our custom AM67A board. But, for the experiment, we installed different DDRs to boards. (So, the boards are the same except for the DDR.)
    The first board has a 4GB Rayson DDR.
    The second board has a 4GB Kingston DDR.

    So you only have 2x physical boards built, both of which are the same board design except 1x currently has a Kingston memory and 1x has a Rayson memory?

    If so, I would suggest building more boards to give insight as to whether this may be a systematic issue or unique to the one failing board.

    You could also replace the Rayson memory with a Kingston memory to see if that resolves the issue, and then put the Rayson memory back on the board to confirm the original issue still exists. This would also give insight as to whether the problem is unique to the Rayson memory. However, I would suggest building additional boards first as removing / replacing the memory may start to wear out the PCB pads.

    Regards,
    Kevin

  • I understand your approach. Thanks for the advices. There are a lot of new boards being produced right now, already.

    But, is there a way(tools, scripts?) we can test the DDR memories before the new boards arrives? We tried it with memtest tool, but I'm not sure if it tested all the DDR addresses.

  • If you able to get to a linux prompt, you should run memtester to test the integrity of your DDR interface.

    If you can't get to a linux prompt, you can add the patch described here: https://e2e.ti.com/support/processors-group/processors/f/processors-forum/1358039/faq-board-bring-up-tips-for-sitara-devices-am64x-am243x-am62x-am62l-am62ax-am62d-q1-am62px#:~:text=Running%20memtester%20(DDR%20memory%20test)%20from%20R5

    which will add memtester to R5 code and it will run immediately after DDR initializes.

    Initially you can run it at room temp overnight to see if you get any errors.  Ideally, you should eventually run it across your operating temperature to ensure the robustness of your design and DDR configuration.  If you don't get any errors, you can be pretty confident that your are not dealing with any DDR errors, and you can focus on possible software problems.

    Regards,

    James

  •  This is the tooI needed. Thank you so much.

  • Hello JJD,

    Is it generally necessary to exclude reserved memory addresses while using memtester? Because, normally I was running memtest at u-boot terminal avoiding reserved memory adresses. 

    Does this patch access all DDR addresses, including reserved memory? 

    I'm asking because I don't know if I should remove the reserved-memory fields in u-boot DTS.

    And, does the patch work for 4GB DDR?
    Because the upper and lower 2GB are not contiguous at AM67A. I see that AM625-SK has 2GB DDR.

  • Memtester isn't intended to test memory cells or test the full DDR capacity.  It is intended to perform stress testing on the interface which will help identify possible signal integrity issues.  Thus is it not necessary to change anything about the memory map of the device.  Memtester patch will allocate 32MB of data to run the tests.

    Regards,

    James