This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VH-Q1: REPLY: After NFS is mounted, the system occasionally encounters the “serror 0xbf000002” exception.

Part Number: TDA4VH-Q1
Other Parts Discussed in Thread: TDA4VH

HI, Diwakar

The trace configuration of m4 is modified as follows:

Fullscreen
1
2
.trace_dst_enables = BOARDCFG_TRACE_DST_UART0,
.trace_src_enables = BOARDCFG_TRACE_SRC_SEC,
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

Under this configuration, we reproduced the serror issue twice, and the printout of m4 is as follows:

Fullscreen
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
First time:
FWL Bit 0x1E
Exception addr 0x45B0B000
FWL Exception 0x110E000
0x10000
0x33D4000C
0x0
0x2201
0x4
Second time:
FWL Bit 0x1E
Exception addr 0x45B0B000
FWL Exception 0x110E000
0x70000
0x33D3D004
0x0
0x2201
0x4
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

It was discovered that the serror issue was caused by accessing the DMA interrupt configuration register (0x33D4000C and 0x33D3D004).

We added the following printout in the kernel:

Confirmed to access register 0x33D3D004 when configuring mpu_1_0_ethmac-device-6 and register 0x33D4000C  when configuring mpu_1_0_ethmac-device-7

From the above observations, it appears that during the initialization of network DMA, the firewall permissions of DMA interrupt configuration registers occasionally fail to be configured.

Please help to resolve this issue, thank you very much TI.

Best regards

Alex

  • HI, ti Experts

    We have also encountered MCU2_0 crash issue before, and the Oops information we captured is as follows:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    Exception data abort on task context!!!
    Exception pc : 0x99fc07e4
    Exception sp : 0x99917f60
    Current task : Lwip2Enet_RxPacketTask
    r0 : 0x00000001 r1 : 0x00000000
    r2 : 0x00000001 r3 : 0x00000000
    r4 : 0x00000000 r5 : 0x33db7000
    r6 : 0x00000001 r7 : 0x00000000
    r8 : 0x08080808 r9 : 0x99911d70
    r10 : 0x99910d34 r11 : 0x9a080899
    r12 : 0x00000008 r14 : 0x99fc07d0
    dfsr : 0x00001808 dfar : 0x33db7000
    ifsr : 0x00000000 ifar : 0x00000000
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    We captured the logs of M4 using the same method as above, and the log of M4 is as follows:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    FWL Bit 0x1E
    Exception addr 0x45B0B000
    FWL Exception 0x110E000
    0x10000
    0x33DB7004
    0x0
    0xA8122D4
    0x4
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    From the information obtained so far, the cause of this issue is the same as the cause of the serror problem.

    Please investigate the cause of the problem. Thank you very much.

    Best regards

    Alex

  • Hi Alex,

    But which SDK release are you using? we dont atleast see these crashes on the EVM, so wondering what's changed? Have you changed anything in SDK? or in the resource manager or in ethernet driver? 

    Regards,

    Brijesh

  • Hi, Brijesh

    Our use Software  Version  is  SDK 8.6, we use boot model is SPL。

    We modified the rm_cfg.c file, and the modified content is as follows: “+” represents the original content of the SDK, and “-” represents our modification.

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    @@ -187,42 +187,42 @@
    /* Main GPIO Interrupt router */
    {
    .start_resource = 0,
    - .num_resource = 24,
    + .num_resource = 4,
    .type = RESASG_UTYPE (J784S4_DEV_GPIOMUX_INTRTR0,
    RESASG_SUBTYPE_IR_OUTPUT),
    - .host_id = HOST_ID_ALL,
    + .host_id = HOST_ID_MAIN_0_R5_0,
    },
    {
    - .start_resource = 24,
    - .num_resource = 0,
    + .start_resource = 4,
    + .num_resource = 4,
    .type = RESASG_UTYPE (J784S4_DEV_GPIOMUX_INTRTR0,
    RESASG_SUBTYPE_IR_OUTPUT),
    .host_id = HOST_ID_MAIN_0_R5_2,
    },
    {
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    All modifications to Linux network drivers are as follows: the content on the left represents the original content of the SDK, and the content on the right represents our modification.

    linux/drivers/phy/ti/phy-gmii-sel.c:

    linux/drivers/soc/ti/k3-ringacc.c:

    linux/drivers/dma/ti/k3-udma.c:

    linux/drivers/net/phy/phy_device.c

    We haven’t made any modifications to the RTOS network driver.

    Best regards

    Alex

  • Hi,

    Coming up to speed on this issue, some thoughts based on the provided FWL exception above.

    The default firewall settings, privIds etc, for TDA4VH are available at below link:

    The log included above has the following data provided which can be mapped to registers in the TRM XLS.

    • Hdr[0] - 0x110E000
    • Hdr[1] - 0x10000
    • Data[0] - 0x33DB7004
    • Data[1] - 0x0
    • Data[2]  - 0xA8122D4
    • Data[3] - 0x4

    This should give additional information as to where/why the error is occurring.  A quick look at Data[2], shows a PRIV Id of 0xD4 (212) which seems to indicate write access from MCU2_0 did not have access.

    Given in parallel thread it was mentioned that NB bridge firewalls were modified, and TI has not been able to reproduce the issue, the modified firewall settings should be reviewed.

    • "We configured the North Bridge (NB) firewall to allow only MCU1_0 to access the 1GB memory starting from address 0x80000000. After using the firewall configuration, the Universal DMA (UDMA) inside NAVSS is unable to access the memory. "

    Regards,

    kb

  • HI, KB

    Thank you for your explanation of the meaning of M4 logs.

    What you mentioned about ‘Given in parallel thread it was mentioned that NB bridge firewalls were modified, and TI has not been able to reproduce the issue, the modified firewall settings should be reviewed.’ is only our testing demo and has not been applied in actual projects. All relevant modifications we made have been documented in the above statement.

    In addition, we found in our tests that on mcu2_0, it is necessary to execute the call of “Sciclient_rmIrqSet” before accessing the 0x33db7000 register. The actual work contents of “Sciclient_rmIrqSet” seem to be closed source, and we currently only know that it configures firewall permissions, but we are not clear about the specific process. You said "TI has not been able to reproduce the issue" but it  is possible that some potential issues in configuring firewall properties may have been exposed in our environment. Can TI help investigate whether there is an issue with configuring firewall permissions?

    Best regards

    Alex

  • Hi Alex,

    Please be aware of below link, which provides a description of decoding the M4 log, vs explanation above.

    Referencing this link and using the M4 log from above, line 4, would seem to indicate that error 0x1 is occurring, please double check.

    If this is the case, next steps would be to read the appropriate firewall, and if indeed it is active and not enabled, try deactivating it running test again.  If test passes, then decision can be made if the firewall should deactivated, or have programming added to it, so it can be active.

    The Sciclient_rmIrqSet() implementation is part of the PDK, at path packages/ti/drv/sciclient/src/sciclient/sciclient_rm.c, in the SDK RTOS downloads.

    Regarding the GPIO changes shared above, wondering if the moving of GPIO resources for J784S4_DEV_WKUP_GPIOMUX_INTRTR0  from MCU2_0 to A72 is having an impact.  Is it possible to test without those changes and see if issue is reproduced?

    Regards,

    kb

  • HI KB:

    1. The function Sciclient_rmIrqSet requests MCU1_0 to handle it by proxy, and then the M4 processor continues to handle security. The source code of the configuration of M4’s firewall is closed to us。Could you provide us with a version of M4 firmware that is not configured with a firewall when CALL Sciclient_rmIrqSet?

    Sciclient_service(TISCI_MSG_RM_IRQ_SET )[A72/MCU2_0 indirect mode ]   ---->  Sciclient_service[MCU1_0]  ---> Sciclient_serviceSecureProxy(pReqPrm, pRespPrm) --> M4

    2. How can I read the configuration information of the 4320U firewall? We have attempted to read it but keep receiving an error indicating that we do not have permission to access the information.

    3. GPIO changes. We are trying to roll back the code to reproduce the issue.

    Best regards

    Alex

  • (1) The M4 firmware source code is not available for distribution on TDA4VH

    (2) The 4320 Channelized Firewall has the below default settings, which channel are is being read?  I would suggest that this be handled on a different e2e thread if further discussion required.  

    J784S4 Firewall Descriptions — TISCI User Guide

    (3)  Thank you, it is understood that this effort will take some time, hopefully it will show that it is related to issue being seen.

    Regards,

    kb

  • hi  KB:

    (2) The 4320 Channelized Firewall has the below default settings, which channel are is being read?

    We tried to read firewall infomation from MCU1_0 / A72 through the SCI interface

    Pseudocode:

    firewallreq.fwl_id = 4320;
    firewallreq.region = 0, 1, 256; // Inputting both channel number and region number as parameters has been attempted.
    firewallreq.n_permission_regs = cfg->n_permission_regs;

    ret = Sciclient_firewallGetRegion(&firewallreq,&firewallresp,SCICLIENT_SERVICE_WAIT_FOREVER);

    May I ask if there is an issue with the parameters being passed? If so, could you please provide an example of a successful read with the correct parameters? Thank you.

  • We have another question that we would like to inquire about:

     why the network driver tx&rx needs to perform interrupt disable and enable operations for the DMA, while other modules that use DMA do not require similar operations? Can we avoid disabling and enabling interrupts in the DMA critical functions of the network driver in Linux and PDK?

    Most of the issues we are encountering now involve accessing the enable/disable interrupt registers for DMA INTR.

    Linux:

    am65_cpsw_nuss_tx_irq

         -->disable_irq_nosync(irq)

    am65_cpsw_nuss_tx_poll

          ---> enable_irq(tx_chn->irq);

    PDK:

    Lwip2Enet_notifyRxPackets

         -->EnetDma_disableRxEvent(rx->hFlow)

    Lwip2Enet_rxPacketTask

        -->EnetDma_enableRxEvent(rx->hFlow)

  • Hi,

    Can you please open a new thread for the network driver question.

    Regarding firewall 4320, the screen shot include above is showing that by default channel 0, and channel 256 are configured to values shown.

    The Firewall TISCI Description — TISCI User Guide, describes where the channel number should be specified to readback settings.

    Regards,

    kb

  • HI KB:

    We have created a new thread to inquire about the network issues.

    In addition we have reverted all modifications made to rm_cfg.c, but the issue still persists.

    Best regards

    Alex

  • Hi KB:

             When reproducing the serror issue, we used JTAG (CCS) to read the Channelized firewall register configuration information。

    M4 log:

    /* Properties of channelized firewall at: NAVSS0_UDMASS_INTA_0_UDMASS_INTA0_CFG_GCNTRTI */
    #define CSL_CH_FW_NAVSS0_UDMASS_INTA_0_UDMASS_INTA0__CFG__GCNTRTI_ID (4320U)
    #define CSL_CH_FW_NAVSS0_UDMASS_INTA_0_UDMASS_INTA0__CFG__GCNTRTI_TYPE (CSL_FW_CHANNEL)
    #define CSL_CH_FW_NAVSS0_UDMASS_INTA_0_UDMASS_INTA0__CFG__GCNTRTI_MMR_BASE (0x00000045438000U)

    0x33D3C004  --> VINT#(0x34), so VINT#(0x34) channel firewall base:0x45438000 + 0x3c * 0x20 = 0x45438780。The firewall configuration obtained by CCS reading is as follows:

    From the read registers, the firewall configuration seems to be correct. However, why produce an M4 firewall exception event when Linux accessing it?

    Register interpretation:

    0x00000000A : enable

    0x00018B88: PRIV-ID 01 (non-secure A72), 8B88(NONSEC_SUPV_READ & WRITE).

    Best regards

    Alex

  • HI Alex,

    Thank you for updates.

    The most recent is showing A72 as priv Id, previous one I had looked at had MCU2_0 as originator.   In both cases the error of '1' (Line 4[16-23]) is logged meaning.

    A theory would be that the DMA channel being used is not correct, as the firewall associated with it has not been enabled. 

    Regards,

    kb

  • HI KB:

          What are the next steps we can take to troubleshoot this issue?

    Regards,

    Alex

  • Hi Alex,

    If issue can be reproduce with TI SDK on EVM that would be the ideal, as it enables the TI team to directly recreate and work on the issue.

    Will loop in some other folks for thoughts.

    Regards,

    kb

  • Hi Alex,

    I have not tried NFS with TDA4VH but on the the other TDA4 platforms we have used NFS and not seen the crash above. It wolly help if you can reproduce on EVM.

    Best Regards,

    Keerthy 

  • Hi Alex,

    Did you get the chance to reproduce this on EVM? It would be really helpful if you could.

    Regards,

    Brijesh

  • Hi Brijesh

    Currently we are using the Fast Startup solution on the customer board, not the one using systemd. So it takes time to test the NFS mounting function by booting with Tiny File System on the EVM board. In addition, I have two related questions to consult.

    1. It is impossible to guarantee that the code on the customer board is exactly the same as that on the EVM. If there is an issue that can be replicated on the customer board but not on the EVM, how does TI typically resolve it?

    2. For the current issue, we used the same RM-cfg as EVM, but the problem still occurs. Based on the above analysis, it is quite clear that it is a firewall issue. Does TI have any further troubleshooting ideas?

    Best regards

    Alex

  • Hi,

     It is impossible to guarantee that the code on the customer board is exactly the same as that on the EVM. If there is an issue that can be replicated on the customer board but not on the EVM, how does TI typically resolve it?

    Please share the complete boot logs. Also are you using ethernet firmware to control CPSW or Linux native driver?

    . For the current issue, we used the same RM-cfg as EVM, but the problem still occurs. Based on the above analysis, it is quite clear that it is a firewall issue. Does TI have any further troubleshooting ideas?

    If you disable the MCU2_0 firmware can you check if this still happens? If not then we need to see why MCU2_0 is accessing that region which is firewalled.

    - Keerthy

  • HI, Keerthy

    The normal boot log and dmesg are shown below:

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    boot log:
    U-Boot SPL 2021.01-svn522639 (Dec 26 2023 - 11:46:05 +0800)
    SYSFW ABI: 3.1 (firmware rev 0x0008 '8.6.3--v08.06.03 (Chill Capybar')
    SPL initial stack usage: 13472 bytes
    Trying to boot from SPI
    bootpart :[main]
    NOTICE: BL31: v2.8(release):
    NOTICE: BL31: Built : 11:45:33, Dec 26 2023
    U-Boot 2021.01-svn522639 (Dec 26 2023 - 11:45:39 +0800), Build: jenkins-AutoElec-cci-Pipeline-13339
    SoC: J784S4 SR1.0 GP
    Model: HIKAUTO AE-B50038-S
    DRAM: 4 GiB
    Flash: 0 Bytes
    MMC: mmc@4fb0000: 1
    In: serial@2880000
    Out: serial@2880000
    Err: serial@2880000
    Net: eth1: ethernet@c200000, eth0: ethernet@c200000port@1
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    dmesg:
    dmesg
    [ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x411fd080]
    [ 0.000000] Linux version 5.10.162-g76b3e88d56 (ci@CI-AutoElec-Slave-71-138) (aarch64-none-linux-gnu-gcc (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 9.2.1 20191025, GNU ld (GNU Toolchain for the A-profile Architecture 9.2-2019.12 (arm-9.10)) 2.33.1.20191209) #1 SMP PREEMPT Tue Dec 26 11:46:35 CST 2023
    [ 0.000000] Machine model: HIKAUTO AE-B50038-S
    [ 0.000000] earlycon: ns16550a0 at MMIO32 0x0000000002880000 (options '')
    [ 0.000000] printk: bootconsole [ns16550a0] enabled
    [ 0.000000] efi: UEFI not found.
    [ 0.000000] Reserved memory: created DMA memory pool at 0x0000000088000000, size 176 MiB
    [ 0.000000] OF: reserved mem: initialized node dsp-multicore-dma-memory@88000000, compatible id shared-dma-pool
    [ 0.000000] Reserved memory: created DMA memory pool at 0x0000000093000000, size 16 MiB
    [ 0.000000] OF: reserved mem: initialized node bsp-multicore-dma-memory@93000000, compatible id shared-dma-pool
    [ 0.000000] Reserved memory: created DMA memory pool at 0x0000000094000000, size 64 MiB
    [ 0.000000] OF: reserved mem: initialized node vision-apps-core-heap-memory-lo@94000000, compatible id shared-dma-pool
    [ 0.000000] Reserved memory: created DMA memory pool at 0x0000000098000000, size 1 MiB
    [ 0.000000] OF: reserved mem: initialized node r5f2_0-dma-memory@98000000, compatible id shared-dma-pool
    [ 0.000000] Reserved memory: created DMA memory pool at 0x0000000098100000, size 47 MiB
    [ 0.000000] OF: reserved mem: initialized node r5f2_0-memory@98100000, compatible id shared-dma-pool
    [ 0.000000] Reserved memory: created DMA memory pool at 0x000000009b000000, size 1 MiB
    [ 0.000000] OF: reserved mem: initialized node r5f1_0-dma-memory@9b000000, compatible id shared-dma-pool
    [ 0.000000] Reserved memory: created DMA memory pool at 0x000000009b100000, size 15 MiB
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Now we using ethernet firmware to control CPSW on MCU2_0.

    After disable mcu2_0, there is still an issue of serror on Linux.

    Best regards

    Alex

  • Hi,

    After disable mcu2_0, there is still an issue of serror on Linux.

    The issue description is that MCU2_0 is trying to access some firewalled memory right?
    I am not sure if you are getting the same exception or a different one after disabling MCU2_0.

    Also the logs attached are not clearly showing when the issue happens?

    - Keerthy

  • Hi Brijesh, 

    We have successfully reproduced the SError issue on the EVM.Please refer to the attached file for the reproduction method.
    If you have any questions, please contact us.

    Best regards
    Alexreproduce serror exception annex.zip

  • Hi Alex 

    Allow us some time to replicate the issue will update you regarding the process by end of the next week.

    Regards
    Diwakar

  • Hi Diwakar

    Soft reminder to keep this issue in your schedule. Customer is pushing us hard as it is affecting project SOP deadline. 

    Please be sure to replicate the issue on your site and give us updates by EOW.

    Regards

    Zekun

  • Hi, Diwakar

    How is going with the replication? If you met any problems you can updates here. 

    Customer mentioned that this step is not need if we can mount successfully.

    Regards

    Zekun

  • Hi Zekun 

    We would like to have a call with customer for better understanding on this issue.

    Regards
    Diwakar

  • Hi, Diwakar

    The attached file is the modified rootfs. Please adjust the etc/init.d/nfs_test.sh script according to your environment. Special attention should be paid to the fact that the ENV parameters of uboot need to be consistent with ours, as we have discovered through testing that this parameter is also relevant.

    Additionally, our kernel boot logs and serial logs are also attached for your reference.

    Best regards

    Alex

    tinyrootfs_and_logs.zip

  • Hi Alex 

    Thanks for share the rootfs.

    The logs which you share are on the EVM both pass and fail logs ?

    Regards
    Diwakar

  • Hi Diwakar

    The kernel log indicates a normal startup, while the serial port log is the log recorded when the Serror occurs.

    Best regards

    Alex

  • HI Alex 

    Are those both logs from TI EVM ?

    Regards
    Diwakar

  • Hi Divakar

    Those logs are all from the TI EVM. 

    May I ask if there are any issues with the current reproduction environment? Can reproduce the issue now?

    Best regards

  • Hi Divakar

    May I ask how the current issue is progressing?

    Best regards

    Alex

  • Hello,

    This thread is assigned to our engineer in the India office. Due to a regional holiday, half of our team is out of the office. Please expect a 1~2 day delay in responses.

    Apologies for the delay, and thank you for your patience.

    Thanks.

  • Hi Praveen

    Has there been any new progress on the issue?

    Best regards

    Alex

  • Hi, Experts

    There have been severals days past, may I know the current status of this issue?

    Regards

    Zekun

  • Hi Zekun,Alex

    Can you please try with the below patch i tried 1000 times so far not able to reproduce the issue.

    https://patchwork.kernel.org/project/linux-arm-kernel/list/?series=845384

    net-net-ethernet-ti-am65-cpsw-nuss-cleanup-DMA-Channels-before-using-them.patch

    Regards
    Diwakar

  • Hi Divakar, Zekun

    We have applied this patch in the SDK 8.6, but the SError issue still recurs after testing on the EVM board. Could you please clarify whether this patch is based on SDK 8.6 or another version?

    Attached is the reproduction log.

    After patch SError log.txt
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    [16:15:04:987]U-Boot SPL 2021.01-g62a9e51344 (Mar 13 2023 - 15:43:18 +0000)
    [16:15:05:061]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:062]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:063]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:074]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:074]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:075]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:093]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:094]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:096]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:118]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [16:15:05:119]SYSFW ABI: 3.1 (firmware rev 0x0008 '8.6.3--v08.06.03 (Chill Capybar')
    [16:15:05:276]SPL initial stack usage: 13472 bytes
    [16:15:05:277]Trying to boot from MMC2
    [16:15:05:497]Starting ATF on ARM64 core...
    [16:15:05:497]
    [16:15:05:512]NOTICE: BL31: v2.8(release):v2.8-226-g2fcd408bb3-dirty
    [16:15:05:513]NOTICE: BL31: Built : 15:42:56, Mar 13 2023
    [16:15:05:524]I/TC:
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

  • Hi Li 

    we have tested the patch on the 9.2 SDK and after that we are not seeing any issue related to the corruption , ideally it should be same for the 8.6SDK.

    Can you make sure that the changes are applied properly and what is the address range for which we are getting firewall exception in the TIFS traces.

    Regards
    Diwakar

  • Hi Divakar

    1. We can confirm that the current patch has been applied to the kernel. The following are the patch and kernel logs.

    2. The exception log captured during the occurrence of the SError issue is as follows, and the firewall output is the same as before.

    Exception addr 0x45B0B000 FWL Exception 0x110E000 0x10000 0x33D38004 0x0 0x2201 0x4

    uart log:

    am65-cpsw-nuss_patched_log.txt
    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    [15:15:43:305]U-Boot SPL 2021.01-dirty (Apr 30 2024 - 14:42:16 +0800)
    [15:15:43:385]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:396]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:396]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:400]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:413]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:413]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:416]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:427]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:428]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:432]ti_i2c_eeprom_am6_get: Ignoring record id 255
    [15:15:43:450]SYSFW ABI: 3.1 (firmware rev 0x0008 '8.6.3--v08.06.03 (Chill Capybar')
    [15:15:43:609]SPL initial stack usage: 13472 bytes
    [15:15:43:610]Trying to boot from MMC2
    [15:15:43:834]Starting ATF on ARM64 core...
    [15:15:43:835]
    [15:15:43:849]NOTICE: BL31: v2.8(release):v2.8-226-g2fcd408bb3-dirty
    [15:15:43:850]NOTICE: BL31: Built : 15:42:56, Mar 13 2023
    [15:15:43:871]I/TC:
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    3. To resolve this issue, besides the network driver patch, is there anything else that needs to be updated in the SDK 8.6 version, such as the firewall firmware?

    4. Please confirm if the SError issue does not recur in the SDK 9.2 version. Was the replication environment set up using the tiny rootfs provided before? When we tested with the tiny rootfs, the SDK 8.6 can stably reproduce the issue, and the frequency of reproduction is quite high.

  • Hi Li 

    We have tested a normal NFS boot on 9.1,9.2 SDK with default sdk rootfs not tiny one without this patch we were able to see the firewall exception.

    TIFS traces:

    A72 logs:

    After applying the shared patch we were not able to reproduce the issue .

    We will also keep a device under test with 8.6 SDK to see if issue persists on that.Will update you about our finding on 8.6.

    Regards
    Diwakar

  • Hi Li 

    Also i assume the above patches you have tested on the EVM setup have you tested the same on your hardware setup as well where on board booting another?

    Regards
    Diwakar

  • Hi Divakar

    Yes, we have also tested it on the product board, and the issue can be reproduced as well.

    Best Regards

  • Hi Li 

    One more thing when you are not doing fixed link changes in the dtsi is the issue reproducable ?

    We are able to make the setup somewhat similar to your but not seeing any issue so far , The diffrence between your setup and our setup on EVM are:

    • We are not using fixed link 
    • NFS script changes

    Fullscreen
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    2
    3 PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin
    4
    5 insmod /lib/modules/5.10.162-g76b3e88d56/kernel/drivers/rpmsg/rpmsg_char.ko
    6 insmod /lib/modules/5.10.162-g76b3e88d56/kernel/drivers/rpmsg/virtio_rpmsg_bus.ko
    7
    8 insmod /lib/modules/5.10.162-g76b3e88d56/kernel/drivers/remoteproc/ti_k3_dsp_remoteproc.ko
    9 insmod /lib/modules/5.10.162-g76b3e88d56/kernel/drivers/remoteproc/ti_k3_m4_remoteproc.ko
    10 insmod /lib/modules/5.10.162-g76b3e88d56/kernel/drivers/remoteproc/ti_k3_r5_remoteproc.ko
    11
    12 lsmod
    13
    14 echo "mount nfs start----------"
    15 ifconfig eth0 hw ether 68:e7:4a:08:15:5f
    16 ifconfig eth0 up
    17 #sleep 4
    18 #udhcpc -i eth0
    19 ifconfig eth0 10.24.68.42 netmask 255.255.254.0
    20
    21 route add default gw 10.24.68.254
    22 mkdir -p /mnt/nfs0
    XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX

    Regards
    Diwakar

  • Hi Divakar

    The issue of fixed-link has been explained before. Our product utilizes two VH chips connected directly through a MAC-MAC network. Therefore, the device tree network needs to be configured in fixed-link mode.

    On the EVM board, without the fixed-link mode set, we did not encounter the SError issue. However, after enabling the fixed-link mode, the Linux kernel boot speed is accelerated due to skipping the PHY auto-negotiation process, and the subsequent driver module loading also takes place earlier.

    We suspect that during the network data communication process, the concurrent loading of the rpmsg and remoteproc drivers led to the occurrence of the SError issue.

    Therefore, it is still recommended to configure it in fixed-link mode for testing. The modification to the nfs script has no impact.

    Best Regards

    Alex

  • Hi Alex,

    On the EVM, when we make any of the two CPSW ports as fixed-link, We will not be able to get link working as the phy is not boot-strapped.

    When you re-created the issue on EVM, were you able to ping and transfer the data or the issue was seen even without making the link working?

    Regards,
    Tanmay

  • Hi Tanmay,

    Our EVM board is connected to a switch, and after the MCU RGMII port is set to fixed-link mode, it is able to ping each other with the NFS server's PC. NFS can also be mounted successfully. The df command in the nfs_test.sh script can confirm the NFS mount status. As shown below.

    [15:15:55:755]data1/zhangjing24/nfs
    [15:15:55:755] 20971520 8434688 12536832 40% /mnt/nfs0

    You can ask Divakar how the SError issue was reproduced on the EVM board before.

    Best Regards

    Alex

  • Hi Alex,

    Thanks for the info.

    In an earlier response I saw your hypothesis that this might have have something to do with the fixed-link taking less time to link up. But still the issue is seen only when you run your nfs script. Is this assumption correct? If this is correct, then it might be possible that there could be some race conditions seen due to the rproc or rpmsg modules, both of which use the same interrupt router as udma.

    Have you tried to not load the modules before you mount the nfs partitions. Can you see if the error is reproduced after this as well.

    Regards,
    Tanmay

  • Hi Tanmay,

    We compiled the rpmsg and remoteproc drivers into the kernel and avoided network transmission and driver loading, and there were no issues encountered when executing nfs_test.sh on the EVM board. However, using the same strategy, we were still able to reproduce the SError issue on our product, which occurs approximately during network data transmission when the DSP service performs DMA initialization.

    The core issue lies in the abnormal interception by the firewall, as previously described. Is it possible to analyze whether there is a problem with the firewall component?

    Best Regards

    Alex

1 2