PROCESSOR-SDK-AM64X: M4F IRAM memory issue

Gergely Korcsák

Hi,

We encountered an issue on the M4F core, that it will have some exact values at the following addresses after initialized with 'sbl_null.release.hs_fs.tiimage' bootloader:

Adress 0x400: BF30BF30
Adress 0xFFC: 0001F2D8

I checked in our hardware and on the sk board as well, that these values will always be present at these addresses.

I noticed that, because our firmware runs into 'hwip_undefined_handler_exceprion' in one part of a function in some builds, in other builds in another segment of the same function.

It helps when the mentioned code segment is further in the memory, but that is not the desired solution, because we are not sure about what the exact cause of the issue.

For some reason makeing the char buffers larger in the function also helped, but that raises the question 'why', because those should be created in the stack memory and have nothing to do in the program memory.

Our linker.cmd matches the hello_worls projects linker file, but I still notice that the .sysmem heap segment is usually linked after the .text codesegment in our case, while in the example it becames the first segment from 0x200-0x8200, but as I understand, that shouldn't cause this issue.

In the example I also notice the first value at 0x400 present, when that memory is allocated for the heap, as well in our case if that segment is unallocated.

Can you suggest any ida what can cause the issue, that our code sometimes runs into handlerfault near address 0xFFC, and why are these values are always set inside the mentioned memory areas?

The mentioned segments assemply code:

00000ff8:   980A                ldr        r0, [r13, #0x28]
00000ffa:   F44F7180            mov.w      r1, #0x100
00000ffe:   F006FEB7            bl         PromPrint
00001002:   E7FF                b          #0x1004
00001004:   F89D02FF            ldrb.w     r0, [r13, #0x2ff]
00001008:   07C0                lsls       r0, r0, #0x1f
0000100a:   B1C8                cbz        r0, #0x1040
0000100c:   E7FF                b          #0x100e

Function PromPrint jumps to 0x7d70 now, and it works, but It the crash happens none of that code executes, because it alwazs crashes on address 0xFF(x).

I also noticed that if I try to go trough that function with debugger, I can only follow trough that code section in the assembly view.

Our Debug.map file output:

******************************************************************************
            TI ARM Clang Linker PC v2.1.3                      
******************************************************************************
>> Linked Fri Jul  7 13:31:14 2023

OUTPUT FILE NAME:   <SAF.out>
ENTRY POINT SYMBOL: "_c_int00"  address: 0000e7ab


MEMORY CONFIGURATION

         name            origin    length      used     unused   attr    fill
----------------------  --------  ---------  --------  --------  ----  --------
  M4F_VECS              00000000   00000200  00000140  000000c0  RWIX
  M4F_IRAM              00000200   0002fe00  0001d2d8  00012b28  RWIX
  M4F_DRAM              00030000   00010000  00003440  0000cbc0  RWIX
  PP_SOM_SHM_MEM        701c0000   00000100  000000bc  00000044  RWIX
  PP_BASE_SHM_MEM       701c0100   00000100  00000000  00000100  RWIX
  EC_PDO_SHM_MEM        701c0200   00000200  00000000  00000200  RWIX
  EC_SDO_SHM_MEM        701c0400   0000fc00  00000000  0000fc00  RWIX
  LOG_SHM_MEM           701d0000   00004000  00000000  00004000  RWIX
  RTOS_NORTOS_IPC_SHM_M 701d4000   0000b000  00000000  0000b000  RWIX
  USER_IPC_SHM_MEM1     701df000   00000100  00000000  00000100  RWIX
  USER_IPC_SHM_MEM2     701df100   00000100  00000000  00000100  RWIX
  USER_IPC_SHM_MEM3     701df200   00000100  00000000  00000100  RWIX
  USER_IPC_SHM_MEM4     701df300   00000400  00000000  00000400  RWIX


SEGMENT ALLOCATION MAP

run origin  load origin   length   init length attrs members
----------  ----------- ---------- ----------- ----- -------
00000000    00000000    00000140   00000140    rw-
  00000000    00000000    00000140   00000140    rw- .vectors
00000200    00000200    000102d8   000102d8    r-x
  00000200    00000200    000102d8   000102d8    r-x .text
000104d8    000104d8    0000d000   00000000    rw-
  000104d8    000104d8    00008000   00000000    rw- .sysmem
  000184d8    000184d8    00005000   00000000    rw- .stack
00030000    00030000    00001a50   00001a50    r--
  00030000    00030000    00001a50   00001a50    r-- .rodata
00031a50    00031a50    00001600   00000000    rw-
  00031a50    00031a50    00001600   00000000    rw- .bss
00033050    00033050    000003f0   000003f0    rw-
  00033050    00033050    000003f0   000003f0    rw- .data


SECTION ALLOCATION MAP

 output                                  attributes/
section   page    origin      length       input sections
--------  ----  ----------  ----------   ----------------
.vectors   0    00000000    00000140     
                  00000000    00000140     nortos.am64x.m4f.ti-arm-clang.debug.lib : HwiP_armv7m_handlers_nortos.obj (.vectors)

.text      0    00000200    000102d8     
                  00000200    00000e4c     first  object
                  0000104c    00000c00     second object
                  ...

Example Debug.map file:

******************************************************************************
            TI ARM Clang Linker PC v2.1.3                      
******************************************************************************
>> Linked Fri Jul  7 11:09:50 2023

OUTPUT FILE NAME:   <hello_world_am64x-sk_m4fss0-0_nortos_ti-arm-clang.out>
ENTRY POINT SYMBOL: "_c_int00"  address: 0000e99b


MEMORY CONFIGURATION

         name            origin    length      used     unused   attr    fill
----------------------  --------  ---------  --------  --------  ----  --------
  M4F_VECS              00000000   00000200  00000140  000000c0  RWIX
  M4F_IRAM              00000200   0002fe00  00013b10  0001c2f0  RWIX
  M4F_DRAM              00030000   00010000  000010a0  0000ef60  RWIX
  USER_SHM_MEM          701d0000   00000080  00000000  00000080  RWIX
  LOG_SHM_MEM           701d0080   00003f80  00000000  00003f80  RWIX
  IPC_VRING_MEM         701d4000   0000c000  00000000  0000c000  RWIX


SEGMENT ALLOCATION MAP

run origin  load origin   length   init length attrs members
----------  ----------- ---------- ----------- ----- -------
00000000    00000000    00000140   00000140    rw-
  00000000    00000000    00000140   00000140    rw- .vectors
00000200    00000200    00008000   00000000    rw-
  00000200    00000200    00008000   00000000    rw- .sysmem
00008200    00008200    00007b10   00007b10    r-x
  00008200    00008200    00007b10   00007b10    r-x .text
0000fd10    0000fd10    00004000   00000000    rw-
  0000fd10    0000fd10    00004000   00000000    rw- .stack
00030000    00030000    00000708   00000000    rw-
  00030000    00030000    00000708   00000000    rw- .bss
00030708    00030708    000005d8   000005d8    r--
  00030708    00030708    000005d8   000005d8    r-- .rodata
00030ce0    00030ce0    000003c0   000003c0    rw-
  00030ce0    00030ce0    000003c0   000003c0    rw- .data


SECTION ALLOCATION MAP

 output                                  attributes/
section   page    origin      length       input sections
--------  ----  ----------  ----------   ----------------
.vectors   0    00000000    00000140     
                  00000000    00000140     nortos.am64x.m4f.ti-arm-clang.debug.lib : HwiP_armv7m_handlers_nortos.obj (.vectors)

.sysmem    0    00000200    00008000     UNINITIALIZED
                  00000200    00000010     libc.a : memory.c.obj (.sysmem)
                  00000210    00007ff0     --HOLE--

.text      0    00008200    00007b10     
                  00008200    00000a30     nortos.am64x.m4f.ti-arm-clang.debug.lib : printf.obj (.text._vsnprintf)
                  00008c30    00000640                                             : printf.obj (.text._etoa)
                  00009270    00000580                                             : printf.obj (.text._ftoa)
                  000097f0    000004a8                                             : HwiP_armv7m.obj (.text.hwi)
                  00009c98    000002da     drivers.am64x.m4f.ti-arm-clang.debug.lib : sciclient.obj (.text.Sciclient_service)
                  00009f72    00000286                                              : uart_v0.obj (.text.UART_open)
                  0000a1f8    00000242                                              : uart_v0.obj (.text.UART_fifoConfig)

Edit:

I managed to run into it again, and the disassembly shows me this code:

00000ff2:   9A1B                ldr        r2, [r13, #0x6c]
00000ff4:   991C                ldr        r1, [r13, #0x70]
00000ff6:   981D                ldr        r0, [r13, #0x74]
00000ff8:   F8B230B0            ldrh.w     r3, [r2, #0xb0]
00000ffc:   C850                ldm        r0!, {r4, r6}
00000ffe:   0001                movs       r1, r0
00001000:   46EC                mov        r12, r13
00001002:   F8CC2000            str.w      r2, [r12]
00001006:   F640127A            movw       r2, #0x97a
0000100a:   F2C00203            movt       r2, #3

Where the exception is thrown at 0x0ffc.

The memory of that area:

0x00000FE0	F8CCE004	F6402000	F2C01238	F00E0203	9A1BF9FB	981D991C	30B0F8B2
0x00000FFC	0001C850	F8CC46EC	F6402000	F2C0127A	F00E0203	E7FFF9EB	0308F89D
0x00001018	B19007C0	F64CE7FF	F2C02023	F2400000	F2C042A9	21000203	F00E9109
0x00001034	9809F9D9	7180F44F	FE9DF006

The registers state that cause it to happen:

The registers state after the exception occures:

In the second time it occures in another part of the function altaugh that function was not changed, but it happens in the exact same memory area always.

over 2 years ago

0 Prashant Shivhare over 2 years ago

TI__Guru* 76031 points

Hello Gergely,

The SBL sets the value 0xBF30BF30 at the address 0x400 while initializing the M4F IRAM with valid reset vector and wait instruction. This is shown below

This ensures the M4F core is in a valid state on the reset & release of the core.

Gergely Korcsák said:
Our linker.cmd matches the hello_worls projects linker file, but I still notice that the .sysmem heap segment is usually linked after the .text codesegment in our case, while in the example it becames the first segment from 0x200-0x8200, but as I understand, that shouldn't cause this issue.

Coming to this, I also understand this shouldn't be the cause. But just for the confirmation, can you please group the .text & .sysmem sections together in that order like shown below

SECTIONS
{
    /* This has the M4F entry point and vector table, this MUST be at 0x0 */
    .vectors:{} palign(8) > M4F_VECS

    GROUP
    {
        .text:   {} palign(8)     /* This is where code resides */
        .sysmem: {} palign(8)     /* This is where the malloc heap goes */
    } > M4F_IRAM

    .bss:    {} palign(8) > M4F_DRAM     /* This is where uninitialized globals go */
    RUN_START(__BSS_START)
    RUN_END(__BSS_END)

    .data:   {} palign(8) > M4F_DRAM     /* This is where initialized globals and static go */
    .rodata: {} palign(8) > M4F_DRAM     /* This is where const's go */
    .stack:  {} palign(8) > M4F_IRAM     /* This is where the main() stack goes */

    /* Sections needed for C++ projects */
    .ARM.exidx:     {} palign(8) > M4F_IRAM  /* Needed for C++ exception handling */
    .init_array:    {} palign(8) > M4F_IRAM  /* Contains function pointers called before main */
    .fini_array:    {} palign(8) > M4F_IRAM  /* Contains function pointers called after main */

    /* General purpose user shared memory */
    .bss.user_shared_mem (NOLOAD) : {} > USER_SHM_MEM
    /* this is used when Debug log's to shared memory are enabled, else this is not used */
    .bss.log_shared_mem  (NOLOAD) : {} > LOG_SHM_MEM
    /* this is used only when IPC RPMessage is enabled, else this is not used */
    .bss.ipc_vring_mem   (NOLOAD) : {} > IPC_VRING_MEM
}

This should ensure that the .text section comes before the .sysmem section. Afterwards, if by any chance the firmware works, we can try to reason why it actually works and not the other way.

Thanks!

0 Marc Schouteeten over 2 years ago in reply to Prashant Shivhare

Prodigy 236 points

Hi Prashant,

grouping the sections doesn't change behavior. We still witness FW crash.

0 Prashant Shivhare over 2 years ago in reply to Marc Schouteeten

TI__Guru* 76031 points

Hello Marc,

Thanks for doing the testing!

So, this rules the ordering of the sections. Next, can we have a look at the MPU configuration and experiment with different settings. At times, these MPU configuration settings have been the source of bugs. The default M4F Hello World example defines the two MPU configurations out of which one is of particular interest.

I would like to know if you have the same MPU configurations in your firmware? If no, could you please have the same MPU configurations as the Hello World example and then try running your firmware. If yes, can you experiment with different options of Region Attributes like Strongly Ordered to see if this changes the behaviour.

Thanks!

0 Marc Schouteeten over 2 years ago in reply to Prashant Shivhare

Prodigy 236 points

Hi,

The original MPUs from the sample are there, but we also added one MPU config for access to MSRAM on main domain. We use shared memory here.

The RATs are as in the sample; so the MSRAM MPU accesses via CONFIG_ADDR_TRANSLATE_REGION3 in the local address which is identical to the system address.

0 Prashant Shivhare over 2 years ago in reply to Marc Schouteeten

TI__Guru* 76031 points

Hi Marc,

Could you please try running the firmware with different Region Attributes settings like Strongly Ordered for CONFIG_MPU_REGION1. I would also like to know if the firmware is trying to access MSRAM at the point of failure.

There is one more thing: Did you have this firmware working before but after some changes, it broke?

Thanks!

0 Marc Schouteeten over 2 years ago in reply to Prashant Shivhare

Prodigy 236 points

I changed to strongly ordered, but the crash is still there.

the crash occurs in a function which will also access the shared memory, but it doesn't crash upon accessing this shared mem and for some lines of code the access to shared mem works.

When looking at the mixed source/disassembly code; there the CCS loses track of the source-assembly linkage at some point in this function close to the crash. It doesn't display any source anymore.

When changing the start address of IRAM from 0x200 to 0x2000 with exactly the same code; then all is fine and source/assembly shows correct and no crash happens.

I shared a movie of the debug session good+bad with our TI application engineer; as well as the .map files; but I cannot share this here on the forum.

0 Prashant Shivhare over 2 years ago in reply to Marc Schouteeten

TI__Guru* 76031 points

Hello Marc,

Thank you for all the testing & sharing the screen recordings of the observations! It helps in ruling out the suspected causes.

Though I can't reproduce the issue on my side, this time I have dig deep into the library source code. There, I see that we are resetting the Stack Pointer (SP) manually in the startup function (_c_int00) code of M4F core. This seems somewhat unusual.

So, can we once comment the code that sets that SP as shown below in the M4F NORTOS kernel library code and try running the firmware with these changes.

After the changes, rebuilt the libraries for the right profile or we can build the M4F NORTOS kernel library only for the right profile since we have done changes only in that library code. Assuming we are building for debug profile, please run the below command from the MCU+ SDK installation directory to build the M4F NORTOS kernel library only.

gmake -s -C source\kernel\nortos -f makefile.am64x.m4f.ti-arm-clang PROFILE=debug all

Once the library is built then rebuild the example and try running the same. Let me know if the issue still persists.

Thanks & Regards,

Prashant

0 Marc Schouteeten over 2 years ago in reply to Prashant Shivhare

Prodigy 236 points

Hi,

Unfortunately the change didn't solve the issue.

I shared the screencapture video via our application team.

Looking forward to more suggestions!

0 Prashant Shivhare over 2 years ago in reply to Marc Schouteeten

TI__Guru* 76031 points

Hello Marc,

I realize that in my very first response, I suggested to group the sections (.text & .sysmem), in the Linker Command file like this:

GROUP
{
    .text:   {} palign(8)     /* This is where code resides */
    .sysmem: {} palign(8)     /* This is where the malloc heap goes */
} > M4F_IRAM

However, this ordering was already there with your builds. What I wanted to suggest is to group the sections (.text & .sysmem) like this:

GROUP
{
    .sysmem: {} palign(8)     /* This is where the malloc heap goes */
    .text:   {} palign(8)     /* This is where code resides */
} > M4F_IRAM

This ensures that the .sysmem comes before the .text section.

Actually, this ordering can be of relevant here as the Hard Fault is occuring somewhere around 0xFF(x) and if .text comes before .sysmem then this address belongs to .text section. However, if the .text section does not cover this address, then the Hard Fault does not occur. So, I guess somehow somewhere in .text something is getting overwritten when .text covers that address.

So, could you please try the above ordering which I wanted you to suggest in the very first response. Sorry for the mishap. Let me know if this works.

Hoping for the best!!

Regards,

Prashant

0 Marc Schouteeten over 2 years ago in reply to Prashant Shivhare

Prodigy 236 points

Hi Prashant,

I tested your proposal and indeed; in this case I don't see the hard fault.

But is it really solving the problem or just masking it? Maybe now the heap in .sysmem gets overwritten instead of the .text code?

The workaround is similar to what I did earlier; namely let the IRAM start from 0x2000 instead of 0x200; which of course reduces the amount of available code memory.

Overall I have the same feeling, namely that something is overwriting memory around address 0xFFx . Could it be SYSFW? Could it be related to SBL? Is there any code being used from SBL after the image is in place and execution starts on the seperate cores? I would believe not?

0 Prashant Shivhare over 2 years ago in reply to Marc Schouteeten

TI__Guru* 76031 points

Hi Marc,

Marc Schouteeten said:
I tested your proposal and indeed; in this case I don't see the hard fault.

This definitely suggests whatever section covering 0xFF(x) is somehow gets overwritten around same 0xFF(x).

Marc Schouteeten said:
Overall I have the same feeling, namely that something is overwriting memory around address 0xFFx . Could it be SYSFW? Could it be related to SBL? Is there any code being used from SBL after the image is in place and execution starts on the seperate cores? I would believe not?

I also believe the Sysfw & SBL has nothing to do here. It must be the M4F code, most probably the startup code that executes before main, which overwrites the memory around 0xFF(x).

There is actually a way to debug this startup code and find if it is overwriting the memory. The way to do this is shown in the below attached screen recording:

Following this, could you please try verifying the program at each statement of this startup function. This verification checks if the sections are intact in the memory or not. If they are it reports "Verification Successful" otherwise reports "Failures" with the address at which the change is detected.

Let me know if you get a verification failure in this startup function.

Regards,

Prashant

0 Marc Schouteeten over 2 years ago in reply to Prashant Shivhare

Prodigy 236 points

Hi Prashant,

the verification fails in the Dpl_init(), so already after jump to main.

0 Prashant Shivhare over 2 years ago in reply to Marc Schouteeten

TI__Guru* 76031 points

Hi Marc,

This verification failure is expected as it is caused by the HwiP_init call which initializes the interrupt handlers in the .vectors section. Even, I see the same verification failure.

So, we want to verify only the .text section and ignore all other sections. To do that, we would need to create a new ELF file from the original ELF file. The steps would be:

Take the original ELF file and copy paste it to preserve the original one.
Use the objcopy tool to keep only the .text section in the copied ELF file. The command for this is:
```
tiarmobjcopy --only-section=.text ${Copied ELF File Path}
```
Confirm if the copied ELF file now really contains only .text section. Ignore sections (.symtab, .strtab & .shstrtab) if present. The command is:
```
tiarmreadelf.exe -S ${Copied ELF path file}
```

These tiarm tools comes with the TI ARM CLANG compiler and are available at ${TI_ARM_CLANG_PATH}/bin.

Now, after the original ELF file has been loaded on the core, we will use the copied ELF file for verification. We know the .text section content is same in both ELF files. Please ignore verification failures, if any, because of the other sections (.symtab, .strtab & .shstrtab).

For example,

Here, I can ignore this verification error as the corresponding address does not belong to .text section. And we know the verification of .text must be successful as other sections in the copied ELF file come later.

The above can be tried to see what is causing the change. I was also thinking before finding out what caused the change, we should first verify if the .text section indeed is overwritten. For that, let the program run into Hard Fault then do the verification step using the copied ELF file. If the verification fails in .text section, then we can debug it otherwise we have look for something else.

Regards,

Prashant

0 Marc Schouteeten over 2 years ago in reply to Marc Schouteeten

Prodigy 236 points

Hi Prashant,

Since HwiP_init() installs the interrupt handlers; this is causing the verification of course.

However even without starting execution; it seems the disassembly is already not correct. So after CPU reset and reload of the program, even before stepping into the _c_int00() and main(), I'm seeing the following (with missing source code in the mixed disassembly window :

0 Marc Schouteeten over 2 years ago in reply to Prashant Shivhare

Prodigy 236 points

Dear Prashant,

when I load my program as always with the CSS debug button and then verify against the stripped ELF; then I have a verification failure at 0xFFC. In this case the program will crash when executing code from the function which overlaps with this location.

When I follow the instruction in your video and in GEL menu I first disable verification and auto run; then this memory location 0xFFC is NOT overwritten and the code runs correctly.

kind regards,
Marc

0 Prashant Shivhare over 2 years ago in reply to Marc Schouteeten

TI__Guru* 76031 points

Hello Marc,

Finally, I have some good news as I have reproduced the issue. I also understand now the root cause at least on my end & the fix for it. Before discussing this further, I sincerely thank you for assisting throughout the debugging. The latest reply really helped me think through the issue & understand the root cause.

Reproducing the Issue

I can reproduce the issue even with the default hello world example but with the following changes in the Linker Command file to make sure .text sections overlaps the memory around 0xFFC.

GROUP
{
    .text:   {} palign(8)     /* This is where code resides */
    .sysmem: {} palign(8)     /* This is where the malloc heap goes */
} > M4F_IRAM

Then, I load the hello world executable on the M4F core. The core gets suspended at the start of main. Then, on doing the verification, CCS reports failure as the value at address 0xFFC mismatches as shown below.

However, even though there is a mismatch, the program runs fine and prints "Hello World!" as expected. As it happens, the control actually never goes to the section which is overwritten & thus the program runs fine. This behaviour prevented me from reproducing the issue earlier as I was only running the application and not doing the verification. This time I did the verification and could reproduce the issue.

Now, there is just one difference: I can reproduce the issue irrespective of whether or not the Fast Verification & Auto Run option in the CCS are enabled. Due to this behaviour, I am unable to root cause the issue at your end.

Understanding the root cause of the issue at my end

As it happens, the issue is linked to the startup code (_c_int00) only. This is the disassembly view of the startup code at my end

As we can see, the register R13, aka Stack Pointer (SP), is at 0xFF8 at the statement 92. Now, let's analyse the disassembly of this statement. So, the first two instructions loads the value 0x13D18 in the register R0. The third instruction then is a Store instruction & stores the value in R0 at the effective address of (R13 + 4). This expression evaluates to (0xFF8 + 4) = 0xFFC. And, so the store is performed at the address 0xFFC which overlaps the .text section.

So, this statement is ultimately the root cause of the issue at my end.

Fix for the issue

A simple fix for this issue is to just make sure .sysmem section comes before .text section and .sysmem is large enough to overlap the address 0xFFC. Then, even if the startup code is writing to .sysmem, it is safe as there is no data in the heap before main.

Regards,

Prashant

0 Marc Schouteeten over 2 years ago in reply to Prashant Shivhare

Prodigy 236 points

Hi Prashant,

yes, this solves the issue. Thank you!

0 Prashant Shivhare over 2 years ago in reply to Marc Schouteeten

TI__Guru* 76031 points

Hi Marc,

If I may ask, I would really like to understand what really is causing the issue in your case and take this thread to its end. As I understand it, you replied that there are no verification errors in the startup code in case the verification & auto run are turned off. So, there still is the mystery of how the value is getting overwritten at your end.

So, if you permit, may I ask for some more help to find the root cause at your end.

Thanks!

0 Marc Schouteeten over 2 years ago in reply to Prashant Shivhare

Prodigy 236 points

Sure, I have proposed our FAE to setup an online meeting tomorrow; that would maybe be the most efficient?

Processors

Processors forum

PROCESSOR-SDK-AM64X: M4F IRAM memory issue