AM2434: Data abort for first context switch

Benedikt Schmidt

Part Number: AM2434

Hello,

I'm running FreeRTOS from ind_comms_sdk_am243x_11_00_00_08 on an AM2434. On one of the R5 cores I get in about 50% of the cases a data abort for the first context switch, which happens to be the timer task.

I've already read out the corresponding registers:

DFAR: 0x00000000
DFSR: 0x00001C06

According to the documentation of the fault registers (https://developer.arm.com/documentation/ddi0460/c/System-Control/Register-descriptions/Fault-Status-and-Address-Registers) this seems to be an Asynchronous External Abort.

When the abort occurs, it always happens during execution of the RFEIA instruction at the end of the context switch. Which should mean I end up in the data abort handler (in ~50% of the cases) if I step forward with PC=7010ace4. After this step forward I am either in the timer task function or the vector table for the data abort.

vPortRestoreTaskContext():
7010acb0: F102001F cps #0x1f
7010acb4: E59F01C8 ldr r0, [pc, #0x1c8]
7010acb8: E5901000 ldr r1, [r0]
7010acbc: E591D000 ldr r13, [r1]
7010acc0: E59F01C0 ldr r0, [pc, #0x1c0]
7010acc4: E49D1004 pop {r1}
7010acc8: E5801000 str r1, [r0]
7010accc: E3510000 cmp r1, #0
7010acd0: 149D0004 popne {r0}
7010acd4: 1CBD0B20 vpopne {d0, d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, d15}
7010acd8: 1EE10A10 vmsrne fpscr, r0
7010acdc: F57FF01F clrex
7010ace0: E8BD5FFF pop {r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r14}
7010ace4: F8BD0A00 rfeia r13!

As it is an asynchronous external abort I'm assuming that the actual trigger might be coming from another instruction, which just happens to be executed in the pipeline at the same time? For this purpose I have already tried to move the stack of this task from the external DDR into the MSRAM, with no success.

What could be causing this issue? Are there any additional fault registers or something similar which could explain in more detail what is triggering this issue?

4 months ago

0 Meet Thakar 4 months ago

TI__Mastermind 22815 points

Hi,

Could you let me know if you are testing this on a Custom board or an EVM?

Also is this a specific example code where you are facing the issue? If you are using an EVM and the example code from the SDK then I can run the same at my end to check if I can reproduce the issue.

Best Regards,

Meet.

0 Benedikt Schmidt 4 months ago in reply to Meet Thakar

Prodigy 20 points

It is a custom board and a custom application, so unfortunately I cannot really offer something for reproducing the issue. But I would already be really glad if you could offer me some guidance on how I can investigate this further.

0 Meet Thakar 4 months ago in reply to Benedikt Schmidt

TI__Mastermind 22815 points

You mentioned moving the task's stack from DDR to MSRAM which gave the same result. Are you using DDR for your freertos kernel code as well? Could you share the linker file for your applicaiton, I just want to check if DDR is somehow responsible for this issue or not?

Also which DDR part number are you using on your custom board?

0 Benedikt Schmidt 4 months ago in reply to Meet Thakar

Prodigy 20 points

 --stack_size=16384
 --heap_size=32768
-evector_table

__HEAP_SIZE = 32768;
__STACK_SIZE = 16384;
__IRQ_STACK_SIZE = 256;
__FIQ_STACK_SIZE = 256;
__SVC_STACK_SIZE = 4096;
__ABORT_STACK_SIZE = 256;
__UNDEFINED_STACK_SIZE = 256;

MEMORY
{
    TCMA_VECTOR                : ORIGIN = 0x00000000, LENGTH = 0x00040
    TCMA                       : ORIGIN = 0x00000040, LENGTH = 0x07FC0
    TCMB                       : ORIGIN = 0x41010000, LENGTH = 0x08000
    CORE0_BSS_CACHED           : ORIGIN = 0x70000000, LENGTH = 0x10000
    CORE1_BSS_CACHED           : ORIGIN = 0x70010000, LENGTH = 0x10000
    CORE0_DATA                 : ORIGIN = 0x70020000, LENGTH = 0x10000
    CORE1_DATA                 : ORIGIN = 0x70030000, LENGTH = 0x10000
    CORE0_CODE                 : ORIGIN = 0x70040000, LENGTH = 0x2F9EC
    CORE0_CODE_HEADER          : ORIGIN = 0x7006F9EC, LENGTH = 0x00614
    CORE0_CODE_RODATA          : ORIGIN = 0x70070000, LENGTH = 0x0FDE4
    CORE0_CODE_RODATA_HEADER   : ORIGIN = 0x7007FDE4, LENGTH = 0x0021C
    CORE1_CODE                 : ORIGIN = 0x70080000, LENGTH = 0x4F5F4
    CORE1_CODE_HEADER          : ORIGIN = 0x700CF5F4, LENGTH = 0x00A0C
    CORE2_BSS_UNCACHED         : ORIGIN = 0x700D0000, LENGTH = 0x10000
    CORE3_BSS_CACHED           : ORIGIN = 0x700E0000, LENGTH = 0x10000
    CORE2_CODE                 : ORIGIN = 0x700F0000, LENGTH = 0x7C000
    CORE2_SHARED               : ORIGIN = 0x7016C000, LENGTH = 0x02000
    CORE3_SHARED               : ORIGIN = 0x7016E000, LENGTH = 0x02000
    CORE3_CODE                 : ORIGIN = 0x70170000, LENGTH = 0x50000
    CORE0_SHARED               : ORIGIN = 0x701C0000, LENGTH = 0x08000
    CORE1_SHARED               : ORIGIN = 0x701C8000, LENGTH = 0x08000
    CORE2_BSS_CACHED           : ORIGIN = 0x80000000, LENGTH = 0x200000
    CORE0_PLC_APP_CODE         : ORIGIN = 0x80200000, LENGTH = 0x08000
    CORE0_PLC_APP_DATA         : ORIGIN = 0x80210000, LENGTH = 0x08000
    CORE1_PLC_APP_CODE         : ORIGIN = 0x80300000, LENGTH = 0x08000
    CORE1_PLC_APP_DATA         : ORIGIN = 0x80310000, LENGTH = 0x08000
    CORE2_DATA                 : ORIGIN = 0x80400000, LENGTH = 0x10000
    CORE3_DATA                 : ORIGIN = 0x80500000, LENGTH = 0x10000
}

SECTIONS
{
    .vector : {
        *(.vector_table)
    } > TCMA_VECTOR, palign(8) 

    .stack : {
    } > TCMA, palign(8) 

    GROUP : {
        .bss : {
        } palign(8)
    } > CORE2_BSS_CACHED

    GROUP : {
        .bss.uncached : {
        } palign(8)
        .bss.nocache : {
        } palign(8)
    } > CORE2_BSS_UNCACHED

    GROUP : {
        .text : {
        } palign(8)
        .rodata : {
        } palign(8)
        .cinit : {
        } palign(8)
    } > CORE2_CODE

    GROUP : {
        .irqstack : {
            . = . + __IRQ_STACK_SIZE;
        } align(8)
        RUN_START(__IRQ_STACK_START)
        RUN_END(__IRQ_STACK_END)
        .fiqstack : {
            . = . + __FIQ_STACK_SIZE;
        } align(8)
        RUN_START(__FIQ_STACK_START)
        RUN_END(__FIQ_STACK_END)
        .svcstack : {
            . = . + __SVC_STACK_SIZE;
        } align(8)
        RUN_START(__SVC_STACK_START)
        RUN_END(__SVC_STACK_END)
        .abortstack : {
            . = . + __ABORT_STACK_SIZE;
        } align(8)
        RUN_START(__ABORT_STACK_START)
        RUN_END(__ABORT_STACK_END)
        .undefinedstack : {
            . = . + __UNDEFINED_STACK_SIZE;
        } align(8)
        RUN_START(__UNDEFINED_STACK_START)
        RUN_END(__UNDEFINED_STACK_END)
    } > CORE2_DATA

    GROUP : {
        .data : {
        } palign(8)
    } > CORE2_DATA

    GROUP : {
        .shared_core0 : {
            *(.spsc_queue_ErrorItemsToREHCore0_write_index)
			...
        } palign(8), type = NOINIT
    }> CORE0_SHARED

    GROUP : {
        .shared_core1 : {
            *(.spsc_queue_ErrorItemsToREHCore1_write_index)
			...
        } palign(8), type = NOINIT
    }> CORE1_SHARED

    GROUP : {
        .shared_core2 : {
            *(.spsc_queue_ErrorItemsToREHCore0_read_index)
			...
        } palign(8)
    }> CORE2_SHARED

    GROUP : {
        .shared_core3 : {
            *(.shared_atomic_variable_bootCounter)
			...
        } palign(8), type = NOINIT
    }> CORE3_SHARED

}

This is our linker script for this core. I have shortened it a bit (... in COREX_SHARED) to be able to upload it here directly.

As you can see in the linker script we have moved BSS and DATA to the DDR for this core, hence also for the freertos kernel code. CODE itself is currently still in the MSRAM, but will in the near future also be partially moved in the DDR.

As DDR we are using MT40A1G16KD-062E IT.

0 Meet Thakar 4 months ago in reply to Benedikt Schmidt

TI__Mastermind 22815 points

Can you also share the value of ADFSR register?

Can you make the MSRAM region from where the code is executing as strongly ordered instead of cached and see if you can get to the exact instruction that causes the abort instead of an asynchronous abort.

0 Benedikt Schmidt 4 months ago in reply to Meet Thakar

Prodigy 20 points

I've reconfigured the MPU region where the affected stack is located to have this region access control register value: 0x300. This should be strongly ordered, without cache?

With these settings I get this information:
ADFSR = 0x3F
DFAR = 0x0
DFSR = 0x1C06

If I understood the documentation of ADFSR (Cortex-R5 Technical Reference Manual) correctly this means the error is coming from Cache/AXIM?

0 Yashraj Motwani 4 months ago in reply to Benedikt Schmidt

TI__Intellectual 1430 points

Hello,
Thank you for your query. The concerned expert is Out of Office due to **TI India** Holiday.
Please expect a delay in response. We appreciate your patience and understanding.

Best regards,
TI E2E Support Team
---
*This is an automated notification.*

0 Meet Thakar 3 months ago in reply to Yashraj Motwani

TI__Mastermind 22815 points

Hi,

Benedikt Schmidt said:
I've reconfigured the MPU region where the affected stack is located to have this region access control register value: 0x300. This should be strongly ordered, without cache?

I meant to configure MSRAM memory from where vPortRestoreTaskContext() is executing to strongly ordered to check if it can give us the exact instruction at which the abort is triggered. For just running a test you can make the entire MSRAM as strongly ordered. To trace the exact instruction causing the abort, inspect the instruction near (R14 − 8) in your abort handler stack frame, please refer to section 5.2.2.4 here: https://www.ti.com/lit/an/sprad28/sprad28.pdf#page=13

Benedikt Schmidt said:
a data abort for the first context switch, which happens to be the timer task.

You mentioned the first context switch to switch to the timer task, it might be worth checking at what address uxTimerTaskStack is allocated to in your map file. To confirm whether the timer task is actually causing any issue or it is something else you can try disabling it and see if you still see the same issue or not. You can disable it by configuring configUSE_TIMERS to 0 in your FreeRTOSConfig.h file.

Benedikt Schmidt said:
If I understood the documentation of ADFSR (Cortex-R5 Technical Reference Manual) correctly this means the error is coming from Cache/AXIM?

The combined register values point to an asynchronous external abort triggered by an AXI Slave error (SLVERR) during a write access. This could be caused incorrect MPU settings. Can you share your current MPU settings or if possible the syscfg file?

Best Regards,

Meet.

0 Benedikt Schmidt 3 months ago in reply to Meet Thakar

Prodigy 20 points

As soon as I configure the code section strongly ordered as well I run into prefetch aborts. Which is kind of weird, haven't yet figured out why.

uxTimerTaskStack is located where I would expect it, in BSS at 0x8009b700.

I'll see if I can ensure that a different task is scheduled first.

This is a dump of all MPU regions of the faulting core:

[0]
    baseAddress 0x00000000
    sizeAndEnable 0x0000003F
    attributes 0x00001020
[1]
    baseAddress 0x00000000
    sizeAndEnable 0x0000803D
    attributes 0x00001204
[2]
    baseAddress 0x00000000
    sizeAndEnable 0x0000001F
    attributes 0x00001329
[3]
    baseAddress 0x00000000
    sizeAndEnable 0x0000000B
    attributes 0x00000629
[4]
    baseAddress 0x70000000
    sizeAndEnable 0x0000F727
    attributes 0x00001624
[5]
    baseAddress 0x70080000
    sizeAndEnable 0x0000FB25
    attributes 0x00001324
[6]
    baseAddress 0x70080000
    sizeAndEnable 0x0000F725
    attributes 0x00001624
[7]
    baseAddress 0x70080000
    sizeAndEnable 0x0000EF25
    attributes 0x00000324
[8]
    baseAddress 0x70080000
    sizeAndEnable 0x0000BF25
    attributes 0x00001329
[9]
    baseAddress 0x70100000
    sizeAndEnable 0x0000F827
    attributes 0x00000324
[10]
    baseAddress 0x80000000
    sizeAndEnable 0x0000FE2F
    attributes 0x00000300
[11]
    baseAddress 0x80800000
    sizeAndEnable 0x0000FE2D
    attributes 0x00000324
[12]
    baseAddress 0
    sizeAndEnable 0
    attributes 0
[13]
    baseAddress 0
    sizeAndEnable 0
    attributes 0
[14]
    baseAddress 0
    sizeAndEnable 0
    attributes 0
[15]
    baseAddress 0
    sizeAndEnable 0
    attributes 0

0 Meet Thakar 3 months ago in reply to Benedikt Schmidt

TI__Mastermind 22815 points

Thanks for sharing the the MPU config, I see that 512kB region starting from 0x70080000 has multiple MPU settings is there any reason for that? Also I see that for region starting from 0x70000000 is configured as read only, any write access for these memory might lead to a data abort, you can try changing this to RD+WR access.

Any reason why sub-region disable mask is configured for most of these?

0 Benedikt Schmidt 3 months ago in reply to Meet Thakar

Prodigy 20 points

The region at 0x70080000 has only on the first glance multiple settings, they are actually distinct regions based on their enabled subregions. We are using the subregions rather extensively as we are automatically generating these settings from another configuration file, so this is just the output of an algorithm to correctly align the sizes and subregions to require a minimum amount of MPU regions for a given memory map. But if I understood this correctly, this should actually be fine?

Regarding the one starting at 0x70000000, it has the following settings:

XN=0b1 -> no instruction fetches
AP=0b110 -> Privileged/User read-only
TEX=0b100
S=0b1
C=0b0
B=0b0 -> TEX,C,B=0b10000 -> Non cachable inner and outer policy

This memory region is intended to be written by other cores and read by this one, hence these settings. I don't understand what you mean with this being as read only/any write access? Could you please clarify this statement?

0 Meet Thakar 3 months ago in reply to Benedikt Schmidt

TI__Mastermind 22815 points

Benedikt Schmidt said:
This memory region is intended to be written by other cores and read by this one, hence these settings. I don't understand what you mean with this being as read only/any write access? Could you please clarify this statement?

What I meant to say is that this particular core only has read access to this memory region, if this core tries to write to a memory in this region then it can lead to an abort. As long as this CPU is not accessing this memory region, it should be fine.

Benedikt Schmidt said:
I'll see if I can ensure that a different task is scheduled first.

Let me know your observations on this, the abort is mostly due to a store instruction trying to access some memory region which is invalid or the access is disabled for your core but your MPU settings don't indicate this issue.

Is it possible for you to provide me a reference code for this that can run on the EVM? So I can try to reproduce this at my end and see if I can find the root cause.

0 Benedikt Schmidt 3 months ago in reply to Meet Thakar

Prodigy 20 points

Unfortuantely I cannot break it down anyhow, it really only appears in the whole system on our custom hardware. The very interesting part is also how I can, for unknown reasons to me, avoid the issue via a breakpoint. For changing the priority of the timer task I have added a breakpoint at xTimerCreateTimerTask, changed the priority in the call to xTaskCreateStatic and let the system run again. But with this the problem doesn't appear anymore. Even more confusing, I get the same results when I don't change the priority at all and just step into the task creation and let the system run again afterwards.

So to sum it up: The data abort appears only once. Retrying the exact same thing succeeds. Hence I conclude that the MPU settings must be correct, because otherwise the data abort would appear again. Besides that, I can avoid the data abort with some instruction steps somewhere way earlier, not really related to the data abort later on.

This is in total a very unclear and confusing error pattern. Do you have any ideas what could be causing such a thing?

0 Benedikt Schmidt 3 months ago in reply to Benedikt Schmidt

Prodigy 20 points

Sorry, I have fooled myself. The problem appears only sometimes, therefore my previous conclusion that stepping manually during the task creation made a difference was wrong. But, on the plus side, I was now able to demote the timer task and let another task be the one for the first context switch. And the problem occurs there as well, so it seems to be independent of the task where the context switch is happening into. The data abort just appears, sometimes at least, for the first context switch into a task.

But the other conclusions are still valid. I don't think it can be a misconfiguration of the MPU, because then it would happen always and a retry wouldn't succeed.

0 Benedikt Schmidt 3 months ago in reply to Meet Thakar

Prodigy 20 points

Meet Thakar said:
I meant to configure MSRAM memory from where vPortRestoreTaskContext() is executing to strongly ordered to check if it can give us the exact instruction at which the abort is triggered.

I think I understood now why the system failed when I have configured the code region as strongly ordered:

Any address in an MPU region with device or strongly-ordered memory type attributes is implicitly given execute-never (XN) permissions.

from the Cortex-R5 Technical Reference Manual

0 Meet Thakar 3 months ago in reply to Benedikt Schmidt

TI__Mastermind 22815 points

Hi Benedikt,

Apologies for the delay, I am checking internally to get some more ideas, will let you know once I have an update.

Best Regards,

Meet.

Arm-based microcontrollers

Arm-based microcontrollers forum

AM2434: Data abort for first context switch