This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

AM2434: Data abort for first context switch

Part Number: AM2434

Hello,

I'm running FreeRTOS from ind_comms_sdk_am243x_11_00_00_08 on an AM2434. On one of the R5 cores I get in about 50% of the cases a data abort for the first context switch, which happens to be the timer task.

I've already read out the corresponding registers:

  • DFAR: 0x00000000
  • DFSR: 0x00001C06

According to the documentation of the fault registers (https://developer.arm.com/documentation/ddi0460/c/System-Control/Register-descriptions/Fault-Status-and-Address-Registers) this seems to be an Asynchronous External Abort.

When the abort occurs, it always happens during execution of the RFEIA instruction at the end of the context switch. Which should mean I end up in the data abort handler (in ~50% of the cases) if I step forward with PC=7010ace4. After this step forward I am either in the timer task function or the vector table for the data abort.

          vPortRestoreTaskContext():
7010acb0:   F102001F            cps        #0x1f
7010acb4:   E59F01C8            ldr        r0, [pc, #0x1c8]
7010acb8:   E5901000            ldr        r1, [r0]
7010acbc:   E591D000            ldr        r13, [r1]
7010acc0:   E59F01C0            ldr        r0, [pc, #0x1c0]
7010acc4:   E49D1004            pop        {r1}
7010acc8:   E5801000            str        r1, [r0]
7010accc:   E3510000            cmp        r1, #0
7010acd0:   149D0004            popne      {r0}
7010acd4:   1CBD0B20            vpopne     {d0, d1, d2, d3, d4, d5, d6, d7, d8, d9, d10, d11, d12, d13, d14, d15}
7010acd8:   1EE10A10            vmsrne     fpscr, r0
7010acdc:   F57FF01F            clrex      
7010ace0:   E8BD5FFF            pop        {r0, r1, r2, r3, r4, r5, r6, r7, r8, r9, r10, r11, r12, r14}
7010ace4:   F8BD0A00            rfeia      r13!

As it is an asynchronous external abort I'm assuming that the actual trigger might be coming from another instruction, which just happens to be executed in the pipeline at the same time? For this purpose I have already tried to move the stack of this task from the external DDR into the MSRAM, with no success.

What could be causing this issue? Are there any additional fault registers or something similar which could explain in more detail what is triggering this issue?

  • Hi,

    Could you let me know if you are testing this on a Custom board or an EVM? 

    Also is this a specific example code where you are facing the issue? If you are using an EVM and the example code from the SDK then I can run the same at my end to check if I can reproduce the issue.

    Best Regards,

    Meet.

  • It is a custom board and a custom application, so unfortunately I cannot really offer something for reproducing the issue. But I would already be really glad if you could offer me some guidance on how I can investigate this further.

  • You mentioned moving the task's stack from DDR to MSRAM which gave the same result. Are you using DDR for your freertos kernel code as well? Could you share the linker file for your applicaiton, I just want to check if DDR is somehow responsible for this issue or not?

    Also which DDR part number are you using on your custom board?

  •  --stack_size=16384
     --heap_size=32768
    -evector_table
    
    __HEAP_SIZE = 32768;
    __STACK_SIZE = 16384;
    __IRQ_STACK_SIZE = 256;
    __FIQ_STACK_SIZE = 256;
    __SVC_STACK_SIZE = 4096;
    __ABORT_STACK_SIZE = 256;
    __UNDEFINED_STACK_SIZE = 256;
    
    MEMORY
    {
        TCMA_VECTOR                : ORIGIN = 0x00000000, LENGTH = 0x00040
        TCMA                       : ORIGIN = 0x00000040, LENGTH = 0x07FC0
        TCMB                       : ORIGIN = 0x41010000, LENGTH = 0x08000
        CORE0_BSS_CACHED           : ORIGIN = 0x70000000, LENGTH = 0x10000
        CORE1_BSS_CACHED           : ORIGIN = 0x70010000, LENGTH = 0x10000
        CORE0_DATA                 : ORIGIN = 0x70020000, LENGTH = 0x10000
        CORE1_DATA                 : ORIGIN = 0x70030000, LENGTH = 0x10000
        CORE0_CODE                 : ORIGIN = 0x70040000, LENGTH = 0x2F9EC
        CORE0_CODE_HEADER          : ORIGIN = 0x7006F9EC, LENGTH = 0x00614
        CORE0_CODE_RODATA          : ORIGIN = 0x70070000, LENGTH = 0x0FDE4
        CORE0_CODE_RODATA_HEADER   : ORIGIN = 0x7007FDE4, LENGTH = 0x0021C
        CORE1_CODE                 : ORIGIN = 0x70080000, LENGTH = 0x4F5F4
        CORE1_CODE_HEADER          : ORIGIN = 0x700CF5F4, LENGTH = 0x00A0C
        CORE2_BSS_UNCACHED         : ORIGIN = 0x700D0000, LENGTH = 0x10000
        CORE3_BSS_CACHED           : ORIGIN = 0x700E0000, LENGTH = 0x10000
        CORE2_CODE                 : ORIGIN = 0x700F0000, LENGTH = 0x7C000
        CORE2_SHARED               : ORIGIN = 0x7016C000, LENGTH = 0x02000
        CORE3_SHARED               : ORIGIN = 0x7016E000, LENGTH = 0x02000
        CORE3_CODE                 : ORIGIN = 0x70170000, LENGTH = 0x50000
        CORE0_SHARED               : ORIGIN = 0x701C0000, LENGTH = 0x08000
        CORE1_SHARED               : ORIGIN = 0x701C8000, LENGTH = 0x08000
        CORE2_BSS_CACHED           : ORIGIN = 0x80000000, LENGTH = 0x200000
        CORE0_PLC_APP_CODE         : ORIGIN = 0x80200000, LENGTH = 0x08000
        CORE0_PLC_APP_DATA         : ORIGIN = 0x80210000, LENGTH = 0x08000
        CORE1_PLC_APP_CODE         : ORIGIN = 0x80300000, LENGTH = 0x08000
        CORE1_PLC_APP_DATA         : ORIGIN = 0x80310000, LENGTH = 0x08000
        CORE2_DATA                 : ORIGIN = 0x80400000, LENGTH = 0x10000
        CORE3_DATA                 : ORIGIN = 0x80500000, LENGTH = 0x10000
    }
    
    SECTIONS
    {
        .vector : {
            *(.vector_table)
        } > TCMA_VECTOR, palign(8) 
    
        .stack : {
        } > TCMA, palign(8) 
    
        GROUP : {
            .bss : {
            } palign(8)
        } > CORE2_BSS_CACHED
    
        GROUP : {
            .bss.uncached : {
            } palign(8)
            .bss.nocache : {
            } palign(8)
        } > CORE2_BSS_UNCACHED
    
        GROUP : {
            .text : {
            } palign(8)
            .rodata : {
            } palign(8)
            .cinit : {
            } palign(8)
        } > CORE2_CODE
    
        GROUP : {
            .irqstack : {
                . = . + __IRQ_STACK_SIZE;
            } align(8)
            RUN_START(__IRQ_STACK_START)
            RUN_END(__IRQ_STACK_END)
            .fiqstack : {
                . = . + __FIQ_STACK_SIZE;
            } align(8)
            RUN_START(__FIQ_STACK_START)
            RUN_END(__FIQ_STACK_END)
            .svcstack : {
                . = . + __SVC_STACK_SIZE;
            } align(8)
            RUN_START(__SVC_STACK_START)
            RUN_END(__SVC_STACK_END)
            .abortstack : {
                . = . + __ABORT_STACK_SIZE;
            } align(8)
            RUN_START(__ABORT_STACK_START)
            RUN_END(__ABORT_STACK_END)
            .undefinedstack : {
                . = . + __UNDEFINED_STACK_SIZE;
            } align(8)
            RUN_START(__UNDEFINED_STACK_START)
            RUN_END(__UNDEFINED_STACK_END)
        } > CORE2_DATA
    
        GROUP : {
            .data : {
            } palign(8)
        } > CORE2_DATA
    
        GROUP : {
            .shared_core0 : {
                *(.spsc_queue_ErrorItemsToREHCore0_write_index)
    			...
            } palign(8), type = NOINIT
        }> CORE0_SHARED
    
        GROUP : {
            .shared_core1 : {
                *(.spsc_queue_ErrorItemsToREHCore1_write_index)
    			...
            } palign(8), type = NOINIT
        }> CORE1_SHARED
    
        GROUP : {
            .shared_core2 : {
                *(.spsc_queue_ErrorItemsToREHCore0_read_index)
    			...
            } palign(8)
        }> CORE2_SHARED
    
        GROUP : {
            .shared_core3 : {
                *(.shared_atomic_variable_bootCounter)
    			...
            } palign(8), type = NOINIT
        }> CORE3_SHARED
    
    }

    This is our linker script for this core. I have shortened it a bit (... in COREX_SHARED) to be able to upload it here directly.

    As you can see in the linker script we have moved BSS and DATA to the DDR for this core, hence also for the freertos kernel code. CODE itself is currently still in the MSRAM, but will in the near future also be partially moved in the DDR.

    As DDR we are using MT40A1G16KD-062E IT.

  • Can you also share the value of ADFSR register?

    Can you make the MSRAM region from where the code is executing as strongly ordered instead of cached and see if you can get to the exact instruction that causes the abort instead of an asynchronous abort. 

  • I've reconfigured the MPU region where the affected stack is located to have this region access control register value: 0x300. This should be strongly ordered, without cache?

    With these settings I get this information:
    ADFSR = 0x3F
    DFAR = 0x0
    DFSR = 0x1C06

    If I understood the documentation of ADFSR (Cortex-R5 Technical Reference Manual) correctly this means the error is coming from Cache/AXIM?

  • Hello,
    Thank you for your query. The concerned expert is Out of Office due to **TI India** Holiday.
    Please expect a delay in response. We appreciate your patience and understanding.

    Best regards,
    TI E2E Support Team
    ---
    *This is an automated notification.*

  • Hi,

    I've reconfigured the MPU region where the affected stack is located to have this region access control register value: 0x300. This should be strongly ordered, without cache?

    I meant to configure MSRAM memory from where vPortRestoreTaskContext() is executing to strongly ordered to check if it can give us the exact instruction at which the abort is triggered. For just running a test you can make the entire MSRAM as strongly ordered. To trace the exact instruction causing the abort, inspect the instruction near (R14 − 8) in your abort handler stack frame, please refer to section 5.2.2.4 here: https://www.ti.com/lit/an/sprad28/sprad28.pdf#page=13 

    a data abort for the first context switch, which happens to be the timer task.

    You mentioned the first context switch to switch to the timer task, it might be worth checking at what address uxTimerTaskStack is allocated to in your map file. To confirm whether the timer task is actually causing any issue or it is something else you can try disabling it and see if you still see the same issue or not. You can disable it by configuring configUSE_TIMERS to 0 in your FreeRTOSConfig.h file.

    If I understood the documentation of ADFSR (Cortex-R5 Technical Reference Manual) correctly this means the error is coming from Cache/AXIM?

    The combined register values point to an asynchronous external abort triggered by an AXI Slave error (SLVERR) during a write access. This could be caused incorrect MPU settings. Can you share your current MPU settings or if possible the syscfg file? 

    Best Regards,

    Meet.

  • As soon as I configure the code section strongly ordered as well I run into prefetch aborts. Which is kind of weird, haven't yet figured out why.

    uxTimerTaskStack is located where I would expect it, in BSS at 0x8009b700.

    I'll see if I can ensure that a different task is scheduled first.

    This is a dump of all MPU regions of the faulting core:

    [0]
        baseAddress 0x00000000
        sizeAndEnable 0x0000003F
        attributes 0x00001020
    [1]
        baseAddress 0x00000000
        sizeAndEnable 0x0000803D
        attributes 0x00001204
    [2]
        baseAddress 0x00000000
        sizeAndEnable 0x0000001F
        attributes 0x00001329
    [3]
        baseAddress 0x00000000
        sizeAndEnable 0x0000000B
        attributes 0x00000629
    [4]
        baseAddress 0x70000000
        sizeAndEnable 0x0000F727
        attributes 0x00001624
    [5]
        baseAddress 0x70080000
        sizeAndEnable 0x0000FB25
        attributes 0x00001324
    [6]
        baseAddress 0x70080000
        sizeAndEnable 0x0000F725
        attributes 0x00001624
    [7]
        baseAddress 0x70080000
        sizeAndEnable 0x0000EF25
        attributes 0x00000324
    [8]
        baseAddress 0x70080000
        sizeAndEnable 0x0000BF25
        attributes 0x00001329
    [9]
        baseAddress 0x70100000
        sizeAndEnable 0x0000F827
        attributes 0x00000324
    [10]
        baseAddress 0x80000000
        sizeAndEnable 0x0000FE2F
        attributes 0x00000300
    [11]
        baseAddress 0x80800000
        sizeAndEnable 0x0000FE2D
        attributes 0x00000324
    [12]
        baseAddress 0
        sizeAndEnable 0
        attributes 0
    [13]
        baseAddress 0
        sizeAndEnable 0
        attributes 0
    [14]
        baseAddress 0
        sizeAndEnable 0
        attributes 0
    [15]
        baseAddress 0
        sizeAndEnable 0
        attributes 0
  • Thanks for sharing the the MPU config, I see that 512kB region starting from 0x70080000 has multiple MPU settings is there any reason for that? Also I see that for region starting from 0x70000000 is configured as read only, any write access for these memory might lead to a data abort, you can try changing this to RD+WR access. 

    Any reason why sub-region disable mask is configured for most of these?

  • The region at 0x70080000 has only on the first glance multiple settings, they are actually distinct regions based on their enabled subregions. We are using the subregions rather extensively as we are automatically generating these settings from another configuration file, so this is just the output of an algorithm to correctly align the sizes and subregions to require a minimum amount of MPU regions for a given memory map. But if I understood this correctly, this should actually be fine?

    Regarding the one starting at 0x70000000, it has the following settings:

    • XN=0b1 -> no instruction fetches
    • AP=0b110 -> Privileged/User read-only
    • TEX=0b100
    • S=0b1
    • C=0b0
    • B=0b0 -> TEX,C,B=0b10000 -> Non cachable inner and outer policy

    This memory region is intended to be written by other cores and read by this one, hence these settings. I don't understand what you mean with this being as read only/any write access? Could you please clarify this statement?

  • This memory region is intended to be written by other cores and read by this one, hence these settings. I don't understand what you mean with this being as read only/any write access? Could you please clarify this statement?

    What I meant to say is that this particular core only has read access to this memory region, if this core tries to write to a memory in this region then it can lead to an abort. As long as this CPU is not accessing this memory region, it should be fine. 

    I'll see if I can ensure that a different task is scheduled first.

    Let me know your observations on this, the abort is mostly due to a store instruction trying to access some memory region which is invalid or the access is disabled for your core but your MPU settings don't indicate this issue.

    Is it possible for you to provide me a reference code for this that can run on the EVM? So I can try to reproduce this at my end and see if I can find the root cause.

  • Unfortuantely I cannot break it down anyhow, it really only appears in the whole system on our custom hardware. The very interesting part is also how I can, for unknown reasons to me, avoid the issue via a breakpoint. For changing the priority of the timer task I have added a breakpoint at xTimerCreateTimerTask, changed the priority in the call to xTaskCreateStatic and let the system run again. But with this the problem doesn't appear anymore. Even more confusing, I get the same results when I don't change the priority at all and just step into the task creation and let the system run again afterwards.

    So to sum it up: The data abort appears only once. Retrying the exact same thing succeeds. Hence I conclude that the MPU settings must be correct, because otherwise the data abort would appear again. Besides that, I can avoid the data abort with some instruction steps somewhere way earlier, not really related to the data abort later on.

    This is in total a very unclear and confusing error pattern. Do you have any ideas what could be causing such a thing?

  • Sorry, I have fooled myself. The problem appears only sometimes, therefore my previous conclusion that stepping manually during the task creation made a difference was wrong. But, on the plus side, I was now able to demote the timer task and let another task be the one for the first context switch. And the problem occurs there as well, so it seems to be independent of the task where the context switch is happening into. The data abort just appears, sometimes at least, for the first context switch into a task.

    But the other conclusions are still valid. I don't think it can be a misconfiguration of the MPU, because then it would happen always and a retry wouldn't succeed.

  • I meant to configure MSRAM memory from where vPortRestoreTaskContext() is executing to strongly ordered to check if it can give us the exact instruction at which the abort is triggered.

    I think I understood now why the system failed when I have configured the code region as strongly ordered:

    Any address in an MPU region with device or strongly-ordered memory type attributes is implicitly given execute-never (XN) permissions.

    from the Cortex-R5 Technical Reference Manual

  • Hi Benedikt,

    Apologies for the delay, I am checking internally to get some more ideas, will let you know once I have an update.

    Best Regards,

    Meet.