This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

M3 subsystem caching issues

Other Parts Discussed in Thread: TWL6040, SYSCONFIG, 4460

Dear TI employees and others involved,

Currently I have software running on the Cortex-M3 Subsystem from the OMAP4460. I works just fine but rather slow and therefor I want to make sure the shared L1 cache is working as expected. 

According to the OMAP4460 datasheet, I have to use the SCACHE_CFG registers for configuration. However I can not access them. Trying to read the registers by using the TI Omapconf tool results in the following message:
[ 3334.851654] Unhandled fault: external abort on non-linefetch (0x1018) at 0xb6e9201c

!!! OUPS... MEMORY ERROR @ 0x00000000 !!!

Are you sure that:
MEMORY ADDRESS IS VALID?
TARGETED MODULE IS CLOCKED?

This error occurs while I am able to access SCACHE_MMU, which is in the same domain.

I want to make sure the option Everything is cachable is enabled (BYPASS bit from SCACHE_CONFIG register)

Can anyone help me out?

Thanks in advance!

Kind regards,

Richard van Berkel.

  • Hello Richard,

    The error which you got is caused when you try to read registers in inactive power domains or clock signals to the module are off.

    The SCACHE_CFG belongs to PD_CORE. The interface/functional clock is MPU_M3_CLK.

    Try to check the power state of PD_CORE domain by register:

    PM_CORE_PWRSTCTRL[1:0] POWERSTATE
    PM_CORE_PWRSTCTRL[11] MPU_M3_UNICACHE_RETSTATE
    PM_CORE_PWRSTCTRL[23:22] MPU_M3_UNICACHE_ONSTATE

    MPU_M3_CLK clock status CM_MPU_M3_CLKSTCTRL[8] CLKACTIVITY_MPU_M3_CLK
    Clock Domain State Transition Control CM_MPU_M3_CLKSTCTRL[1:0] CLKTRCTRL

    I read 0xb6e9 201c on my Blaze board:

    omapconf read 0xB6E9201C ==>  C0A80D6B

    Could you provide more information about your Hardware and software platforms?

    Best regards,

    Yanko

  • Hello Yanko,

    Thanks for your reply!

    My HW platform is the Pandaboard ES.
    I run preinstalled Ubuntu server 12.04 on the A9 subsystem and FreeRTOS on the M3 subsystem. The M3 subsystem is initialized by the A9 subsystem (I basically added the M3 subsystem to the Ubuntu divice tree). 

    I checked the registers you asked for:
    PM_CORE_PWRSTCTRL ==> 0x03FF0F07
    PM_CORE_PWRSTCTRL[1:0] POWERSTATE ==> 0x3 (ON State)
    PM_CORE_PWRSTCTRL[11] MPU_M3_UNICACHE_RETSTATE ==> 0x1 (Memory bank is retained when domain is in RETENTION state)
    PM_CORE_PWRSTCTRL[23:22] MPU_M3_UNICACHE_ONSTATE ==> 0x3 (Memory bank is on when the domain is ON)

    CM_MPU_M3_CLKSTCTRL ==> 0x102
    CM_MPU_M3_CLKSTCTRL[8]  ==> 0x1 (Corresponding clock is running or gating/ungating transition is ongoing)
    CM_MPU_M3_CLKSTCTRL[1:0] ==> 0x2 (SW_WKUP: Start a software forced wake-up transition on the domain.)

    Kind regards,

    Richard.

  • Hello Richard,

    Try to set CM_MPU_M3_CLKSTCTRL[1:0] CLKTRCTRL with 0x3 - HW_AUTO: Automatic transition is enabled. Sleep
    and wakeup transition are based upon hardware conditions.

    Best regards,

    Yanko

  • Yanko,

    Thanks for your reply again!

    I changed the CLKTRCTRL mode from SW_WKUP to HW_AUTO and rebooted Ubuntu. I received the following kernel messages multiple times:
    Modules linked in: cpufreq_ondemand dm_crypt joydev dm_multipath hid_logitech_dj wl12xx_sdio wl12xx mac80211 cfg80211 twl6040_vibra ff_memless leds_gpio omap_m3cortex(O+) usbhid hid dm_raid45 xor dm_mirror dm_region_hash dm_log btrfs zlib_deflate libcrc32c

    I order to initialize the M3 subsystem and to load an image including FreeRTOS I wrote a Ubuntu kernel module which includes the following lines of code (the kernel module is automatically inserted at boot time):

    -------------------------------------------------------------------CODE SNIPPED-----------------------------------------------------------------
    /* Enable power saving via automatic interface clock gating */
    iowrite32(MODULEMODE_AUTO, cm_mpu_m3_mpu_m3_clkctrl_virtual);

    /* Brings the M3 clock domain out of reset */
    iowrite32(CLKTRCTRL_HW_AUTO, cm_mpu_m3_clkstctrl_virtual);

    /* Reset Cortex-M3 */
    m3cortex_reset(1);

    /* Enable the MMU and CACHE (p 630) */
    temp = ioread32(rm_mpu_m3_rstctrl_virtual);
    iowrite32(temp & ~RST3, rm_mpu_m3_rstctrl_virtual);

    /* Flush TLB (p 4468) */
    iowrite32(GLOBALFLUSH, cortexm3_l2mmu_gflush_virtual);

    /* Execute software reset of MMU */
    iowrite32(SOFTRESET, cortexm3_l2mmu_sysconfig_virtual);

    /* Wait for finalization of software reset */
    while (!(ioread32(cortexm3_l2mmu_sysstatus_virtual) & RESETDONE)) {}

    /* Disable power saving via automatic interface clock gating */
    temp = ioread32(cortexm3_l2mmu_sysconfig_virtual);
    iowrite32(temp & ~AUTOIDLE, cortexm3_l2mmu_sysconfig_virtual);

    /* Write virtual address of image location */
    iowrite32(0x00000000 | MMU_P | MMU_V | PAGESIZE_1MB, cortexm3_l2mmu_cam_virtual);

    /* Write physical address of image location */
    iowrite32(M3_LOAD_ADDRESS | (ELEMENTSIZE_NOTRANSLATION << ELEMENTSIZE_SHIFT), cortexm3_l2mmu_ram_virtual);

    /* Define TLB entry we would like to write to */
    iowrite32(0x0 << CURRENTVICTIM_SHIFT, cortexm3_l2mmu_lock_virtual);

    /* Load both virtual and physical into the just defined TLB entry */
    iowrite32(LDTLBITEM, cortexm3_l2mmu_ld_tlb_virtual);

    iowrite32(MULTIHITFAULT | TLBMISS, cortexm3_l2mmu_irqenable_virtual);

    /* write all valid io mappings to the TLB, assuming a 4KB page size */
    for (i = 0; valid_io[i] != 0; i++)
    {
        /* Write virtual address */
       iowrite32((valid_io[i] & 0xfffff000) | MMU_P | MMU_V | PAGESIZE_4KB, cortexm3_l2mmu_cam_virtual);

       /* Write physical address */
       iowrite32((valid_io[i] & 0xfffff000) | (ELEMENTSIZE_NOTRANSLATION << ELEMENTSIZE_SHIFT),        cortexm3_l2mmu_ram_virtual);

       /* Define TLB entry we would like to write to */
       iowrite32((i+1) << CURRENTVICTIM_SHIFT, cortexm3_l2mmu_lock_virtual);

       /* Load both virtual and physical into the just defined TLB entry */
       iowrite32(LDTLBITEM, cortexm3_l2mmu_ld_tlb_virtual);
    }

    /* Protect TLB enties from overwriting */
    iowrite32((i+1) << BASEVALUE_SHIFT, cortexm3_l2mmu_lock_virtual);

    /* Enable MMU, table walking is disabled (p 4456) */
    iowrite32(MMUENABLE, cortexm3_l2mmu_cntl_virtual);

    /* Now upload the image to the M3 cortex, it must be present in the imgbuf already */
    m3cortex_imgupdate();

    /* Release MMU and CACHE */
    temp = ioread32(rm_mpu_m3_rstctrl_virtual);
    iowrite32(temp & ~RST3, rm_mpu_m3_rstctrl_virtual);

    pr_crit("%s: Done!\n", devname);

    /* Release Cortex-M3 from reset only when there is an image downloaded */
    if (imgbufsize_logical > 0)
    {
        m3cortex_reset(0);
    }

    pr_debug("%s: Cortex-M3 subsystem driver is now up and running!\n", devname);
    -------------------------------------------------------------------CODE SNIPPED----------------------------------------------------------------

    Mention CLKTRCTRL_HW_AUTO in the code snipped.
    Since I receive a lot of error messages now and the system becomes kind of non responsive, I believe I have to reconsider my code. I might have screwed things up. Anyway, if you have any advice at this point, please let me know.

    Kind regards,

    Richard.

  • Hi,

    When CLKTRCTRL is in HW_AUTO mode I have to assert a wake-up request. Can you tell me how?

    Kind regards,

    Richard.

  • Hello Richard,

    The the main function of CM_<Clock domain>_CLKSTCTRL[x] CLKTRCTRL bit-field by 0x3 value HW_AUTO is Hardware-controlled automatic sleep and wake-up transition is initiated by the PRCM module when the associated hardware conditions are satisfied.

    When is HW_Auto mode is set: The clock domain wakes up and changes state to ACTIVE. As the module mode control is still disabled, this event has no effect on the module state. The functional and interface clocks may still be restarted automatically, based on the requirements of other modules sharing these clocks.
    Software changes the module mode to enabled. The clocks to the module are restarted automatically. The PRCM module then deasserts a hardware idle request signal to the module. The module sends an idle request acknowledge to the PRCM module. The module is now effectively awake.

    #Q: I have to assert a wake-up request. Can you tell me how?
    If the clock domain doesn't have dependency with other clock domains. It is not needed to assert a wake-up request.

    However, when the clock domain has dependency to other clock domains. It is needed to assert wake-up request to process wake-up dependency.

    Clock Domain Dependency
    A domain dependency is a binary relationship between two clock domains. For example, clock domain A
    depends on clock domain B when a module in clock domain B provides services to a module in clock
    domain A. As a result, clock domain B must be active when clock domain A is active so that the module in
    clock domain B is accessible by the module in clock domain A.
    The dependency between two clock domains may also exist if one clock domain serves to ensure
    communication between two modules (for example, the clock domain of the device interconnect).
    Static Dependency
    If clock domain A has a master module that can access a slave module in clock domain B, then clock
    domain A can have a static dependency with clock domain B. Similarly, a static dependency can also exist
    between domains A and B if domain B conveys the transactions from domain A module toward a module
    in any other domain. For example, CD_DSP can have a static dependency with CD_L3_1 because this
    domain has the L3 interconnect to carry the transactions from the DSP module.
    The static dependency between a source clock domain and a destination clock domain is configured in the
    PRCM module by setting the CM_<Source Clock domain>_STATICDEP[x] <Destination Clock
    domain>_STATDEP bit.

    Dynamic Dependency
    When clock domains A and B contain modules directly linked to a common device interconnect, these
    clock domains can have a dynamic dependency.
    A dynamic dependency consists of forcing clock domain B to stay active as long as a module from clock
    domain A is communicating with the module in clock domain B through the interconnect. Clock domain B
    becomes active as soon as the communication is initiated. This is automatically managed by the PRCM
    module by monitoring the communication on the interconnect between the modules of the two clock
    domains.

    The dynamic dependency between a source clock domain and a destination clock domain can be read in
    the PRCM module from the corresponding read-only CM_<Source Clock domain>_DYNAMICDEP[x]
    <Destination Clock domain>_DYNDEP bit.

    Best regards,

    Yanko

  • Today I found that both reading and writing to the shared cache configuration registers is simply not possible from the A9 subsystem. Anyway, I succeeded to read and write to those registers from my code running on the M3 subsystem itself. My next step is to allow caching of CPU instructions which should really speed up my application. I already tried some different configurations but still for each and every instruction the CPU fetches the instruction from the external SDRAM (my code is there). I was wondering whether someone has some example configuration file for the shared cache which actually enables the caching of CPU instructions??

    Thanks in advance!

    King regards,

    Richard.

  • Tried multiple cache settings right now, but nothing really changes. I expect to see a difference between execution with caching enabled and execution without caching enabled, this is however not the case. Any ideas? A procedure (or programming guide) on how to configure the caches would be nice since I might forget some essential steps.

    Thanks in advance!

    Kind regards,

    Richard. 

  • Ok, let me try to be more specific:

    I want the L1 cache from the M3 subsystem to cache M3 instructions.
    My binary for the M3 subsystem is located at physical address 0x40300000 (OCM_RAM). The physical address of the OCM_RAM is translated by the L2 MMU to logical address 0x00000000 (which the subsystem uses to boot from). I configure the L1 MMU (aka the AMMU) as follows:
    *SCACHE_MMU_SMALL_ADDR_0 = 0x00000000;
    *SCACHE_MMU_SMALL_XLTE_0 = 0x00000000;
    *SCACHE_MMU_SMALL_POLICY_0 = 0x00090001; (page enabled, cacheable, write back)

    Afterwards I enable the cache:
    *SCACHE_CONFIG = 0x1E;

    The just described configurations are done from software running on the M3 subsystem. The configurations are done first, then the main loop of the program starts running. 

    The program runs fine, but I do not believe anything is cached by the L1 cache. My question is whether anyone can provide me any programming guide? E.g. I don't activate the maintenance mode of the cache before configuring, because I don't know whether it is required?!

    I would be nice if anyone can help me solve my last issues...

    Thanks in advance.

    Kind regards,

    Richard. 

      

  • Hello Richard,

    There is no a guide for L1 cache configuration. I would like to notice that:

    The CPU power domains (CPU0 and CPU1) and L1 cache are controlled by the local PRCM module.
    I suggest you checking registers, to see if L1 is in ON state:

    Clock transition depends on the StandbyWFI assertion + the value of 3 bit fields:
    • PWRCTLO from the SCU power status register (internal to ARM)
    • CLKTRCTRL from the CM_PDA_CPUi_CLKSTCTRL register (local PRCM module)
    • POWERSTATE from the PM_PDA_CPUi_PWRSTCTRL register (local PRCM module)

    The PM_PDA_CPUi_PWRSTCTRL register is static over any power transition. To wake up CPUx, the
    user must:
    1. Execute a forced wake-up transition to the CPUx: CM_PDA_CPUx_CLKSTCTRL[1:0] CLKTRCTRL =
    0x2.
    2. The CPUx interrupt handler must set back the automatic hardware transition
    CM_PDA_CPUx_CLKSTCTRL[1:0] CLKTRCTRL = 0x3.

    Best regards,

    Yanko

  • Hi Yanko,

    Thanks for your reply. Unfortunately, your answer is related to the caches from the A9 subsystem. I can find the content from your reply in Chapter 4 from the OMAP4460 TRM, which is a chapter about the A9 subsystem. The L1 cache I want to use is from the Cortex-M3 subsystem (Chapter 7).

    Kind regards,

    Richard.

  • Hello Richard,

    For more information, see Chapter 5, DSP, Section 5.4.2.2.3.1, Direct Maintenance of Caches.

    Could you check if your M3 clock signals is available?

    Check registers:

    - CM_MPU_M3_CLKSTCTRL[1:0] CLKTRCTRL

    - CM_MPU_M3_CLKSTCTRL[8] CLKACTIVITY_MPU_M3_CLK
    - CM_MPU_M3_MPU_M3_CLKCTRL[1:0] MODULEMODE

    Software clears the RM_MPU_M3_RSTCTRL[2] RST3 bit in the PRCM module register to release from reset the MPU M3 Cache and MMU.

    The dual Cortex-M3 subsystem receives only one clock, MPU_M3_CLK, which is divided in two for each
    ARM Cortex-M3 processor, ROM and RAM memory and the L2 MMU. The shared cache and the L2 MIF
    are directly clocked by the MPU_M3_CLK, without any division

    Did you check Shared Cache Configuration registers for Cortex M3?

    The shared cache MMU in Cortex-M3 has four large pages, two medium pages, and ten small pages. See section 5.4.2.2.3.1 Direct Maintenance of Caches in OMAP4 TRM.

    Best regards,

    Yanko

  • Hi Richard,

    did you sort this problem out? I'm having the exact same issue! I'm rolling my own bare-metal OS, and have code running fine on both A9 and M3 but there appears to be no way that I can enable caching on the M3.

    I've had no trouble programming either the L2 MMU or the shared L1 MMU, and enabling page entries within it. However none of the cacheability options appear to do anything.

    Using the default clock speeds that U-boot leaves with the system (700MHz A9, 200 MHz M3, 350 MHz? DDR) I've managed to confirm that the system is clocked the way I think it is. Using SysClk on the M3, 200 MHz does appear to be correct. However running a simple loop in asm to count the number of instructions issued in one second gives me ~58 clocks/inst when my code is in DRAM. Moving the code to the 64KB in-module RAM reduces that to 5 clocks/inst.

    What am I missing?!

    Cheers,

    Simon

  • Hi Simon,

    Sorry for my late response, I was on vacation for several weeks.

    I have not spend much time working with the OMAP4 last several months. My plan is to continue with my project within 2/3 weeks and I will probably have a look at the caching issue again.

    Have you find any solution for the problem in the mean time? Btw I am also curious about how you managed to use in-module RAM for your instructions?

    King regards,

    Richard.

  • Hi there,

    sorry for the delay.

    Regarding running code from the embedded memory, if you look in section 2.4 of the 4460 TRM you'll see that private RAM is mapped by default at 0x55020000. As far as I can tell, even if you have programmed the L2 MMU to have something else in this range, it won't take effect and you'll still see ROM/RAM/MMU config/WKUGEN etc in the 0x550xxxxx range. You should be able to read and write to it from your M3 but you won't be able to execute code there. If you try you'll get an ARMv7M exception (BusFault?).

    According to the ARMv7M reference manual (section B3.1) 0x55020000 is in the XN 'peripheral' section from 0x40-0x5f. If you use the L1 shared cache MMU to map 0x55020000 into the 0x00000000-0x1ffffffff executable region you'll be able to run code from the embedded memory. I got ~5 cycles/instruction running from there versus ~58 from normal memory mapped in via the L2 MMU (and no mappings in the L1).

    For me in my quest for "turning the cache on" the one thing I did not try is turning off the L2 MMU as this makes running code a bit tricky. Perhaps this is the thing to do?

    Also another interesting thing you can do is use the SCACHE_CTADDR and SCACHE_CTDATA registers to see what the L1 cache is holding. You program your target address into CTADDR and then read out the value from CTDATA. I'd always get junk..but the same junk. I wouldn't get the real contents of my memory.

    Anyway please let me know how you get on!
    Anyone from TI here who can help?

    Cheers

    Simon