This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

In Circuit Emulation for J6

Other Parts Discussed in Thread: DRA744

I'm working with the IPU/M4 core of the DRA744 processor.  I'm running Green Hills RTOS called uVelosity and building the binary & debugging with the Green Hills tool chain.

The M4 processor is used to marshal data back and forth to the A15 which is running Linux.  Under Linux device tree we use /memreserve/ to block off a section of physical memory to which we load the M4 image into, set up the various M4 registers and kick it off.

The MMUs on the J6 are setup such that the M4 executes out of L3 DDR memory that was /memreserved/.

The problem is that after running for a few minutes, we throw a fault and get a run-time error that has alluded us for some time now.  Since the M4 has no MPU and no processor trace, this is very difficult to track down and the Green Hills debugger is not revealing any useful information.  Static analysis on our code base is not producing anything useful either.

I'm highly suspicious that, somehow, Linux might be corrupting L3-DDR memory despite the /memreserve/.  I've not been able to prove this.  The Green Hills tools have a way to read back memory and at least when we read back the .text segment, it checks out (no memory corruption).

Question:  Is there an in-circuit-emulator we can purchase or one that Ti can recommend to help track things down?  I'm looking for help on tools and type of hardware socket I would need to get specified for our circuit boards.

Are there any debug tools or suggestions Ti can suggest to help try to track down?

Another issue I ran into that I believe has been solved but could use some Ti expert to chime in:

The TI docs do not really explain it. This has to do with an implementation dependent feature of the ARMv7-M (i.e., Cortex-M4) architecture. (And “implementation” in this case is TI’s implementation – specifically the SPI peripheral.)

 During a SPI Tx sequence, three SPI interrupts are enabled. When the ISR runs, it sees which SPI interrupts are active and services them. It also clears the corresponding active bit(s.) However, it clears the bits in two steps – first for the Rx & Tx status bits, and then for the EOW (End of Word Count) bit. The problem is, when clearing any of these bits, if others are still active (and enabled), the result is that the CM4 NVIC sees a new interrupt event, even though later on the additional active bit is cleared. (By that time it’s too late because the interrupt event has already been latched in the CM4 NVIC.) So, when the ISR leaves, the SPI interrupt is still pending, as far as the CM4 NVIC is concerned, so it immediately invokes the ISR again. This time, however, the source bits are cleared and disabled, so they don’t cause yet another interrupt event. (Note that the NVIC automatically clears a pending interrupt on ISR entry. The trouble is that the ISR was causing an additional, accidental interrupt event to the NVIC.)

So, the fix is to clear the active bits all at once, instead of partially in multiple steps, per ISR invocation.

Thanks for your time!

Best,

Eric Mayo

  • Hi Eric,

    Your questions has been forwarded to TI experts.

    Best regards
    Lucy
  • Hi Eric,

    Concerning your question:

    Question: Is there an in-circuit-emulator we can purchase or one that Ti can recommend to help track things down? I'm looking for help on tools and type of hardware socket I would need to get specified for our circuit boards.

    Answer from experts: The XDS200 USB Debug Probe is the most economical. You can reference the Vayu EVM schematics for debug header hookup.

    Please check also the below links for more information:
    www.ti.com/.../tmdsemu200-u
    processors.wiki.ti.com/.../XDS200

    Best regards
    Lucy
  • I'm not sure how this response from Ti helps me.  I specifically have asked for in-circuit because the M4 does not have system trace capabilities (am I wrong here?).

    Moreover, our problems are under external load after taking data off the SPI line.  We can't "simulate" this load out of the circuit without changing our code.  I'm having trouble understanding how the IPU Unicache works which is another reason for asking for an emulator (and hoping there are no bugs in the emulator).

    I have to imagine that someone at Ti must be comparing CPUs to in circuit emulators perhaps as part of QC audit?

    TI’s documentation on the IPU_UNICACHE is pretty sparse:

     1)      Does the Cortex-M4 in the IPU need to worry about invalidating/flushing the IPU_UNICACHE when the shared L3 memory used for communication between the Cortex-A15s and the Cortex-M4s is configured as write-back cacheable in the IPU_UNICACHE?

    2)      When performing an IPU_UNICACHE maintenance operation, should the length be a multiple of the cache line size? I.e., if the start address is aligned, does the end address need to be a multiple of the cache line size, minus one? If so, what exactly is the cache line size – 32 bytes?

    3)      When performing an IPU_UNICACHE maintenance operation, what happens if the processor begins exception execution, such as from an interrupt? Is it recommended to temporarily disable interrupts (by setting PRIMASK to one) while IPU_UNICACHE operations are in progress?

  • Hello, Lucy asked for some inputs:


    Scanning the problem description it seems like you are trying to track down the context to why the M4 aborted.  For this debug if I typically would use an instance of TRACE32 (Lauterbach) per active core enabling PTM for the A15 and a DWT based trace on the M4.  I would set a HW break point in the M4 error handler which can be used to halt the M4 and A15 and then collect the trace information into each core.  If it is a corruption is seen in DDR, then a mix of HW range break points and STM-bus snooping could be employed to try and stop on access to the corrupted memory.


    TRACE32 allows the M4's DWT unit to be sampled extracting PCs at run time using its 'snooper tracer'.  At error trigger you can then get an OK sample (PC every few uS) of the execution into the event.  If the error reproduces with a reduced frequency M4 then sample holes will be relatively smaller.  If the error path is consistent sometimes I will take multiple runs and merge them in an ensemble average to overcome some sampling holes.


    The A15s' traces which were stopped on M4 error can be inspected to see if they have any relation to the M4 activity.

  • Thanks for the response.  Still looking for information on IPU_UNICACHE.  Along with my original set of questions, I'm pasting the functions we're using for cache maintenance operations.

    1)      Does the Cortex-M4 in the IPU need to worry about invalidating/flushing the IPU_UNICACHE when the shared L3 memory used for communication between the Cortex-A15s and the Cortex-M4s is configured as write-back cacheable in the IPU_UNICACHE?

    2)      When performing an IPU_UNICACHE maintenance operation, should the length be a multiple of the cache line size? I.e., if the start address is aligned, does the end address need to be a multiple of the cache line size, minus one? If so, what exactly is the cache line size – 32 bytes?

    3)      When performing an IPU_UNICACHE maintenance operation, what happens if the processor begins exception execution, such as from an interrupt? Is it recommended to temporarily disable interrupts (by setting PRIMASK to one) while IPU_UNICACHE operations are in progress?

    4)     Below is our cache maintenance code for the IPU_UNICACHE - would Ti recommend any changes to it?

    ------------------------------------------------------------------------------------------------------------------------------------------------------------------

    #ifndef __IPU_H
    #define __IPU_H
    
    // Number of interrupt priority bits is implementation defined. This is the
    // value for the Cortex-M4's in the IPU.
    //   16 levels; i.e., bits [7:4]
    #define N_INT_PRI_BITS 4
    
    // NOTE: N_INT_PRI_BITS must be defined before including cm4.h.
    #include "cm4.h"
    
    // NOTE: Most of the registers defined herein are outside the Cortex-M4 proper.
    //       Therefore they must be accessed using virtual addresses as determined
    //       by the translations performed by the IPUx_UNICACHE_MMU and/or
    //       IPUx_MMU. The following macro is used to translate the physical
    //       addresses of these registers to virtual addresses the Cortex-M4 must
    //       use to access them. If this macro is not defined outside this file,
    //       then assume virtual = physical.
    #ifndef BSP_PhysicalToVirtual
    #define BSP_PhysicalToVirtual(addr) (addr)
    #endif
    
    //=============================================================================
    // Interrupt Management
    
    // N_IRQS (in cm4.h) requires run-time evaluation since it is based on reading
    // the ICTR. For the Cortex-M4 implementation in the IPU the number of IRQs is
    // known, so define a new macro that can be used where constants are required.
    #define N_IPU_IRQS 64
    
    // IRQ's for Cortex-M4's in the IPU.
    #define XLATE_MMU_FAULT_IRQ 0
    #define UNICACE_MMU_IRQ     1
    #define HWSEM_M4_IRQ        3 // from CORTEXM4_CTRL_REG of other IPU Cortex-M4
    
    //=============================================================================
    // IPU Registers
    
    // IPUx_UNICACHE
    //-----------------------------------------------------------------------------
    
    // Configuration Register
    #define IPUx_UNICACHE_CONFIG_ADDR BSP_PhysicalToVirtual(0x55080004)
    #define IPUx_UNICACHE_CONFIG REG_32(IPUx_UNICACHE_CONFIG_ADDR)
    #define IPUx_UNICACHE_CONFIG_BYPASS 0x00000002
    
    // Interface Configuration Register
    #define IPUx_UNICACHE_OCP_ADDR BSP_PhysicalToVirtual(0x5508000C)
    #define IPUx_UNICACHE_OCP REG_32(IPUx_UNICACHE_OCP_ADDR)
    #define IPUx_UNICACHE_OCP_WRAP     0x00000001
    #define IPUx_UNICACHE_OCP_WRBUFFER 0x00000002
    
    // Interrupt Register
    #define IPUx_UNICACHE_INT_ADDR BSP_PhysicalToVirtual(0x55080008)
    #define IPUx_UNICACHE_INT REG_32(IPUx_UNICACHE_INT_ADDR)
    #define IPUx_UNICACHE_INT_MAINT 0x00000004
    
    // Maintenance Configuration Register
    #define IPUx_UNICACHE_MAINT_ADDR BSP_PhysicalToVirtual(0x55080010)
    #define IPUx_UNICACHE_MAINT REG_32(IPUx_UNICACHE_MAINT_ADDR)
    #define IPUx_UNICACHE_MAINT_CLEAN      0x00000008
    #define IPUx_UNICACHE_MAINT_INVALIDATE 0x00000010
    #define IPUx_UNICACHE_MAINT_INTERRUPT  0x00000020
    
    // Maintenance Start Configuration Register
    #define IPUx_UNICACHE_MTSTART_ADDR BSP_PhysicalToVirtual(0x55080014)
    #define IPUx_UNICACHE_MTSTART REG_32(IPUx_UNICACHE_MTSTART_ADDR)
    
    // Maintenance End Configuration Register
    #define IPUx_UNICACHE_MTEND_ADDR BSP_PhysicalToVirtual(0x55080018)
    #define IPUx_UNICACHE_MTEND REG_32(IPUx_UNICACHE_MTEND_ADDR)
    
    #ifndef __LANGUAGE_ASM__
    #include <armv7m_ghs.h>
    
    // Cache Maintenance Operations
    
    #ifdef __cplusplus
    #define INLINE inline
    extern "C" {
    #else
    #define INLINE __inline
    #endif
    
    INLINE void ipu_unicache_maint(void *start, size_t length, uint32_t op)
    {
        // Disable interrupts.
        unsigned old_state = __MRS(__PRIMASK);
        __MSR(__PRIMASK, old_state | 1);
    
        // Clear maintenance complete status.
        IPUx_UNICACHE_INT = IPUx_UNICACHE_INT_MAINT;
    
        // Perform operation.
        IPUx_UNICACHE_MTSTART = (uint32_t) start;
        IPUx_UNICACHE_MTEND   = (uint32_t) start + length - 1;
        IPUx_UNICACHE_MAINT   = IPUx_UNICACHE_MAINT_INTERRUPT | op;
    
        // Wait for maintenance operation to complete.
        while (!(IPUx_UNICACHE_INT & IPUx_UNICACHE_INT_MAINT))
            ;
    
        // Restore previous interrupt enable state.
        __MSR(__PRIMASK, old_state);
    }
    
    INLINE void ipu_unicache_invalidate(void *start, size_t length)
    {
        ipu_unicache_maint(start, length, IPUx_UNICACHE_MAINT_INVALIDATE);
    }
    
    INLINE void ipu_unicache_flush(void *start, size_t length)
    {
        ipu_unicache_maint(start, length, IPUx_UNICACHE_MAINT_CLEAN);
    }
    
    #ifdef __cplusplus
    }
    #endif
    
    #endif // __LANGUAGE_ASM__
    
    // IPUx_UNICACHE_MMU
    //-----------------------------------------------------------------------------
    
    // Maintenance Configuration Register
    #define IPUx_UNICACHE_MMU_MAINT_ADDR BSP_PhysicalToVirtual(0x55080CA8)
    #define IPUx_UNICACHE_MMU_MAINT REG_32(IPUx_UNICACHE_MMU_MAINT_ADDR)
    #define IPUx_UNICACHE_MMU_MAINT_G_FLUSH 0x00000400
    
    // Maintenance Status Register
    #define IPUx_UNICACHE_MMU_MAINTST_ADDR BSP_PhysicalToVirtual(0x55080CB4)
    #define IPUx_UNICACHE_MMU_MAINTST REG_32(IPUx_UNICACHE_MMU_MAINTST_ADDR)
    #define IPUx_UNICACHE_MMU_MAINTST_STATUS 0x00000001
    
    // Misc
    //-----------------------------------------------------------------------------
    
    // Inter-Cortex-M4 interrupts.
    #define CORTEXM4_CTRL_REG_ADDR BSP_PhysicalToVirtual(0x55081000)
    #define CORTEXM4_CTRL_REG REG_32(CORTEXM4_CTRL_REG_ADDR)
    //   Interrupt to IPU_C0
    #define INT_CORTEX_1 0x00000001
    //   Interrupt to IPU_C1
    #define INT_CORTEX_2 0x00010000
    
    // Unique register for each core.
    #define CORTEXM4_RW_PIDx_ADDR 0xE00FE000
    #define CORTEXM4_RW_PIDx REG_32(CORTEXM4_RW_PIDx_ADDR)
    
    #endif
    

  • There is no hardware coherency between IPU cache and A15's or other entities caches. Cache flushing, posting buffer maintenance along with cross core synchronization is needed if exchanging cached data between A15 and M4. Its not uncommon for people to map regions as non-cached to avoid some of the maintenance. From the A15 side especially care needs to be there as the write posting schemes are more elaborate. The IPU side is less strict as the M4's have a much simpler path with no L1. The TI reference code images will be mature in these areas and serve as a good practical reference.   In your debugging have you disabled caches and still see issues?


    The unicache parameters are listed in TRM along other descriptive information (the line size is 256 bits -- 32 bytes).  The reference Startware code TI provides to customers would be a good reference to cross check your code against. The need to mask interrupts or not depends on your RTOS buffer management strategy.  On the A15 older buffer ownership schemes which depended on 'no explicate access means no allocations' were broken by aggressive processor speculation which allows prediction engines to access any valid non-io mmu.  The unicaches have provision for some stride prediction but its less aggressive so more simple code likely will work. 

    I am not working/owning the M4 area so can't provide a quick review today. My primary comments are against emulation and debug questions. Maybe another engineer can do this with you or it will have to be scheduled by TI forum owner.

  • This response doesn’t help. At least, it doesn’t help me. And some of the statements seem to be incorrect. E.g., “The IPU side is less strict as the M4's have a much simpler path with no L1.” That doesn’t seem correct to me – the IPUx_UNICACHE *is* L1 the way I read it.
  • The unicache as instantiated is a shared L2 (there are a couple M4s hooked to the unicache). As its hooked up its a shared 'next-level' cache. The unicache controller ip has provision to support (private) l1 and (shared) l2 and the ammu registers & tables show this. The hook up here is as an l2. Maybe this is non-obvious but correct.