This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

L2 of ARM A8 as SRAM issue

Hi, supporters:

When processing images on A8, I may need to use on-chip memory (L2) to speed up the performance. 

Could A8 512K L2 be used as a SDRAM for DMA+PingPong Processing just like in dsp?

Is there register or memory map to do the job?

  

  • Joey,

    Joey Lin said:
    Could A8 512K L2 be used as a SDRAM for DMA+PingPong Processing just like in dsp?

    No, I do not think you can use the Cortex-A8 L2 cache as RAM for DMA transfer. The DSP L2 is stated as cache and/or RAM and is mapped at start address 0x40800000/0x00800000, while Cortex-A8 L2 is stated as just cache and no start address is available in the Memory Map.

    See also this E2E thread:

    http://e2e.ti.com/support/dsp/davinci_digital_media_processors/f/716/t/243883.aspx

    There is RAM inside the Cortex-A8, mapped to start address 0x402F0400, but I think you can not use it for a DMA transfer as this RAM is small in size (64KB) and stated as internal for the Cortex-A8 (only accessible by the Cortex-A8).

    Joey Lin said:
    When processing images on A8, I may need to use on-chip memory (L2) to speed up the performance. 

    You can try with the OCMC L3 SRAM (128KB) mapped at start address 0x40300000.

    Regards,
    Pavel

  • Joey Lin said:

    When processing images on A8, I may need to use on-chip memory (L2) to speed up the performance. 

    Could A8 512K L2 be used as a SDRAM for DMA+PingPong Processing just like in dsp?

    Certainly.  You can use L2 Cache Lockdown to effectively turn part of L2 cache into local RAM (in steps of 1/8-th of the total L2 cache), and then use the PreLoad Engine (PLE) to move data in/out of cache in the background while you process the previously loaded data in parallel.

    More information can be found in the ARM Cortex-A8 Technical Reference Manual, specifically section 3.2.54 for the details of cache lockdown, section 8.4 for an overview of the preload engine and sections 3.2.59-3.2.67 for its details. Useful background info is also in the ARMv7-A/R Architecture Reference Manual, for example section B.2.2 on caches in general.

  • Hi, Matthijs:

    Thank you for your information.  I have read the ARM Cortex-A8 Technical Reference Manual (DDI0344K_cortex_a8_r3p2_trm.pdf) for quite sometimes.  

    I am currently stuck on modifying CP15 register for configuring PLE.  It seems to me that I need to write assembly code to achieve this. Do you think I need to start with ARM assembly tutorial or I can resolve this issue without knowing it?

    Thank you very much,

    Joey from Altek

  • You don't really need to learn ARM assembly for this in any detail, as the TRM explicitly shows the instruction needed, which you can use in GCC inline assembly.  Some (untested) examples:

    // get bitmap of channels running
    u32 running;
    asm( "mrc p15, 0, %0, c11, c0, 2" : "=r" (running) );
    
    // get and set current channel
    u32 channel;
    asm( "mrc p15, 0, %0, c11, c2, 0" : "=r" (channel) );
    asm( "mcr p15, 0, %0, c11, c2, 0" : : "r" (channel) );
    
    // start engine
    asm( "mcr p15, 0, %0, c11, c3, 1" : : "r" (0) );

    I personally often use the clang compiler, which has intrinsics for mrc and mcr which, using a tiny wrapper class, allow me to make coprocessor registers accessible as if they were global variables:

    template< uint p, uint n, uint op1, uint m, uint op2 >
    class cp {
    public:
            cp() {}
            operator uint () {
                    return __builtin_arm_mrc( p, op1, n, m, op2 );
            }
            uint operator = ( uint val ) {
                    __builtin_arm_mcr( p, op1, val, n, m, op2 );
                    return val;
            }
            void operator |= ( uint val ) { *this = *this | val; }
            void operator &= ( uint val ) { *this = *this & val; }
            void operator ^= ( uint val ) { *this = *this ^ val; }
    };
    
    static cp<15,11,0, 0,0> ple_present;    //r-
    static cp<15,11,0, 0,2> ple_running;    //r-
    static cp<15,11,0, 0,3> ple_stopping;   //r-
    static cp<15,11,0, 1,0> ple_useraccess; //rw
    static cp<15,11,0, 2,0> ple_select;     //rw
    static cp<15,11,0, 3,0> ple_stop;       //-w
    static cp<15,11,0, 3,1> ple_start;      //-w
    static cp<15,11,0, 3,2> ple_clear;      //-w
    static cp<15,11,0, 4,0> ple_control;    //rw
    static cp<15,11,0, 5,0> ple_vaddr;      //rw
    static cp<15,11,0, 7,0> ple_size;       //rw
    static cp<15,11,0, 8,0> ple_status;     //rw
    static cp<15,11,0,15,0> ple_context;    //rw

  • Hi, Matthijs:

    Thank you for replying. It is kinda neat to use class wrapper, but unfortunately I have not used C++ for quite some time. Currently I am coding api in C for each operation. I am checking now for how to convert virtual address of symbol in C to assembly r0 as below for setting the start address. Could you provide your version for reference?

    int func( int start_add, int byte_count,..){

    asm("    LDR r0, =start_add );  <== Load virtual address to r0?

    asm("    MCR p15, #0, r0, c11, c5, #0 ; Write PLE Internal Start Address Register");

    ...

    }

    Best regards,

    Joey from Altek

  • Look more closely at my examples:  the value being written to a coprocessor register is a C expression, and likewise when reading a coprocessor register you simply name the variable where you want the result to end up.  The "%0" in the assembly instruction will be replaced with the register which the compiler allocated for the argument.

    So you don't need any other assembly instructions than the mrc and mcr, just

    asm( "mcr p15, 0, %0, c11, c5, 0" : : "r" (start_addr) );

    Some more notes:

    1. Mind the double colon when using mcr versus a single colon when using mrc.  This is because the general format is:   asm( "..." : output arguments : input arguments : other stuff affected );

    2. The preload engine affects memory from the compiler's point of view, so you need to tell it this to prevent it from e.g. moving memory loads/stores across a PLE operation. You can do this by placing a "compiler barrier"

      • right before starting an eviction (L2 -> memory), and
      • after completion of a preload, before accessing the data.

      The syntax for such a barrier is:

      asm( "" : : : "memory" );  // compiler barrier

      You can also mark an instruction itself as "affecting memory", but since in this case it's not really any single instruction which is affecting memory I think using a barrier is clearer.

    3. A little detail about GCC inline asm which can be important to know:  if an instruction has one or more outputs but you do not use any of them, the optimizer will think the instruction wasn't needed and is allowed to remove it as dead code.  You can use   asm volatile( ... );   to prevent this. If the instruction has no outputs then it is implicitly marked volatile.
  • Note that all this is assuming you are using GCC.  I have no experience with TI's own compiler for ARM.

  • Hi, Matthijs:

    Thank you very much for your note and explanation. Indeed the TI compiler does not accept 

    passing the C arguments to inline assembly just the way GCC does.

    I would start a new thread for this issue.

    Best regards,

    Joey from Altek