This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320F280038C-Q1: CLA array placed in LSRAM4 through 7 read access time extremely slow

Part Number: TMS320F280038C-Q1


Tool/software:

Hello,

I would like to store a couple constant float LUT arrays in LS ram memory to be accessed by the CLA. I have the LUTs initially stored in flash and then written to ram on startup and was able to get this all working correctly when the LS ram was read by the CPU. However, once I gave ram access over to the CLA and had the CLA do the reads I noticed the access time for a single read (in the cla) was several hundred clock cycles, so something seems off. The read value is the correct value so I am guessing my setup is incorrect.

Here are the relevant code sections 

 

///// LUT C File 

#pragma DATA_SECTION (LUT_D2,"LUT_DUTY_MEM1")

const float LUT_D2[16][71] = {..}

///// 


///// LUT H File 

extern const float LUT_D2[LUT_P_MAX+1][LUT_Vr_MAX+1];

//////


///// CLA memory config 


void initCLA(void)
{
    //
    // Copy the program and constants from FLASH to RAM before configuring
    // the CLA
    //
#if defined(_FLASH)
    memcpy((uint32_t *)&Cla1ProgRunStart, (uint32_t *)&Cla1ProgLoadStart,
           (uint32_t)&Cla1ProgLoadSize);
    memcpy((uint32_t *)&Cla1ConstRunStart, (uint32_t *)&Cla1ConstLoadStart,
        (uint32_t)&Cla1ConstLoadSize );

    memcpy((uint32_t *)&LUTRUNSTART, (uint32_t *)&LUTRUNSTART,
        (uint32_t)&LUTRUNSTART );

//    memcpy(&LUTRUNSTART,&LUTRUNSTART,(Uint32)&LUTRUNSTART);

#endif //defined(_FLASH)

    //
    // CLA Program will reside in RAMLS0 and data in RAMLS1, RAMLS2
    //
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS0, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS1, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS2, MEMCFG_LSRAMMASTER_CPU_CLA1);

    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS4, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS5, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS6, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS7, MEMCFG_LSRAMMASTER_CPU_CLA1);


    MemCfg_setCLAMemType(MEMCFG_SECT_LS0, MEMCFG_CLA_MEM_PROGRAM);
    MemCfg_setCLAMemType(MEMCFG_SECT_LS1, MEMCFG_CLA_MEM_PROGRAM);

    MemCfg_setCLAMemType(MEMCFG_SECT_LS2, MEMCFG_CLA_MEM_DATA);

    MemCfg_setCLAMemType(MEMCFG_SECT_LS4, MEMCFG_CLA_MEM_DATA);
    MemCfg_setCLAMemType(MEMCFG_SECT_LS5, MEMCFG_CLA_MEM_DATA);
    MemCfg_setCLAMemType(MEMCFG_SECT_LS6, MEMCFG_CLA_MEM_DATA);
    MemCfg_setCLAMemType(MEMCFG_SECT_LS7, MEMCFG_CLA_MEM_DATA);

//
// Suppressing #770-D conversion from pointer to smaller integer
// The CLA address range is 16 bits so the addresses passed to the MVECT
// registers will be in the lower 64KW address space. Turn the warning
// back on after the MVECTs are assigned addresses
//
#pragma diag_suppress=770

    //
    // Assign the task vectors and set the triggers for task 1
    // and 8
    //
    CLA_mapTaskVector(CLA1_BASE, CLA_MVECT_1, (uint16_t)&Cla1Task1);
//    CLA_mapTaskVector(CLA1_BASE, CLA_MVECT_2, (uint16_t)&Cla1Task2);
    CLA_mapTaskVector(CLA1_BASE, CLA_MVECT_8, (uint16_t)&Cla1Task8);
    //CLA_setTriggerSource(CLA_TASK_1, CLA_TRIGGER_ADCA1);
//    CLA_setTriggerSource(CLA_TASK_1, CLA_TRIGGER_ADCB1);
    CLA_setTriggerSource(CLA_TASK_1, CLA_TRIGGER_EPWM7INT);
//    CLA_setTriggerSource(CLA_TASK_2, CLA_TRIGGER_EPWM7INT);
    CLA_setTriggerSource(CLA_TASK_8, CLA_TRIGGER_SOFTWARE);

#pragma diag_warning=770

    //
    // Enable Tasks 1 and 8
    //
//    CLA_enableTasks(CLA1_BASE, (CLA_TASKFLAG_1 | CLA_TASKFLAG_2| CLA_TASKFLAG_8));
    CLA_enableTasks(CLA1_BASE, (CLA_TASKFLAG_1 | CLA_TASKFLAG_8));

    //
    // Force task 8, the one time initialization task
    //
    CLA_forceTasks(CLA1_BASE, CLA_TASKFLAG_8);
}


//////////////


////// Linker File 
   RAMLS4_7           : origin = 0x0000A000, length = 0x00002000
   
   FLASH_BANK1_SEC567  : origin = 0x095000, length = 0x003000

      LUT_MEM_RAM1        : LOAD = FLASH_BANK1_SEC567,
 	  RUN = RAMLS4_7,
      LOAD_START(LUTLOADSTART),
      RUN_START(LUTRUNSTART),
      LOAD_SIZE(LUTLOADSIZE)
	  ALIGN(8)



/////// CLA Program 

#include "TPS_LUT.h"
float Lut_test;


__attribute__((interrupt)) void Cla1Task1(void)
{
// x,y two uint16_t values with correct limits 
Lut_test = LUT_D2[x][y];


///////



  • Hi Jason,

    Let me look into this and I'll respond shortly.

    Regards,

    Ozino

  • Hello Ozino,

    Found one error in my code in that 

        memcpy((uint32_t *)&LUTRUNSTART, (uint32_t *)&LUTRUNSTART,
            (uint32_t)&LUTRUNSTART );

    should be 

        memcpy((uint32_t *)&LUTRUNSTART, (uint32_t *)&LUTLOADSTART,
            (uint32_t)&LUTLOADSIZE );

    but still seeing the delay upon fixing the above line 


  • Hi Jason,

    I suspect your linker is not quite setup correctly.

    Can you confirm how you are mapping the CLA specific memory sections (scratchpad, bss_Cla, const_cla). I also noticing that you are attempting to enable CLA1Task2 but its not defined in your application (commented out). Can you also confirm if you are using task 8 as a normal task or background task. If using as a background task, there are different APIs that need to be called to setup in such a manner. See academy link at the end for more details. 

    Please note the C2000 CLA examples should demo the various questions i have raised.

    Please reference to this guide to ensure you've properly enabled the  CLA in your application: /cfs-file/__key/communityserver-discussions-components-files/171/CLAProjectStructureUG.pdf 

    You can also refer to the CLA academy section on how to configure the CLA.

    https://dev.ti.com/tirex/explore/node?node=A__ASh.QBtmaD.DbEzgWFwnEw__C28X-ACADEMY__1sbHxUB__LATEST

    Regards,

    Ozino

  • "Can you confirm how you are mapping the CLA specific memory sections (scratchpad, bss_Cla, const_cla)"

    I am storing the CLA program in LS0-1 and the non-array data in LS2. The arrays are placed in LS4-7. That seems to be mapping correctly (see picture).

    "enable CLA1Task2 but its not defined in your application (commented out)"

    CLA1Task2 isn't used in my program at the moment and should be commented out everywhere, which I think it is. 

    "Can you also confirm if you are using task 8 as a normal task or background task."

    Task 8 is a normal task used to do variable initialization and called once at startup. 

    Looking at the assembly there are lots of MNOPs following the single read. Is it possible I am running into arbitration issues even though the cpu isn't actively reading or writing to this memory range? Alternatively I am aware of various CLA pipeline issues like the write-read, but I am only doing a read here, so wouldn't think that could be an issue.

    Thanks again,
    Jason




    	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 564,column 5,is_stmt,isa 0
            MMOVIZ    MR1,#0                ; [CPU_FPU] |564| 
    	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 563,column 5,is_stmt,isa 0
            MMOV16    @clav,MR3             ; [CPU_FPU] |563| 
    	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 564,column 5,is_stmt,isa 0
            MMOVXI    MR3,#102              ; [CPU_FPU] |564| 
            MMOVZ16   MR2,@LUTP_cla         ; [CPU_FPU] |564| 
            MLSL32    MR2,#16               ; [CPU_FPU] |564| 
            MASR32    MR2,#16               ; [CPU_FPU] |564| 
    $C$L21:    
            MMOVXI    MR1,#1                ; [CPU_FPU] |564| 
            MAND32    MR1,MR1,MR3           ; [CPU_FPU] |564| 
            MNOP      ; [CPU_FPU] 
            MNOP      ; [CPU_FPU] 
            MNOP      ; [CPU_FPU] 
            MBCNDD    $C$L22,EQ             ; [CPU_FPU] |564| 
            MNOP      ; [CPU_FPU] 
            MNOP      ; [CPU_FPU] 
            MNOP      ; [CPU_FPU] 
            ; branchcc occurs ; [] |564| 
            MADD32    MR0,MR0,MR2           ; [CPU_FPU] |564| 
    $C$L22:    
            MLSL32    MR2,#1                ; [CPU_FPU] |564| 
            MLSR32    MR3,#1                ; [CPU_FPU] |564| 
            MNOP      ; [CPU_FPU] 
            MNOP      ; [CPU_FPU] 
            MNOP      ; [CPU_FPU] 
            MBCNDD    $C$L21,NEQ            ; [CPU_FPU] |564| 
            MNOP      ; [CPU_FPU] 
            MNOP      ; [CPU_FPU] 
            MNOP      ; [CPU_FPU] 
            ; branchcc occurs ; [] |564| 
            MMOVZ16   MR2,@LUTV_cla         ; [CPU_FPU] |564| 
            MLSL32    MR2,#1                ; [CPU_FPU] |564| 
            MADD32    MR2,MR2,MR0           ; [CPU_FPU] |564| 
            MMOV16    MAR0,MR2,#LUT_D2      ; [CPU_FPU] |564| 
            MNOP      ; [CPU_FPU] 
    	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 569,column 5,is_stmt,isa 0
            MMOVIZ    MR0,#15523            ; [CPU_FPU] |569| 
            MMOVXI    MR0,#55050            ; [CPU_FPU] |569| 
    	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 564,column 5,is_stmt,isa 0
            MMOV32    MR2,*MAR0             ; [CPU_FPU] |564| 
            MMOV32    @LUT_Test,MR2         ; [CPU_FPU] |564| 

  • Hi,

    Based on the CLA architecture, the MNOPs are required. We need to ensure that normal pipeline delays are inserted, as CLA does not have a protected pipeline, and hence NOPs are needed to ensure write followed by a read occur correctly. 

     MNOP instructions required for CLA memory or register access 

    Can you confirm if you have optimization turned on? I believe NOPs are replaced with other instructions when optimization (-O2) is utilized. TMS320F2837x CLA MNOP 

    Regards,
    Ozino

  • Can you confirm if you have optimization turned on? I believe NOPs are replaced with other instructions when optimization (-O2) is utilized.

    Optimization is turned on at 2 - Global Optimizations 



  • Did you notice a performance improvement when you used optimization -O2? How many NOPs did you see reduce?

  • Sorry, should have clarified. I've always have optimization on at O2. My previous results are with it on O2

  • Let me forward this post to the compiler team for further comment.

  • Hi Jason,

    Can you confirm your system initialization is correct. Our compiler team has confirmed that with optimization turned on it makes it less likely an MNOP instruction is used, but does not eliminate it. The MNOP instructions can only account for a few cycles, not hundreds.  Therefore I suspect there must be some issue with system configuration.

    You can try using our CLA examples as a starting template to ensure that you have correctly configured the device. 

    You can also reference the steps in this file to ensure that you've gone through the CLA setup correctly. /cfs-file/__key/communityserver-discussions-components-files/171/CLAProjectStructureUG.pdf 

    Regards,

    Ozino