TMS320F280038C-Q1: CLA array placed in LSRAM4 through 7 read access time extremely slow

Jason Galtieri1

Tool/software:

Hello,

I would like to store a couple constant float LUT arrays in LS ram memory to be accessed by the CLA. I have the LUTs initially stored in flash and then written to ram on startup and was able to get this all working correctly when the LS ram was read by the CPU. However, once I gave ram access over to the CLA and had the CLA do the reads I noticed the access time for a single read (in the cla) was several hundred clock cycles, so something seems off. The read value is the correct value so I am guessing my setup is incorrect.

Here are the relevant code sections

///// LUT C File 

#pragma DATA_SECTION (LUT_D2,"LUT_DUTY_MEM1")

const float LUT_D2[16][71] = {..}

///// 


///// LUT H File 

extern const float LUT_D2[LUT_P_MAX+1][LUT_Vr_MAX+1];

//////


///// CLA memory config 


void initCLA(void)
{
    //
    // Copy the program and constants from FLASH to RAM before configuring
    // the CLA
    //
#if defined(_FLASH)
    memcpy((uint32_t *)&Cla1ProgRunStart, (uint32_t *)&Cla1ProgLoadStart,
           (uint32_t)&Cla1ProgLoadSize);
    memcpy((uint32_t *)&Cla1ConstRunStart, (uint32_t *)&Cla1ConstLoadStart,
        (uint32_t)&Cla1ConstLoadSize );

    memcpy((uint32_t *)&LUTRUNSTART, (uint32_t *)&LUTRUNSTART,
        (uint32_t)&LUTRUNSTART );

//    memcpy(&LUTRUNSTART,&LUTRUNSTART,(Uint32)&LUTRUNSTART);

#endif //defined(_FLASH)

    //
    // CLA Program will reside in RAMLS0 and data in RAMLS1, RAMLS2
    //
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS0, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS1, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS2, MEMCFG_LSRAMMASTER_CPU_CLA1);

    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS4, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS5, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS6, MEMCFG_LSRAMMASTER_CPU_CLA1);
    MemCfg_setLSRAMMasterSel(MEMCFG_SECT_LS7, MEMCFG_LSRAMMASTER_CPU_CLA1);


    MemCfg_setCLAMemType(MEMCFG_SECT_LS0, MEMCFG_CLA_MEM_PROGRAM);
    MemCfg_setCLAMemType(MEMCFG_SECT_LS1, MEMCFG_CLA_MEM_PROGRAM);

    MemCfg_setCLAMemType(MEMCFG_SECT_LS2, MEMCFG_CLA_MEM_DATA);

    MemCfg_setCLAMemType(MEMCFG_SECT_LS4, MEMCFG_CLA_MEM_DATA);
    MemCfg_setCLAMemType(MEMCFG_SECT_LS5, MEMCFG_CLA_MEM_DATA);
    MemCfg_setCLAMemType(MEMCFG_SECT_LS6, MEMCFG_CLA_MEM_DATA);
    MemCfg_setCLAMemType(MEMCFG_SECT_LS7, MEMCFG_CLA_MEM_DATA);

//
// Suppressing #770-D conversion from pointer to smaller integer
// The CLA address range is 16 bits so the addresses passed to the MVECT
// registers will be in the lower 64KW address space. Turn the warning
// back on after the MVECTs are assigned addresses
//
#pragma diag_suppress=770

    //
    // Assign the task vectors and set the triggers for task 1
    // and 8
    //
    CLA_mapTaskVector(CLA1_BASE, CLA_MVECT_1, (uint16_t)&Cla1Task1);
//    CLA_mapTaskVector(CLA1_BASE, CLA_MVECT_2, (uint16_t)&Cla1Task2);
    CLA_mapTaskVector(CLA1_BASE, CLA_MVECT_8, (uint16_t)&Cla1Task8);
    //CLA_setTriggerSource(CLA_TASK_1, CLA_TRIGGER_ADCA1);
//    CLA_setTriggerSource(CLA_TASK_1, CLA_TRIGGER_ADCB1);
    CLA_setTriggerSource(CLA_TASK_1, CLA_TRIGGER_EPWM7INT);
//    CLA_setTriggerSource(CLA_TASK_2, CLA_TRIGGER_EPWM7INT);
    CLA_setTriggerSource(CLA_TASK_8, CLA_TRIGGER_SOFTWARE);

#pragma diag_warning=770

    //
    // Enable Tasks 1 and 8
    //
//    CLA_enableTasks(CLA1_BASE, (CLA_TASKFLAG_1 | CLA_TASKFLAG_2| CLA_TASKFLAG_8));
    CLA_enableTasks(CLA1_BASE, (CLA_TASKFLAG_1 | CLA_TASKFLAG_8));

    //
    // Force task 8, the one time initialization task
    //
    CLA_forceTasks(CLA1_BASE, CLA_TASKFLAG_8);
}


//////////////


////// Linker File 
   RAMLS4_7           : origin = 0x0000A000, length = 0x00002000
   
   FLASH_BANK1_SEC567  : origin = 0x095000, length = 0x003000

      LUT_MEM_RAM1        : LOAD = FLASH_BANK1_SEC567,
 	  RUN = RAMLS4_7,
      LOAD_START(LUTLOADSTART),
      RUN_START(LUTRUNSTART),
      LOAD_SIZE(LUTLOADSIZE)
	  ALIGN(8)



/////// CLA Program 

#include "TPS_LUT.h"
float Lut_test;


__attribute__((interrupt)) void Cla1Task1(void)
{
// x,y two uint16_t values with correct limits 
Lut_test = LUT_D2[x][y];


///////

3 months ago

0 Ozino Odharo 3 months ago

TI__Mastermind 21275 points

Hi Jason,

Let me look into this and I'll respond shortly.

Regards,

Ozino

0 Jason Galtieri1 3 months ago in reply to Ozino Odharo

Prodigy 165 points

Hello Ozino,

Found one error in my code in that

memcpy((uint32_t *)&LUTRUNSTART, (uint32_t *)&LUTRUNSTART,
(uint32_t)&LUTRUNSTART );

should be

memcpy((uint32_t *)&LUTRUNSTART, (uint32_t *)&LUTLOADSTART,
(uint32_t)&LUTLOADSIZE );

but still seeing the delay upon fixing the above line

0 Ozino Odharo 3 months ago in reply to Jason Galtieri1

TI__Mastermind 21275 points

Hi Jason,

I suspect your linker is not quite setup correctly.

Can you confirm how you are mapping the CLA specific memory sections (scratchpad, bss_Cla, const_cla). I also noticing that you are attempting to enable CLA1Task2 but its not defined in your application (commented out). Can you also confirm if you are using task 8 as a normal task or background task. If using as a background task, there are different APIs that need to be called to setup in such a manner. See academy link at the end for more details.

Please note the C2000 CLA examples should demo the various questions i have raised.

Please reference to this guide to ensure you've properly enabled the CLA in your application: /cfs-file/__key/communityserver-discussions-components-files/171/CLAProjectStructureUG.pdf

You can also refer to the CLA academy section on how to configure the CLA.

https://dev.ti.com/tirex/explore/node?node=A__ASh.QBtmaD.DbEzgWFwnEw__C28X-ACADEMY__1sbHxUB__LATEST

Regards,

Ozino

0 Jason Galtieri1 3 months ago in reply to Ozino Odharo

Prodigy 165 points

"Can you confirm how you are mapping the CLA specific memory sections (scratchpad, bss_Cla, const_cla)"

I am storing the CLA program in LS0-1 and the non-array data in LS2. The arrays are placed in LS4-7. That seems to be mapping correctly (see picture).

"enable CLA1Task2 but its not defined in your application (commented out)"

CLA1Task2 isn't used in my program at the moment and should be commented out everywhere, which I think it is.

"Can you also confirm if you are using task 8 as a normal task or background task."

Task 8 is a normal task used to do variable initialization and called once at startup.

Looking at the assembly there are lots of MNOPs following the single read. Is it possible I am running into arbitration issues even though the cpu isn't actively reading or writing to this memory range? Alternatively I am aware of various CLA pipeline issues like the write-read, but I am only doing a read here, so wouldn't think that could be an issue.

Thanks again,
Jason

	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 564,column 5,is_stmt,isa 0
        MMOVIZ    MR1,#0                ; [CPU_FPU] |564| 
	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 563,column 5,is_stmt,isa 0
        MMOV16    @clav,MR3             ; [CPU_FPU] |563| 
	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 564,column 5,is_stmt,isa 0
        MMOVXI    MR3,#102              ; [CPU_FPU] |564| 
        MMOVZ16   MR2,@LUTP_cla         ; [CPU_FPU] |564| 
        MLSL32    MR2,#16               ; [CPU_FPU] |564| 
        MASR32    MR2,#16               ; [CPU_FPU] |564| 
$C$L21:    
        MMOVXI    MR1,#1                ; [CPU_FPU] |564| 
        MAND32    MR1,MR1,MR3           ; [CPU_FPU] |564| 
        MNOP      ; [CPU_FPU] 
        MNOP      ; [CPU_FPU] 
        MNOP      ; [CPU_FPU] 
        MBCNDD    $C$L22,EQ             ; [CPU_FPU] |564| 
        MNOP      ; [CPU_FPU] 
        MNOP      ; [CPU_FPU] 
        MNOP      ; [CPU_FPU] 
        ; branchcc occurs ; [] |564| 
        MADD32    MR0,MR0,MR2           ; [CPU_FPU] |564| 
$C$L22:    
        MLSL32    MR2,#1                ; [CPU_FPU] |564| 
        MLSR32    MR3,#1                ; [CPU_FPU] |564| 
        MNOP      ; [CPU_FPU] 
        MNOP      ; [CPU_FPU] 
        MNOP      ; [CPU_FPU] 
        MBCNDD    $C$L21,NEQ            ; [CPU_FPU] |564| 
        MNOP      ; [CPU_FPU] 
        MNOP      ; [CPU_FPU] 
        MNOP      ; [CPU_FPU] 
        ; branchcc occurs ; [] |564| 
        MMOVZ16   MR2,@LUTV_cla         ; [CPU_FPU] |564| 
        MLSL32    MR2,#1                ; [CPU_FPU] |564| 
        MADD32    MR2,MR2,MR0           ; [CPU_FPU] |564| 
        MMOV16    MAR0,MR2,#LUT_D2      ; [CPU_FPU] |564| 
        MNOP      ; [CPU_FPU] 
	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 569,column 5,is_stmt,isa 0
        MMOVIZ    MR0,#15523            ; [CPU_FPU] |569| 
        MMOVXI    MR0,#55050            ; [CPU_FPU] |569| 
	.dwpsn	file "../CPU1CLA1_cla_VS.cla",line 564,column 5,is_stmt,isa 0
        MMOV32    MR2,*MAR0             ; [CPU_FPU] |564| 
        MMOV32    @LUT_Test,MR2         ; [CPU_FPU] |564|

0 Ozino Odharo 3 months ago in reply to Jason Galtieri1

TI__Mastermind 21275 points

Hi,

Based on the CLA architecture, the MNOPs are required. We need to ensure that normal pipeline delays are inserted, as CLA does not have a protected pipeline, and hence NOPs are needed to ensure write followed by a read occur correctly.

MNOP instructions required for CLA memory or register access

Can you confirm if you have optimization turned on? I believe NOPs are replaced with other instructions when optimization (-O2) is utilized. TMS320F2837x CLA MNOP

Regards,
Ozino

0 Jason Galtieri1 3 months ago in reply to Ozino Odharo

Prodigy 165 points

Ozino Odharo said:
Can you confirm if you have optimization turned on? I believe NOPs are replaced with other instructions when optimization (-O2) is utilized.

Optimization is turned on at 2 - Global Optimizations

0 Ozino Odharo 3 months ago in reply to Jason Galtieri1

TI__Mastermind 21275 points

Did you notice a performance improvement when you used optimization -O2? How many NOPs did you see reduce?

0 Jason Galtieri1 3 months ago in reply to Ozino Odharo

Prodigy 165 points

Sorry, should have clarified. I've always have optimization on at O2. My previous results are with it on O2

0 Ozino Odharo 3 months ago in reply to Jason Galtieri1

TI__Mastermind 21275 points

Let me forward this post to the compiler team for further comment.

0 Ozino Odharo 3 months ago in reply to Ozino Odharo

TI__Mastermind 21275 points

Hi Jason,

Can you confirm your system initialization is correct. Our compiler team has confirmed that with optimization turned on it makes it less likely an MNOP instruction is used, but does not eliminate it. The MNOP instructions can only account for a few cycles, not hundreds. Therefore I suspect there must be some issue with system configuration.

You can try using our CLA examples as a starting template to ensure that you have correctly configured the device.

You can also reference the steps in this file to ensure that you've gone through the CLA setup correctly. /cfs-file/__key/communityserver-discussions-components-files/171/CLAProjectStructureUG.pdf

Regards,

Ozino

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F280038C-Q1: CLA array placed in LSRAM4 through 7 read access time extremely slow