This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

CCS/MSP432P401R: using __attribute__((ramfunc)) and realizing a reduction in code execution time.

Part Number: MSP432P401R

Tool/software: Code Composer Studio

part#: MPS432P401R (specifically the version 2 MSP432 launchpad)

IDE: Code Composer Studio v7.2.0.00013

Compilier: TI v16.9.4.LTS

There are two functions with computationally intensive that I need to run from RAM in order to reduce the execution time.

__attribute__((ramfunc)) void Proj_BP2_Detection_Filter(RECEVIVE_APP *pRx)

and a function which is called within Proj_BP2_Detection_Filter(RECEVIVE_APP *pRx)

__attribute__((ramfunc)) void Biquad_Section_Bandpass(INT16 *pIO, BIQUAD_SECTION *pBQ).

The project *.map file indicates

.TI.ramfunc
* 0 0000ec58 00000464 RUN ADDR = 01008218
0000ec58 000003d4 Proj_BP2_Detection_Filter.obj (.TI.ramfunc)
0000f02c 00000004 --HOLE-- [fill = 0]
0000f030 0000008c Biquad_Section.obj (.TI.ramfunc)

FAR CALL TRAMPOLINES

callee name trampoline name
callee addr tramp addr call addr call info
-------------- ----------- --------- ----------------
Proj_BP2_Detection_Filter $Tramp$TT$S$$Proj_BP2_Detection_Filter
01008219 0000d740 00006d6c Proj_BP2_Receiver_Signal_Detect.obj (.text)

[1 trampolines]
[1 trampoline calls]

NO IMPROVEMENT in execution time has been realized. HELP?

  • Scott,
    Can you confirm that you are comparing the execution time at 48Mhz between flash which require a wait-state of 1 and SRAM which require no wait-states.

    Thanks,
    Chris
  • I will provide you with relevant code sections below.

    I am not sure if this confirms the assumption or not.

    The following function is called early in main.c for setting 48MHz clock and 1 wait state for both FLASH banks.

    I do not know any specific DriverLib call for setting zero wait states for RAM execution. Can you elaborate on specific settings required to result in 0 wait state execution from RAM?

    BOOL Proj_BP2_Receiver_Clock(void)
    {
        //BOOL        bReturnVal;
    
    #ifdef MSP432WARE_CONFIGURATION
    
        // Configuring pins for HFXT peripheral/crystal usage
        MAP_GPIO_setAsPeripheralModuleFunctionOutputPin(GPIO_PORT_PJ, GPIO_PIN2 | GPIO_PIN3, GPIO_PRIMARY_MODULE_FUNCTION);
    
        // Explicitly set external clock frequencies
        MAP_CS_setExternalClockSourceFrequency(32768,48000000);
        // use DC-DC converter, VCORE1 to allow 48MHz operation, place MCU into ACTIVE mode
        if (!MAP_PCM_setPowerState(PCM_AM_DCDC_VCORE1))
        {
            return RSC_FALSE;
        }
        MAP_PCM_enableRudeMode();
    
        // according to page 31 of MSP432P401 data sheet, a wait state of one is allowed at 48MHz, DCDC
        MAP_FlashCtl_setWaitState(FLASH_BANK0, 1);
        MAP_FlashCtl_setWaitState(FLASH_BANK1, 1);
        // Starting HFXT in non-bypass mode with timeout
        if (!MAP_CS_startHFXTWithTimeout(false,0x00010000))
        {
            return RSC_FALSE;
        }
    
        // Enable peripheral module clock requests for the active mode clocks
        MAP_CS_enableClockRequest((CS_MCLK | CS_HSMCLK | CS_SMCLK));
    
        // Initializing MCLK to HFXT, 48MHz
        MAP_CS_initClockSignal(CS_MCLK, CS_HFXTCLK_SELECT, CS_CLOCK_DIVIDER_1);
        // MCLK is the source for the CPU clock
    
        // Initializing HSMCLK to HFXT divided by 2, 24MHz
        MAP_CS_initClockSignal(CS_HSMCLK, CS_HFXTCLK_SELECT, CS_CLOCK_DIVIDER_2);
        // HSMCLK may be the clock source for certain peripherals: ADC14
    
        // Initializing SMCLK to HFXT divided by 4, 12MHz
        MAP_CS_initClockSignal(CS_SMCLK, CS_HFXTCLK_SELECT, CS_CLOCK_DIVIDER_4);
        // SMCLK may be the clock source for certain peripherals: SPI, UARTs, Timers, etc.
    
        // Initializing BCLK to REFO, 32768 Hz
        MAP_CS_setReferenceOscillatorFrequency(CS_REFO_32KHZ);
        MAP_CS_initClockSignal(CS_BCLK, CS_REFOCLK_SELECT, CS_CLOCK_DIVIDER_1);
    #endif
    
        return RSC_TRUE;
    
    }
    

  • No, there are no wait states in the SRAM execution. This does confirm you are accurately comparing FLASH to SRAM and you should see an improvement in performance. Let me look into this some more and see if there are any implications from making calls within the SRAM space as well as how the code is aligned.

    Chris
  • In an effort to speed and make your response as relevant as possible, I am providing you with additional code segments as well as the linker command file.

    //***********
    // Linker Command
    //***************

    --retain=flashMailbox MEMORY { INTVECS (RX) : origin = 0x00000000, length = 0x00000100 APP_MAIN (RX) : origin = 0x00000100, length = 0x0001EF00 NV1 (RX) : origin = 0x0001F000, length = 0x00001000 UPDATE (RX) : origin = 0x00020000, length = 0x0001EF00 NV2 (RX) : origin = 0x0003F000, length = 0x00001000 INFO (RX) : origin = 0x00200000, length = 0x00004000 #ifdef __TI_COMPILER_VERSION__ #if __TI_COMPILER_VERSION__ >= 15009000 ALIAS { SRAM_CODE (RWX): origin = 0x01000000 SRAM_DATA (RW) : origin = 0x20000000 } length = 0x00010000 // Provision for running code from RAM: total code lenght < 0x00000400 //SRAM_VEC_TABLE (RWX) : origin = 0x20000000, length = 0x00000100 //SRAM_RESERVED (RWX) : origin = 0x20000100, length = 0x00000400 //SRAM_DATA (RW) : origin = 0x20000500, length = 0x00010000 - 0x00000500 //SRAM_CODE_RESERVED (RWX) : origin = 0x01000100, length = 0x00000400 #else /* Hint: If the user wants to use ram functions, please observe that SRAM_CODE */ /* and SRAM_DATA memory areas are overlapping. You need to take measures to separate */ /* data from code in RAM. This is only valid for Compiler version earlier than 15.09.0.STS.*/ SRAM_CODE (RWX): origin = 0x01000000, length = 0x00010000 SRAM_DATA (RW) : origin = 0x20000000, length = 0x00010000 #endif #endif } /* The following command line options are set as part of the CCS project. */ /* If you are building using the command line, or for some reason want to */ /* define them here, you can uncomment and modify these lines as needed. */ /* If you are using CCS for building, it is probably better to make any such */ /* modifications in your CCS project and leave this file alone. */ /* */ /* A heap size of 1024 bytes is recommended when you plan to use printf() */ /* for debug output to the console window. */ /* */ /* --heap_size=0 */ /* --stack_size=1024 */ /* --library=rtsv7M4_T_le_eabi.lib */ /* Section allocation in memory */ SECTIONS { .intvecs: > INTVECS .text : > APP_MAIN //.code_romToram: RUN_START(code_romToram_run_start), RUN_SIZE(code_romToram_run_size) > APP_MAIN .const : > APP_MAIN .cinit : > APP_MAIN .pinit : > APP_MAIN .init_array : > APP_MAIN .binit : {} > APP_MAIN /* The following sections show the usage of the INFO flash memory */ /* INFO flash memory is intended to be used for the following */ /* device specific purposes: */ /* Flash mailbox for device security operations */ .flashMailbox : > 0x00200000 /* TLV table for device identification and characterization */ .tlvTable : > 0x00201000 /* BSL area for device bootstrap loader */ .bslArea : > 0x00202000 .vtable : > 0x20000000 //.sram_reserved: RUN_START(sram_reserved_run_start) > SRAM_RESERVED //.sram_code_reserved: RUN_START(sram_code_reserved_run_start) > SRAM_CODE_RESERVED .data : > SRAM_DATA .bss : > SRAM_DATA .sysmem : > SRAM_DATA .stack : > SRAM_DATA (HIGH) #ifdef __TI_COMPILER_VERSION__ #if __TI_COMPILER_VERSION__ >= 15009000 .TI.ramfunc : {} load=APP_MAIN, run=SRAM_CODE, table(BINIT) #endif #endif } /* Symbolic definition of the WDTCTL register for RTS */ WDTCTL_SYM = 0x4000480C;
    //***********************
    // BIQUAD SECTION
    //******************
    //============================================================
    // Description: Biquad Digital Filter (one iteration)
    // Function is declared with NO passed variables so it is
    // capable of being copied and run from RAM.
    //
    // NOTE: This has been optimized for bandpass applications, i.e a0 == 1 and b1 == 0.
    //		x2 = x1;
    //		x1 = x0;
    //		x0 = (FLOAT) *pIO;
    //		y2 = y1;
    //		y1 = y0
    //		y0 = a0*(b0*x0 + b1*x1 + b2*x2) - a2*y2 -a1*y1;
    //
    // NOTE: This has been optimized for 10 bit ADC and 16 bit storage.
    // NOTE: A further speed improvement is made by tolerating truncation versus rounding
    //		*pIO = ((INT16) pBQ->y0);
    //
    // ALERNATIVELY:
    //		*pIO = MATH_RoundToInt16(y0);
    //
    //============================================================
    #if 0
    #if __TI_COMPILER_VERSION__ >= 15009000
    __attribute__((ramfunc))
    #endif
    #endif
    
    void Biquad_Section_Bandpass(INT16 *pIO, BIQUAD_SECTION *pBQ)
    {
    	// input shifts
    	pBQ->x2 = pBQ->x1;
    	pBQ->x1 = pBQ->x0;
    	pBQ->x0 = (FLOAT) *pIO;
    
    	// output shifts
    	pBQ->y2 = pBQ->y1;
    	pBQ->y1 = pBQ->y0;
    
    	// filter calculation
    	// pBQ->y0 = (pBQ->a0 * ((pBQ->b0 * pBQ->x0) + (pBQ->b1 * pBQ->x1) + (pBQ->b2 * pBQ->x2))) - (pBQ->a2 * pBQ->y2) - (pBQ->a1 * pBQ->y1);
    	//
    	// NOTE: Optimized for bandpass applications, i.e b1 == 0.
    	// + (pBQ->b1 * pBQ->x1) has been omitted;
    	// NOTE: Optimized for bandpass applications, i.e a0 == 1.
    	// pBQ->a0 *  has been omitted;
    	pBQ->y0 = ((pBQ->b0 * pBQ->x0) + (pBQ->b2 * pBQ->x2)) - (pBQ->a2 * pBQ->y2) - (pBQ->a1 * pBQ->y1);
    
    #if 0
    	*pIO = MATH_RoundToInt16(pBQ->y0);
    	// A significant speed improvement is made by in-lining the relevant portions of MATH_RoundToInt16.
    	// Also, since the ADC is at 10bit and we are using INT16's to store the filter results
    	// I am also for going overflow protection.
    	if (pBQ->y0 >= 0)
    	{
    		*pIO = ((INT16) (pBQ->y0 + 0.5));
    	}
    	else
    	{
    		*pIO = ((INT16) (pBQ->y0 - 0.5));
    	}
    #endif
    
    	// A further speed improvement is made by tolerating truncation versus rounding
    	*pIO = ((INT16) pBQ->y0);
    }
    
    

    //********************

    // Detection Filter

    //**********************

    //============================================================
    // Description:
    //
    // X,Y,Z BPF (FUNDAMENTAL only).
    // X,Y,Z RMS calculations
    //============================================================
    #if 0
    #if __TI_COMPILER_VERSION__ >= 15009000
    __attribute__((ramfunc))
    #endif
    #endif
    
    void Proj_BP2_Detection_Filter(RECEVIVE_APP *pRx)
    {
        WORD		wi;
    #if 0
        WORD		wj;
    
        DWORD		dwX_Sum;
        DWORD		dwY_Sum;
        DWORD		dwZ_Sum;
    #endif
    
        INT16		*pX_IO;
        INT16		*pY_IO;
        INT16		*pZ_IO;
    
    #if 0
        FLOAT		fX_RMS;
        FLOAT		fY_RMS;
        FLOAT		fZ_RMS;
    #endif
    
    	// initialize channel RMS filters
        pRx->fX_RMS = 0;
        pRx->fY_RMS = 0;
        pRx->fZ_RMS = 0;
    
    	// initialize channel IO starting points
    	pX_IO = &pRx->X_Rx[pRx->wSpStart];
        pY_IO = &pRx->Y_Rx[pRx->wSpStart];
        pZ_IO = &pRx->Z_Rx[pRx->wSpStart];
    
    	for (wi = pRx->wSpStart; wi < pRx->wSpEnd; wi++)
    	{
    		Biquad_Section_Bandpass(pX_IO, &_X_F0_BPF_S0);
    		Biquad_Section_Bandpass(pX_IO, &_X_F0_BPF_S1);
    
    		Biquad_Section_Bandpass(pY_IO, &_Y_F0_BPF_S0);
    		Biquad_Section_Bandpass(pY_IO, &_Y_F0_BPF_S1);
    
    		Biquad_Section_Bandpass(pZ_IO, &_Z_F0_BPF_S0);
    		Biquad_Section_Bandpass(pZ_IO, &_Z_F0_BPF_S1);
    
    #if 0
    		if ((wi >= pRx->wRMS_Valid_Index) && !(wi % pRx->wRMS_Interval))
    		{
    			// calculate RMS over wRMS_Window samples
    			dwX_Sum = 0;
    			dwY_Sum = 0;
    			dwZ_Sum = 0;
    
    			for (wj = wi - pRx->wRMS_Window; wj < wi; wj++)
    			{
    				dwX_Sum += (pRx->X_Rx[wj] * pRx->X_Rx[wj]);
    				dwY_Sum += (pRx->Y_Rx[wj] * pRx->Y_Rx[wj]);
    				dwZ_Sum += (pRx->Z_Rx[wj] * pRx->Z_Rx[wj]);
    			}
    			// RMS
    			fX_RMS = sqrtf(((FLOAT) dwX_Sum) / ((FLOAT) pRx->wRMS_Window));
    			fY_RMS = sqrtf(((FLOAT) dwY_Sum) / ((FLOAT) pRx->wRMS_Window));
    			fZ_RMS = sqrtf(((FLOAT) dwZ_Sum) / ((FLOAT) pRx->wRMS_Window));
    
    			// RMS filters
    			pRx->fX_RMS = (0.875 * pRx->fX_RMS) + (0.125 * fX_RMS);
    			pRx->fY_RMS = (0.875 * pRx->fY_RMS) + (0.125 * fY_RMS);
    			pRx->fZ_RMS = (0.875 * pRx->fZ_RMS) + (0.125 * fZ_RMS);
    
    			// Filtered RMS threshold test
    			if (pRx->fX_RMS > MINIMUM_RMS)
    			{
    				pRx->bX_CarrierDetect = RSC_TRUE;
    			}
    			if (pRx->fY_RMS > MINIMUM_RMS)
    			{
    				pRx->bY_CarrierDetect = RSC_TRUE;
    			}
    			if (pRx->fZ_RMS > MINIMUM_RMS)
    			{
    				pRx->bZ_CarrierDetect = RSC_TRUE;
    			}
    
    #if 0
    			// !!WIP looking for peaks during no signal conditions to adjust filter coefficients
    			if (pRx->fX_RMS > pRx->fXYZ_RMS)
    			{
    				pRx->fXYZ_RMS = pRx->fX_RMS;
    			}
    			if (pRx->fY_RMS > pRx->fXYZ_RMS)
    			{
    				pRx->fXYZ_RMS = pRx->fY_RMS;
    			}
    			if (pRx->fZ_RMS > pRx->fXYZ_RMS)
    			{
    				pRx->fXYZ_RMS = pRx->fZ_RMS;
    			}
    #endif
    		}
    		// A significant speed improvement is made by in-lining "MATH_RoundToInt16(fRMS)".
    		// Since the ADC is at 10 bits and we are using WORD's to store results,
    		// I am also for going overflow protection. A further speed improvement is
    		// made by tolerating truncation versus rounding.
    		// This also fills in gaps created by pRx->wRMS_Interval
    		pRx->X_RMS[wi] = ((WORD) pRx->fX_RMS);
    		pRx->Y_RMS[wi] = ((WORD) pRx->fY_RMS);
    		pRx->Z_RMS[wi] = ((WORD) pRx->fZ_RMS);
    #endif
    
    		// if we achieve carrier detect on any channel,
    		// must also be able to start looking for and
    		// keeping track of Header Sync
    		pX_IO++;
    	    pY_IO++;
    	    pZ_IO++;
    	}
    }
    
    

    Additionally the sample ISR may benefit from RAM execution

    //*******************

    // Sample ISR

    //****************

    //************************************************************
    //	Description:
    // 	This routine is TIME CRITICAL. IRQ interval may be as short as 5usec.
    //
    //	Clear TA3 CCR0 capture compare interrupt.
    //	Read XYZ
    //	increment index
    //	Trigger ADC sequence.
    //************************************************************
    #if 0
    //************************************************************
    //	Using the "ramfunc" attribute has added 50% to the execution time of the ISR
    //************************************************************
    #if __TI_COMPILER_VERSION__ >= 15009000
    __attribute__((ramfunc))
    #endif
    #endif
    
    void Proj_Dev_BP2_Receiver_TA3_CCR0_ISR(void)
    {
    	//*******************************************************************************************
    	//LED1_RED_ON();
    	//HWREG16((uint32_t)P1 + ((uint32_t)&P1->OUT - (uint32_t)P1)) |= LED1_RED_BIT;
    	//*******************************************************************************************
    
    #if 0
    	//*******************************************************************************************
    	// debug - check for ADC14 TOV or OV errors
    	if (ADC14->IFGR1 & 0x00000030)
    	{
    		// get error flags
    		_Receive_App.dwdebug = (ADC14->IFGR1 & 0x00000030);
    		ADC14->CLRIFGR1 = 0x00000030;
    		// count error flags
    		_Receive_App.werror_cnt++;
    	}
    #endif
    
    	//*******************************************************************************************
    	//MAP_Timer_A_clearCaptureCompareInterrupt(TIMER_A3_BASE, TIMER_A_CAPTURECOMPARE_REGISTER_0);
    	//
    	//idx = (TIMER_A_CAPTURECOMPARE_REGISTER_0>>1) - 1;
    	//BITBAND_PERI(TIMER_A_CMSIS(TIMER_A3_BASE)->CCTL[idx],TIMER_A_CCTLN_CCIFG_OFS) = 0;
    	BITBAND_PERI(TIMER_A_CMSIS(TIMER_A3_BASE)->CCTL[(TIMER_A_CAPTURECOMPARE_REGISTER_0>>1) - 1],TIMER_A_CCTLN_CCIFG_OFS) = 0;
    	//*******************************************************************************************
    
    	// read last conversion
    	_Receive_App.X_Rx[_Receive_App.wRxIndex] = ADC14->MEM[3];
    	_Receive_App.Z_Rx[_Receive_App.wRxIndex] = ADC14->MEM[4];
    	_Receive_App.Y_Rx[_Receive_App.wRxIndex] = ADC14->MEM[5];
    
    	//	Manage index
    	_Receive_App.wRxIndex++;
    	if (_Receive_App.wRxIndex > RX_BUFFER_MAX)
    	{
    		//MAP_Timer_A_stopTimer(TIMER_A3_BASE);
    	    TIMER_A_CMSIS(TIMER_A3_BASE)->CTL &= ~TIMER_A_CTL_MC_3;
    
    	    //MAP_ADC14_clearInterruptFlag(0xFFFFFFFFFFFFFFFF);
    	    ADC14->CLRIFGR0 |= 0xFFFFFFFF;
    	    ADC14->CLRIFGR1 |= 0xFFFFFFFF;
    	}
    	else
    	{
    		//*******************************************************************************************
    		//MAP_ADC14_toggleConversionTrigger();
    		// start next ADC14 sequence
    	    BITBAND_PERI(ADC14->CTL0, ADC14_CTL0_SC_OFS) = 1;
    	}
    
        //*******************************************************************************************
    	//LED1_RED_OFF();
    	//HWREG16((uint32_t)P1 + ((uint32_t)&P1->OUT - (uint32_t)P1)) &= ~LED1_RED_BIT;
    	//*******************************************************************************************
    }
    
    
  • Scott,
    I am going to try and setup a fairly simple example to recreate. What are you using to measure the clocks and/or execution time?

    Thanks,
    Chris
  • I am using a 432 Launchpad. I use the GPIO for LED 1 (On/Off) for measuring execution times.
  • Thanks Scott. I believe that I have been able to recreate the issue and still investigating. You may want to check the state of the pre-fetch buffer (FLCTL_BANK0_RDCTL), but I have been unable to come to a satisfactory resolution based upon that setting(s).

    Chris
  • Chris,

    Thank you for the update. Please continue to investigate.

    I am not sure what you want me to look for concerning the pre-fetch buffer?

    Scott

  • I just wanted to confirm that the pre-fetch buffer is not enabled in your code.

    Thanks,
    Chris
  • Could you give a clue as to which section of project properties this is under in CCS7?
  • This is in the flash control register of the device and not the IDE.  

**Attention** This is a public forum