This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Incorrect Real FFT Output on MSP432P401R of BOOSTXL-EDUMKII Edu BoosterPack Microphone Example

Other Parts Discussed in Thread: BOOSTXL-EDUMKII, MSPWARE, MSP430F5638

Are there any others that are experiencing FFT computation problems with the Microphone example on the BOOSTXL-EDUMKII Educational BoosterPack for the MSP432P401R LaunchPad or know of a problem with the real FFT routines in the current CMSIS library?

 

The CMSIS real FFT routine used in the Microphone example in the BOOSTXL-EDUMKII Educational BoosterPack for the MSP432P401R LaunchPad does not calculate the FFT results correctly.

 

The routine that Texas Instruments has used to generate a frequency spectrum on the 128X128 LCD display is the "arm_rfft_q15()" routine from the current CMSIS libraray. This routine should take the 512 data points sampled from the microphone on this board and display the frequency spectrum of the waveform from zero to 2 KHz.

 

When tested using forced sine waves that are interger multiples of the sampling frequency, the "arm_rfft_q15()" correctly calculates the FFT. However, any frequency that is not an integer multiple of the sampling frequency does not gerenerate a correct FFT.

 

When the "arm_rfft_q15()" routine is changed to the use of its equivalent "q31" format, then the FFT is correctly calculated for any input waveform. Use of this version of the real FFT also results in correct output on the LCD display of the BoosterPack.

 

It is most likely that the problem with the "arm_rfft_q15()" routine on the MSP432P401R processor is due to an anomaly in the CMSIS real FFT routine and not the M4F processor in the MSP432P401R.

 

The main() code below shows the modifications made to the Texas Instruments Microphone FFT demo (in red) to change from the "q15" version of the real FFT routine to the "q31" version.

 

//****************************************************************************

//

// main.c - MSP-EXP432P401R + Educational Boosterpack MkII - Microphone FFT

//

//         CMSIS DSP Software Library is used to perform 512-point FFT on

//         the audio samples collected with MSP432's ADC14 from the Education

//         Boosterpack's onboard microhpone. The resulting frequency bin data

//         is displayed on the BoosterPack's 128x128 LCD.

//

//****************************************************************************

 

#include "msp.h"

#include <driverlib.h>

#include <grlib.h>

#include "Crystalfontz128x128_ST7735.h"

#include <stdio.h>

#include <arm_math.h>

#include "arm_const_structs.h"

 

 

#define TEST_LENGTH_SAMPLES 512

#define SAMPLE_LENGTH 512

 

/* ------------------------------------------------------------------

* Global variables for FFT Bin Example

* ------------------------------------------------------------------- */

uint16_t fftSize = SAMPLE_LENGTH;

uint8_t ifftFlag = 0;

uint8_t doBitReverse = 1;

volatile arm_status status;

 

/* Graphic library context */

Graphics_Context g_sContext;

 

#define SMCLK_FREQUENCY     48000000

#define SAMPLE_FREQUENCY   8000

 

/* DMA Control Table */

#ifdef ewarm

#pragma data_alignment=256

#else

#pragma DATA_ALIGN(controlTable, 256)

#endif

uint8_t controlTable[256];

 

/* FFT data/processing buffers*/

float hann[SAMPLE_LENGTH];

int16_t data_array1[SAMPLE_LENGTH];

int16_t data_array2[SAMPLE_LENGTH];

int16_t data_input[SAMPLE_LENGTH*2];

int16_t data_output[SAMPLE_LENGTH];

 

// q31 variables to fix q15 FFT problem

q31_t Buffer_In1_Real[SAMPLE_LENGTH];

q31_t Buffer_In1_Real_Copy[SAMPLE_LENGTH];

q31_t Buffer_In2_Real[SAMPLE_LENGTH];

q31_t Buffer_Out_Complex[SAMPLE_LENGTH * 2];

q31_t Buffer_Magnitude_Real[SAMPLE_LENGTH];

 

arm_rfft_instance_q31 Instance_Real;

 

volatile int switch_data = 0;

 

/* Timer_A PWM Configuration Parameter */

Timer_A_PWMConfig pwmConfig =

{

       TIMER_A_CLOCKSOURCE_SMCLK,

       TIMER_A_CLOCKSOURCE_DIVIDER_1,

       (SMCLK_FREQUENCY/SAMPLE_FREQUENCY),

       TIMER_A_CAPTURECOMPARE_REGISTER_1,

       TIMER_A_OUTPUTMODE_SET_RESET,

       (SMCLK_FREQUENCY/SAMPLE_FREQUENCY)/2

};

 

void main(void)

{

   /* Halting WDT and disabling master interrupts */

   MAP_WDT_A_holdTimer();

   MAP_Interrupt_disableMaster();

 

   /* Initializes Clock System */

   MAP_CS_setDCOCenteredFrequency(CS_DCO_FREQUENCY_48);

   MAP_CS_initClockSignal(CS_MCLK, CS_DCOCLK_SELECT, CS_CLOCK_DIVIDER_1 );

   MAP_CS_initClockSignal(CS_HSMCLK, CS_DCOCLK_SELECT, CS_CLOCK_DIVIDER_1 );

   MAP_CS_initClockSignal(CS_SMCLK, CS_DCOCLK_SELECT, CS_CLOCK_DIVIDER_1 );

   MAP_CS_initClockSignal(CS_ACLK, CS_REFOCLK_SELECT, CS_CLOCK_DIVIDER_1);

 

   /* Initializes display */

   Crystalfontz128x128_Init();

 

   /* Set default screen orientation */

   Crystalfontz128x128_SetOrientation(LCD_ORIENTATION_UP);

 

   /* Initializes graphics context */

   Graphics_initContext(&g_sContext, &g_sCrystalfontz128x128);

   Graphics_setForegroundColor(&g_sContext, GRAPHICS_COLOR_RED);

   Graphics_setBackgroundColor(&g_sContext, GRAPHICS_COLOR_WHITE);

   GrContextFontSet(&g_sContext, &g_sFontFixed6x8);

   Graphics_clearDisplay(&g_sContext);

   Graphics_drawStringCentered(&g_sContext,

                                   "512-Point FFT",

                                   AUTO_STRING_LENGTH,

                                   64,

                                    6,

                                   OPAQUE_TEXT);

   Graphics_drawStringCentered(&g_sContext,

                                   "1kHz",

                                   AUTO_STRING_LENGTH,

                                   64,

                                    122,

                                   OPAQUE_TEXT);

   Graphics_drawStringCentered(&g_sContext,

                                   "0",

                                   AUTO_STRING_LENGTH,

                                  6,

                                   122,

                                   OPAQUE_TEXT);

   Graphics_drawStringCentered(&g_sContext,

                                   "2",

                                   AUTO_STRING_LENGTH,

                                   122,

                                   122,

                                   OPAQUE_TEXT);

 

   // Initialize Hann Window

   int n;

   for (n = 0; n < SAMPLE_LENGTH; n++)

   {

       hann[n] = 0.5 - 0.5 * cosf((2*PI*n)/(SAMPLE_LENGTH-1));

   }

 

   // Configuring Timer_A to have a period of approximately 500ms and

   // an initial duty cycle of 10% of that (3200 ticks)

   Timer_A_generatePWM(TIMER_A0_MODULE, &pwmConfig);

 

   // Initializing ADC (MCLK/1/1)

   ADC14_enableModule();

   ADC14_initModule(ADC_CLOCKSOURCE_MCLK, ADC_PREDIVIDER_1, ADC_DIVIDER_1, 0);

 

   ADC14_setSampleHoldTrigger(ADC_TRIGGER_SOURCE1, false);

 

   // Configuring GPIOs (4.3 A10)

   GPIO_setAsPeripheralModuleFunctionInputPin(GPIO_PORT_P4, GPIO_PIN3,

   GPIO_TERTIARY_MODULE_FUNCTION);

 

   // Configuring ADC Memory

   ADC14_configureSingleSampleMode(ADC_MEM0, true);

   ADC14_configureConversionMemory(ADC_MEM0, ADC_VREFPOS_AVCC_VREFNEG_VSS, ADC_INPUT_A10, false);

 

   // Set ADC result format to signed binary

   ADC14_setResultFormat(ADC_SIGNED_BINARY);

 

   // Configuring DMA module

   DMA_enableModule();

   DMA_setControlBase(controlTable);

 

 

   DMA_disableChannelAttribute(DMA_CH7_ADC12C,

                                 UDMA_ATTR_ALTSELECT | UDMA_ATTR_USEBURST |

                                 UDMA_ATTR_HIGH_PRIORITY |

                                 UDMA_ATTR_REQMASK);

 

 

   // Setting Control Indexes. In this case we will set the source of the

   // DMA transfer to ADC14 Memory 0

   // and the destination to the

   // destination data array.

   MAP_DMA_setChannelControl(UDMA_PRI_SELECT | DMA_CH7_ADC12C,

       UDMA_SIZE_16 | UDMA_SRC_INC_NONE | UDMA_DST_INC_16 | UDMA_ARB_1);

   MAP_DMA_setChannelTransfer(UDMA_PRI_SELECT | DMA_CH7_ADC12C,

       UDMA_MODE_PINGPONG, (void*) (ADC14_BASE + OFS_ADC14MEM0),

       data_array1, SAMPLE_LENGTH);

 

   MAP_DMA_setChannelControl(UDMA_ALT_SELECT | DMA_CH7_ADC12C,

       UDMA_SIZE_16 | UDMA_SRC_INC_NONE | UDMA_DST_INC_16 | UDMA_ARB_1);

   MAP_DMA_setChannelTransfer(UDMA_ALT_SELECT | DMA_CH7_ADC12C,

       UDMA_MODE_PINGPONG, (void*) (ADC14_BASE + OFS_ADC14MEM0),

       data_array2, SAMPLE_LENGTH);

 

   // Assigning/Enabling Interrupts

   MAP_DMA_assignInterrupt(DMA_INT1, 7);

   MAP_Interrupt_enableInterrupt(INT_DMA_INT1);

   MAP_DMA_assignChannel(DMA_CH7_ADC12C);

   MAP_DMA_clearInterruptFlag(7);

   MAP_Interrupt_enableMaster();

 

   // Now that the DMA is primed and setup, enabling the channels. The ADC14

   // hardware should take over and transfer/receive all bytes

   MAP_DMA_enableChannel(7);

   MAP_ADC14_enableConversion();

 

   while(1)

   {

       MAP_PCM_gotoLPM0();

 

       int i = 0;

 

/*

       for (i=0; i<SAMPLE_LENGTH; i++)

       {

         // Generate a 1000 Hz sine wave with DC offset.

         // Sample rate is 8000 samples/second

         Buffer_In1_Real[i] = (q31_t)( 1000.0 * sinf( 6.283185308 *

                              (float)i * 1000.0 / 8000.0 ) + 0X000 );

 

         // Save a copy of Buffer_In1_Real[] since it will be manipulated by the FFT

         Buffer_In2_Real[i] = Buffer_In1_Real[i];

       }

*/

 

       // Copy TI's data arrays to the new FFT working buffers

       for (i = 0; i < SAMPLE_LENGTH; i++)

       {

         Buffer_In1_Real[i] = (q31_t)data_array1[i];

         Buffer_In2_Real[i] = (q31_t)data_array2[i];

       }

 

       //Compute real FFT using the completed data buffer

       if (switch_data & 1)

      {

           arm_rfft_instance_q31 Instance_Real;

           status = arm_rfft_init_q31(&Instance_Real, fftSize, ifftFlag, doBitReverse);

           arm_rfft_q31(&Instance_Real, Buffer_In1_Real, Buffer_Out_Complex);

       }

       else

       {

           arm_rfft_instance_q31 Instance_Real;

           status = arm_rfft_init_q31(&Instance_Real, fftSize, ifftFlag, doBitReverse);

           arm_rfft_q31(&Instance_Real, Buffer_In2_Real, Buffer_Out_Complex);

       }

 

       // The following ARM function calculates the real magnitudes of the complex

       // data output from the RFFT but does not work correctly.

       //arm_cmplx_mag_q31(Buffer_Out_Complex, Buffer_Magnitude_Real, (uint32_t)SAMPLE_LENGTH);

 

 

       // Calculate magnitude of FFT complex output

       for (i = 0; i < SAMPLE_LENGTH * 2; i+=2)

       {

           Buffer_Magnitude_Real[i/2] = (q31_t)(sqrtf((Buffer_Out_Complex[i] * Buffer_Out_Complex[i]) +

                                       (Buffer_Out_Complex[i+1] * Buffer_Out_Complex[i+1])));

       }

 

       // Draw frequency bin graph

       for (i = 0; i < 128; i++)

       {

               // Clip the magnitude so that it doesn't go off the display

           int x = min (100, (int)(Buffer_Magnitude_Real[i] / 10));

 

           Graphics_setForegroundColor(&g_sContext, GRAPHICS_COLOR_WHITE);

           Graphics_drawLineV(&g_sContext, i, 115-x, 15);

           Graphics_setForegroundColor(&g_sContext, GRAPHICS_COLOR_RED);

           Graphics_drawLineV(&g_sContext, i, 115, 115 - x);

       }

 

   } // end of while(1)

 

} // end of main()

 

 

/* Completion interrupt for ADC14 MEM0 */

void DMA_1_ISR(void)

{

   /* Switch between primary and alternate bufferes with DMA's PingPong mode */

   if (DMA_getChannelAttribute(7) & UDMA_ATTR_ALTSELECT)

   {

       DMA_setChannelControl(UDMA_PRI_SELECT | DMA_CH7_ADC12C,

           UDMA_SIZE_16 | UDMA_SRC_INC_NONE | UDMA_DST_INC_16 | UDMA_ARB_1);

       DMA_setChannelTransfer(UDMA_PRI_SELECT | DMA_CH7_ADC12C,

           UDMA_MODE_PINGPONG, (void*) (ADC14_BASE + OFS_ADC14MEM0),

           data_array1, SAMPLE_LENGTH);

     switch_data = 1;

   }

   else

   {

       DMA_setChannelControl(UDMA_ALT_SELECT | DMA_CH7_ADC12C,

           UDMA_SIZE_16 | UDMA_SRC_INC_NONE | UDMA_DST_INC_16 | UDMA_ARB_1);

       DMA_setChannelTransfer(UDMA_ALT_SELECT | DMA_CH7_ADC12C,

           UDMA_MODE_PINGPONG, (void*) (ADC14_BASE + OFS_ADC14MEM0),

           data_array2, SAMPLE_LENGTH);

       switch_data = 0;

   }

}

 

  • Hi Ronald,

    Thank you very much for reporting your findings. I've taken a look at your code and can confirm your observations, both incorrect arm_rfft_q15 output and correct qrm_rfft_q31 output.

    Searching the E2E led me to this post:
    https://e2e.ti.com/support/microcontrollers/tiva_arm/f/908/t/399448

    It appears the current "cmsis_ccs.h" included in CCS and used in this demo project still contains the incorrect mapping of the ARM Cortex M4 instruction "Signed Dual Multiply Subtract Reversed". I rebuilt the dsp library with the fix in cmsis_ccs.h, and Q15 FFT worked as expected.

    Attached are the new dsplib-msp432.lib and cmsis_ccs.h that you can drop into the current project using Q15 to fix the issue.
    dsplib-msp432.lib
    cmsis_ccs.h

    I've filed an internal bug for cmsis_ccs.h to be fixed in CCS and also for the BOOSTXL-EDUMKII Microphone FFT Example to be fixed in the next package update.

    Thanks again for your feedback and sorry for the inconveniences this caused.

    Best Regards,

    Eric C

  • Eric:

    Thanks for finding the CMSIS "q15" real FFT fix for the Microphone example.  The new DSP library works well.

    Please keep in mind that for those of us who are using all these libraries and functions developed by others that are intended to "speed-up" our development processes, may not do that if the code that they write is corrupted like in this case.  Too bad we don't have the time to test our code before it is used by others.  This has not been a good experience for me but luckily, I am not in a design phase at this point in time.

    Sincerely,

    Ronald S. Lisiecki, M.D.

  • Dr. Lisiecki,

    Thanks so much for finding this bug, I built the application this morning and using a frequency generator on my phone the results were non-nonsensical until I found this thread, and updated the lib and header file per Eric's attachments.

    Almost a month later and the package had not been updated, with the corrected files :(

    Much appreciated!

    R.
  • Rando:

    Glad that you were able to find the thread to fix this problem.  This is what this E2E Community is all about.  I'm very thankful that TI responded so quickly as they did.

    Just a few warnings about this new MSP432 processor and the CMSIS DSP routines that are used in the demo.  There appears to be a bottleneck in the bus from the ARM M4 processor to TI's peripherals and RAM so that you will probably find the processing speed of this chip not that impressive even at 48 MHz.  Other Cortex M4 processors from Freescale and Atmel seem to process 6-8 times faster although their processor speeds are 120 MHz.  However, the MSP432 is probably the lowest power processor in this ARM class.

    Also, if you ever get into any of the CMSIS routines for MSP432 that are used for DSP applications, be aware that CMSIS has changed a lot of these and that the older versions of the routines are no longer recommended.  It's real easy to bump into these older routines.

    Be careful out there!

    Ronald S. Lisiecki, M.D.

  • Thanks for posting this!

    I'm still getting up to speed on all things MSP432, and I just wish I would have found this before spending extra hours discovering the same issues.

    After getting some help from TI, I discovered that there is a new release of MSPWare (MSPWare_2_30_00_49) that includes the fix to the issue you identified.

  • David:

    I'm glad that you found this MSP432 fix before having to spend a lot of time trying to troubleshoot the problem.  TI was very responsive to my initial discovery of this problem.

    Good luck with experiments using the new MSP432.  It's a good 32-bit processor but as I have warned others, there appears to be some bottlenecks between the processor and the TI peripherals.  So, make sure that you do some performance testing to make sure that this processor has the horsepower that you will need in your application(s).

    Ronald S. Lisiecki, M.D.

  • Hi Ronald,

    Could you share some more details on your "bottlenecks" concern?
  • LPRF:

    Here's some of the tests that I made with the MSP432P401R processor and comparisons to the Atmel SAM4E16E.  These processors both have the same Cortex M4 core processor.  The MSP432 runs at 48 MHz and the SAM4E16E runs at 120 MHz.  A little unfair comparing apples and oranges here but I think that we are really comparing Fuij apples to Macintosh apples.  You draw your own conclusions and/or make your own tests.

    MSP432P401R @ 48 MHz:

    Direct toggle of a port line (P5.6) low-high-low = 350 ns

    TI library toggle of a port line (P5.6) low-high-low = 1400 ns (lots of overhead if you use the TI library routines due to multiple subroutine calls)

    Floating point add with direct number (not variables) = 320 ns

    Floating point add with variables = 850 ns

    512 point real FFT = 3.6 ms

    To me, these are kind of slow processor times considering that the MSP432P401 is working at 48 MHz.  I would have expected most of these numbers to be on the order of 50-100 ns.  The FFT is exceptionally slow.

    Now, compare (if you can) the above times to the Atmel SAM4E16E @ 120 MHz

    Direct toggle of a port line, low-high-low = 50 ns

    Floating point add with direct number (not variables) = 120 ns

    Complex 256 point FFT = 180 us

    Note: A complex 256 point FFT on a TRUE DSP processor like the Blackfin BF533 takes 4-10 us.

    By the above comparisons, I just think that there is some type of memory and/or peripheral bottleneck that exists in the MSP432P401 that makes its overall processing speed much slower than other Cortex M4 processors.  Even if you scale the above numbers for the higher clock speed of the Atmel SAM4E16E processors, it still shows that the MSP432 is slower.

    I still have a lot of respect for the new MSP432 processor but I would just be careful of how you use it, particularly in applications where you might need some fast processing.  Just because you have a Cortex M4 processor with an on-board floating point unit and pseudo-DSP instructions doesn't mean that it's a number cruncher.

    Also, I sure wish that they would have put their on-chip USB system on this first MSP432.  That would have made me much more interested in using this processor in future designs.

    Good luck,

    Ronald S. Lisiecki, M.D.

  • Hi Ronald,
    Assuming you're running code from flash memory, have you set the required number of flash wait states and tried enabling the instruction read buffer?

    The wait-state setting defaults to 3 cycles on the current silicon revision, but 2 cycles is sufficient for 48MHz operation. The saved cycle at 48MHz probably won't make a huge improvement, but it's worth knowing about for lower clock speeds where zero wait-state operation is possible.

    The instruction read buffer will read 128 bits at a time from flash, and the wait state will only occur when crossing into a new 128 bit block. That might help with the FFT, since there shouldn't be any problems with peripheral access bottlenecks there.

    EDIT: Also, would you mind sharing the code used for the "Floating point add with direct number (not variables)" and "Floating point add with variables" tests?

  • Robert:

    I checked that the flash banks are both set at 3 wait states each as they should be.

    I also set the instruction read buffering to on for both banks.  (TI is missing the BUFD and BUFI defines for this register)

    Here's the speeds that I measure on the MSP432P401R:

    Direct I/O bit toggle (P5.6) = 290 ns

    I/O bit toggle (P5.6) using TI library functions = 1100 ns

    Direct floating point addition = 540 ns

    Floating point addition using variables = 660 ns

    512 point real FFT = 1.8 ms

    Not much improvement.

    Here's the bad news.  If I run the same bit toggle test and floating point add on a MSP430F5638 processor at 24 MHz, then here's what I measure:

    Direct I/O bit toggle (P4.0) = 160 ns

    Direct floating point addition = 500 ns

    You can see that the 16 bit MSP430F5638 processor at 24 MHz is faster than the 32 bit Cortex M4 processor MSP432P401R at 48 MHz.  The 16 bit MSP430F5638 processor doesn't have a floating point unit.

    I still think that this speed discrepancy here is probably due to bottlenecks out of the Cortex M4 to the memories and peripherals.

    Sincerely,

    Ronald S. Lisiecki, M.D.

  • Robert:

    I forgot to send you the code that I used for the speed tests. Here it is.

    // Set the instruction read buffering to on in bank0 and bank1 flash memories
    FLCTL_BANK0_RDCTL |= 0X30;
    FLCTL_BANK1_RDCTL |= 0X30;

    // Set P5.6 low-high-low
    P5OUT &= ~(0X40);
    P5OUT |= 0X40;
    P5OUT &= ~(0X040);

    // Test the speed of toggleing an I/O pin using the MSP430 library
    P5OUT |= 0X40;
    GPIO_setOutputLowOnPin (GPIO_PORT_P5, GPIO_PIN6);
    GPIO_setOutputHighOnPin (GPIO_PORT_P5, GPIO_PIN6);
    GPIO_setOutputLowOnPin (GPIO_PORT_P5, GPIO_PIN6);
    P5OUT &= ~(0X040);

    // Test the speed of a floating point load operation
    P5OUT |= 0X40;
    fTest1 = 1.234;
    P5OUT &= ~(0X040);

    // Test the speed of another floating point operation
    P5OUT |= 0X40;
    fTest1 = 1.234 + 7.89;
    P5OUT &= ~(0X040);

    // Test the speed of another floating point operation
    P5OUT |= 0X40;
    fTest1 = 1.234 + 7.89E10;
    P5OUT &= ~(0X040);

    // Test the speed of another floating point operation
    fTest1 = 1.234;
    fTest2 = 99.4567;
    P5OUT |= 0X40;
    fTest1 = fTest1 + fTest2;
    P5OUT &= ~(0X040);

    Here' the 512 point real FFT code:

    //Compute real FFT using the completed data buffer
    if (switch_data & 1)
    {
    // Test the speed of the real 512 point FFT
    P5OUT |= 0X40;
    arm_rfft_instance_q31 Instance_Real;
    status = arm_rfft_init_q31(&Instance_Real, fftSize, ifftFlag, doBitReverse);
    arm_rfft_q31(&Instance_Real, Buffer_In1_Real, Buffer_Out_Complex);
    P5OUT &= ~(0X040);
    }

    Sincerely,

    Ronald S. Lisiecki, M.D.
  • Ronald Lisiecki said:
    I checked that the flash banks are both set at 3 wait states each as they should be.

    Sorry, I should have been clearer in my last post. The default wait state setting is 3 cycles, but that is higher than it ever needs to be. The correct setting for 48MHz is 2 wait states.

    EDIT: Thanks for the test code, I'll take a look at that later.

    By the way, what are the definitions of the fTest1/fTest2 variables used in the example code?

  • Robert:

    I ran my code again for the MSP432P401R using flash memory wait states = 2 for both flash banks.  I also had the instruction/data read buffering enabled.

    This resulted in about a 15-20% decrease in execution times for the past numbers that I gave you.  Some improvement.

    However, with wait states set = 2, the debugger became unstable.  I could not restart my code after pausing the debugger.

    fTest1 and fTest2 were both float (32 bit word).

    Good luck,

    Ronald S. Lisiecki, M.D.

  • Ronald Lisiecki said:
    However, with wait states set = 2, the debugger became unstable.  I could not restart my code after pausing the debugger.

    Hmm, it works fine for me. Are you setting VCORE to 1? The default is VCORE=0, and the datasheet specifies that can only be used up to 24MHz.

    Regarding the test code, I'm not having much luck reproducing your results accurately. What optimization level are you using in the compiler?

    Also, is the "Direct toggle of a port line (P5.6) low-high-low" timing measured from the first falling edge on P5.6 to the second? In other words, is the measurement covering a complete cycle of square wave output?

    Finally, do the timings for the floating point operations have the IO operation overhead subtracted from them to give just the time for the calculations?

  • Robert:
    I rechecked my measurements. Although TI had set Vcore=0 @ 48 MHz, changing it to 1 made no difference.
    Here's some of the same numbers as before, but I did a little rounding to keep the math simple and noted the inclusion of the bit toggle time where applicable.

    MSP432P401R @ 48 MHz, Vcore=1, Flash wait states both = 2, instr/data read buffering = on
    Direct toggle of P5.6 = 300 ns (observed pulse on oscilloscope)
    TI library toggle of P5.6 = 900 ns (observed pulse on oscilloscope)
    Floating point add of direct numbers; toggle of P5.6 = 600 ns (observed pulse on oscilloscope; subtract 300 ns for P5.6 toggle time)
    Floating point add of two floating point variables (32 bit) and assign to 3rd floating point variable; toggle of P5.6 = 800 ns (observed pulse on oscilloscope; subtract 300 ns for P5.6 toggle)
    512 point FFT; toggle of P5.6 = 1.6 milliseconds

    So, all of the Vcore, wait states and instr/data buffering still seems to show some type of bottleneck of the Cortex M4.

    Now, the surprising thing still is if I double check the same benchmarks on the 24 MHz MSP430F5638, a 16-bit machine, then things look faster except for some of the floating point operations. This is due to the 16-bit busses and lack of a floating point unit:
    Direct toggle of P4.0 = 160 ns
    Floating point add of direct numbers; toggle of P4.0 = 500 ns (observed pulse on oscilloscope; subtract 160 ns for P4.0 toggle time)
    Floating point add of two floating point variables (32 bit) and assign to 3rd floating point variable; toggle of P4.0 = 15,000 ns

    Why is the 24 MHz , 16-bit processor still faster with most of these operations? Of course the last floating point add takes a long time because three, 32 bit variables need to get moved in/out of memory and there is no floating point unit on this processor.

    I have optimization turned completely off. I don't think that the optimizer would modify any of the above code operations and speed them up.

    Regarding the toggle operations, it should be understood that the toggle time is due to just two lines of code:
    P5OUT = P5OUT | 0X40; // P5.6 goes high = delay of bitset operation and eventual rising edge of pulse
    P5OUT = P5OUT & ~0X40; // P5.6 goes low = top of pulse due to delay of bitclr operation and eventual falling edge of pulse

    I believed that I addressed your question about the floating point operations above having the bit toggle time removed from the total pulse time.

    I still believe that this MSP432P401R processor is a tad bit slow for a Cortex M4 processor at 48 MHz. I hope that someone finds something wrong that I am doing that can show improved performance. I see a much higher performance on the Atmel Cortex M4 processors with similar tests.

    Sincerely,

    Ronald S. Lisiecki, M.D.
  • Ronald Lisiecki said:
    I rechecked my measurements. Although TI had set Vcore=0 @ 48 MHz, changing it to 1 made no difference.

    It made no difference to the speed, or no difference to the instability? I'd expect the former to be true, but if you're still getting stability problems with VCORE=1 and 2 wait states at 48MHz then there might be a problem that needs investigating further.

    Ronald Lisiecki said:
    Now, the surprising thing still is if I double check the same benchmarks on the 24 MHz MSP430F5638, a 16-bit machine, then things look faster except for some of the floating point operations [...] Why is the 24 MHz , 16-bit processor still faster with most of these operations?

    I don't find it particularly surprising that the MSP430F5638 can bit-bang IO pins faster than the MSP432.

    Although the MSP432's CPU and DMA can run at 48MHz, its peripheral modules are limited to 24MHz. That neutralises the benefit of the higher clock speed in this scenario. The memory-mapped GPIO registers on both MCUs are 16-bit, so the 32-bit support in the MSP432 gives no advantage. Finally, the ARM architecture is a pure load/store design. That means the GPIO registers can't be modified in place, but must be loaded into a register, changed and then stored back. Each modification requires 3 instructions minimum, which takes at least 3 cycles plus any flash wait states required.

    The MSP430 architecture allows in-place modifications, so a BIC or BIS instruction will do the same job in 4 or 5 cycles.

    Regarding the Atmel SAM results, the 50ns toggle time works out at just 6 cycles. That means the GPIO module and associated bus must be clocked at 120MHz and the code is running out of the 2kb cache with no wait states. In terms of specification that's closer to TI's TM4C chips than the MSP432.

    By the way, optimisation does have an effect in this case. I find that optimisation level 0 causes the compiler to reload the GPIO port address for every modification. With optimisation level 2 it keeps the address in a register and reuses the value, saving an instruction on each modification. If you do this when benchmarking you do need to take care that the compiler doesn't optimise out the thing you're trying to measure, however.

  • Robert:

    Thanks for the insight into all of these processors.

    I guess that one needs to be real careful with any one processor and makes sure that its configuration is correct when doing any performance testing.  This is why I make a few of these tests on new processors so that I don't end up with a throughput problem down the road.  I like all of these processors.  They sure beat the throughput of all the old guys like the 6502 and 8080.

    Thanks again,

    Ronald S. Lisiecki, M.D.

  • Have you tried moving the functions you want to run faster into SRAM?
     ( No, I haven't looked for any examples of this yet, but if you have a application that you can trade some SRAM for speed, it might help )

    http://www.ti.com/lit/an/slaa668/slaa668.pdf


    ..." SRAMs are standard read-write memories on MCUs. Particularly on MSP432P401xx, SRAMs work at the same speed as the CPU clock frequency so code execution from SRAM clearly gives a performance and throughput advantage compared to code execution from flash. There is a power consumption advantage for SRAM compared to flash as well. Power consumption (commonly measured in µA/MHz) for SRAM is significantly lower than the µA/MHz consumption of flash memory. Therefore, TI recommends executing small loops or functions that are frequently used from SRAM instead of flash for performance and power benefits."

  • David:

    No, I currently don't have an application where I need to put code into SRAM for performance reasons.  But, it is a good idea for possible future applications.

    I guess that my issue in all the above conversations is why this new MSP432 processor, with a more sophisticated Cortex M4 32-bit processor at 48 MHz, has a lower performance (mostly in I/O) than the 16-bit MSP430 processor at 24 MHz that is executing similar code, both out of flash memory.  I would have thought that the access times to the flash memory would be comparable for the two processors and that the MSP432 would have been significantly faster in my benchmarks.  This doesn't seem to be the case which is why I have been discussing possible bottleneck memory issues in this new MSP432 processor.

    Sincerely,

    Ronald S. Lisiecki, M.D.

**Attention** This is a public forum