MSP430FR5994: Efficient rd/wr large amount of data between FRAM/SRAM

Rosh Mendis

Part Number: MSP430FR5994

Hi,

I have a code segment like below:

void test_func(uint rows, fixed mat_data[], uint mat_idx[], 
	uint mptr[], mat *vec, mat *dest) {
    uint i,j;
	for(i = 0; i < rows; i ++) {		
		for(j = mptr[i]; j < mptr[i + 1]; j ++) {
		    fixed *ptr1 = &mat_data[j];
		    uint *ptr2 = &mat_idx[j];
			fixed w = F_ADD(MAT_GET(dest, i), F_MUL(
			                                        (signed short)__data20_read_short((unsigned long int)ptr1), // read from FRAM
			                                        MAT_GET(vec, (uint)__data20_read_short((unsigned long int)ptr2)) // read from FRAM and FRAM2 
													)
							);
			MAT_SET(dest, w, i); // write back to FRAM
		}
	}
}

The following variables lie in FRAM/SRAM as follows:

- "dest" : lower FRAM

- "mat_data", "mat_idx" : FRAM2 (after the 0x10000 boundary), hence need to use __data20_read_short() intrinsic for access

- rest of the variables are in SRAM

The above section of code works as intended but is quite slow. I suspect this is because of multiple read/writes to/from FRAM/FRAM2.

I read in slaa498b.pdf that the DMA can be used to make the data access faster. In this scenario - would it help ? do i need to use assembly code (as given in the document) to get the DMA working or is there any sample C-code I can have a look at ?

Also, would it be faster if I first performed a pre-buffering of a chunk of data from FRAM->SRAM and worked off from SRAM directly in a micro-batch fashion ? or would the overhead outweigh directly working off FRAM.

As you may have guessed the F_MUL and F_ADD performs multiplication and addition, so probably worth while moving these operations to LEA, but that FRAM access bottleneck would still be there.

Are there any other ways I can make the FRAM rd/wr access faster ?

Many thanks

Rosh

over 7 years ago

0 Bruce McKenney over 7 years ago

Guru 100365 points

There's a memory-to-memory DMA example in msp430fr599x_dma_01.c (SLAC710).

I don't recognize what this code is computing, but if it's one of the things the LEA does, the LEA will do it very quickly (and in parallel).

In my case, I had to segment my problem (didn't fit in LEA RAM) so I was constantly swapping data in and out of LEA RAM. I replaced memcpy() with a DMA-based function and that ran much faster. Your problem is somewhat different, but having such a DMA_memcpy would allow you to move things around in the caller of this function and compare the time for each combination.

Unsolicited: You appear to have a loop invariant (dest[i]) which the optimizer might or might not be able to recognize. By introducing an explicit accumulator in the inner loop you might save 25% (for roughly zero cost).

0 Rosh Mendis over 7 years ago in reply to Bruce McKenney

Intellectual 730 points

Thank you Bruce, i'll check out the code you pointed out to, as well as your loop optimization suggestion.

0 Mitch Ridgeway over 7 years ago in reply to Rosh Mendis

TI__Expert 6635 points

Hey Rosh,

Please let me know if you were able to resolve your question.

Thanks!

-Mitch

0 Rosh Mendis over 7 years ago in reply to Mitch Ridgeway

Intellectual 730 points

I have had to focus on another project, I'll be returning back to this issue soon. But looking at the size of my data inputs (for matrix multiplication / 2d convolution), I would also need to keep most of the data in FRAM and swap in/out to SRAM like Bruce. So not sure if there will be much speedup with added overhead of swapping. But as his advice sounds logical, i'm making as resolved for the time-being

**Attention** This is a public forum

MSP low-power microcontrollers

MSP low-power microcontroller forum

MSP430FR5994: Efficient rd/wr large amount of data between FRAM/SRAM