TMS320F28388D: Move memcpy to RAM - measure execution time

inno

Part Number: TMS320F28388D
Other Parts Discussed in Thread: TMDSCNCD28388D, C2000WARE

Hello

I wrote a function called HandleCmToCpu1IpcRequests inside my CPU1 project, which copies data from the CM->CPU1 message RAM into a structure in the following way:

memcpy(&s_CanTelegramReceivedFromMaster, (const void *)ipcAddr, (size_t)sizeof(s_CanTelegramReceivedFromMaster)

In the near future I have to run the function HandleCmToCpu1IpcRequests and all other functions, which are called out of that function, from RAM. This is required, since HandleCmToCpu1IpcRequests may be called out of an IRQ while the flash of the CPU1 might be in a process of erasing a sector. Therefore, I tried first to move the memcpy function itself to RAM via the linker command file, e.g. for the .TI.ramfunc section:

--library=rts2800_fpu64_eabi.lib<memcpy.c.obj>(.text)

But this is obviously not a good idea, since I received a linker warning and the firmware didn't work anymore:

"../2838x_FLASH_lnk_cpu1.cmd", line 92: warning #10068-D: no matching section

warning #10278-D: LOAD placement specified for section ".text:rts2800_fpu64_eabi.lib<memcpy.c.obj>". This section contains decompression routines required for linker generated copy tables and C/C++ auto-initialization. Must ensure that this section is copied to run address before the C/C++ boot code is executed or is placed with single allocation specifier (ex. "> MEMORY").

Afterwards, I copied the memcpy function from C:\ti\ccs1040\ccs\tools\compiler\ti-cgt-c2000_20.2.4.LTS\lib\src\memcpy.c to my project, renamed that function into innoMemcpy and moved it to the RAM:

I measured now the execution time of the HandleCmToCpu1IpcRequests function via CPU timer 1. No matter which additional pragma I am using for innoMemcpy, like --opt_for_speed or opt_level=2 (see above picture), the innoMemcpy function is always twice a s slow as the memcpy of the rts2800_fpu64_eabi.lib library. So in the following picture, the first line is always faster than the second line by factor 2.

Is there any idea what could cause the difference in the execution time? How can I make the function innoMemcpy become as fast as memcpy?

Thanks,

Inno

over 3 years ago

0 George Mock over 3 years ago

TI__Guru**** 251490 points

The memcpy function cannot be part of the output section .TI.ramfunc. That is because it is used to copy .TI.ramfunc from flash to RAM. It must be part of a different output section. The output section which contains memcpy can only be allocated to one memory range. It is loaded into this memory range, and it runs from there. You probably want the output section which contains memcpy to be allocated to the fastest part of memory.

Please let me know if these suggestions resolve the problem.

Thanks and regards,

-George

0 inno over 3 years ago in reply to George Mock

Expert 1700 points

Hello George,

I created an own function called innoMemcopy, which is the same function as the memcpy function but just renamed, due to what you wrote above (memcpy is used for copying .TI.ramfunc). My question is a different one:

Why is the execution time of the function memcpy (which resides on the flash) faster than my innoMemcpy function (which resides on the RAM), no matter which pragma I use in the 2nd picture, which I posted above. I am talking about:

#pragma FUNCTION_OPTIONS(innoMemcpy,"--opt_for_speed"); or #pragma FUNCTION_OPTIONS(innoMemcpy,"--opt_level=2");

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Hi,

inno said:
I created an own function called innoMemcopy, which is the same function as the memcpy function but just renamed

Have you copied the same function from TI library ?

Regards,

Vivek Singh

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Yes, I did. I first checked inside the map file where the memcpy function comes from:

Afterwards I opened the file and copied it over to my project:

You can compare the above picture with the 1st picture in my initial post.

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Is it possible to send me your sample CCS project which I could run on my setup ?

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Hello Vivek Singh,

It seems that you can easily reproduce the issue in a simple blinking LED example program. I can give you the code, which you can simply copy over to the example project.

First of all, I run the code on the TMDSCNCD28388D evaluation board and I imported the CPU1/CM blinking LED example from the C2000 ware product. I compiled the blinking LED CPU1 project in configuration Flash, see here:

I use CPU1 timer 1 for the time measurement.

Add the following code before the main() function

#define MEMCOPY_DEBUG_CODE 1

#if MEMCOPY_DEBUG_CODE == 1

#pragma CODE_SECTION(innoMemcpy, ".TI.ramfunc");
#pragma FUNCTION_OPTIONS(innoMemcpy,"--opt_for_speed");
//#pragma FUNCTION_OPTIONS(innoMemcpy,"--opt_for_space");
//#pragma FUNCTION_OPTIONS(innoMemcpy,"--opt_level=2");
void *innoMemcpy(void *to, const void *from, size_t n)
{
     register char *rto   = (char *) to;
     register char *rfrom = (char *) from;
     register unsigned int rn;
     register unsigned int nn = (unsigned int) n;

     /***********************************************************************/
     /*  Copy up to the first 64K. At the end compare the number of chars   */
     /*  moved with the size n. If they are equal return. Else continue to  */
     /*  copy the remaining chars.                                          */
     /***********************************************************************/
     for (rn = 0; rn < nn; rn++) *rto++ = *rfrom++;
     if (nn == n) return (to);

     /***********************************************************************/
     /*  Write the memcpy of size >64K using nested for loops to make use   */
     /*  of BANZ instrunction.                                              */
     /***********************************************************************/
     {
        register unsigned int upper = (unsigned int)(n >> 16);
        register unsigned int tmp;
        for (tmp = 0; tmp < upper; tmp++)
        {
           for (rn = 0; rn < 65535; rn++)
           *rto++ = *rfrom++;
           *rto++ = *rfrom++;
           }
     }

     return (to);
}

uint16_t myMemcpyBuff[40];
uint16_t myInnoMemcpyBuff[40];

// The elements of the struct hold the CPU Timer 1 value in raw counts (200[MHz])
struct ExecutionTimeMeasurement
{
    int32_t s32_time_at_start_of_function;      // Time at start of function
    int32_t s32_prev_time_at_start_of_function; // Previous time at start of function
    int32_t s32_current_sample_time;            // Current sample rate of the task
    int32_t s32_current_execution_time;         // Current execution time
    int32_t s32_max_execution_time;             // maximum execution time
};

volatile struct ExecutionTimeMeasurement memcpy_ExecutionTime = {0, 0, 0, 0, 0};
volatile struct ExecutionTimeMeasurement innoMemcpy_ExecutionTime = {0, 0, 0, 0, 0};

#endif

Enable inside the main-function the CPU timer 1 as a free runnig counter just before entering the endless loop (the for(;;):

    // *****************************************************************************
    // Here we set up TIMER 1 as a free running timer for time measurement purposes.
    // *****************************************************************************
    // enable timer 1 to free run, can be read via: CPUTimer_getTimerCount(CPUTIMER1_BASE);
    //                                              CpuTimer1Regs.TIM.all; when adding the C-files f2838x_XYZ (e.g. f2838x_globalvariabledefs.c, f2838x_gpio.c...) to the project
    // Timer 1 runs per default with 200[MHz] system clock speed and counts down.
    CPUTimer_setPeriod(CPUTIMER1_BASE, 0xFFFFFFFF); // Initialize timer period to maximum
    CPUTimer_setPreScaler(CPUTIMER1_BASE, 0); // Initialize pre-scale counter to divide by 1 (SYSCLKOUT)
    CPUTimer_stopTimer(CPUTIMER1_BASE); // Make sure timer is stopped
    CPUTimer_reloadTimerCounter(CPUTIMER1_BASE); // Reload all counter register with period value
    CPUTimer_startTimer(CPUTIMER1_BASE); // start timer (counts down)

Now add the time measurment code inside the endless loop:

#if MEMCOPY_DEBUG_CODE == 1

        memcpy_ExecutionTime.s32_time_at_start_of_function = CPUTimer_getTimerCount(CPUTIMER1_BASE);
        memcpy_ExecutionTime.s32_current_sample_time = memcpy_ExecutionTime.s32_prev_time_at_start_of_function - memcpy_ExecutionTime.s32_time_at_start_of_function;
        memcpy_ExecutionTime.s32_prev_time_at_start_of_function = memcpy_ExecutionTime.s32_time_at_start_of_function;

        // Memcopy 40 words from CM to CPU1 MSG RAM 0
        memcpy(&myMemcpyBuff[0], (const void *)0x38000, (size_t)sizeof(myMemcpyBuff));

        memcpy_ExecutionTime.s32_current_execution_time = memcpy_ExecutionTime.s32_time_at_start_of_function - CPUTimer_getTimerCount(CPUTIMER1_BASE);
        if(memcpy_ExecutionTime.s32_current_execution_time > memcpy_ExecutionTime.s32_max_execution_time)
        {
            memcpy_ExecutionTime.s32_max_execution_time = memcpy_ExecutionTime.s32_current_execution_time;
        }



        innoMemcpy_ExecutionTime.s32_time_at_start_of_function = CPUTimer_getTimerCount(CPUTIMER1_BASE);
        innoMemcpy_ExecutionTime.s32_current_sample_time = innoMemcpy_ExecutionTime.s32_prev_time_at_start_of_function - innoMemcpy_ExecutionTime.s32_time_at_start_of_function;
        innoMemcpy_ExecutionTime.s32_prev_time_at_start_of_function = innoMemcpy_ExecutionTime.s32_time_at_start_of_function;

        // innoMemcpy 40 words from CM to CPU1 MSG RAM 1
        innoMemcpy(&myInnoMemcpyBuff[0], (const void *)0x38400, (size_t)sizeof(myInnoMemcpyBuff));

        innoMemcpy_ExecutionTime.s32_current_execution_time = innoMemcpy_ExecutionTime.s32_time_at_start_of_function - CPUTimer_getTimerCount(CPUTIMER1_BASE);
        if(innoMemcpy_ExecutionTime.s32_current_execution_time > innoMemcpy_ExecutionTime.s32_max_execution_time)
        {
            innoMemcpy_ExecutionTime.s32_max_execution_time = innoMemcpy_ExecutionTime.s32_current_execution_time;
        }
#endif

You can check the maxium execution time in the debugger, which is for me as follows:

Note that the measured execution time is still OK in case timer 1 rolls over, since the measured execution time e.g. inside variable memcpy_ExecutionTime.s32_current_execution_time will also face the rollover.

One additional remark: I will be out of the office starting from tomorrow until August 11th and therefore I am not able to respond to this post. Maybe it would be possible to keep that post open unless you could reproduce the issue and find the root cause?

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Inno,

Have you tried using the CCS profiling feature to know the execution time of a function ? You can find more detail about this on link below -

https://software-dl.ti.com/ccs/esd/documents/c2000_profiling-on-c28x-targets.html

The comment in the code snapshot says that you are copying data from CM to CPU1 MSG ram but the Memcpy function is executed on CPU1 which only has READ permission to CMtoCPU1 MSG RAM. Am I missing something ?

Also have you checked that there is no interrupt which is interrupting the innoMemcpy function execution ?

I would still prefer if you could export your CCS project and send it to me because it's not just about the C code but also linker cmd file and build configuration which will impact the execution time.

Regards,

Vivek Singh

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Helo Vivek Singh.

1) I can also use the clock cycle measurement method of CCS after my return, but in your above link you will also find my applied method explained in chapter 'Profiling in Sys/BIOS'. So I see no difference in eiter method.

2) My memcpy function reads data from CMtoCPU1 message RAM and copies it over to a variable inside the CPU1 RAM (e.g. RAMLS5 memory). That means that this memcpy instruction simulates a part of the IPC data exchange from CM to CPU1, therefore CPU1 must of course read data from read-only message RAM memory.

3) I will try to provide you my project in a private messgage within the next days.

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

inno said:
2) My memcpy function reads data from CMtoCPU1 message RAM and copies it over to a variable inside the CPU1 RAM (e.g. RAMLS5 memory). That means that this memcpy instruction simulates a part of the IPC data exchange from CM to CPU1, therefore CPU1 must of course read data from read-only message RAM memory.

Ok, that makes sense.

Will wait for the CCS project to check this one.

Regards,

Vivek Singh

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Hello Vivek Singh,

I believe my colleague had sent you via the project inside a file called 'MemcpyProjects.zip'.

Have you been able to reproduce the issue inside the CPU1 project (see file 'led_ex1_c28x_cm_blinky_cpu1.c')?

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

I got the project and could run the .out in flash configuration but not sure how to reproduce the issue (BTW, when I tried to compile it, it gave the error) . What should I look for in this project ?

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Hello Vivek Singh,

That is weird, I can compile the project without any issues. The important code to look at is placed between the following hash defines:

#if MEMCOPY_DEBUG_CODE == 1

#endif

When I run the project, then I continuously copy 40 words from the MSG RAM to an array, one time with the "memcpy" from the library (line 344 in the below picture) and one time with the same function code but just a different name "innoMemcpy" (line 359 in the below picture). I measure the worst case execution time of the memory copy code via CPU timer 1.

As you can see in the following picture, I get different results in the worst case execution time:

And this although I tried my best to ensure maximum speed for "innoMemcpy":

So what can I do to make "innoMemcpy" run as fast as "memcpy"? As I said, the "memcpy" function from the library has been copied to my project and renamed into "innoMemcpy", so "innoMemcpy" and "memcpy" must contain the same code.

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Ok, will try this tomorrow and get back you. This is in RAM build configuration or Flash ?

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

It is compiled in CPU1_FLASH build configuration.

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Ok, thanks. I have that .out with the project which was sent so I can reuse the same but if I need to recompile, I'll have problem. Will let you know.

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Alternatively you can do the following:

1) Take an TMDSCNCD28388D evaluation board.

2) Import the project from C200ware, e.g.: C:\ti\c2000\C2000Ware_3_04_00_00\driverlib\f2838x\examples\c28x_cm\led

3) Compile the CPU1 project in configuration Flash (compiling the CM project is not required) and move my additional code over to that project.

a) The code within the hash define "#if MEMCOPY_DEBUG_CODE == 1" ... "#endif"

b) The few lines of code where I set up CPU timer 1 to be a free running counter.

4) Flash CPU1 project to the board and run (again, the CM project is not required).

I believe it does not matter which C2000 ware version you use.

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Ok, I am still not able to generate this issue.

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Hello Vivek Singh,

Do you mean that you did not yet run the example code or do you run the example but you see identical time values for the time consumption in the Expressions window?

If you did not yet run the example (due to lack of time), then no problem. The situation is not yet super urgent for me.

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

I am yet to run with the steps your have mentioned.

0 Vivek Singh over 3 years ago in reply to Vivek Singh

TI__Guru** 115881 points

Hi,

I looked at the compile code for both function and here is what I see -

memcpy(&myMemcpyBuff[0], (const void *)0x38000, (size_t)sizeof(myMemcpyBuff));

It is using RPT loop so there is no branch instruction.

innoMemcpy(&myInnoMemcpyBuff[0], (const void *)0x38400, (size_t)sizeof(myInnoMemcpyBuff));

Here no RPT loop instruction. Instead it's using the branch to itself instruction which will take multiple cycle (4 cycle). This means for every copy, there are additional cycles in 2nd function.

This is why innoMemcpy function talking more cycles.

Hope it is clear.

Regards,

Vivek Singh

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Hi Vivek Singh,

Thanks for finding the precise cause. But of course the generated assembler must have been different, otherwise there would have been no further explanation about the difference in the execution time (except where the code is located, RAM or Flash). My question is actually a different one.

The memcpy function is some pre-compiled code coming out of the library. That means that TI has compiled that function in a different way (most likely the same compiler but certain attributes during compilation). I am in general interested makig our own written code as time efficient as possible and therefore I would like to learn certain techniques.

My original question is therefore as follows:

What do I have to do (e.g. which #pragma do I have to use) to make the C2000 compiler generate the same efficent assembler code for "innoMemcpy" like the TI pre-compiled "memcpy" function?

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Hi,

inno said:
Thanks for finding the precise cause. But of course the generated assembler must have been different, otherwise there would have been no further explanation about the difference in the execution time (except where the code is located, RAM or Flash). My question is actually a different one.

Your query got diverted into design issue related to FLASH vs RAM execution hence came back to me from compiler team. I have just shown why you are getting bigger execution time when function is getting executed from RAM.

inno said:
What do I have to do (e.g. which #pragma do I have to use) to make the C2000 compiler generate the same efficent assembler code for "innoMemcpy" like the TI pre-compiled "memcpy" function?

For this I have to move this post again to compiler team.

Regards,

Vivek Singh

0 JohnS over 3 years ago in reply to Vivek Singh

TI__Guru**** 162895 points

Inno,

Our compiler expert is not available.

However we do ship the source and build utility for the C runtime libraries with the compiler. The toolchain actually builds the libraries dynamically as needed.

Thus you can look at the source and also see how they are being built.

I am not sure which version of the compiler you are using but you can look in the compiler users guide and search for "Building Standard Libraries". There will be information on how to invoke mklib.

Regards,

John

0 inno over 3 years ago in reply to JohnS

Expert 1700 points

Hello JohnS,

I use the following:

If your compiler expert would be available in the next week and may identify the situation right on the spot, then please let me know. Otherwise I will check myself in the compiler users guide and search for "Building Standard Libraries".

Thanks,

Inno

0 JohnS over 3 years ago in reply to inno

TI__Guru**** 162895 points

Inno,

They are out until next Tuesday.

Regards,

John

0 inno over 3 years ago in reply to JohnS

Expert 1700 points

That would be absolutely OK for me. If the compiler expert knows the required #pragma instruction for generating the identical code during this or next week, then this would be a big help.

0 Ki over 3 years ago in reply to inno

TI__Guru**** 473001 points

Our compiler expert should be back tomorrow to take a look. Thank you for your patience.

0 George Mock over 3 years ago in reply to inno

TI__Guru**** 251490 points

inno said:
What do I have to do (e.g. which #pragma do I have to use) to make the C2000 compiler generate the same efficent assembler code for "innoMemcpy" like the TI pre-compiled "memcpy" function?

I can generate the same code by using optimization level 2, and no other special settings. Here is a demonstration.

C:\examples>ar2000 -x C:\ti\compilers\ti-cgt-c2000_20.2.4.LTS\lib\rts2800_fpu64_eabi.lib memcpy.c.obj

C:\examples>dis2000 --quiet memcpy.c.obj > rts_dis.txt

C:\examples>copy C:\ti\compilers\ti-cgt-c2000_20.2.4.LTS\lib\src\memcpy.c .
        1 file(s) copied.

C:\examples>cl2000 --opt_level=2 --abi=eabi --float_support=fpu64 memcpy.c

C:\examples>dis2000 --quiet memcpy.obj > build_dis.txt

C:\examples>fc rts_dis.txt build_dis.txt
Comparing files rts_dis.txt and BUILD_DIS.TXT
FC: no differences encountered

In your first post indicate you use version 20.2.4.LTS and RTS library rts2800_fpu64_eabi.lib. These commands use that same version and library. The first command extracts memcpy.c.obj from the RTS library. The second command disassembles it, and saves the results in the file rts_dis.txt. The copy command copies memcpy.c from the compiler install directory. The cl2000 command builds it. Notice the only setting related to optimization is --opt_level=2. The last dis2000 command disassembles the memcpy.obj just built, and saves the results in the file build_dis.txt. The last command compares the two disassembly files, and shows they match.

Thanks and regards,

-George

0 inno over 3 years ago in reply to George Mock

Expert 1700 points

Hi George,

I gave Vivek Singh a simple blinking LED example project, in which I copied the memcpy function from the file "C:\ti\ccs1040\ccs\tools\compiler\ti-cgt-c2000_20.2.4.LTS\lib\src\memcpy.c" over to a C file and renamed it into innoMemcpy.

I called memcpy and innoMemcpy inside that project and meausred the execution time when copying the same amount of data to a certain buffer. I ensured that both functions reside on flash and I enabled compiler optimization level 2 inside my project as you suggested.

For this test all #pragma instructions for innoMemcpy have been commented:

If I now measure the execution time of the function calls using CPU timer 1, then I still see a huge difference in the measured execution time:

Would it be possible for you to reproduce the issue on an evaluation board? The issue is not very urgent but still I would like to reach a state where the measured execution time for calling memcpy and innoMemcpy is the same.

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Did you check the assembly to see if it's generating the RPT block instruction in you local function now ?

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Hi Vivek Singh,

I am getting a bit confused. The code looks identical to me, BANZ is used in both (BTW, when using LCR, doesn't the stack piles up and may reach a critical stage?). However, here the functions from the Disassembly window:

I checked again the location of the relevant variables and the code inside the map file:

The code that measures the execution time is also identical C-code as you could see in one of my first posts.

So everything appears to be the same but the measured time via timer 1 is different:

Thanks,

Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Assembly code look different from what project you gave me. Memcpy, which is part of library function was using RPT block earlier, but in this snapshot I don't see that. If you have made any modification to project, please send me updated project.

0 George Mock over 3 years ago in reply to inno

TI__Guru**** 251490 points

inno said:
What do I have to do (e.g. which #pragma do I have to use) to make the C2000 compiler generate the same efficent assembler code for "innoMemcpy" like the TI pre-compiled "memcpy" function?

Make sure innoMemcpy is inlined the same way memcpy gets inlined. A good way to do that is to supply these lines in a header file ...

#pragma FUNC_ALWAYS_INLINE(innoMemcpy)
static inline void *innoMemcpy(void *to, const void *from, size_t n)
{
    /* implementation here */
}

#include this header file in any source file that calls innoMemcpy.

The best build options to use are --opt_level=2 (or higher) and --unified_memory. At lower levels of optimization, the compiler does not inline any functions. The option --unified_memory allows the compiler to implement the memory-to-memory copy with the PREAD instruction, which in turn allows the loop to be implemented with the single RPT instruction.

I'm sorry I didn't realize the importance of inlining before now.

Thanks and regards,

-George

0 inno over 3 years ago in reply to Vivek Singh

Expert 1700 points

Hi Vivek Singh,

I used the same project but just with the following setting:

Thanks,

Inno

0 inno over 3 years ago in reply to George Mock

Expert 1700 points

Hello George,

I did the following:

1) I verified that --unified_memory is activated inside the compiler settings of CCS.

2) I moved innoMemcpy to a header file and declared it as ALWAYS_INLINE as you suggested above.

3) I enabled for the main function (which calls innoMemcpy) the compiler optimization level 2

#pragma FUNCTION_OPTIONS(main,"--opt_level=2");
void main(void)

My results look as follows:

As you can see now, innoMemcpy is 5-7 times faster that memcpy. That means that there are still differences between memcpy and innoMemcpy, but this time the innoMemcpy function is much faster. Of course I won't complain about it ;-)

Thanks a lot for your help.

-Inno

0 Vivek Singh over 3 years ago in reply to inno

TI__Guru** 115881 points

Glad to see you were able to get this working finally.

Regards,

Vivek Singh

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28388D: Move memcpy to RAM - measure execution time