TMS320F28388D: TMS320F28388D code optimization related to memory handling, data transfers of memory blocks or structs (issues related to compiler and/or linker?)

Nenad Tezak

Part Number: TMS320F28388D

Dear TI experts,

after iterative attempts to improve efficiency of final assembly code generated by TI tools (C/C++ compiler/linker) I did not find an adequate combination of relevant options and system settings (compiler, linker, C28x...) to achieve sufficiently fast data transfers (efficiency measured through number of CPU sysclk cycles necessary to perform those extensively used operations).

The fact is that we were forced to shrink the data interface to 16-bit wide EMIF interface (as a compromise to enable integration of internet interface in our ECS). There is of course an additional (aggravating) issue regarding the asynchronous character of EMIFx data interface.

There are many CPU performance improvement features available, such as many arithmetical/logical 1-2 cycle (read / test&modify / write) instructions utilizing some of enhancements provided by new C2000 MCU familiy infrastructure, when operating between internal registers or between registers and internal memory and registers (including automatic postincrement, postdecrement addressing options). However, none of such instructions provides similar performance when source and destination data are both in non-register allocated memory locations.

For example, using EMIF16 interface, in case there is an external block of data, stored in external memory locations
(e.g. buffer with consecutive addresses containing immediate ADC results results, acquired by additional, independently controlled, external neighboring DAQ subsystem integrated on the same PCB, and/or their independently computed post-processing derivatives)
which should be promptly (as soon as possible) delivered to data memory within MCU chip
(e.g. GS or LS memory assigned to either of C28x cores)

Each of such transfers required rather high number of CPU cycles, most of which seem to be wasted, since C28x core of MCU during that time
does not perform any other useful task (according to C/C++ generated assembly code).

For illustration, data block of 8 words (16-bit data), requires more than 170 cycles (averaged value)
with tendency to further execution performance degradation, when total CPU load reaches some 75% or a bit more.

Besides, it turned out that data coherence (of the data within such block transfer) can only be guarantied if the block transfer operation is executed on the top priority task.

Is there any chance to improve performance of such de facto "memory-to-memory" data transfers, employing some advanced options/switches for TI C2000 compiler/linker?

Since the consecutive addresses are discussed, would performance improve if (for the mentioned data block transfer example) only 4 EMIF transfers are made (4×32-bit data internally split to high/low words, if necessary)
instead of two times more EMIF data transfers (8×16-bit data) - both using EMIF16 data interface?

Furthermore, what would you recommend (apart from DMA) to optimize performance of data exchange between internal peripherals (internal ADCs, SDFM subsystems, EPWMs...)
and C28x cores (preferably applicable to both CLA cores as well)?

Is there any way to instruct (pre)compiler to maximize employment of "atomic instructions" (to recognize/implement as more atomic instructions as possible, i.e. whenever applicable)?

Is there any particular reason why assembly RPT command can handle only one single instruction (instead of at least two instructions, in which case I strongly believe that many of aforementioned CPU code execution efficiency problems might be resolved)?

That's all for know. Hope to here from you soon. Thanks in advance.

Time is of essence (particularly execution time for real-time embedded control systems) as it always is :-)

Best regards

Nenad Težak

P.S: As strange as it may sound, from current point of view, it appears that CPU performance of F28388D with C/C++ generated assembly is lagging significantly behind F28335 employing directly composed assembly code!?
Does it have to do with less efficient input/output data argument passing/exchange between particular functional blocks (each of which is written in C/C++, and afterwards compiled/translated/linked to execution assembly)?

11 months ago

0 George Mock 11 months ago

TI__Guru**** 244930 points

Your post covers several different topics. I am unable to respond to all of them. Please pick one topic to focus on.

If a topic such as ...

Nenad Tezak said:
we were forced to shrink the data interface to 16-bit wide EMIF interface

... is of most interest, then I will ask other experts to help.

If the topic of interest can be expressed as some C code that performs poorly, then I can probably help. For the source file which contains this code, please follow the directions in the article How to Submit a Compiler Test Case. Indicate where in the source the problem starts and stops. Perhaps it makes sense to name the function. Or maybe it makes sense to add a comment like /* PROBLEM HERE */.

Thanks and regards,

-George

0 Nenad Tezak 11 months ago in reply to George Mock

Intellectual 450 points

Hi George,

the first "topic of interest" you mentioned (regarding the usage of EMIF16 interface) was just included to add additional information (context) for the actual topic of interest (code optimization issues...). However, there are several questions tackling the principles (recommended practice approaches) to maximize the CPU performance of C28x core within F28388D.Before reaching out to E2E site I have already consulted all available lierature I could find (including C28x extended instruction seta, C/C++ optimization LTS guides, TI workshops materials related to this topic), but I have not found Compiler/Linker options to maximize usage of "atomic instructions" as well as some other "tips and tricks" regarding usage of atomic instructions which might help improve CPU performansi of C28x core within F28388D ...) in order at least to eliminate performance lagging of F28388D (operating at 200MHz and employing assembler code created by TI C/C++ compiler/linker) behind combination of F28335 MCU (operating at 150MHz) and directly generated, manually optimized, assembly code for C28x code within the F28335 MCU. Regarding the particular memcopy problem related to transfer o data blocks in case when both locations (source and destinations are not internal MCU registers) I will ask my colleagues to provide both C/C++ and generated assembly code for submission of test case.

Best regards

Nenad

C2000™︎ microcontrollers

C2000 microcontrollers forum

TMS320F28388D: TMS320F28388D code optimization related to memory handling, data transfers of memory blocks or structs (issues related to compiler and/or linker?)