Dear TI experts,
after iterative attempts to improve efficiency of final assembly code generated by TI tools (C/C++ compiler/linker) I did not find an adequate combination of relevant options and system settings (compiler, linker, C28x...) to achieve sufficiently fast data transfers (efficiency measured through number of CPU sysclk cycles necessary to perform those extensively used operations).
The fact is that we were forced to shrink the data interface to 16-bit wide EMIF interface (as a compromise to enable integration of internet interface in our ECS). There is of course an additional (aggravating) issue regarding the asynchronous character of EMIFx data interface.
There are many CPU performance improvement features available, such as many arithmetical/logical 1-2 cycle (read / test&modify / write) instructions utilizing some of enhancements provided by new C2000 MCU familiy infrastructure, when operating between internal registers or between registers and internal memory and registers (including automatic postincrement, postdecrement addressing options). However, none of such instructions provides similar performance when source and destination data are both in non-register allocated memory locations.
For example, using EMIF16 interface, in case there is an external block of data, stored in external memory locations
(e.g. buffer with consecutive addresses containing immediate ADC results results, acquired by additional, independently controlled, external neighboring DAQ subsystem integrated on the same PCB, and/or their independently computed post-processing derivatives)
which should be promptly (as soon as possible) delivered to data memory within MCU chip
(e.g. GS or LS memory assigned to either of C28x cores)
Each of such transfers required rather high number of CPU cycles, most of which seem to be wasted, since C28x core of MCU during that time
does not perform any other useful task (according to C/C++ generated assembly code).
For illustration, data block of 8 words (16-bit data), requires more than 170 cycles (averaged value)
with tendency to further execution performance degradation, when total CPU load reaches some 75% or a bit more.
Besides, it turned out that data coherence (of the data within such block transfer) can only be guarantied if the block transfer operation is executed on the top priority task.
Is there any chance to improve performance of such de facto "memory-to-memory" data transfers, employing some advanced options/switches for TI C2000 compiler/linker?
Since the consecutive addresses are discussed, would performance improve if (for the mentioned data block transfer example) only 4 EMIF transfers are made (4×32-bit data internally split to high/low words, if necessary)
instead of two times more EMIF data transfers (8×16-bit data) - both using EMIF16 data interface?
Furthermore, what would you recommend (apart from DMA) to optimize performance of data exchange between internal peripherals (internal ADCs, SDFM subsystems, EPWMs...)
and C28x cores (preferably applicable to both CLA cores as well)?
Is there any way to instruct (pre)compiler to maximize employment of "atomic instructions" (to recognize/implement as more atomic instructions as possible, i.e. whenever applicable)?
Is there any particular reason why assembly RPT command can handle only one single instruction (instead of at least two instructions, in which case I strongly believe that many of aforementioned CPU code execution efficiency problems might be resolved)?
That's all for know. Hope to here from you soon. Thanks in advance.
Time is of essence (particularly execution time for real-time embedded control systems) as it always is :-)
Best regards
Nenad Težak
P.S: As strange as it may sound, from current point of view, it appears that CPU performance of F28388D with C/C++ generated assembly is lagging significantly behind F28335 employing directly composed assembly code!?
Does it have to do with less efficient input/output data argument passing/exchange between particular functional blocks (each of which is written in C/C++, and afterwards compiled/translated/linked to execution assembly)?