Peripheral register access time

sylvestre

We are monitoring the cpu usage of our application.
It appear that the access time to peripheral register is over 8 times slower that using RAM memory on the tms570LS2 at 160 Mhz.
That measures affect all peripheral register. GIO, ADC HNET... Is it normal to get that acces time overhead

In the given sample code we got

inHwDeltaCycle = 1022 cycles;
inRamDeltaCycle = 135 cycles;

We compile it using Program Mode compilation (-pm -02 -mf3 ) to force inlining gioSetBit with

==================== resulting assembly =================================

The resulting assembly code if for HET IO
00086e20:   EB000F4B BL              $../PerformanceMonitor.asm:216:220$
00086e24:   E59F6340 LDR             R6, $C$CON263
00086e28:   E5860000 STR             R0, [R6]
00086e2c:   E584500C STR             R5, [R4, #12]    ; DSET
00086e30:   E5845010 STR             R5, [R4, #16]    ; DCLR
.... repeat 48 more time
00086e34:   E584500C STR             R5, [R4, #12]
00086e38:   E5845010 STR             R5, [R4, #16]

The resulting assembly code if for pseudo RamIO is
00086e20:   EB000F4B BL              $../PerformanceMonitor.asm:216:220$
00086e24:   E59F6340 LDR             R6, $C$CON265
00086e28:   E5860000 STR             R0, [R6]
00086e2c:   E584500C STR             R5, [R4, #12]    ; DSET
00086e30:   E5845010 STR             R5, [R4, #16]    ; DCLR
.... repeat 48 more time
00086e34:   E584500C STR             R5, [R4, #12]
00086e38:   E5845010 STR             R5, [R4, #16]

==================== Source Code =================================

/* USER CODE BEGIN (0) */

// call l argument 50 times;

#define LOOP_50(l) l;l;l;l;l;l;l;l;l;l; \
                                  l;l;l;l;l;l;l;l;l;l; \
                                  l;l;l;l;l;l;l;l;l;l; \
                                 l;l;l;l;l;l;l;l;l;l; \
                                  l;l;l;l;l;l;l;l;l;l;

/* USER CODE END */

/* Include Files */

#include "sys_common.h"
#include "system.h"
#include "PerformanceMonitor.h"

/* USER CODE BEGIN (1) */
#include "het.h"
/* USER CODE END */

/* USER CODE BEGIN (2) */

// ============================== In ram pseudo gioport
gioPORT_t myport
#define ramPORT ((gioPORT_t *)&myport)

// ============================== Performace moditoring result
volatile unsigned int PMU_counter1_result = 0;
volatile unsigned int PMU_counter2_result = 0;
volatile unsigned int PMU_counter3_result = 0;
volatile unsigned int PMU_cycle_count = 0;

unsigned int inRamDeltaCycle = 0;
unsigned int inHwDeltaCycle = 0;

/* USER CODE END */

/** @fn void main(void)
*   @brief Application main function
*   @note This function is empty by default.
*
*   This function is called after startup.
*   The user can use this function to implement the application.
*/
void main(void)
{
   /* USER CODE BEGIN (3) */
   int i;
   unsigned value = 0;

   /* Enable IRQ */ dsafsd
   _enable_IRQ();
   /* Set HET port pins to output */
   gioSetDirection(hetPORT, 0xFFFFFFFF);
   /* Set HET port pin 1 high */
   gioSetBit(hetPORT, 1, 1);
   /* Send user prompt */

PmInit( PMCC_CYCLE_COUNT, PMCC_CYCLE_COUNT, PMCC_CYCLE_COUNT );
PmClearAndStartCycleCounter();

   // ============================== In HW nhet loop
   PMU_counter1_result = PmReadCycleCounter();

   LOOP_50( gioSetBit(hetPORT, 1, value ^= 1 ))
   PMU_counter2_result = PmReadCycleCounter();

   // ============================== In RAM pseudo IO loop
   LOOP_50(gioSetBit(ramPORT, 1, value ^= 1 ))
   PMU_counter3_result = PmReadCycleCounter();

   // ============================== Measurement
   inHwDeltaCycle = PMU_counter2_result-PMU_counter1_result;         // 1022 cycles
   inRamDeltaCycle = PMU_counter3_result-PMU_counter2_result;       // 135 cycles

asm(" nop");

while(1);

/* USER CODE END */
}

over 14 years ago

0 steveg over 14 years ago

Intellectual 270 points

I measure the same delay, but other forum members apparently don't, perhaps a setup issue? I don't know.

0 Anthony F. Seely over 14 years ago in reply to steveg

TI__Guru 69070 points

Generally, an access to a peripheral register is on the order of 19 - 25 cycles depending on whether it is a read or write, and clock divider ratios. (at 160Mhz you'll have to run the peripherals at /2 so you'll get the longer access times).

If you have the MPU set for strongly ordered, or not set at all (because the default is strongly ordered for the region where peripheral registers are) then you will get this result if you're just in a loop accessing peripheral registers (or I think more generally, in a loop accessing memory not on a TCM interface like flash or SRAM... some target that you have to go through the AXI master port to reach).

If you change the MPU setting for the specific area to 'device' type, then writes can be buffered which will have a sort of 'pipelining' effect, and you'll see less cycles on between two writes (although, the latency on a single write doesn't change).

Device is probably appropriate for most peripherals - we just happen to have ours mapped at the high addresses to be compatible with the memory map of the TMS470R1 devices, which predated the definition of the MPU default memory map.

If your application has quite a lot of IO activity, then you might need to consider bringing in the DMA to assist with the data transfers; since the DMA can move the data to the CPU's ram where it can be accessed very quickly.

Regarding your RAM accesses, they should be single cycle (effectively) most of the time in a real application.. there is some latency but this is hidden by the pipeline except when there are data dependencies. It's pretty common in simple test code to have the data dependency problem; but you may want to benchmark an algorithm more typical of your actual application before you worry too much about the RAM access being more than one cycle in the simple test program.

0 steveg over 14 years ago in reply to Anthony F. Seely

Intellectual 270 points

Thanks that is interesting and explains a lot, but where is that 19-25 cycles peripheral register access time documented?

I am transferring large amounts of data from the A/D via DMA to RAM but I still need to calculate worse case transfer times.

0 Anthony F. Seely over 14 years ago in reply to steveg

TI__Guru 69070 points

Hi Steve,

I'm not aware of any one document that captures this information. We got this by checking the specifications of the individual components (CPU - ARM's spec, switch fabric - internal specification docs). And we tested on the bench and correlated the two. This was for the path from the CPU to the peripheral.
The DMA->Peripheral path is different.

I think the next step here should be to get a list of the paths you need timing information on (maybe prioritized) and we can take this as a documentation update request.

Thanks and Best Regards,

Anthony

0 steveg over 14 years ago in reply to Anthony F. Seely

Intellectual 270 points

Anthony,

I am most interested in DMA performance, as my CPU to Peripheral access is usually minimal and usually done only on power up.

Typically I want two DMA performance metrics, worse case transfer rates (when moving large amounts of data) and worse case request rates (when moving small amounts of data frequently).

So basically how long does it take to move a large chuck of data and how many requests of data can I make in a certain amount of time.

The metrics are to be defined from SRAM to SRAM, Peripheral to SRAM and SRAM to Peripheral.

TI usually documents this stuff in app notes like

http://www.ti.com/lit/an/spraa00/spraa00.pdf

steve

0 KGreb over 14 years ago in reply to Anthony F. Seely

TI__Mastermind 23000 points

Quick clarification on ARM v7 memory ordering:

Strongly ordered memory regions will hold the bus until an "ok" or "error" bus response/write response is returned. This is the equivalent of non-cached/non-buffered (NCNB) on previous ARM cores. Slow, but safe, as a bad transaction can be retried without loss of context.
Device memory regions will allow buffered write transactions - as soon as the data transaction hits the CPU write buffer, CPU assumes it will complete properly and advances to the next pending bus transaction. Potential to increase performance, but you lose the ability to rewind and retry the transaction if something goes wrong.

In general, you should keep your peripheral region in the default strongly ordered state unless you understand the costs and benefits of implementing a device mode MPU region.

0 steveg over 14 years ago in reply to KGreb

Intellectual 270 points

Thanks, in general for safety critical applications we disable any automatic recovery modes, as they can get stuck (due to hard over failure) and completely take down the entire system.