This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Peripheral register access time

We are monitoring the cpu usage of our application. 
It appear that the access time to peripheral register is over 8 times slower that using RAM memory on the tms570LS2 at 160 Mhz. 
That measures affect all peripheral register. GIO, ADC HNET...  Is it normal to get that acces time overhead


In the given sample code we got

 inHwDeltaCycle  =  1022 cycles;
 inRamDeltaCycle =  135 cycles;

 We compile it using Program Mode compilation (-pm -02 -mf3 ) to force inlining gioSetBit with


 

==================== resulting assembly =================================

The resulting assembly code if for HET IO
00086e20:   EB000F4B BL              $../PerformanceMonitor.asm:216:220$
00086e24:   E59F6340 LDR             R6, $C$CON263
00086e28:   E5860000 STR             R0, [R6]
00086e2c:   E584500C STR             R5, [R4, #12]    ; DSET
00086e30:   E5845010 STR             R5, [R4, #16]    ; DCLR
....  repeat 48 more time
00086e34:   E584500C STR             R5, [R4, #12]
00086e38:   E5845010 STR             R5, [R4, #16]

The resulting assembly code if for pseudo RamIO is
00086e20:   EB000F4B BL              $../PerformanceMonitor.asm:216:220$
00086e24:   E59F6340 LDR             R6, $C$CON265
00086e28:   E5860000 STR             R0, [R6]
00086e2c:   E584500C STR             R5, [R4, #12]    ; DSET
00086e30:   E5845010 STR             R5, [R4, #16]    ; DCLR
....  repeat 48 more time
00086e34:   E584500C STR             R5, [R4, #12]
00086e38:   E5845010 STR             R5, [R4, #16]

 

==================== Source Code =================================

/* USER CODE BEGIN (0) */

// call l argument 50 times; 

#define LOOP_50(l)   l;l;l;l;l;l;l;l;l;l;  \
                                  l;l;l;l;l;l;l;l;l;l;  \
                                  l;l;l;l;l;l;l;l;l;l;  \
                                  l;l;l;l;l;l;l;l;l;l;  \
                                  l;l;l;l;l;l;l;l;l;l;

/* USER CODE END */


/* Include Files */

#include "sys_common.h"
#include "system.h"
#include "PerformanceMonitor.h"

/* USER CODE BEGIN (1) */
#include "het.h"
/* USER CODE END */


/* USER CODE BEGIN (2) */

// ==============================  In ram pseudo gioport
gioPORT_t myport
#define ramPORT ((gioPORT_t *)&myport)
 


// ==============================  Performace moditoring result
volatile unsigned int PMU_counter1_result = 0;
volatile unsigned int PMU_counter2_result = 0;
volatile unsigned int PMU_counter3_result = 0;
volatile unsigned int PMU_cycle_count = 0;

unsigned int inRamDeltaCycle = 0;
unsigned int inHwDeltaCycle = 0;

 /* USER CODE END */


/** @fn void main(void)
*   @brief Application main function
*   @note This function is empty by default.
*
*   This function is called after startup.
*   The user can use this function to implement the application.
*/
void main(void)
{
   /* USER CODE BEGIN (3) */
   int i;
   unsigned value = 0;

   /* Enable IRQ */ dsafsd
   _enable_IRQ();
   /* Set HET port pins to output */
   gioSetDirection(hetPORT, 0xFFFFFFFF);
   /* Set HET port pin 1 high */
   gioSetBit(hetPORT, 1, 1);
   /* Send user prompt */


   PmInit( PMCC_CYCLE_COUNT, PMCC_CYCLE_COUNT, PMCC_CYCLE_COUNT );
   PmClearAndStartCycleCounter();


   // ==============================  In HW nhet loop
   PMU_counter1_result = PmReadCycleCounter();
  
   LOOP_50( gioSetBit(hetPORT, 1, value ^= 1 ))   
   PMU_counter2_result = PmReadCycleCounter();
   
   // ==============================  In RAM pseudo IO loop
   LOOP_50(gioSetBit(ramPORT, 1, value ^= 1 ))
   PMU_counter3_result = PmReadCycleCounter();

 

   // ==============================  Measurement
   inHwDeltaCycle =  PMU_counter2_result-PMU_counter1_result;         // 1022 cycles
   inRamDeltaCycle = PMU_counter3_result-PMU_counter2_result;       //  135 cycles


   asm(" nop");

   while(1);

   /* USER CODE END */
}

  • I measure the same delay, but other forum members apparently don't, perhaps a setup issue? I don't know.

  • Generally, an access to a peripheral register is on the order of 19 - 25 cycles depending on whether it is a read or write, and clock divider ratios.  (at 160Mhz you'll have to run the peripherals at /2 so you'll get the longer access times).

    If you have the MPU set for strongly ordered, or not set at all (because the default is strongly ordered for the region where peripheral registers are) then you will get this result if you're just in a loop accessing peripheral registers  (or I think more generally, in a loop accessing memory not on a TCM interface like flash or SRAM... some target that you have to go through the AXI master port to reach).

    If you change the MPU setting for the specific area to 'device' type,  then writes can be buffered which will have a sort of 'pipelining' effect, and you'll see less cycles on between two writes (although, the latency on a single write doesn't change).   

    Device is probably appropriate for most peripherals - we just happen to have ours mapped at the high addresses to be compatible with the memory map of the TMS470R1 devices, which predated the definition of the MPU default memory map.

    If your application has quite a lot of IO activity, then you might need to consider bringing in the DMA to assist with the data transfers;  since the DMA can move the data to the CPU's ram where it can be accessed very quickly.

    Regarding your RAM accesses,  they should be single cycle (effectively) most of the time in a real application.. there is some latency but this is hidden by the pipeline except when there are data dependencies.  It's pretty common in simple test code to have the data dependency problem; but you may want to benchmark an algorithm more typical of your actual application before you worry too much about the RAM access being more than one cycle in the simple test program.

  • Thanks that is interesting and explains a lot, but where is that 19-25 cycles peripheral register access time documented?

    I am transferring large amounts of data from the A/D via DMA to RAM but I still need to calculate worse case transfer times.

     

  • Hi Steve,

    I'm not aware of any one document that captures this information.  We got this by checking the specifications of the individual components (CPU - ARM's spec,  switch fabric - internal specification docs).  And we tested on the bench and correlated the two.    This was for the path from the CPU to the peripheral.
    The DMA->Peripheral path is different.  

    I think the next step here should be to get a list of the paths you need timing information on (maybe prioritized) and we can take this as a documentation update request.

    Thanks and Best Regards,

    Anthony

  • Anthony,

    I am most interested in DMA performance, as my CPU to Peripheral access is usually minimal and usually done only on power up.

    Typically I want two DMA performance metrics, worse case transfer rates (when moving large amounts of data) and worse case request rates (when moving small amounts of data frequently).

    So basically how long does it take to move a large chuck of data and how many requests of data can I make in a certain amount of time.

    The metrics are to be defined from SRAM to SRAM, Peripheral to SRAM and SRAM to Peripheral.

    TI usually documents this stuff in app notes like

    http://www.ti.com/lit/an/spraa00/spraa00.pdf

    steve

     

  • Quick clarification on ARM v7 memory ordering:

    • Strongly ordered memory regions will hold the bus until an "ok" or "error" bus response/write response is returned.  This is the equivalent of non-cached/non-buffered (NCNB) on previous ARM cores.  Slow, but safe, as a bad transaction can be retried without loss of context.
    • Device memory regions will allow buffered write transactions - as soon as the data transaction hits the CPU write buffer, CPU assumes it will complete properly and advances to the next pending bus transaction.  Potential to increase performance, but you lose the ability to rewind and retry the transaction if something goes wrong.

    In general, you should keep your peripheral region in the default strongly ordered state unless you understand the costs and benefits of implementing a device mode MPU region.

     

  • Thanks, in general for safety critical applications we disable any automatic recovery modes, as they can get stuck (due to hard over failure) and completely take down the entire system.

  • Hello Steve

    USing MPU setting you suggest I redo the benchmark.

    Cpu cycle to write 50 time to gio register  using _mpuInit_()

    Het GIO 210
    GIO 189
    RAM 131

    Cpu cycle to write 50 time to gio register  whitout _mpuInit_()

    Het GIO 1023
    GIO 1014
    RAM 136

     

    Thank again for your help

  •  

    Hello Sylvestre,

    Is this topic now closed for you?  Can we consider the topic closed?

  • I assume that this thread is now closed.  Thanks!