This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Waitstates for RTI register

Other Parts Discussed in Thread: TMDXRM46HDK, RM46L852

Hello,

I use the TMDXRM46HDK's RTI module as timer. I am wondering, that a Read-Modify-Write at the timer register takes extremely long.
For measuring, I used an output pin that I set high and low. With my oszilloscope I measured that Read-Modify-Write including pin toggling needs 340ns.
   //340ns
   SET_TEST_PIN_1_HIGH();   //(gioPORTA->DCLR = 1 << 1)
   STOP_CPU_TIMER_2();      //rtiREG1->GCTRL &= ~(1U << (rtiCOUNTER_BLOCK1 & 3U))
   SET_TEST_PIN_1_LOW();    //(gioPORTA->DCLR = 1 << 1)
   //\340ns

Just toggling the pin needs 100ns
   //100ns
   SET_TEST_PIN_1_HIGH();    //(gioPORTA->DCLR = 1 << 1)
   SET_TEST_PIN_1_LOW();    //(gioPORTA->DCLR = 1 << 1)
   //\100ns

Even if I subtract the 100ns, the Read-Modify-Write still needs 240ns,  53 cycles at 220MHz!
I had a look into the assembler listing, Read-Modify-Write should take only 4 Cycles = 18ns.
   ||$C$CON38||:    .field    -541644,32            //Pointer to structure gioPORTA=0xFFF7BC34
        MOV       A1, #0                ; |862|                     //A1 = 0
        LDR       V9, $C$CON38          ; |866|     CYC=1,LDR=2  0+1=1      //V9 = gioPORTA = 0xFFF7BC34
        MOV       A2, #2                ; |866|     CYC=1,LDR=1  1+1=2      //A2 = 1<<1 = 2
        STR       A2, [V9, #12]         ; |866|     CYC=1,LDR=2  2+1=3      //V9[12] = A2 (LDR of V9 already finished, so no waitstates)

        LDR       A3, [A1, #-1024]      ; |867|     CYC=1,LDR=2  3+1=4      //A3 = A1[0xFFFFFC00] = *0xFFFFFC00
                                                 4+1=5    //1Waitstate for A3
        BIC       A3, A3, #2            ; |867|     CYC=1,LDR=1  5+1=6      //A3 &= ~2
        STR       A3, [A1, #-1024]      ; |867|     CYC=1,LDR=2  6+1=7      //A1[0xFFFFFC00] = *0xFFFFFC00 = A3

        STR       A2, [V9, #16]         ; |868|     CYC=1,LDR=2  7+1=8      //A2 = V9[16]

Where do the extra waitstates for the RTI register come from?
In the datasheets I couldn't find anything. My information that I took from the ARM datasheet (Cortex-R4-white-paper.pdf) doesn't fit.

Is there any possibility to avoid the waitstates?

Is there a document whith description of the RM46l852 Instruction Set (similar to spru430d.pdf "TMS320C28x DSP CPU and Instruction Set Reference Guide" for F28xx)?

Thank you,

Norbert

  • Hi,

    If you are looking to use the RTI for benchmarking, I would like to suggest the use of the PMU that is a part of the Cortex core itself. This should give you more accurate results as reads/writes to the PMU are faster.

    Not all ARM instructions execute in 1 cycle. Here is the link to the Cortex R4 TRM where you can find the details of the instruction set: http://infocenter.arm.com/help/topic/com.arm.doc.ddi0363e/index.html

    The latency of an access to a peripheral can be in the 12 cycle range.  This is not the same as wait states.  A wait state by definition is the time required by a bus slave to respond to a transaction after receipt.  In this case the cycles are consumed in the interconnect and the peripheral is responding at zero wait states.

    One aspect that has an influence on the peripheral access performance is the MPU setup. In order to get better performance you should consider configuring the MPU regions to "device" for specific peripheral regions.

    Here are a few threads that discuss MPU and as such are threads along a similar line. Hope this helps.

    http://e2e.ti.com/support/microcontrollers/hercules/f/312/t/129897.aspx

    http://e2e.ti.com/support/microcontrollers/hercules/f/312/t/140039.aspx

    - Forum support

  • Hi,

    I use the RTI as system timer for my timeslice OS, not for benchmarking. The RTI registers where just one example.
    I have the same problems with other peripherials such as MibSPI, where I have many read and write accesses to the buffers in MibSPI RAM.

    To get more accurate benchmarks, I measured the timing of my RTI example with the PMU.

    CYCLE_COUNTER_START();            //_pmuResetCycleCounter_();_pmuStartCounters_(pmuCYCLE_COUNTER)
    CYCLE_COUNTER_GET(cycleCount);    //_pmuStopCounters_(pmuCYCLE_COUNTER);cycleCount = _pmuGetCycleCount_()
    //cycleCount is at 24 cycles

    /////////////////////////////////////////////////////////////////////
    //MPU deactivated: 73 Cyles: 73-24=49 = 223ns
    CYCLE_COUNTER_START();            //_pmuResetCycleCounter_();_pmuStartCounters_(pmuCYCLE_COUNTER)
        STOP_CPU_TIMER_2();                //rtiREG1->GCTRL &= ~(1U << (rtiCOUNTER_BLOCK1 & 3U))
    CYCLE_COUNTER_GET(cycleCount);    //_pmuStopCounters_(pmuCYCLE_COUNTER);cycleCount = _pmuGetCycleCount_()

    As you see, this is nearly the same time as I measured with my testpin (49 vs 53 cycles).

    I also did the benchmark with changed MPU settings (DEVICE_NON_SHAREABLE, PRIV_RW_USER_RW_NOEXEC).
    With the change, I measured 50 Cyles: 50-24=26 = 118ns. So the Read-Modify-Write still needs 26 cycles,

    I understand that I have to wait 12 cycles to get the value from RTI register into CPU register. But after with changed MPU settings, I wouldn't expect any more latency for writing it back from CPU into RTI register.

    So, why does the Read-Modify-Write of a peripherial register still need 26 cycles?

    Thank you,

    Norbert

  • Hi Norbert,

    Do you have optimizations enabled for your benchmarking and if so which level?

    I have benchmark the following code:

        volatile uint32 cycleCount1;
        volatile uint32 cycleCount2;

        _mpuInit_();
        _pmuInit_();

        _mpuEnable_();
        //_mpuDisable_();

        asm(" dsb");

        CYCLE_COUNTER_START();            //_pmuResetCycleCounter_();_pmuStartCounters_(pmuCYCLE_COUNTER)
        CYCLE_COUNTER_GET(cycleCount1);    //_pmuStopCounters_(pmuCYCLE_COUNTER);cycleCount = _pmuGetCycleCount_()

        //              no Opt    -o2
        // MPU Enable:  25        25
        // MPU Disable: 25        25

        asm(" dsb");

        //MPU deactivated: 73 Cyles: 73-24=49 = 223ns
        CYCLE_COUNTER_START();            //_pmuResetCycleCounter_();_pmuStartCounters_(pmuCYCLE_COUNTER)
            STOP_CPU_TIMER_2();                //rtiREG1->GCTRL &= ~(1U << (rtiCOUNTER_BLOCK1 & 3U))
        CYCLE_COUNTER_GET(cycleCount2);    //_pmuStopCounters_(pmuCYCLE_COUNTER);cycleCount = _pmuGetCycleCount_()

        //              no Opt    -o2
        // MPU Enable:  46        42
        // MPU Disable: 71        67

        cycleCount1 = cycleCount1;
        cycleCount2 = cycleCount2;

        while(1);

    Please find the benchmark results embedded in the code above (comments).

    If optimization is disabled the compiler generate the following code for STOP_CPU_TIMER_2();.

            LDR       A1, $C$CON1           ; [DPU_4_PIPE0] |66|
            LDR       V9, [A1, #0]          ; [DPU_4_PIPE0] |66|
            BIC       V9, V9, #2            ; [DPU_4_PIPE0] |66|
            STR       V9, [A1, #0]          ; [DPU_4_PIPE0] |66|

    If o2 is used the code looks like:

            MOV       A1, #0                ; [DPU_4_PIPE0] |66|
            LDR       V9, [A1, #-1024]      ; [DPU_4_PIPE0] |66|
            BIC       V9, V9, #2            ; [DPU_4_PIPE0] |66|
            STR       V9, [A1, #-1024]      ; [DPU_4_PIPE0] |66|

    Please note the differences in the first instruction. The LDR instruction could take much longer time than the MOV instruction thus the difference of 4 cycles between no Opt and o2 can be illustrated.

    For me in best case the execution of STOP_CPU_TIMER_2(); took 42-25=17cycles which I guess is a good result if you keep in mind, that the LDR V9, [A1, #-1024] alone consumes about 12 cycles (peripheral access which has to be finished before the next instruction in this case).

    Does this explanation help you to understand what is going on in the HW?

    Best Regards,
    Christian

  • Norbert,

    It takes CPU 12 VCLK cycles to complete a LDR instruction to read from a peripheral register/RAM. A STR instruction will also take CPU 12 VCLK cycles if the peripheral space is configured as strongly ordered memory.  CPU takes only 2 VCLK cyles to complete a STR instruction if the peripheral space is configured as device memory.

    Thanks and regards,

    Zhaohong

  • Hi,

    I did my tests with optimization -o2,

    There seems to be some tolerance in measurement with PMU. I repeated the tests with the following code:

    /* USER CODE BEGIN (1) */
    #include "sys_mpu.h"
    #include "sys_pmu.h"
    #include "reg_rti.h"

    #define CYCLE_COUNTER_INIT()                 _pmuInit_();_pmuEnableCountersGlobal_()
    #define CYCLE_COUNTER_START()                _pmuResetCycleCounter_();_pmuStartCounters_(pmuCYCLE_COUNTER)
    #define CYCLE_COUNTER_STOPGET(cycleCount)    _pmuStopCounters_(pmuCYCLE_COUNTER);cycleCount = _pmuGetCycleCount_()
    #define CYCLE_COUNTER_STOP()                _pmuStopCounters_(pmuCYCLE_COUNTER)
    #define CYCLE_COUNTER_GET(cycleCount)        cycleCount = _pmuGetCycleCount_()


    #define STOP_CPU_TIMER_2(rtiReg)    (rtiReg->GCTRL &= ~2)

    /* USER CODE END */

    /** @fn void main(void)
    *   @brief Application main function
    *   @note This function is empty by default.
    *
    *   This function is called after startup.
    *   The user can use this function to implement the application.
    */

    /* USER CODE BEGIN (2) */
        volatile uint32 cycleCountStart, cycleCountEnd;
        volatile uint32 cycleCount00;
        volatile uint32 cycleCount01;
        volatile uint32 cycleCount02;
        volatile uint32 cycleCount03;
        volatile uint32 cycleCount04;
        volatile uint32 cycleCountRtiDummy;
        volatile uint32 cycleCountRtiPeriph;
        rtiBASE_t  m_rtiBASE_Dummy; //Dummy register in RAM
        uint32 m_initMpu;
        uint32 m_enableMpu;

    /* USER CODE END */
    void main(void)
    {
    /* USER CODE BEGIN (3) */
        static rtiBASE_t *const mp_rtiBASEDummy = &m_rtiBASE_Dummy;
        static rtiBASE_t *const mp_rtiBASEPeriph = rtiREG1;

        m_initMpu = 1;

        m_enableMpu = 2;

        CYCLE_COUNTER_INIT();
        CYCLE_COUNTER_START();
        if(m_initMpu)
        {
            _mpuInit_();
        }

        while(1)
        {
            if(m_initMpu)
            {
                if(m_enableMpu == 0)
                {
                }
                else if(m_enableMpu == 1)
                {
                    _mpuDisable_();
                }
                else
                {
                    _mpuEnable_();
                }
            }

            //Just try if CYCLE_COUNTER_GET() always needs the same time
            asm(" dsb");
            CYCLE_COUNTER_GET(cycleCountStart);
            CYCLE_COUNTER_GET(cycleCountEnd);
            asm(" dsb");
            cycleCount00 = cycleCountEnd - cycleCountStart;

            asm(" dsb");
            CYCLE_COUNTER_GET(cycleCountStart);
            CYCLE_COUNTER_GET(cycleCountEnd);
            asm(" dsb");
            cycleCount01 = cycleCountEnd - cycleCountStart;

            asm(" dsb");
            CYCLE_COUNTER_GET(cycleCountStart);
            CYCLE_COUNTER_GET(cycleCountEnd);
            asm(" dsb");
            cycleCount02 = cycleCountEnd - cycleCountStart;

            asm(" dsb");
            CYCLE_COUNTER_GET(cycleCountStart);
            CYCLE_COUNTER_GET(cycleCountEnd);
            asm(" dsb");
            cycleCount03 = cycleCountEnd - cycleCountStart;

            asm(" dsb");
            CYCLE_COUNTER_GET(cycleCountStart);
            CYCLE_COUNTER_GET(cycleCountEnd);
            asm(" dsb");
            cycleCount04 = cycleCountEnd - cycleCountStart;

            //Do dummy access to RAM
            asm(" dsb");
            CYCLE_COUNTER_GET(cycleCountStart);
                STOP_CPU_TIMER_2(mp_rtiBASEDummy);                //rtiReg->GCTRL &= ~2
            CYCLE_COUNTER_GET(cycleCountEnd);
            asm(" dsb");
            cycleCountRtiDummy = cycleCountEnd - cycleCountStart;

            //Do access to Register
            asm(" dsb");

            CYCLE_COUNTER_GET(cycleCountStart);
                STOP_CPU_TIMER_2(mp_rtiBASEPeriph);                //rtiReg->GCTRL &= 1
            CYCLE_COUNTER_GET(cycleCountEnd);
            asm(" dsb");
            cycleCountRtiPeriph = cycleCountEnd - cycleCountStart;

            cycleCount00 = cycleCount00;
            cycleCount01 = cycleCount01;
            cycleCount02 = cycleCount02;
            cycleCount03 = cycleCount03;
            cycleCount04 = cycleCount04;
            cycleCountRtiDummy = cycleCountRtiDummy;
            cycleCountRtiPeriph = cycleCountRtiPeriph;
        }
    /* USER CODE END */
    }
    and I got these results:

    After all, I get ca. 16Cycles difference in access time between RAM and register. This fits to theory.
    I think measuring the time for PMU access and subtracting it from register access time was not the right way.
    The new method with difference of Register access and RAM access works better.

    Thanks,

    Norbert