This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS570 peripheral speed

Other Parts Discussed in Thread: HALCOGEN

We are using a TMS570LS20x USB Stick for development hardware and the Code Composer Studio compiler (with HALCOGEN providing the setup parameters).

This seems to be running much slower than I would expect for a processor of this power. To test the speed of the GPIO pins we have implemented the pin toggle code in assembler from application note spna138. This toggles bits 0,2,4,6 with a jump between the set and clear.

Monitoring a pin on the oscilloscope, we see a high time of 158nS and low time of 154nS. With an internal clock frequency of 140MHz(7.14nS) this would imply 22 system clock cycles per instruction!  Also however, with this code, should we also see a 2:1 high to low ratio as there is a jump after the set.

I would have expected a pulse width of closer to 21nS. Even with VCLK at 100MHz(10nS) I would expect 30nS.

Can you suggest what we might have done wrong?

Philip

6740.testcode1.txt

  • Hi Philip,

    The L2 AXI interface used on current TMS570LS Cortex R devices is optimized for high throughput rather than for latency.  However, there are a couple of things you can review which will improve the overall performance.

    • Check the memory protection unit and review the memory attributes set for the region.  In particular, if write buffering is not set then the L2 AXI interface will wait for completion of one write before issuing a second write. The default configuration of the peripheral region after boot is that peripherals are strongly-ordered, non-buffered.  For fastest write performance to peripherals, you want to configure the peripheral memory region to device, buffered.  This can often cut write times in half.
    • Confirm the clock configuration of the peripherals as opposed to the CPU and interconnect.  For lowest latency you want the divider between the clock domains to be as small as possible that can be supported via the datasheet.
    • Take advantage of the 64b interface - use word or double word accesses rather than byte or half-word transactions as you are doing in the example code.  A 16b transaction takes the same amount of time as a 64b transaction from the CPU and SCRs perspective (See ARM R4 TRM r1p3 section 9.3.5, non-cacheable reads and 9.3.6, non-cacheable writes). 
    • Use bursts when possible rather than singles (i.e. LDM, not LDR) - the interface is optimized for burst transactions

    Regards,

    Karl

  • Hi Karl

    Thank you for your reply. 

    I have now been through your suggestions and seen no improvement.

    Pipeline mode was enabled previously so i disabled this to see if there was any impact and there was none.I have also used the ECLK pin to read all of the internal clocks and they are all as expected

    140MHz for GCLK/HCLK/VCLK2

    70MHz for VCLK1 and AVCLK1

     

    Given the simplicity of the test program and core clock speed and bus speeds i would expect to be able to toggle the I/O much faster. I have also tried the same tests with the sample code that comes with the TMS570 Safety demo, same outcome.

    Do you have anymore insight into this?

     

    Thank you

     

  • Hi Philip,

    The pipeline buffer control in the flash wrapper effects only the flash access, not the peripheral access.  

    Can you please check the settings of your MPU and provide feedback?  Changing the buffering settings should make a big difference on writes; I have not seen a case where changing this setting made no difference.

    Are you measuring the clock cycles with the PMU or with the RTI?  If using the RTI, this inserts additional peripheral accesses which can impact your cycle count. I would recommend use of the PMU internal to CPU if possible.

    If you apply the suggestions that I note here and in my last message, you should be able to get the average peripheral access time down to the ~10 CPU cycles range.  The interface is designed for high throughput from multiple masters rather than low latency.

     

    Regards,

    Karl

  • Philip,

    Shall we keep this thread open?  Or have you applied Karl's suggestions already?