This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Low code execution speed in 6443 ARM

Other Parts Discussed in Thread: OMAPL138

Problem we are having is the low code execution speed in ARM side.  PLL1 has been set so that DSP runs at 594MHz (divider value in PLL1 multiplier is 21) and ARM therefore gets automatically half of this, 297MHz. The DSP subsystem gets this correctly, this has been verified. On ARM side, we have following simple test code for testing the speed:

#pragma CODE_SECTION(".text3")
void speedTest()
{
Uint32 debC = TIMER0->TIM12;
     __asm (" NOP");
     __asm (" NOP");
     __asm (" NOP");
     __asm (" NOP");
     __asm (" NOP");
     __asm (" NOP");
     __asm (" NOP");
     __asm (" NOP");
     __asm (" NOP");
     __asm (" NOP");
debC = TIMER0->TIM12 - debC;
}

Code below is the assembly listing of the code above:

955 00000008 E59FC074 LDR V9, CON1 ; |129| 956 0000000c E59CC000 LDR V9, [V9, #0] ; |129| 957 00000010 E59CC010 LDR V9, [V9, #16] ;|129| 958 00000014 E58DC000 STR V9, [SP, #0] ; |129| 959 .dwpsn "arm.cpp",130,2 960 00000018 E1A00000 NOP 961 .dwpsn "arm.cpp",131,2 962 0000001c E1A00000 NOP 963 .dwpsn "arm.cpp",132,2 964 00000020 E1A00000 NOP 965 .dwpsn "arm.cpp",133,2 966 00000024 E1A00000 NOP 967 .dwpsn "arm.cpp",134,2 968 00000028 E1A00000 NOP 969 .dwpsn "arm.cpp",135,2 970 0000002c E1A00000 NOP 971 .dwpsn "arm.cpp",136,2 972 00000030 E1A00000 NOP 973 .dwpsn "arm.cpp",137,2 974 00000034 E1A00000 NOP 975 .dwpsn "arm.cpp",138,2 976 00000038 E1A00000 NOP 977 .dwpsn "arm.cpp",139,2 978 0000003c E1A00000 NOP 979 .dwpsn "arm.cpp",140,1 980 00000040 E59F003C LDR A1, CON1 ; |140| 981 00000044 E5900000 LDR A1, [A1, #0] ; |140| 982 00000048 E59DC000 LDR V9, [SP, #0] ; |140| 983 0000004c E5900010 LDR A1, [A1, #16] ; |140| 984 00000050 E040C00C SUB V9, A1, V9 ; |140| 985 00000054 E58DC000 STR V9, [SP, #0] ; |140|


Execution time result is as follows (with different PLL multiplier registry value):
  • PLL mult 21: 15 timer ticks (550ns)
  • PLL mult 15: 18 timer ticks (670ns)
  • PLL mult 10: 22 timer ticks (810ns)
  • PLL mult 5: 33 timer ticks (1200ns)
 

Now this performance was also measured with an oscilloscope: We drive one I/O pin to 1 before the 1st nop and pull it down after the timer value was read again.For this we used PLL multiplier 21 (ARM clock 294MHz). Execution time was 700ns what tells us that the timer tick measurement was correct.There is about 150ns extra time coming from setting the IO pin to logic 1 and back to 0. Enabling / disabling cache has no effect to the execution speed. If the code was put to ddr memory then enabling the cache has clear effect.

But, why is the execution speed so low? Or are we expecting too much?

Where on earth can we find definiton of instruction set of ARM including cycle time it takes for each ASM command?We found one incomplete doc which was "Retired" in infocenter.arm website which did not include all commands. We estimated that the ASM code above takes about 30 cycles which corresponds to 100ns (assuming 1 clock cycle = 1 machine cycle). So with this estimation performance is too low by a factor of 5.


Any ideas?




 

  • Seppo,

    I can not comment much on "why those performances" still find below some inputs:


    - The instruction set is documented in the ARM "Architecture Reference manual" (ARM DDI-0100). It does not seem to be made available widely on the ARM infocenter:
    http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.set.arm9/index.html
    but just by using a search engine you should be able to find it (but probably not the latest version).
    For example I found a version here where ARMv5T is documented:
        http://www.altera.com/literature/third-party/archives/ddi0100e_arm_arm.pdf
    It is probably the same you looked at.

    - Regarding performance you could probably contact ARM to see if they can provide you some typical benchmarks or even some test code.

    I think that the performance will depend on different factors. What memory is being used for code/data (ie ion-chip memory vs external memory), if instruction and data cache is enabled, if MMU is being used, ...etc.
    Since the ARMv5T based devices are often used by high level OS such as Linux all the setting of those different blocks is already done and the user usually does not need to look at it.

    - For AM1xxx/OMAP-L13x devices we do provide some none OS based examples on how to use the peripherals. It does not directly apply to DM644x but since the same ARM926 CPU is being used it is likely that all the CPU side configuration could be re-used.
    SITARAWARE (for AM1xxx) /STARTERWARE (for OMAP-L13x):
     http://software-dl.ti.com/dsps/dsps_public_sw/c6000/web/omapl138_starterware/latest/index_FDS.html
     http://focus.ti.com/docs/toolsw/folders/print/sitaraware.html
     http://processors.wiki.ti.com/index.php/SitaraWare
    The starterware (for ARM + C674x) and sitaraware (for ARM) are today separated but this will be merged in the future. Note that there might be different SW examples provided in the 2 packages.

    - Be aware that the Standard TI SW for DM644x is the DVSDK that does use Linux. We do not plan to make the Sitaraware/Starterware available for DM644x.

    Hope it helps,

    Anthony