Problem we are having is the low code execution speed in ARM side. PLL1 has been set so that DSP runs at 594MHz (divider value in PLL1 multiplier is 21) and ARM therefore gets automatically half of this, 297MHz. The DSP subsystem gets this correctly, this has been verified. On ARM side, we have following simple test code for testing the speed:
#pragma CODE_SECTION(".text3") void speedTest() { Uint32 debC = TIMER0->TIM12; __asm (" NOP"); __asm (" NOP"); __asm (" NOP"); __asm (" NOP"); __asm (" NOP"); __asm (" NOP"); __asm (" NOP"); __asm (" NOP"); __asm (" NOP"); __asm (" NOP"); debC = TIMER0->TIM12 - debC; }
Code below is the assembly listing of the code above:
955 00000008 E59FC074 LDR V9, CON1 ; |129| 956 0000000c E59CC000 LDR V9, [V9, #0] ; |129| 957 00000010 E59CC010 LDR V9, [V9, #16] ;|129| 958 00000014 E58DC000 STR V9, [SP, #0] ; |129| 959 .dwpsn "arm.cpp",130,2 960 00000018 E1A00000 NOP 961 .dwpsn "arm.cpp",131,2 962 0000001c E1A00000 NOP 963 .dwpsn "arm.cpp",132,2 964 00000020 E1A00000 NOP 965 .dwpsn "arm.cpp",133,2 966 00000024 E1A00000 NOP 967 .dwpsn "arm.cpp",134,2 968 00000028 E1A00000 NOP 969 .dwpsn "arm.cpp",135,2 970 0000002c E1A00000 NOP 971 .dwpsn "arm.cpp",136,2 972 00000030 E1A00000 NOP 973 .dwpsn "arm.cpp",137,2 974 00000034 E1A00000 NOP 975 .dwpsn "arm.cpp",138,2 976 00000038 E1A00000 NOP 977 .dwpsn "arm.cpp",139,2 978 0000003c E1A00000 NOP 979 .dwpsn "arm.cpp",140,1 980 00000040 E59F003C LDR A1, CON1 ; |140| 981 00000044 E5900000 LDR A1, [A1, #0] ; |140| 982 00000048 E59DC000 LDR V9, [SP, #0] ; |140| 983 0000004c E5900010 LDR A1, [A1, #16] ; |140| 984 00000050 E040C00C SUB V9, A1, V9 ; |140| 985 00000054 E58DC000 STR V9, [SP, #0] ; |140|
Execution time result is as follows (with different PLL multiplier registry value):
- PLL mult 21: 15 timer ticks (550ns)
- PLL mult 15: 18 timer ticks (670ns)
- PLL mult 10: 22 timer ticks (810ns)
- PLL mult 5: 33 timer ticks (1200ns)
Now this performance was also measured with an oscilloscope: We drive one I/O pin to 1 before the 1st nop and pull it down after the timer value was read again.For this we used PLL multiplier 21 (ARM clock 294MHz). Execution time was 700ns what tells us that the timer tick measurement was correct.There is about 150ns extra time coming from setting the IO pin to logic 1 and back to 0. Enabling / disabling cache has no effect to the execution speed. If the code was put to ddr memory then enabling the cache has clear effect.
But, why is the execution speed so low? Or are we expecting too much?
Where on earth can we find definiton of instruction set of ARM including cycle time it takes for each ASM command?We found one incomplete doc which was "Retired" in infocenter.arm website which did not include all commands. We estimated that the ASM code above takes about 30 cycles which corresponds to 100ns (assuming 1 clock cycle = 1 machine cycle). So with this estimation performance is too low by a factor of 5.
Any ideas?