Starterware on beaglebone is going slow?

Karl Albertsson

Hello!

I have some code running on the beaglebone, but it seems that performance is not that great. I am currently running a timer interrupt with some code in it. But the code that is being run in this is not going especially fast. I have one function that toggles a gpio pin. The difference between running the register command and running the register command via a function is 1.5us. So what I am saying is basically that a single function call is taking up 1.5us of processing time. That seems very bad.

Is there anything I'm missing here? Is the default clock rate of the beaglebone with starterware not set to 500 or 720? I've been troubleshooting this for several hours, and I can just not find whats wrong, because surely a function call should not take 1.5us?

Regards

Karl

over 13 years ago

0 Karl Albertsson over 13 years ago

Prodigy 85 points

I have done some more tests. I have also enabled caches as in the demo application. Even then an application that takes 0.3 seconds to run on a blackfin at 400MHz takes about 5 seconds with the BeagleBone running starterware... There must be something wrong, anybody have any ideas?

0 Madhvapathi Sriram over 13 years ago in reply to Karl Albertsson

Intellectual 485 points

Hmm... this problem sounds interesting, though I know it's paining you :-)

Just putting my understanding of the problem here.

You have a function which does some task (which is unusual, but lets have it that way), in the context of the timer interrupt.

You set/reset a GPIO on entering the interrupt handler, and reset/set the GPIO before exiting. You see that the pulse duration is 5 secs.

Are you doing any intensive computation? Like a lot of divisions, or complex math etc?

Though I have never worked on Blackfin processors, my two cents:

I see that Blackfin has DSP integrated and that could be one reason that task is finished faster and thus the rate of toggling is faster? The Sitara processors are ARM only and uses runtime libraries. Did you try with different compiler tool chains (TI, GCC, IAR) on the Sitara?
How about comparing the two performances with just an empty handler. That way we will analyze the interrupt response times and plain interrupt latencies. Would that be a better way of comparison?

While I continue to think, hope this helps..

Regards,

Madhvapathi Sriram

0 Karl Albertsson over 13 years ago in reply to Madhvapathi Sriram

Prodigy 85 points

Hi Madhvapathi!

Sorry about the confusion, I will try to clarify. I currently have two programs.

The first program consists of basically just a timer interrupt (dmtimer2) running at 50 KHz. The code inside the interrupt is not especially complex, it toggles a gpio on and off, and has some other calculations in between. While looking at it with my analyzer I noticed that the interrupts did not occur at steady 50KHz, but rather at like 30 KHz. So I began stripping down the code. During this time I noticed that just removing one function call and replacing it with what was inside of it would reduce the computation time of the interrupt with 1.5us, which seems awfully long for a function call I thought.

So, I went back to my other program I wrote earlier (it is just a bunch of integer computations, very few floats). I had never benchmarked this before, I just made sure it could be run properly. Now, while running the program I notice that the execution time of it is very slow as well. The same program running on a blackfin is around 40 times faster.

By now I'm basically thinking that it is not especially the interrupt that is slow, but the whole computation of the processor as a whole. So I looked around on the forums, and I saw that some guys were able to improve performance by enabling the caches as written in the demo application. So I take that code and apply it to my programs. The first program with the timer is now able to run at 50KHz, but replacing a single function call with its content is still telling me that function calls are very slow (>1us), so it did not seem to offer that much of an improvment.

I now try to apply the cache code to my second program, this helps a bit, and basically cuts the execution time in half (from 10 seconds down to 5), but the blackfin is still 20 times faster.

Now I'm starting to wonder if the processor is not running at full speed? Is the bootloader putting it at 500 MHz? Is there some other pipelining issues at hand? It just doesn't seem right that a single function call takes over 1us, or that the performance of the processor is basically at least 20 times slower than I would expect.

I have currently only used CCS and the TMS470 compiler. I have not been digging into any optimization options (althought there does not seem to be many). I have tried both running the code via the debugger and booting from memory card, although the performance seems to be the same. To me it seems like there is some sort of initialization of something missing, because surely the performance must be greater...

Thanks for the help!

Regrards

Karl

0 Karl Albertsson over 13 years ago in reply to Karl Albertsson

Prodigy 85 points

Ok, I finally fixed it!

It seems like the demo-application did not use D-cache, only I-cache. When I enabled the D-cache (I used the code from the uartEdma_Cache project), things speeded up drastically, it is now faster than the blackfin with 50% :)

Thanks for the help!

0 Thomas Laudan over 11 years ago in reply to Karl Albertsson

Prodigy 75 points

Hi,

I have a similar problem with my Beaglebone and Starterware. To check out the performance of Starterware and Beaglebone I do the following in the main-Function:

while(1)
{
   GPIOPinWrite(GPIO_INSTANCE_ADDRESS,28, GPIO_PIN_HIGH);

   for(i=0; i<35000; i++)
   {
      x = sqrt(3.141592654);
   }

   GPIOPinWrite(GPIO_INSTANCE_ADDRESS, 28, GPIO_PIN_LOW);

   for(i=0; i<35000; i++)
   {
       x = sqrt(3.141592654);
   }
}

As you can see I toggle a pin and in between I do sqrt(pi) 35000 times. Using an oscilloscope I measure the period /frequency on that pin.

At first I got: T = 18 s !!!!

Then I found this post and enabled Cache and MMU like in the example: \\StarterWare\examples\evmskAM335x\uart_edma\uartEcho_edma.c

After that i measured: T = 500 ms / f = 2 Hz

This is still quite slow! I run a program with the same functionality using the Linux Distribution Angström on the beaglebone, there I measured a period of T = 1.1 ms / f = 1 kHz .

So what is going on with the Starterware being so slow? Do I have to enable the Cache / MMU differently? Does anybody have an idea?

0 Thomas Laudan over 11 years ago in reply to Thomas Laudan

Prodigy 75 points

I found out that the FPU (Neon and VFP) is not enabled at startup. I guess this might be the reason for Starterware being so slow!

I'll change my program by doing just integer calculations and compare the speed between Linux and Starterware again. Then I'll try to enable the FPU and try my float calculations again.

0 qxc over 11 years ago in reply to Thomas Laudan

Genius 5820 points

Thomas Laudan said:
I found out that the FPU (Neon and VFP) is not enabled at startup.

That's interesting...I recently removed some floating point calculations just because they have been incredibly slow. Seems this is the same reason. So how can FPU be enabled?

0 Thomas Laudan over 11 years ago in reply to qxc

Prodigy 75 points

Hey Hans,

The Cortex-A8 integrates two FPUs: the VFP coprocessor and the Neon engine. For more detailed information read this:

http://processors.wiki.ti.com/index.php/StarterWare_NeonVFP

There these functionalities are explained and it also explains the neonVFPBenchmarkApp.c example.

But here is my short description of how to enable the FPU:

I guess you also have the Projects "drivers", "platform", "system" and "utils" integrated into your Project Explorer in CodeComposerStudio. Do a right click on "system", go to Properties, Build, ARM Compiler, Processor Options and there you select VFPv3 under "Specify floating point support". Then you do ok. Also check that under Build, ARM Compiler, Adcanced Options, Runtime Model Options the option "Generate SIMD instructions targeting Neon" is selected".

Now you rebuild the Project system.

Next do the same for your Project. You can choose to either just select the VFP coprocessor by just selecting VFPv3 or if you also want the Compiler to use Neon, you also have to select the "Generate SIMD instructions targeting Neon".

Now rebuild your project too. There you go.

Moreover your performance depends on whether you build your project as Debug oder Release. My loop ran with 290 kHz on Debug and with 380 kHz on Release mode!

Processors

Processors forum

Starterware on beaglebone is going slow?