This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

BEAGLEBK: Slow execution on Beagle Bone Black

Part Number: BEAGLEBK

Hi;

I would like to use the AM335x for some hard real time applications where I need to do some speedy number crunching. Before going to RTOS I wanted to check the processing speed to make sure I can achieve the performance I need. I'm using a Beagle Bone Black with an TMS320-XDS100v3 JTAG emulator/debugger in CCS8.2.0.00007

I started with the boot loader program from the StarterWare _02_00_01_01 software package. I modified the boot-loader in a way that it will not load an app but run a small delay loop and a pin toggle directly in the bl_main.

int main(void)
{
    /* Configures PLL and DDR controller*/
    BlPlatformConfig();

    UARTPuts("StarterWare ", -1);
    UARTPuts(deviceType, -1);
    UARTPuts(" Boot Loader\n\r", -1);

    /* Copies application from non-volatile flash memory to RAM */
    //ImageCopy();

    UARTPuts("Jumping to StarterWare Application...\r\n\n", -1);

    /* Do any post-copy config before leaving boot loader */
    //BlPlatformConfigPostBoot();

    /* Giving control to the application */
    //appEntry = (void (*)(void)) entryPoint;

    //(*appEntry)( );

    /* prepare GPIO for set and reset */
    ////////////////////////////////////////////////////////////////
    /* Enabling functional clocks for GPIO1 instance. */
    GPIO1ModuleClkConfig();

    /* Selecting GPIO1[23] pin for use. */
    GPIO1Pin23PinMuxSetup();

    /* Selecting GPIO1[17] pin for use as output */
    GPIO1Pin17PinMuxSetup();

    /* Enabling the GPIO module. */
    GPIOModuleEnable(GPIO_INSTANCE_ADDRESS);

    /* Resetting the GPIO module. */
    GPIOModuleReset(GPIO_INSTANCE_ADDRESS);

    /* Setting the GPIO pins as output pins. */
    GPIODirModeSet(GPIO_INSTANCE_ADDRESS,
                   GPIO_INSTANCE_PIN_NUMBER_LED,
                   GPIO_DIR_OUTPUT);
    GPIODirModeSet(GPIO_INSTANCE_ADDRESS,
                   GPIO_INSTANCE_PIN_NUMBER_OUT,
                   GPIO_DIR_OUTPUT);

    while(1){
        HWREG(0x4804C194) = 0x20000;        // set pin high
        Delay(0x8000);
        HWREG(0x4804C190) = 0x20000;        // set pin low
        Delay(0x8000);
    }

    return 0;
}
/*
** A function which is used to generate a delay.
*/
static void Delay(volatile unsigned int count)
{
    while(count--);
}

The functionality has been taken from the gpioLEDBlink application. I changed it to use an accessible port pin for toggling and I measure the pin output cycle time with an oscilloscope. What I wanted to estimate is the time needed for one run of the Delay function loop ( while(count--); ). I get a time of 8.7ms for the high or the low phase of the pin. Count is set to 0x8000 which means a single loop step is taking about 265ns. I would assume that this type of loop would be at a maximum 2 processor cycles (maybe even one). However 265ns means about 3.8MHz processor clock or 7.5MHz in case of 2 cycles per loop step. From all I can see the processor is setup for running the ARM core at 1GHz which means the it takes over 260 clock cycles for the operation. This sounds very slow. I understand that the port pin functions are slower but with 0x8000 for the delay loop that additional delay should be insignificant (actually toggling a pin without the Delay function takes about 250ns in my setup).

The question is whether I'm doing something wrong or whether my expectations/ assumptions are wrong? I have used an 200MHz MCU before which is running the same application way faster than this Beagle Bone setup (actually it runs at one clock cycle per loop-step) however I'll need about 4 times the speed for my final application.

Thanks!

  • Hi,

    AM335x GPIO1 interface is connected to the L4 interconnect, which runs at 100MHz. This is not a valid way of testing ARM performance.
  • Hi Biser;

    thanks for the quick response. As mentioned in my question I'm aware of the fact that the GPIO interface is running much slower and I'm not worried about that part. If I try to toggle the GPIO as fast as I can I get to 250ns per phase. But I'm using a Delay function (just a while loop that counts down) between the change of the port-pin. The value for count is set high enough (0x8000) to ensure that the time spend to toggle the port pin is insignificant compared to the complete loop time.If I assume 2 clock cycles per each run of the while loop and 1GHz processor clock I would expect 65.536us for one complete run of the delay loop adding 250ns wouldn't make much of a difference. However I see 8.7ms for one full execution of the Delay function + one port-pin change which suggests to me that the execution of the while loop is much slower than 2 clock cycles. Therefore I'm not sure why I can not use this very simple program to test the performance of the ARM core? My final program will look very much the same just including many double precision floating point operations inside the loop which will run through an array of data. Therefore my main question remains: Is there a way that a simple while loop will run at the predicted speed (using something like 1-3 1GHz clock cycles)?

    Thanks!

    Best regards

    Hartmut

  • Hi,

    >>>>I have used an 200MHz MCU before. Is this also ARM A8 core?

    The BBB board has an TI AM335x SOC, this is an ARM A8 core. If you do some simple loop or math algorithm, it is pure CPU performance. The starterware you metioned, or Processor SDK RTOS, they are the TI driver package/example, unfortunately we don't have CPU bench-marking inside.

    I would think you need to find some bench-marking algorithm, try to see if you can use NEON, VFP and also compiler settings to get the best results. There are some info here: processors.wiki.ti.com/.../Cortex-A8

    For measuring the loop of 0x8000, you may use the A8 performance counter to see how many cycles used, instead of GPIO toggle.
    infocenter.arm.com/.../index.jsp
    stackoverflow.com/.../measure-executing-time-on-arm-cortex-a8-using-hardware-counter

    Also, you need to enabled I-cache and D-cache.

    Regards, Eric
  • Hi Eric;

    thanks a lot for your answer. Yes you are right I'm after the pure CPU performance. I'll only benefit from the 1GHz clock speed if I can do certain operations (including double precision floating point operations) very efficiently (as efficient as on my current MCU). I'll have a closer look into your suggestions and see how far I get. I'll let you know once I have the results.

    Best regards

    Hartmut

  • lding said:
    Also, you need to enabled I-cache and D-cache.

    In in addition to what Eric mentioned, the MMU has to be enabled as well as the caches. 

    See Beaglebone Black CPI for the difference in clocks-per-instruction according to if the MMU and cache and enabled / disabled.