This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Beegle Tool Chain Doubt!!

Hi,

I am using a tool chain and compile the source code for cortexA8. The tool chain and the enabled switches are given below

CPP=arm-none-linux-gnueabi-gcc
SW=-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp -flax-vector-conversions

I have following doubts regarding this tool chain!

How can i disable the neon vectorization of the given C only source ? but i need an executable for beegle board, because the beegle board is my target

If i didn't give any neon intrinsic or neon assembly then the compiler will generate a default neon code for the given C to the target beegle using the above tool chain ?

my aim is to check the performence of C code without any neon or vectorization Vs Neon enabled C code (C Neon Intrinsics + Neon Assembly) for an image viewer application

Rgds

Dave

 

 

  • Hi Dave,

    You need some form of: -ftree-vectorize to autovectorize for NEON.  Since you do not have -ftree-vectorize, you should not generate auto vectorized NEON code.  Also you mentioned you do not have NEON intrinsics or NEON assembly included, so the only way to generate NEON code is by compiler autovectorization.

    If you add  -ftree-vectorizer-verbose=2, you should be able to see where the NEON vectorization occursif any. Then when you remove this compiler option, you should see the NEON code go away. You could verify this by dumping out the assembly, NEON uses different registers than ARM. NEON uses d0-d31 for 64 bit SIMD operations and q0-q15 for 128 bit SIMD operations.

    Are you using any floating point operations? If so it might be a little more complicated to verify because the VFP also uses d0-d15, but the VFP is not SIMD, so you can verify whether NEON is used or not.

    Regards,

  • Hi Jeff

    Thanks for your replay but my problem is still existing
    I explain my issues once again !

    I try to develop an image viewer application to view RGB and BitMap.
    My target is Beegle board and Kernal is Angstrom

    I have two set of source code (All versions are in Fixed Point)

    Version 1 : Pure ANSI C

    Only the C code is considered
    The make file is given below

    OBJFILES = # objfiles.o
    INCLUDE = -I./../Header
    ABC=arm-none-linux-gnueabi-gABC
    PQR=-march=armv7-a -mtune=cortex-a8
    CFLAGS = -O3 -Wall $(INCLUDE) $(PQR)
    HOME = IMViewer.so
    $(HOME) : $(OBJFILES)
     $(ABC) -o $@ $^ $(CFLAGS) -fPIC -L. -shared
     ${ABC} -o IMView $(OBJFILES) -ldl -L. -lIMViewer
     install $(HOME) ../lib/
     mv IMView ../lib/
     rm -rf IMViewer.so
     @echo "C Version completed..."
    %.o : %.c
     $(ABC) -c $(CFLAGS) $< -o $@
     
    Version 2 : C + Neon Intrinsics

    In this version i use the neon intrinsics where ever applicable
    and the resulting source is mixed with C and Neon intrinsics
    The make file used for compiling this is given below

    OBJFILES = # objfiles.o
    INCLUDE = -I./../Header
    ABC=arm-none-linux-gnueabi-gABC
    PQR=-march=armv7-a -mtune=cortex-a8 -mfpu=neon -mfloat-abi=softfp -ftree-vectorizer-verbose=2 -flax-vector-conversions
    CFLAGS = -O3 -Wall $(INCLUDE) $(PQR)
    HOME = IMViewer.so
    $(HOME) : $(OBJFILES)
     $(ABC) -o $@ $^ $(CFLAGS) -fPIC -L. -shared
     ${ABC} -o IMView $(OBJFILES) -ldl -L. -lIMViewer
     install $(HOME) ../lib/
     mv IMView ../lib/
     rm -rf IMViewer.so
     @echo "C Neon Version completed..."
    %.o : %.c
     $(ABC) -c $(CFLAGS) $< -o $@
     
    Hope u get my real set up

    Then in my IMViewer application i take the performence of both versions
    the code fragment is given below

    #include<stdio.h>
    #include <sys/time.h>
    long st = 0,et = 0;
    struct timeval First, Last;
    void main(int argc, char**argv)
    {
         gettimeofday(&First, NULL);
      st = (First.tv_sec * 1000) + (First.tv_usec/1000) ;       /* Time In Mill Second Unit */
     
      IMViewer();
     
      gettimeofday(&Last, NULL);
      et = (Last.tv_sec * 1000) + (Last.tv_usec/1000) ;       /* Time In Mill Second Unit */ 
      printf("The Effective time in Millisecond is %d",(et - st));
     

    }

    This code fragment is working in common for two versions to take the time to complete .

    But sadly the performence for version 2 is not good. It is near to C version. I don't spot
    what is the problem here !

    I did the checking the following cases  and it is Ok

    1. OS Kernal is NEON enabled (OMAP 3530)
    2. In the generated assembly files of NEON code there is assembly instruction of neon intrinsics

    Following doubts still exists

    1. Will i can configure the L1 and L2 cache size of OS kernal?
    2. Is there any hand  written assembly is needed for enable the Neon processor of beegle board

    Kindly look in to my issue and please help me !

    Rgds
    Dave

     

     

     

  • Dave,

    Now I understand your setup better. I don't have a beagle board to test with, so you may want to also make this post on beagleboard.org. Since you have Neon intrisic code, you can ignore the previous post about autovectorization.  If your code runs then NEON is enabled.  If Neon were not enabled, then you would get an illegal instruction when executing NEON code.  Can you check the L1NEON bit and make sure Neon is able to cache in L1. You can find it at bit 5 of the Auxiliary Control Register:

    MRC p15, 0, <Rd>, c1, c0, 1 ; Read Auxiliary Control Register

    Regards,

    Jeff

  • Hi Jeff,

    Happy to see your replay.

    Please look in to the following code section

    void main()

    {

    /* Declarations */

        unsigned long aux;
        __asm__ __volatile__("mrc p15, 0, %0, c1, c0, 1":"=r" (aux));
        printf("Aux control Before: %08X\n", aux );

          __asm__ __volatile__("mrc p15, 0, r0, c1, c0, 1");
          /* Enabling ASA */
          __asm__ __volatile__("orr r0, r0, #0x10");
        /* Enable L1NEON */
        __asm__ __volatile__("orr r0, r0, #1<<5");

        __asm__ __volatile__("mrc p15, 0, %0, c1, c0, 1":"=r" (aux));
        printf("Aux control After: %08X\n", aux );

    /* IMViewer Code */

     

    }

    Here the fifth bit  will set for L1 cache. But i don't know what is mean by ASA

    Is this necessary for L1 cache enabling and get an optimized result ?

    Like this L2 cache has any dependency with NEON core ?

    Thanks,

    Dave