This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

C6run performance

Other Parts Discussed in Thread: DM3730

Hello all,

 

I managed to get c6run to compile it's example code as well as a trivial matrix multiply program I wrote in an attempt to profile the performance improvement gained from doing these operations on the DSP. both my matrix multiply program and all the the c6runapp and c6runlib example code provided in the dvsdk performs worse on the dsp then the arm. Why would this be? is it a configuration issue?

 

Thanks a lot,

Woody

  • Woody,

    Can you tell us what platform you are running on (OMAP-L13x or OMAP35x)?  What kind of numerical operations are you doing (fixed or floating point)?  Are you aware that there are overheads associated with calling to the DSP, such that operations on small data sizes may not make as much sense?  What size matrices are we talking about? You can find information here about minimizing the cache operations needed to handle data between the cores.

    If you are doing fixed point operations, you may want to use the optimized matrix multiply code that comes as part of the C64x+ DSPLib package. The C compiler may not be generating the most optimal code.

    Regards, Daniel

  • #include <stdio.h>
    #include <stdlib.h>
    #include <time.h>


    void matfill(int size, int **mat) {
            int i, j;
            for (i = 0; i < size; i++) {
                    for (j = 0; j < size; j++) {
                            mat[i][j] = rand() % 100;
                    }
            }
    }



    void printmat(int size, int **mat) {
    #if 0
            int i, j;
            for (i = 0; i < size; i++) {
                    for (j = 0; j < size; j++) {
                            printf("%6d ", mat[i][j]);
                    }
                    printf("\n");
            }
    #endif
    }

    void matmult(int size, int **A, int **B, int **C) {
            int i, j, k;
            for (i = 0; i < size; i++) {
                    for (j = 0; j < size; j++) {
                            C[i][j] = 0;
                            for (k = 0; k < size; k++) {
                                    C[i][j] += (A[i][k] * B[k][j]);
                            }
                    }
            }
    }

    //square matrices only

    int main(int argc, char *argv[]) {
            int **A, **B, **C;
            int i;
            int matsize;
                      time_t t;

            srand(time(NULL));

            if (argc != 2) exit(-1);

            matsize = atoi(argv[1]);

            if (matsize == 0) exit(-1);

            A = calloc(matsize, sizeof(int *));
            for (i = 0; i < matsize; i++) {
                    A[i] = calloc(matsize, sizeof(int));
            }
            matfill(matsize, A);
            B = calloc(matsize, sizeof(int *));
            for (i = 0; i < matsize; i++) {
                    B[i] = calloc(matsize, sizeof(int));
            }
            matfill(matsize, B);
            C = calloc(matsize, sizeof(int *));
            for (i = 0; i < matsize; i++) {
                    C[i] = calloc(matsize, sizeof(int));
            }
                      t = time(NULL);
            matmult(matsize, A, B, C);
                      t = time(NULL) - t;
                      printf("time of multiply: %d seconds\n", t);

    #if 0
            printf("\nmatrix A:\n");
            printmat(matsize, A);

            printf("\nmatrix B:\n");
            printmat(matsize, B);

            printf("\nmatrix C:\n");
            printmat(matsize, C);
    #endif
            for (i = 0; i < matsize; i++) {
                    free(A[i]);
            }
            free(A);
            for (i = 0; i < matsize; i++) {
                    free(B[i]);
            }
            free(B);
            for (i = 0; i < matsize; i++) {
                    free(C[i]);
            }
            free(C);
    }


    The code i'm using (with c6runapp) is above. As you can see, it takes an argument for generating the two matrixes to be multiplied, and does all integer math. I'm printing the time value to get a rough estimate of how long the matmult function is run. the ARM native code usually returns about 14 seconds, the DSP code about 19. i'm calling the program with an argument (500), and compiling it -O2 and -O3 on both the arm and c6run platforms.

    I would like to try those DSP libs, but i have very little experience writing code for the DSP and i don't exactly know how to call/link them.

     

    Thank you for your time,

    Woody

  • Woody,

    I took your code and modified it slightly to fit the calling conventions of the DSPLib DSP_mat_mul() function.   I used the attached makefile and C source file to build and run the code on both the ARM and DSP core of the DM3730 (ARM Cortex A8 running at 1 GHz, C64x+ DSP running at 800 MHz).  The ARM version uses the natural C code in the C file, while the DSP version calls into the DSPLib to accomplish the same task.  Note that the data type of the matrix elements have been changed to short (from int) and we also have to account for a shift factor to get results that don't overflow.  Anyways, compiling and running that code gives me the following:

    root@dm37x-evm:/opt# for i in 100 200 300 400 500 600; do ./matmult_gpp $i; done
    time of multiply: 0 seconds                                                    
    time of multiply: 0 seconds                                                    
    time of multiply: 2 seconds                                                    
    time of multiply: 6 seconds                                                    
    time of multiply: 18 seconds                                                   
    time of multiply: 39 seconds                                                   
    root@dm37x-evm:/opt# for i in 100 200 300 400 500 600; do ./matmult_dsp $i; done
    time of multiply: 0 seconds                                                    
    time of multiply: 0 seconds                                                    
    time of multiply: 0 seconds                                                    
    time of multiply: 0 seconds                                                    
    time of multiply: 1 seconds                                                    
    time of multiply: 3 seconds  

    As you can see, we can get a fairly significant advantage for larger matrix sizes.  And the included makefile shows that linking to the optimized DSP library is not actually very difficult.

    Regards, Daniel

    matmult_dsplib.zip
  • It worked, thanks! One more question;

     

    To run those dsp functions from a regular arm application, would i need to wrap them in a library and compile it with c6runlib? would it be more efficient to use the c6accel kernel and codec engine?

     

    Thanks again,

    Woody.

  • Woody Douglass said:

    It worked, thanks! One more question;

     

    To run those dsp functions from a regular arm application, would i need to wrap them in a library and compile it with c6runlib? would it be more efficient to use the c6accel kernel and codec engine?

     

    Thanks again,

    Woody.

    Yes, you could wrap any DSPLib functions in a small C wrapper and use c6runlib to generate an ARM static lib to call these functions.  Yes, you can use Codec Engine with C6Accel to do the same.  I honestly don't know which of those two options would be more efficient.  It would be fun to setup some experiments and find out. 

    As a side note, just for fun I ran the example apps again with larger matrices.

    800x800 -> 16 seconds on the DSP, 134 seconds on the ARM

    900x900 -> 23 seconds on the DSP, 198 seconds on the ARM

    1000x1000 -> 32 seconds on the DSP, 286 seconds on the ARM

    1100x1100 -> 42 seconds on the DSP, 402 seconds on the ARM

    So using the optimized DSP libraries you can get a 9-10x performance advantage.

    Regards, Daniel

  • Thanks again, you're a lifesaver. This DSP stuff is very exciting!

  • Hi all,

    I have a beagle board-xm and would like to use it to evaluate the performance of the DSP and ARM as I read in this post. I was considering the tools to start and I found this very useful guide to this link:


    http://processors.wiki.ti.com/index.php/Getting_Started_With_C6Run_On_Beagleboard


    Is it okay to get right you get to do exactly this example made ​​from Woody?