C6run performance

Woody Douglass

Other Parts Discussed in Thread: DM3730

Hello all,

I managed to get c6run to compile it's example code as well as a trivial matrix multiply program I wrote in an attempt to profile the performance improvement gained from doing these operations on the DSP. both my matrix multiply program and all the the c6runapp and c6runlib example code provided in the dvsdk performs worse on the dsp then the arm. Why would this be? is it a configuration issue?

Thanks a lot,

Woody

over 15 years ago

0 Daniel Allred over 15 years ago

TI__Genius 17355 points

Woody,

Can you tell us what platform you are running on (OMAP-L13x or OMAP35x)? What kind of numerical operations are you doing (fixed or floating point)? Are you aware that there are overheads associated with calling to the DSP, such that operations on small data sizes may not make as much sense? What size matrices are we talking about? You can find information here about minimizing the cache operations needed to handle data between the cores.

If you are doing fixed point operations, you may want to use the optimized matrix multiply code that comes as part of the C64x+ DSPLib package. The C compiler may not be generating the most optimal code.

Regards, Daniel

0 Woody Douglass over 15 years ago in reply to Daniel Allred

Intellectual 405 points

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

void matfill(int size, int **mat) {
        int i, j;
        for (i = 0; i < size; i++) {
                for (j = 0; j < size; j++) {
                        mat[i][j] = rand() % 100;
                }
        }
}

void printmat(int size, int **mat) {
#if 0
        int i, j;
        for (i = 0; i < size; i++) {
                for (j = 0; j < size; j++) {
                        printf("%6d ", mat[i][j]);
                }
                printf("\n");
        }
#endif
}

void matmult(int size, int **A, int **B, int **C) {
        int i, j, k;
        for (i = 0; i < size; i++) {
                for (j = 0; j < size; j++) {
                        C[i][j] = 0;
                        for (k = 0; k < size; k++) {
                                C[i][j] += (A[i][k] * B[k][j]);
                        }
                }
        }
}

//square matrices only

int main(int argc, char *argv[]) {
        int **A, **B, **C;
        int i;
        int matsize;
                  time_t t;

        srand(time(NULL));

        if (argc != 2) exit(-1);

        matsize = atoi(argv[1]);

        if (matsize == 0) exit(-1);

        A = calloc(matsize, sizeof(int *));
        for (i = 0; i < matsize; i++) {
                A[i] = calloc(matsize, sizeof(int));
        }
        matfill(matsize, A);
        B = calloc(matsize, sizeof(int *));
        for (i = 0; i < matsize; i++) {
                B[i] = calloc(matsize, sizeof(int));
        }
        matfill(matsize, B);
        C = calloc(matsize, sizeof(int *));
        for (i = 0; i < matsize; i++) {
                C[i] = calloc(matsize, sizeof(int));
        }
                  t = time(NULL);
        matmult(matsize, A, B, C);
                  t = time(NULL) - t;
                  printf("time of multiply: %d seconds\n", t);

#if 0
        printf("\nmatrix A:\n");
        printmat(matsize, A);

        printf("\nmatrix B:\n");
        printmat(matsize, B);

        printf("\nmatrix C:\n");
        printmat(matsize, C);
#endif
        for (i = 0; i < matsize; i++) {
                free(A[i]);
        }
        free(A);
        for (i = 0; i < matsize; i++) {
                free(B[i]);
        }
        free(B);
        for (i = 0; i < matsize; i++) {
                free(C[i]);
        }
        free(C);
}

The code i'm using (with c6runapp) is above. As you can see, it takes an argument for generating the two matrixes to be multiplied, and does all integer math. I'm printing the time value to get a rough estimate of how long the matmult function is run. the ARM native code usually returns about 14 seconds, the DSP code about 19. i'm calling the program with an argument (500), and compiling it -O2 and -O3 on both the arm and c6run platforms.

I would like to try those DSP libs, but i have very little experience writing code for the DSP and i don't exactly know how to call/link them.

Thank you for your time,

Woody

0 Daniel Allred over 15 years ago in reply to Woody Douglass

TI__Genius 17355 points

Woody,

I took your code and modified it slightly to fit the calling conventions of the DSPLib DSP_mat_mul() function. I used the attached makefile and C source file to build and run the code on both the ARM and DSP core of the DM3730 (ARM Cortex A8 running at 1 GHz, C64x+ DSP running at 800 MHz). The ARM version uses the natural C code in the C file, while the DSP version calls into the DSPLib to accomplish the same task. Note that the data type of the matrix elements have been changed to short (from int) and we also have to account for a shift factor to get results that don't overflow. Anyways, compiling and running that code gives me the following:

root@dm37x-evm:/opt# for i in 100 200 300 400 500 600; do ./matmult_gpp $i; done
time of multiply: 0 seconds
time of multiply: 0 seconds
time of multiply: 2 seconds
time of multiply: 6 seconds
time of multiply: 18 seconds
time of multiply: 39 seconds
root@dm37x-evm:/opt# for i in 100 200 300 400 500 600; do ./matmult_dsp $i; done
time of multiply: 0 seconds
time of multiply: 0 seconds
time of multiply: 0 seconds
time of multiply: 0 seconds
time of multiply: 1 seconds
time of multiply: 3 seconds

As you can see, we can get a fairly significant advantage for larger matrix sizes. And the included makefile shows that linking to the optimized DSP library is not actually very difficult.

Regards, Daniel

matmult_dsplib.zip

0 Woody Douglass over 15 years ago in reply to Daniel Allred

Intellectual 405 points

It worked, thanks! One more question;

To run those dsp functions from a regular arm application, would i need to wrap them in a library and compile it with c6runlib? would it be more efficient to use the c6accel kernel and codec engine?

Thanks again,

Woody.

0 Daniel Allred over 15 years ago in reply to Woody Douglass

TI__Genius 17355 points

Woody Douglass said:

It worked, thanks! One more question;

To run those dsp functions from a regular arm application, would i need to wrap them in a library and compile it with c6runlib? would it be more efficient to use the c6accel kernel and codec engine?

Thanks again,

Woody.

Yes, you could wrap any DSPLib functions in a small C wrapper and use c6runlib to generate an ARM static lib to call these functions. Yes, you can use Codec Engine with C6Accel to do the same. I honestly don't know which of those two options would be more efficient. It would be fun to setup some experiments and find out.

As a side note, just for fun I ran the example apps again with larger matrices.

800x800 -> 16 seconds on the DSP, 134 seconds on the ARM

900x900 -> 23 seconds on the DSP, 198 seconds on the ARM

1000x1000 -> 32 seconds on the DSP, 286 seconds on the ARM

1100x1100 -> 42 seconds on the DSP, 402 seconds on the ARM

So using the optimized DSP libraries you can get a 9-10x performance advantage.

Regards, Daniel

0 Woody Douglass over 15 years ago in reply to Daniel Allred

Intellectual 405 points

Thanks again, you're a lifesaver. This DSP stuff is very exciting!

0 Maurizio Porpiglia over 13 years ago in reply to Daniel Allred

Prodigy 135 points

Hi all,

I have a beagle board-xm and would like to use it to evaluate the performance of the DSP and ARM as I read in this post. I was considering the tools to start and I found this very useful guide to this link:

http://processors.wiki.ti.com/index.php/Getting_Started_With_C6Run_On_Beagleboard

Is it okay to get right you get to do exactly this example made from Woody?

Processors

Processors forum

C6run performance