Question about C6678 computational efficiency

user4263705

Intellectual 530 points

Hi All

I am a new guy to use multicore DSP, when I programming on a single core on C6678 like this

#define SIZE 100

int result[SIZE][SIZE],src[SIZE][SIZE];dest[SIZE][SIZE]

init_Matrix(result);

init_Matrix(src);

init_Matrix(dest);

for(i = 0; i<SIZE; i++)

for(j = 0; j<SIZE; j++)

for(k = 0; k<SIZE; k++)

{

result[i][j] += src[i][k]*dest[k][j];

}

Someone tells me that this program computational efficiency on C6778(all eight core) can enhance more than one hundred times than single core( C6678 core0) running, if you done a good optimize.

I can't believe that, does it possible?

I saw that 22.4GFLOP/core on Corepac datasheet.

How many multiply operation can C6678 done in a cycle?

I found a table in a document,here it is

does it means that in a single clock cycle C6678 can do 64 times 32bit x32bit operation?

thus, 100*100 Matrix Multiple is about 1,000,000 times multiple operation .then it means that It cost at least 1,000,000/64 = 15625 cycle？

I do a experience on C6678 with openMP to do this work. the sorce code is here

#include <ti/omp/omp.h>
#include <c6x.h>
#include <stdlib.h>
#include<stdio.h>
#include <time.h>
#define SIZE 100
#define NTHREADS 8

unsigned long long start,finish;

void main()
{
start =0 ; finish = 0 ;
TSCL = 0;

int (*A)[SIZE], (*b)[SIZE], (*c)[SIZE];
A =malloc(sizeof(int)*SIZE*SIZE);
b = malloc(sizeof(int)*SIZE*SIZE);
c = malloc(sizeof(int)*SIZE*SIZE);
int i, j,k;
srand(time(NULL));

/*矩阵初始化*/
for (i=0; i < SIZE; i++)
{
for (j=0; j < SIZE; j++)
{
A[i][j] = rand()%10; //随机生成矩阵内容
b[i][j] = rand()%10;
c[i][j] = 0;
}
}
start = TSCL;
omp_set_num_threads(NTHREADS);
#pragma omp parallel shared(A,b,c) private(i)
{
// tid = omp_get_thread_num(); //获取当前线程的id
#pragma omp for private(j,k) //将for循环展开运算
for (i=SIZE; i > 0; --i)
{
for (j=SIZE; j > 0; --j)
{
for(k=SIZE; k > 0; --k)
{
c[i][j] += (A[i][k] * b[k][j]); //做矩阵乘法

}
}
}
}
#pragma omp critical
{
finish = TSCL;
printf("\n timecoast %ld \n",finish-start);

for(i = 0; i<SIZE;i++)
{
for(j =0; j<SIZE;j++)
{
printf(" c[%d][%d]=%4d ",i,j,c[i][j]);
}
printf("\n");
}

free(A);
free(b);
free(c);
}

}

I set the Optimization level -3

but the result is that the 1,000,000 multiple operation cost 26597707 cycles.

Did I test it in a wrong way? Or it has a better Optimization method?

over 10 years ago

0 Chad Courtney over 10 years ago

TI__Mastermind 30825 points

Optimization is probably the biggest impact item here. In order to achieve optimized code for tight loop processing of data, such as would be the case for digital signal processing techniques, you'd want to do things like 'unrolling' the loop, using pragma's to define data alignment, etc. There's a good wiki page that discusses this material.

Here's a link to a good Ap Note that discusses the Optimization techniques and how to apply them. www.ti.com/.../sprabf2.pdf

Best Regards,
Chad

0 Ganapathi Dhandapani95 over 10 years ago

TI__Mastermind 28085 points

Hi,

Refer "chapter 5 Optimization" on keystone lab manual. This chapter demonstrate some basic optimization techniques for keystone devices.
www.ti.com/.../sprp820.pdf

Thanks,

0 Johannes over 10 years ago

Mastermind 6240 points

I'm not familiar with OMP, so this is just a guess... When you measured the execution time, you not only go the results of computation but also the results of memory accesses, cache line invalidation and so on. The docs Chad and Ganapathi posted are a good place to look.

Regards

J

0 Bo Wang1 over 10 years ago in reply to Ganapathi Dhandapani95

Intellectual 265 points

Hi Ganapathi,
Where can I get the source code of the lab projects of the Keystone Multicore Workshop you referred to ?

0 Johannes over 10 years ago in reply to Bo Wang1

Mastermind 6240 points

Hi,

Many of the codes (may be all of them, I didn't check) come with MCSDK:

Also, check this Optimization Workshop

Regards

0 Ganapathi Dhandapani95 over 10 years ago in reply to Bo Wang1

TI__Mastermind 28085 points

I will check with training team and get back to you.

0 Ganapathi Dhandapani95 over 10 years ago in reply to Ganapathi Dhandapani95

TI__Mastermind 28085 points

The source and solution code for the K1 labs is available for download here: learningmedia.ti.com/.../K1labcode.zip

Thanks,

0 Bo Wang1 over 10 years ago in reply to Ganapathi Dhandapani95

Intellectual 265 points

Hi,

The link you provided seems to be unavailabe. Can you verify that ?

Thanks!

0 user4263705 over 10 years ago in reply to Ganapathi Dhandapani95

Intellectual 530 points

Hi

I am appreciate for your update, but I can't open the link either.
Can you mail it for me ?
My e-mail address : wangyaohui@wanji.net.cn
Thank you!

Best Regards
Yewkui Wang

0 Ganapathi Dhandapani95 over 10 years ago in reply to user4263705

TI__Mastermind 28085 points

Hi,

I will file the IR to fix the issue. I have attached the Keystone 1 labs example .zip file in this post.

K1labcode.zip

Thanks,

Processors

Processors forum

Question about C6678 computational efficiency