Hi,experts
recently I use c7x mma to perform some calculation, the sdk version is 8.01, and the mmalib seem can not reach the computing of 8 TOPS.
Here is the sample I use to test the computing:
void testMatrixMultiply(tivxRadarFft1DTransParams *prms) { int i, j; MMALIB_kernelHandle handle = malloc(MMA_MULTIPLY_HANDLE_SIZE); matrixMultiplyInit(&handle, 1, 32, 160, 0, MMALIB_INT32); int32_t *matA = prms->pL2_FFTData; int32_t *matB = matA+32; int32_t *matC = matB+32*160; MMALIB_LINALG_matrixMatrixMultiply_ixX_ixX_oxX_exec_checkParams(handle, matA, matB, matC); uint64_t time1 = tivxPlatformGetTimeInUsecs(); for(i=0; i<2000; i++) { matrixMultiExec(&handle, matA, matB, matC); } uint64_t time2 = tivxPlatformGetTimeInUsecs(); printf("matrix multi total time= %lu. \n", time2-time1); free(handle); }
1. I use matrix multiplication kernel to calculate [1*160] = [1*32] * [32 * 160] with loop of 2000 times, the data typed int32_t are all in the L2, it takes about 990us for this computing;
2. since mma can perform the calculation of [1*16] = [1*16] * [16 * 16] with data type of int32_t in one clock time as descriped, it should consume 2*2*10*2000/10^9 = 80us while frequence is 1GHz.
Here are above 10 times difference.
What is the reason for this? How to use mma and reach the computing of 8 TOPS?
Best regards
Tao