Part Number: TDA4VM
Hi,experts
recently I use c7x mma to perform some calculation, the sdk version is 8.01, and the mmalib seem can not reach the computing of 8 TOPS.
Here is the sample I use to test the computing:
void testMatrixMultiply(tivxRadarFft1DTransParams *prms)
{
int i, j;
MMALIB_kernelHandle handle = malloc(MMA_MULTIPLY_HANDLE_SIZE);
matrixMultiplyInit(&handle, 1, 32, 160, 0, MMALIB_INT32);
int32_t *matA = prms->pL2_FFTData;
int32_t *matB = matA+32;
int32_t *matC = matB+32*160;
MMALIB_LINALG_matrixMatrixMultiply_ixX_ixX_oxX_exec_checkParams(handle, matA, matB, matC);
uint64_t time1 = tivxPlatformGetTimeInUsecs();
for(i=0; i<2000; i++)
{
matrixMultiExec(&handle, matA, matB, matC);
}
uint64_t time2 = tivxPlatformGetTimeInUsecs();
printf("matrix multi total time= %lu. \n", time2-time1);
free(handle);
} 1. I use matrix multiplication kernel to calculate [1*160] = [1*32] * [32 * 160] with loop of 2000 times, the data typed int32_t are all in the L2, it takes about 990us for this computing;
2. since mma can perform the calculation of [1*16] = [1*16] * [16 * 16] with data type of int32_t in one clock time as descriped, it should consume 2*2*10*2000/10^9 = 80us while frequence is 1GHz.
Here are above 10 times difference.
What is the reason for this? How to use mma and reach the computing of 8 TOPS?
Best regards
Tao