This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TDA4VM: C7x MMA 8TOPS implement

Part Number: TDA4VM
Hi,experts
    recently I use c7x mma to perform some calculation, the sdk version is 8.01, and the mmalib seem can not reach the computing of 8 TOPS.
    Here is the sample I use to test the computing:
void testMatrixMultiply(tivxRadarFft1DTransParams *prms)
{
    int i, j;
    MMALIB_kernelHandle handle = malloc(MMA_MULTIPLY_HANDLE_SIZE);

    matrixMultiplyInit(&handle, 1, 32, 160, 0, MMALIB_INT32);
    
    
    int32_t *matA = prms->pL2_FFTData;
    int32_t *matB = matA+32;
    int32_t *matC = matB+32*160;

    MMALIB_LINALG_matrixMatrixMultiply_ixX_ixX_oxX_exec_checkParams(handle, matA, matB, matC);


    uint64_t time1 = tivxPlatformGetTimeInUsecs();

    for(i=0; i<2000; i++)
    {
        matrixMultiExec(&handle, matA, matB, matC);
    }

    uint64_t time2 = tivxPlatformGetTimeInUsecs();
    printf("matrix multi total time= %lu. \n", time2-time1);



    free(handle);
}
 1. I use matrix multiplication kernel to calculate [1*160] = [1*32] * [32 * 160] with loop of 2000 times, the data typed int32_t are all in the L2, it takes about 990us for this computing;
 2. since mma can perform the calculation of  [1*16] = [1*16] * [16 * 16] with data type of int32_t in one clock time as descriped, it should consume 2*2*10*2000/10^9 = 80us while frequence is 1GHz.
    Here are above 10 times difference.
 
    What is the reason for this? How to use mma and reach the computing of 8 TOPS?
Best regards
Tao