TDA4VM: C7x MMA 8TOPS implement

Tao Xie

Part Number: TDA4VM

Hi，experts

recently I use c7x mma to perform some calculation, the sdk version is 8.01, and the mmalib seem can not reach the computing of 8 TOPS.

Here is the sample I use to test the computing:

void testMatrixMultiply(tivxRadarFft1DTransParams *prms)
{
    int i, j;
    MMALIB_kernelHandle handle = malloc(MMA_MULTIPLY_HANDLE_SIZE);

    matrixMultiplyInit(&handle, 1, 32, 160, 0, MMALIB_INT32);
    
    
    int32_t *matA = prms->pL2_FFTData;
    int32_t *matB = matA+32;
    int32_t *matC = matB+32*160;

    MMALIB_LINALG_matrixMatrixMultiply_ixX_ixX_oxX_exec_checkParams(handle, matA, matB, matC);


    uint64_t time1 = tivxPlatformGetTimeInUsecs();

    for(i=0; i<2000; i++)
    {
        matrixMultiExec(&handle, matA, matB, matC);
    }

    uint64_t time2 = tivxPlatformGetTimeInUsecs();
    printf("matrix multi total time= %lu. \n", time2-time1);



    free(handle);
}

1. I use matrix multiplication kernel to calculate [1*160] = [1*32] * [32 * 160] with loop of 2000 times, the data typed int32_t are all in the L2, it takes about 990us for this computing;

2. since mma can perform the calculation of [1*16] = [1*16] * [16 * 16] with data type of int32_t in one clock time as descriped, it should consume 2*2*10*2000/10^9 = 80us while frequence is 1GHz.

Here are above 10 times difference.

What is the reason for this? How to use mma and reach the computing of 8 TOPS?

Best regards

Tao

over 2 years ago

Processors

Processors forum

TDA4VM: C7x MMA 8TOPS implement