This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6748: Matrix multiplication benchmark

Part Number: TMS320C6748
Other Parts Discussed in Thread: PROCESSOR-SDK-OMAPL138

I am trying to find out how long a matrix multiplication takes on different processors. I found this link which gave me some basic numbers for 16x16 timing.

https://www.ti.com/processors/digital-signal-processors/core-benchmarks/core-benchmarks.html

However I am more interested in multiplication with bigger size. Do you happen to have more info that you can provide?

I currently have access to C6748 with diplib. I did some timing analysis and it scaled badly when the matrix size grew. I have a couple suspicions on why that is the case but do you happen to have an good explanation for that?

Here are the timings that I have collected with different dimensions:

16x16 real, buffers in DDR : 19us
16x16 cplx, buffers in DDR : 36us
32x32 real, buffers in DDR : 85us
32x32 cplx, buffers in DDR : 215us
64x64 real, buffers in DDR : 520us
64x64 cplx, buffers in DDR : 1830us
128x128 real, buffers in DDR : 15.2ms
128x128 cplx, buffers in DDR : 60.0ms
256x256 real, buffers in DDR : 1642ms
256x256 cplx, buffers in DDR : 3345ms
16x16 real, buffers in L2 : 12us
16x16 cplx, buffers in L2 : 27us
32x32 real, buffers in L2 : 70us
32x32 cplx, buffers in L2 : 180us
Is there a way to do matrix multiplication with large dimensions effectively?
  • Hi,

    We are looking into this issue.

    Can you please provide details for below?

    1. What is the DSPLIB version you used?

    2. What is the example you picked up for bench marking above results?

    3. What is the Cache setting? (L1P/L1D/L2 Cache sizes)

    4. Have you tried to any other cache settings and have the results to compare?

    5. Is DDR cached? 

    6. What is the DDR Speed?

    7. Is the code (dsp library) used as is OR recompiled and rebuilt with any other compiler options? If yes, please share the details.

    Thanks,

    Aravind

  • Here are the info that you are asking for:

    1. DSPLIB 3.4.0.0

    2. The benchmarking code is something that I wrote. It looks like the following:

    if (init == 0)

    {

    for (i=0; i<2*TEST_DIM; i++)

    {

    for (j=0; j<TEST_DIM; j++)

    {

    x1[i][j] = 1.0;

    x2[i][j] = 2.0;

    y[i][j] = 0.0;

    }

    }

    init = 1;

    }

    time1 = Timestamp_get32();

    switch (iter % 2)

    {

    case 0:

    DSPF_sp_mat_mul(&x1[0][0], TEST_DIM, TEST_DIM, &x2[0][0], TEST_DIM, &y[0][0]);

    break;

    case 1:

    DSPF_sp_mat_mul_cplx(&x1[0][0], TEST_DIM, TEST_DIM, &x2[0][0], TEST_DIM, &y[0][0]);

    break;

    case 2:

    //DSPF_sp_mat_mul_gemm(&x1[0][0], 3.0, 1000, 100, &x2[0][0], 1000, &y[0][0]);

    break;

    case 3:

    //DSPF_sp_mat_mul_gemm_cplx(&x1[0][0], 3.0, 1000, 50, &x2[0][0], 1000, &y[0][0]);

    break;

    case 4:

    //DSPF_sp_mat_trans(&x1[0][0], 1000, 100, &y[0][0]);

    break;

    default:

    break;

    }

    iter++;

    time2 = Timestamp_get32();

    delta1 = time2 - time1;

    // printf("Time %u, %u => %u\n", time1, time2, delta1);

    Timestamp_getFreq(&freq);

    time1_ms = (float)delta1 * 1000.0 / (float)freq.lo;

    printf("DSPF_sp_mat_mul : %f ms\n", time1_ms);

    3. 32K/32K/128K

    4. No

    5. Yes

    6. 144MHz

    7. As is

  • Thanks for providing the details. From the code you shared, I see that, the benchmark application you are using is a SYS BIOS application.

    Can you also please share,

    The PROCESSOR-SDK-OMAPL138 release version (If I need to check for any other component versions)
    SYS BIOS Version, XDC version that you used for your excercise.
    RTSC cfg file for the application (if it is a RTSC application)
    Lnk command file and the generated map file for the application
    Any Custom platform and/or memory map
    Is the LITTLE_ENDIAN code used or BIG_ENDIAN code used from the DSPLIB?

    Can you go through below application note:
    www.ti.com/.../sprabf2.pdf

    Especially, the Compiler sections and Profiling Sections. The profiling sections in in the app note, is demonstrating the TSCL based benchmark. The TSCL information is the data in DSP Core Cycles.

    Can you please provide the register dumps for below?
    1. L2CFG regiser --> Can you make sure you have the L2MODE in L2CFG Cache set to maximum allowed. (value = 7)?
    2. L1DCFG Register (Address: 0x01840040)
    3. L1PCFG Level 1 Program Configuration Register (Addres: 0x01840020)

    You can refer to C66x Corepac user guide for more details on above registers.
    https://www.ti.com/lit/sprugw0

    From the details, it appears to me that you are hitting the Cache misses for higher order matrix multiplications. So, it may be good for you to refer to TMS320C66x DSP Cache: (Literature Number: SPRUGY8): https://www.ti.com/lit/sprugy8

    Is it possible for you to place DSPF_sp_mat_mul() and DSPF_sp_mat_mul_cplx() next to each other in memory using the linker command file and benchmark the results?

    As per my understanding you are looking for getting the DSPLIB matrix multiplication effectively for large N values. As N gets large, you would have impact due to cache misses.

  • Here are some of the info that you are looking for.

    The other stuff will take me a bit of time to obtain. But regarding to your suspicion on what may be the cause. I was suspecting the cache as well and that is what I am concerned with. If somehow the matrix multiplication function does not account for the cache limitation when the matrix size grows large and ends up causing a lot of cache thrashing, then really there is no easy way to do large matrix multiplication effectively on the DSP using the function as is.

    We are in the middle of selecting a processor and the C6748 is just something that we already have so I was using it for a quick test. So given the fact that we are open for picking a different processor, will there be a processor/dsplib that can do large matrix multiplication effectively or do they all suffer the same limitation?

  • Hi,

    To minimize the cache misses impact, you can divide and conquer the matrix multiplication using sub matrix (partition) multiplication and adding the result. 

    It is always very expensive if you go with high order matrix multiplication. The DSPLIB is very generic and can be efficiently used by higher layer application.

    Example: I have shown the partition based multiplication for a simple 4X4 matrix multiplication. 

    Conceptually, this can be extended for higher order as well. 

    Let's say Matrix A:

    A1

    A2

    A3

    A4

    1

    567

    456

    56

    45

    2

    43

    23

    52

    214

    3

    309

    435

    3459

    6439

    4

    490

    49

    68

    4305


      And  and Matrix B are:

    B1

    B2

    B3

    B4

    1

    58

    539

    235

    657

    2

    45

    68

    -345

    450

    3

    3405

    304

    68

    4305

    4

    2

    24

    50957

    5307

       
    and if C=AB, then the matrix C would be

     

    C1

    C2

    C3

    C4

    1

    244176

    354725

    2272798

    1057614

    2

    181017

    45685

    10910504

    1398159

    3

    11828270

    1402203

    328269875

    49461531

    4

    270775

    391434

    219472754

    23483355

    This can be computed by partitioning A and B and computing intermediate C matrix and then adding them

    That is, 

    Ci = ApBp

    where Ap = 

    A1

    A2

    1

    56

    45

    2

    52

    214

    3

    3459

    6439

    4

    68

    4305

    And

    Bp =

    B1

    B2

    B3

    B4

    1

    3405

    304

    68

    4305

    2

    2

    24

    50957

    5307

    Resulting matrix Ci = ApBp is:

    C1

    C2

    C3

    C4

    1

    53406

    336621

    -24075

    577719

    2

    3529

    24741

    2170

    38601

    3

    37497

    196131

    -77460

    398763

    4

    30625

    267442

    98245

    343980

     

    You can compute the next partial matrix Cj = AqBq

    where

    Aq = 

    A1

    A2

    1

    56

    45

    2

    52

    214

    3

    3459

    6439

    4

    68

    4305

    And 

    Bq =

    B1

    B2

    B3

    B4

    1

    3405

    304

    68

    4305

    2

    2

    24

    50957

    5307

    Cj = AqBq = 

     


    C
    1

    C2

    C3

    C4

    1

    190770

    18104

    2296873

    479895

    2

    177488

    20944

    10908334

    1359558

    3

    11790773

    1206072

    328347335

    49062768

    4

    240150

    123992

    219374509

    23139375

    The final answer can be computed by adding these two intermediate matrix results:

    C = Ci + Cj

    Ci1

    Ci2

    Ci3

    Ci4

    1

    53406

    336621

    -24075

    577719

    2

    3529

    24741

    2170

    38601

    3

    37497

    196131

    -77460

    398763

    4

    30625

    267442

    98245

    343980

    +

    Cj1

    Cj2

    Cj3

    Cj4

    1

    190770

    18104

    2296873

    479895

    2

    177488

    20944

    10908334

    1359558

    3

    11790773

    1206072

    328347335

    49062768

    4

    240150

    123992

    219374509

    23139375

     
    =

    C1

    C2

    C3

    C4

    1

    244176

    354725

    2272798

    1057614

    2

    181017

    45685

    10910504

    1398159

    3

    11828270

    1402203

    328269875

    49461531

    4

    270775

    391434

    219472754

    23483355

      
    Note that the result computed by partitioning multiply and addition is same as the one that is obtained originally with native matrix multiplication. 
    The advantage with approach 2 (partitioning) is the data can be fetched once and the results can be computed while the data is in cache lines.
    The trick is to stop partitioning when all the data are in the cache lines. (in this case, probably, you can stop partitioning for the large matrix multiplication when the partitioned matrix order is at 16 (or I think even 32 should be okay). You can experiment. 
    I hope you got the essence of it, on how can we effectively compute the matrix multiplication for higher order matrix.

  • Thanks for the suggestion. I am aware of the divide and conquer method. I am just hoping that there is a version of the library that has that feature built-in. But if not, that is okay then. Thanks again for answering my questions.

  • By the way, is there an equivalent of a BLAS library that can be run on the DSP side?

  • Hi Charles

    Please look at some discussion on BLAS on the following thread

    https://e2e.ti.com/support/tools/ccs/f/81/t/928242

    BLAS is available for some of our higher end multicore devices, but not for c674x.

    Regards

    Mukul 

  • Thanks. That is the information that I am looking for.

  • Hi Charles

    Ok. 
    Do let us know if you have more questions as you go through the processor/dsp selection. 

    We are also soliciting feedback on areas to improve on DSPLIB for c66x family etc. 

    Regards

    Mukul