TMS320C6748: Matrix multiplication benchmark

Charles Ung92

Part Number: TMS320C6748
Other Parts Discussed in Thread: PROCESSOR-SDK-OMAPL138

I am trying to find out how long a matrix multiplication takes on different processors. I found this link which gave me some basic numbers for 16x16 timing.

https://www.ti.com/processors/digital-signal-processors/core-benchmarks/core-benchmarks.html

However I am more interested in multiplication with bigger size. Do you happen to have more info that you can provide?

I currently have access to C6748 with diplib. I did some timing analysis and it scaled badly when the matrix size grew. I have a couple suspicions on why that is the case but do you happen to have an good explanation for that?

Here are the timings that I have collected with different dimensions:

16x16 real, buffers in DDR : 19us

16x16 cplx, buffers in DDR : 36us

32x32 real, buffers in DDR : 85us

32x32 cplx, buffers in DDR : 215us

64x64 real, buffers in DDR : 520us

64x64 cplx, buffers in DDR : 1830us

128x128 real, buffers in DDR : 15.2ms

128x128 cplx, buffers in DDR : 60.0ms

256x256 real, buffers in DDR : 1642ms

256x256 cplx, buffers in DDR : 3345ms

16x16 real, buffers in L2 : 12us

16x16 cplx, buffers in L2 : 27us

32x32 real, buffers in L2 : 70us

32x32 cplx, buffers in L2 : 180us

Is there a way to do matrix multiplication with large dimensions effectively?

over 4 years ago

0 Aravind Batni over 4 years ago

TI__Expert 8885 points

Hi,

We are looking into this issue.

Can you please provide details for below?

1. What is the DSPLIB version you used?

2. What is the example you picked up for bench marking above results?

3. What is the Cache setting? (L1P/L1D/L2 Cache sizes)

4. Have you tried to any other cache settings and have the results to compare?

5. Is DDR cached?

6. What is the DDR Speed?

7. Is the code (dsp library) used as is OR recompiled and rebuilt with any other compiler options? If yes, please share the details.

Thanks,

Aravind

0 Charles Ung92 over 4 years ago in reply to Aravind Batni

Intellectual 490 points

Here are the info that you are asking for:

1. DSPLIB 3.4.0.0

2. The benchmarking code is something that I wrote. It looks like the following:

if (init == 0)

{

for (i=0; i<2*TEST_DIM; i++)

{

for (j=0; j<TEST_DIM; j++)

{

x1[i][j] = 1.0;

x2[i][j] = 2.0;

y[i][j] = 0.0;

}

init = 1;

}

time1 = Timestamp_get32();

switch (iter % 2)

{

case 0:

DSPF_sp_mat_mul(&x1[0][0], TEST_DIM, TEST_DIM, &x2[0][0], TEST_DIM, &y[0][0]);

break;

case 1:

DSPF_sp_mat_mul_cplx(&x1[0][0], TEST_DIM, TEST_DIM, &x2[0][0], TEST_DIM, &y[0][0]);

break;

case 2:

//DSPF_sp_mat_mul_gemm(&x1[0][0], 3.0, 1000, 100, &x2[0][0], 1000, &y[0][0]);

break;

case 3:

//DSPF_sp_mat_mul_gemm_cplx(&x1[0][0], 3.0, 1000, 50, &x2[0][0], 1000, &y[0][0]);

break;

case 4:

//DSPF_sp_mat_trans(&x1[0][0], 1000, 100, &y[0][0]);

break;

default:

break;

}

iter++;

time2 = Timestamp_get32();

delta1 = time2 - time1;

// printf("Time %u, %u => %u\n", time1, time2, delta1);

Timestamp_getFreq(&freq);

time1_ms = (float)delta1 * 1000.0 / (float)freq.lo;

printf("DSPF_sp_mat_mul : %f ms\n", time1_ms);

3. 32K/32K/128K

4. No

5. Yes

6. 144MHz

7. As is

0 Aravind Batni over 4 years ago in reply to Charles Ung92

TI__Expert 8885 points

Thanks for providing the details. From the code you shared, I see that, the benchmark application you are using is a SYS BIOS application.

Can you also please share,

The PROCESSOR-SDK-OMAPL138 release version (If I need to check for any other component versions)
SYS BIOS Version, XDC version that you used for your excercise.
RTSC cfg file for the application (if it is a RTSC application)
Lnk command file and the generated map file for the application
Any Custom platform and/or memory map
Is the LITTLE_ENDIAN code used or BIG_ENDIAN code used from the DSPLIB?

Can you go through below application note:
www.ti.com/.../sprabf2.pdf

Especially, the Compiler sections and Profiling Sections. The profiling sections in in the app note, is demonstrating the TSCL based benchmark. The TSCL information is the data in DSP Core Cycles.

Can you please provide the register dumps for below?
1. L2CFG regiser --> Can you make sure you have the L2MODE in L2CFG Cache set to maximum allowed. (value = 7)?
2. L1DCFG Register (Address: 0x01840040)
3. L1PCFG Level 1 Program Configuration Register (Addres: 0x01840020)

You can refer to C66x Corepac user guide for more details on above registers.
https://www.ti.com/lit/sprugw0

From the details, it appears to me that you are hitting the Cache misses for higher order matrix multiplications. So, it may be good for you to refer to TMS320C66x DSP Cache: (Literature Number: SPRUGY8): https://www.ti.com/lit/sprugy8

Is it possible for you to place DSPF_sp_mat_mul() and DSPF_sp_mat_mul_cplx() next to each other in memory using the linker command file and benchmark the results?

As per my understanding you are looking for getting the DSPLIB matrix multiplication effectively for large N values. As N gets large, you would have impact due to cache misses.

0 Charles Ung92 over 4 years ago in reply to Aravind Batni

Intellectual 490 points

Here are some of the info that you are looking for.

The other stuff will take me a bit of time to obtain. But regarding to your suspicion on what may be the cause. I was suspecting the cache as well and that is what I am concerned with. If somehow the matrix multiplication function does not account for the cache limitation when the matrix size grows large and ends up causing a lot of cache thrashing, then really there is no easy way to do large matrix multiplication effectively on the DSP using the function as is.

We are in the middle of selecting a processor and the C6748 is just something that we already have so I was using it for a quick test. So given the fact that we are open for picking a different processor, will there be a processor/dsplib that can do large matrix multiplication effectively or do they all suffer the same limitation?

0 Aravind Batni over 4 years ago in reply to Charles Ung92

TI__Expert 8885 points

Hi,

To minimize the cache misses impact, you can divide and conquer the matrix multiplication using sub matrix (partition) multiplication and adding the result.

It is always very expensive if you go with high order matrix multiplication. The DSPLIB is very generic and can be efficiently used by higher layer application.

Example: I have shown the partition based multiplication for a simple 4X4 matrix multiplication.

Conceptually, this can be extended for higher order as well.

Let's say Matrix A:

	A1	A2	A3	A4
1	567	456	56	45
2	43	23	52	214
3	309	435	3459	6439
4	490	49	68	4305

And and Matrix B are:

	B1	B2	B3	B4
1	58	539	235	657
2	45	68	-345	450
3	3405	304	68	4305
4	2	24	50957	5307

and if C=AB, then the matrix C would be

	C1	C2	C3	C4
1	244176	354725	2272798	1057614
2	181017	45685	10910504	1398159
3	11828270	1402203	328269875	49461531
4	270775	391434	219472754	23483355

This can be computed by partitioning A and B and computing intermediate C matrix and then adding them

That is,

Ci = ApBp

where Ap =

	A1	A2
1	56	45
2	52	214
3	3459	6439
4	68	4305

And

Bp =

	B1	B2	B3	B4
1	3405	304	68	4305
2	2	24	50957	5307

Resulting matrix Ci = ApBp is:

	C1	C2	C3	C4
1	53406	336621	-24075	577719
2	3529	24741	2170	38601
3	37497	196131	-77460	398763
4	30625	267442	98245	343980

You can compute the next partial matrix Cj = AqBq

where

Aq =

	A1	A2
1	56	45
2	52	214
3	3459	6439
4	68	4305

And

Bq =

	B1	B2	B3	B4
1	3405	304	68	4305
2	2	24	50957	5307

Cj = AqBq =

	C1	C2	C3	C4
1	190770	18104	2296873	479895
2	177488	20944	10908334	1359558
3	11790773	1206072	328347335	49062768
4	240150	123992	219374509	23139375

The final answer can be computed by adding these two intermediate matrix results:

C = Ci + Cj

	Ci1	Ci2	Ci3	Ci4
1	53406	336621	-24075	577719
2	3529	24741	2170	38601
3	37497	196131	-77460	398763
4	30625	267442	98245	343980

	Cj1	Cj2	Cj3	Cj4
1	190770	18104	2296873	479895
2	177488	20944	10908334	1359558
3	11790773	1206072	328347335	49062768
4	240150	123992	219374509	23139375

	C1	C2	C3	C4
1	244176	354725	2272798	1057614
2	181017	45685	10910504	1398159
3	11828270	1402203	328269875	49461531
4	270775	391434	219472754	23483355

Note that the result computed by partitioning multiply and addition is same as the one that is obtained originally with native matrix multiplication.

The advantage with approach 2 (partitioning) is the data can be fetched once and the results can be computed while the data is in cache lines.

The trick is to stop partitioning when all the data are in the cache lines. (in this case, probably, you can stop partitioning for the large matrix multiplication when the partitioned matrix order is at 16 (or I think even 32 should be okay). You can experiment.

I hope you got the essence of it, on how can we effectively compute the matrix multiplication for higher order matrix.

0 Charles Ung92 over 4 years ago in reply to Aravind Batni

Intellectual 490 points

Thanks for the suggestion. I am aware of the divide and conquer method. I am just hoping that there is a version of the library that has that feature built-in. But if not, that is okay then. Thanks again for answering my questions.