Compiler/TMS320C6678: The fastest way to transpose a huge matrix?

user5314112

Part Number: TMS320C6678

Tool/software: TI C/C++ Compiler

Hello,everyone!Currently,i am working on a project.In this project,i am dealing with a matrix transposition.In details,Given an complex matrix with 32 rows and 16*1024 columns,when i tried the function :DSPF_sp_mat_trans_cplx () on DDR3,it took 50ms to finish.Plus,i tried to do it by DMA,it seems the length of row is larger than the length of stride in A dimension.So,i used twice DMA to do this transposion.Unfortunately,it doubled the time to finish it,almost 100ms.

Now,i am desperated for any help.Please tell me any way to finish this task faster and more efficient.Look forward to your advice!

over 5 years ago

0 lding over 5 years ago

TI__Guru* 95265 points

Hi,

Assuming you are using the TI DSPLIB 3.4.0.X, and you are calling below API:

/**
* This function transposes the input matrix x[] and writes the
* result to matrix y[].
*
* @param x[r1*c1] Input matrix containing r1*c1 double floating-point numbers
* @param rows Number of rows in matrix x. Also number of columns in matrix y
* @param cols Number of columns in matrix x. Also number of rows in matrix y
* @param y[2*c1*r1] Output matrix containing 2*c1*r1 floating floating-point numbers
*
* @par Algorithm:
* DSPF_sp_mat_trans_cplx.c is the natural C equivalent of the optimized intrinsic
* C code withoutrestrictions. Note that the intrinsic C code is optimized
* and restrictions may apply.
*
* @par Assumptions:
* The number of rows and columns is >=2. <BR>
*
* @par Implementation Notes:
* @b Interruptibility : The code is interruptible. <BR>
* @b Endian support : supports both Little and Big endian modes. <BR>
*
*/

void DSPF_sp_mat_trans_cplx(const float *restrict x, const int rows,
const int cols, float *restrict y);

A test example is given under dsplib_c66x_3_4_0_4\packages\ti\dsplib\src\DSPF_sp_mat_trans_cplx\c66\DSPF_sp_mat_trans_cplx_d.c. Do you use this? If yes, then the test report: dsplib_c66x_3_4_0_4/docs/DSPLIB_C66x_TestReport.html showed the cycle is 1*R*C + 6*R + 28

Your R is 32. C is 16384. It should take 524508 cycles. Assuming your CPU is 1GHz, then it should be 0.524 ms. Your number is too big, please check

1) your CPU PLL is set to 1GHz or higher, your timestamp function is TSCL/TSCH

2) use -O3 to build code

3) Make L1D 32K cache and part of L2 cache and DDR3 is cached

4) Try to see if put input/output array inside MSMC helps.

Regards, Eric

0 user5314112 over 5 years ago in reply to lding

Prodigy 140 points

Sorry to bother you again,whether you did it on DDR3 or MSMC to acheive this kind of speed?

0 user5314112 over 5 years ago in reply to user5314112

Prodigy 140 points

I enabled the Cache of that part of DDR3.Then,i successfully cut the time consumption to 35ms.By the way,i also tried to add the Complier Option O3.As the result,it didnt improve the efficiency.

0 lding over 5 years ago in reply to user5314112

TI__Guru* 95265 points

Hi,

The code/data was placed into L2 for testing, as you can see the linker file. I ran the test with up to 64 rows and 64 columns of the DSPF_sp_mat_trans_cplx_66_LE_ELF project from dsplib_c66x_3_4_0_4 :

DSPF_sp_mat_trans_cplx Iter#: 18 Result Successful NR = 8 NC = 64 natC: 6817 optC: 1456
DSPF_sp_mat_trans_cplx Iter#: 19 Result Successful NR = 16 NC = 2 natC: 430 optC: 227
DSPF_sp_mat_trans_cplx Iter#: 20 Result Successful NR = 16 NC = 4 natC: 834 optC: 285
DSPF_sp_mat_trans_cplx Iter#: 21 Result Successful NR = 16 NC = 8 natC: 1642 optC: 399
DSPF_sp_mat_trans_cplx Iter#: 22 Result Successful NR = 16 NC = 16 natC: 3269 optC: 630
DSPF_sp_mat_trans_cplx Iter#: 23 Result Successful NR = 16 NC = 32 natC: 6501 optC: 1309
DSPF_sp_mat_trans_cplx Iter#: 24 Result Successful NR = 16 NC = 64 natC: 12961 optC: 4282
DSPF_sp_mat_trans_cplx Iter#: 25 Result Successful NR = 32 NC = 2 natC: 814 optC: 403
DSPF_sp_mat_trans_cplx Iter#: 26 Result Successful NR = 32 NC = 4 natC: 1602 optC: 487
DSPF_sp_mat_trans_cplx Iter#: 27 Result Successful NR = 32 NC = 8 natC: 3185 optC: 705
DSPF_sp_mat_trans_cplx Iter#: 28 Result Successful NR = 32 NC = 16 natC: 6341 optC: 1277
DSPF_sp_mat_trans_cplx Iter#: 29 Result Successful NR = 32 NC = 32 natC: 12645 optC: 2977
DSPF_sp_mat_trans_cplx Iter#: 30 Result Successful NR = 32 NC = 64 natC: 25276 optC: 9034
DSPF_sp_mat_trans_cplx Iter#: 31 Result Successful NR = 64 NC = 2 natC: 1593 optC: 755
DSPF_sp_mat_trans_cplx Iter#: 32 Result Successful NR = 64 NC = 4 natC: 3145 optC: 894
DSPF_sp_mat_trans_cplx Iter#: 33 Result Successful NR = 64 NC = 8 natC: 6257 optC: 1463
DSPF_sp_mat_trans_cplx Iter#: 34 Result Successful NR = 64 NC = 16 natC: 12485 optC: 2817
DSPF_sp_mat_trans_cplx Iter#: 35 Result Successful NR = 64 NC = 32 natC: 24960 optC: 6597
DSPF_sp_mat_trans_cplx Iter#: 36 Result Successful NR = 64 NC = 64 natC: 50079 optC: 19733

The number is higher than the formula. I also tried the R=32, C=16384 with .far inside DDR3, as L2 or MSMC is not big enough:

DSPF_sp_mat_trans_cplx Iter#: 63 Result Successful NR = 32 NC = 128 natC: 1102068 optC: 356268
DSPF_sp_mat_trans_cplx Iter#: 64 Result Successful NR = 32 NC = 256 natC: 2054774 optC: 674668
DSPF_sp_mat_trans_cplx Iter#: 65 Result Successful NR = 32 NC = 512 natC: 4164560 optC: 1348516
DSPF_sp_mat_trans_cplx Iter#: 66 Result Successful NR = 32 NC = 1024 natC: 8480970 optC: 2696460
DSPF_sp_mat_trans_cplx Iter#: 67 Result Successful NR = 32 NC = 2048 natC: 16989048 optC: 5393238
DSPF_sp_mat_trans_cplx Iter#: 68 Result Successful NR = 32 NC = 4096 natC: 34038450 optC: 10785414
DSPF_sp_mat_trans_cplx Iter#: 69 Result Successful NR = 32 NC = 8192 natC: 68295506 optC: 21569062
DSPF_sp_mat_trans_cplx Iter#: 70 Result Successful NR = 32 NC = 16384 natC: 136590800 optC: 43137004

That is 43 ms, similar to your number (50 or 35 ms). I looked at the setting of cache and PLL, and -O3, it all looked right for me. The engineer supporting for DSPLIB is out of office and will be back around end of the month. I will check with him if anything else missed. Sorry for the waiting!

Regards, Eric

0 Rahul Prabhu over 5 years ago in reply to lding

TI__Guru** 116170 points

The performance in DSPLIB Test Report is done using C66x cycle accurate simulator which assumes flat memory model. On actual silicon, the memory and bus latency does contribute to the benchmark so the only way to obtain result closest to what is specified in the Text report is to make sure the data fits in L1 and L2 memory and enable cache. Other than compiler option -O3, also make sure that the data buffers if place in MSMC and DDR are aligned to cache line boundary to minimize effect of cache latency.

Given each C66x core only has 512K of L2, both your input and output buffers will not fit in L2 memory so you will need to put input in L2 and output in MSMC/DDR which will impact your latency. One way to also try to reduce the latency would be to use OpenMP and distribute the complex matrix transpose task using OpenMP across multiple cores:

http://downloads.ti.com/mctools/esd/docs/openmp-dsp/index.html

Regards,

Rahul

Processors

Processors forum

Compiler/TMS320C6678: The fastest way to transpose a huge matrix?