Compiler: the speed limit of c6748

user4756843

Other Parts Discussed in Thread: TMS320C6748

Tool/software: TI C/C++ Compiler

I want to know the speed limit of Floating-point multiplication ?It is 2746MFLOPS or 456MHZ?If it is 2746Mflops, how can i achieve it?

I write a for loop test 100000 times a=b*c,after optimization, each multiplication spends one instruction cycle, but spends 6 cpu cycle.why?

And the 100000 times a=b*c spends 0.22ms.

over 8 years ago

0 Chester Gillon over 8 years ago

Guru 92251 points

user4756843 said:
It is 2746MFLOPS or 456MHZ?If it is 2746Mflops, how can i achieve it?

The maximum 2746 MFLOPS quoted for the TMS320C6748 assumes that 6 floating point operations can be performed every clock cycle (with a clock speed of 456-MHz). The maximum quoted MFLOPS for devices are a theoretical maximum, unlikely to be achieved in user code due to a combination of:

a) While the C674x has multiple Functional Units which can execute instructions in parallel, each type of Functional Unit supports different instructions. E.g. there are two .M1, .M2 functional units which support the MPYSP "Multiply Two Single-Precision Floating-Point Values" instruction with a Functional Unit Latency of one. That means the maximum rate of single-precision Floating Point multiples is two per cycle or 912 MFLOPS.

b) Unless all the input and output data will fit in the L1 cache, there will be some slowdown in accessing data from slower memory.

c) A Software Pipelined Loop (SPLOOP) Buffer to keep all the execution units busy may not be achievable, depending upon the algorithm.

user4756843 said:
I want to know the speed limit of Floating-point multiplication ?

What is the actual algorithm you need to implement?

Core Benchmarks has the execution time in microseconds for different functions, from which the achieved MFLOPS can be calculated. E.g. for the C674x DSP core running at 456-MHz:

- "Real Matrix SGEMM 16x16" which is 8192 floating-point operations takes 7.69 microseconds, which equates to 1065 MFLOPS

- "Complex Matrix SGEMM 16x16" which is 32768 floating-point operations takes 23.87 microseconds, which equates to 1372 MFLOPS

0 RandyP over 8 years ago

TI__Guru* 84110 points

The 2746 MFLOPS number looks like a typo. It should be 456MHz * 6FLOPs/cyc = 2736MFLOPS. I had never noticed that error before, but it is not very important to your question.

What you have measured in your test is program implementation performance, not raw processor benchmarking. But what you have measured is a more realistic number to be aware of since the performance of your program and algorithm is what really matters.

Adding to Chester's item a), the .L units and .S units can also do floating point operations, so the total possible using the 8 functional units is the 912 number he quoted for each pair of .M and .L and .S units, for a total of 2736.

The benchmarks Chester points you to may help you understand how to achieve the maximum performance for your application. For now, you might not even be using compiler optimization or other algorithmic optimization.

2736 is a target to aim for in your code. You will not exceed that number, but with the items Chester has listed, you can see what other system limitations may exist. It is a useful skill to develop proficiency at writing optimized DSP code, so please continue your work on developing those skills for your continued success.

Regards,
RandyP

0 user4756843 over 8 years ago in reply to Chester Gillon

Prodigy 140 points

how can i use the m1.m2 functional units at the same time?I looked at the assmbly window,found the m1.m2 unit only use one every a=b*c;

0 user4756843 over 8 years ago in reply to user4756843

Prodigy 140 points

0 Chester Gillon over 8 years ago in reply to user4756843

Guru 92251 points

user4756843 said:
how can i use the m1.m2 functional units at the same time?I looked at the assmbly window,found the m1.m2 unit only use one every a=b*c;

I am not an expert on writing optimized algorithms for C674x DSP, but your example code is scalar multiplies where the arguments for the multiplies are the results for previous multiplies a few statements previous which creates dependencies between the instructions.

My guess is that the easiest way to get the C compiler to use multiple functional units is to operate on vectors (arrays of variables) where the inputs and outputs are independent memory locations which the compiler can see don't alias. E.g. the following short example:

#define NUM_ITERATIONS 100

float a_vec[NUM_ITERATIONS];
float b_vec[NUM_ITERATIONS];
float c_vec[NUM_ITERATIONS];

int main(void)
{
    unsigned int iteration;

    for (iteration = 0; iteration < NUM_ITERATIONS; iteration++)
    {
        a_vec[iteration] = b_vec[iteration] * c_vec[iteration];
    }

    return 0;
}

Which when compiled with Optimization Level 2 resulted in the following inter-listed assembly file which shows the m1 and m2 functional units in use at the same time:

;******************************************************************************
;* FUNCTION NAME: main                                                        *
;*                                                                            *
;*   Regs Modified     : A3,A4,A5,A6,A7,B4,B5,B6,B7,B8                        *
;*   Regs Used         : A3,A4,A5,A6,A7,B3,B4,B5,B6,B7,B8                     *
;*   Local Frame Size  : 0 Args + 0 Auto + 0 Save = 0 byte                    *
;******************************************************************************
main:
;** --------------------------------------------------------------------------*
;** 15	-----------------------    L$1 = 50;
;**  	-----------------------    U$19 = &a_vec[-2];
;**  	-----------------------    U$11 = &c_vec;
;**  	-----------------------    U$6 = &b_vec;
;**  	-----------------------    #pragma MUST_ITERATE(50, 50, 50)
;**  	-----------------------    // LOOP BELOW UNROLLED BY FACTOR(2)
;**  	-----------------------    #pragma LOOP_FLAGS(4098u)
;**	-----------------------g2:
;** 17	-----------------------    VEC$f32x2$001 = *U$6++{8};
;** 17	-----------------------    VEC$f32x2$002 = *U$11++{8};
;** 17	-----------------------    *(U$19 += 2) = __subvec(0, VEC$f32x2$001)*__subvec(0, VEC$f32x2$002);
;** 17	-----------------------    U$19[1] = __subvec(1, VEC$f32x2$001)*__subvec(1, VEC$f32x2$002);
;** 15	-----------------------    if ( L$1 = L$1-1 ) goto g2;
	.dwpsn	file "../main.c",line 15,column 25,is_stmt,isa 0
           MVK     .S2     0x32,B4           ; [B_Sb674] |15| 

           SUB     .L2     B4,2,B4           ; [B_L674] 
||         MVKL    .S2     b_vec,B5          ; [B_Sb674] 

           MVC     .S2     B4,ILC            ; [B_Sb674] 
||         MVKL    .S1     c_vec,A5          ; [A_S674] 

;*----------------------------------------------------------------------------*
;*   SOFTWARE PIPELINE INFORMATION
;*
;*      Loop found in file               : ../main.c
;*      Loop source line                 : 15
;*      Loop opening brace source line   : 16
;*      Loop closing brace source line   : 18
;*      Loop Unroll Multiple             : 2x
;*      Known Minimum Trip Count         : 50                    
;*      Known Maximum Trip Count         : 50                    
;*      Known Max Trip Count Factor      : 50
;*      Loop Carried Dependency Bound(^) : 0
;*      Unpartitioned Resource Bound     : 2
;*      Partitioned Resource Bound(*)    : 2
;*      Resource Partition:
;*                                A-side   B-side
;*      .L units                     0        0     
;*      .S units                     0        0     
;*      .D units                     2*       2*    
;*      .M units                     1        1     
;*      .X cross paths               1        1     
;*      .T address paths             2        2     
;*      Logical  ops (.LS)           0        0     (.L or .S unit)
;*      Addition ops (.LSD)          0        0     (.L or .S or .D unit)
;*      Bound(.L .S .LS)             0        0     
;*      Bound(.L .S .D .LS .LSD)     1        1     
;*
;*      Searching for software pipeline schedule at ...
;*         ii = 2  Schedule found with 5 iterations in parallel
;*      Done
;*
;*      Loop will be splooped
;*      Collapsed epilog stages       : 0
;*      Collapsed prolog stages       : 0
;*      Minimum required memory pad   : 0 bytes
;*
;*      Minimum safe trip count       : 1 (after unrolling)
;*----------------------------------------------------------------------------*
$C$L1:    ; PIPED LOOP PROLOG

           SPLOOPD         2                 ;10 ; [A_L674] (P) 
||         MVKH    .S1     c_vec,A5          ; [A_S674] 
||         MVKH    .S2     b_vec,B5          ; [B_Sb674] 

;** --------------------------------------------------------------------------*
$C$L2:    ; PIPED LOOP KERNEL
	.dwpsn	file "../main.c",line 17,column 9,is_stmt,isa 0

           LDDW    .D2T2   *B5++(8),B7:B6    ; [B_D64P] |17| (P) <0,0> 
||         LDDW    .D1T1   *A5++(8),A7:A6    ; [A_D64P] |17| (P) <0,0> 

           NOP             1                 ; [A_L674] 

           SPMASK                            ; [] 
||^        MVKL    .S2     a_vec,B4          ; [B_Sb674] 

           SPMASK                            ; [] 
||^        MVKH    .S2     a_vec,B4          ; [B_Sb674] 

           SPMASK                            ; [] 
||^        SUB     .L2     B4,8,B4           ; [B_L674] 

           MPYSP   .M2X    A6,B6,B8          ; [B_M674] |17| (P) <0,5> 
||         MPYSP   .M1X    A7,B7,A3          ; [A_M674] |17| (P) <0,5> 

           SPMASK                            ; [] 
||^        ADD     .L1X    12,B4,A4          ; [A_L674] 

           SPMASK                            ; [] 
||^        ADD     .L2     8,B4,B4           ; [B_L674] 

           NOP             1                 ; [A_L674] 

           SPKERNEL        1,0               ; [] 
||         STW     .D2T2   B8,*B4++(8)       ; [B_D64P] |17| <0,9> 
||         STW     .D1T1   A3,*A4++(8)       ; [A_D64P] |17| <0,9> 

;** --------------------------------------------------------------------------*
$C$L3:    ; PIPED LOOP EPILOG
;** 20	-----------------------    return 0;
           NOP             1                 ; [A_L674] 
	.dwpsn	file "../main.c",line 21,column 1,is_stmt,isa 0
$C$DW$9	.dwtag  DW_TAG_TI_branch
	.dwattr $C$DW$9, DW_AT_low_pc(0x00)
	.dwattr $C$DW$9, DW_AT_TI_return

           RETNOP          B3,4              ; [] |21| 
	.dwpsn	file "../main.c",line 20,column 5,is_stmt,isa 0
           ZERO    .L1     A4                ; [A_L674] |20| 
	.dwpsn	file "../main.c",line 21,column 1,is_stmt,isa 0
           ; BRANCH OCCURS {B3}              ; [] |21|

Edit: The TMS320C674x DSP CPU and Instruction Set Reference Guide has information on the DSP Pipeline and Software Pipelined Loop (SPLOOP) Buffer which may help in understanding how to maximize performance.

0 RandyP over 8 years ago in reply to user4756843

TI__Guru* 84110 points

For more information on optimization techniques, please go to TI.com and search for "c6000 optimization" (no quotes) to find some articles and archived training courses that you can read through.

Please look to the libraries provided by TI for implementing components of your product's required algorithms. These libraries have functions that have been either optimized in C or hand-written in assembly to achieve maximum performance. You can study these implementations to learn good techniques.

Regards,
RandyP

0 user4756843 over 8 years ago in reply to Chester Gillon

Prodigy 140 points

Thanks for your help.I really appreciate it.Whether there is any other way to parallel .M1 and .M2?

0 user4756843 over 8 years ago in reply to RandyP

Prodigy 140 points

Could you tell me some website or .pdf of libraries?I

f i make .M1 and .M2 units parallel,the speed will be doubled?

Please tell me how to make units parallel in CCS5.5.

i read the "TMS320C674x DSP CPU and Instruction Set Reference Guide",and want to know how to finish Parallel Operations?

0 RandyP over 8 years ago in reply to user4756843

TI__Guru* 84110 points

With apologies, I am not sure what is being asked above. The red highlighting is not readable on my computer screen, at least with my old eyes.

I cannot do a better job of explaining how to write parallel instructions than the CPU & Instruction Set Reference Guide.

Please go to the training material I referenced earlier to find the instruction and pointers and examples and guidance that you need. You might also search for "c6748 training" or "c6748 workshop" (no quotes) to find more basic programming and application training.

Regards,
RandyP

Code Composer Studio™︎

Code Composer Studio forum

Compiler: the speed limit of c6748