This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

c64x+ 32-bit x 32-bit multiply with add



Hi,

this is my first post! My name is fabian and i am atm trying to speed up my multi-precision library i wrote in C by using intrinsics.

I have been looking at the instruction set of the c64x+ as described in the CPU and Instruction Set reference(spru732j) as well as the list of intrinsics in the programmer's guide(spru198k) but i couldn't find a 32-bit x 32-bit multiply with add intrinsic.

I have read several times that the c64x+ should be able to do 2 32-bit x 32-bit MACs per cycle like e.g. written in the document Optimizing loops on the c66x DSP(sprabg7).

My multi-precision numbers are stored in a struct with an uint32_t array containing the number.

I have implemented a multi precision mulitplication algorithm called "Comba multiplication" which features a tripple precision step in the core loop like a[i] * b[i] + c.

How would you do this by using intrinsics?

Would be great if someone could help me out.

regards and thanks in advance,

fabian

  • Fabian,

    Welcome to the TI E2E forum. I hope you will find many good answers here and in the TI.com documents and in the TI Wiki Pages. Be sure to search those for helpful information and training material, and browse for the questions others have asked on similar topics.

    Write your program in C and use the best optimization techniques with just C. If your program runs fast enough, do not try to go deeper into optimization using intrinsics or C6000 assembly; this is very difficult and it is very unlikely that we can teach you assembly programming through the forum. Maybe someone else will have a different opinion, but I strongly recommend you start with a search of the TI Wiki Pages for C6000 Optimization and follow what you learn there.

    You may find other training materials on TI.com, also.

    Once you have highly optimized C code that builds and runs successfully, the easiest way to understand how C6000 assembly code works is to generate a listing file and look at the code the compiler generates.

    Regards,
    RandyP

     

    If you need more help, please reply back. If this answers the question, please click  Verify Answer  , below.

  • Hey Randy,

    thanks for the welcome :)

    I am writing my program in C. However there is no way I am able to say that my program runs fast enough as I am using the c64x+ in my bachelor thesis and I shall write my code as fast as possible. I have to admit that i am not going any deeper than the intrinsics level because writing assembler really looks difficult although i am familar with assembler on other machines.

    I am always surprised what the compiler does to my code.

    In my opinion programing with intrinsics is surprisingly easy. I was able to reduce the cycle count for my 192-bit multi precision multiplication algorithm from 424 to just 125 cycles by using intrinsics, as well as loop unrolling and using aligned data access.

    However i am using the 32-bit x 32-bit multiplicators in this piece of code, which makes it impossible for me to use a MAC operation, since a 32-bit x 32-bit + 32-bit MAC is yielding in the worst case a 65-bit number, which renders it useless for me, since there is no type with more precision than 64-bit.

    I am going to write a version of the multi precision multiplication algorithm that uses 16-bit MACs because i want to see how many cycles it's going to take.

    What i thought, when i wrote my post, was that there is some kind of intrisinc like result = _mac(a, b, c); and it just computes a MAC. I was surprised that the c6000 does this with seperate multiply and add instructions.

    regards,

    fabian

  • Fabian,

    Fabian B��ttner said:
    I am writing my program in C...
    I shall write my code as fast as possible...
    i am not going any deeper than the intrinsics level...
    writing assembler really looks difficult

    "as fast as possible" is not always an achievable goal. Nor is it realistic, since an application requires a certain performance and writing an algorithm to run faster than that may not improve the application.

    C6000 assembly is difficult, and intrinsics are a method to force the compiler to use a specific assembly instruction. Intrinsics are not a replacement for the full set of features that can be implemented in assembly, but they provide a good tradeoff.

    C6000 assembly is difficult because the architecture uses an unprotected pipeline and up to 8 functional units.

    Best of luck with your thesis.

    Regards,
    RandyP

  • While 424 to 125 improvement appears impressive it doesn't answer the question how far is it from theoretical limit and if assembler programming effort would [still] be justifiable. I have nested SPLOOP assembler implementation that operates at 2*N*(N+1)+10 for N>=6, slower for smaller N because of read-after-write penalties. N is number of 32-bit words in vector, in 192-bit case 6, resulting in 94 cycles. Well, in real life you'd have some additional call overhead, but the order of magnitude is there. At the very least it's not worse than 125 cycles, despite the fact that it's straightforward, non-Comba implementation storing intermediate results in memory. On pros side it's 160 bytes large [or should one say small?]. As for Comba algorithm. I reckon that fully unrolled, register-based Comba implementation should end up close to N*(N+K)+L, where K is 2-3 and L is somewhere around 10. "Should" means that I haven't actually attempted to implement it, nor plan to...

  • Hi,

    thank you for this answer Andy! Wow 94 sounds unbelievable fast to me!

    My solution is a fully unrolled comba written with intrinsics using the 32-bit multiplies as well as 8byte aligned memory accesses. I don't have an idea how to speed it up even more, since the compiler seems to be shifting the instructions already around as it likes it.

    I'd love to write an unrolled register-bases comba, but i guess i am low on time(wrote this for my thesis). If i computed the formula(with K = 3) correctly the multiplication should execute in just 64 cycles, that's insane! :)

    By the way. What are the shift left, shift right intrinsics good for? are they somehow superior to the "<< " and " >>" operators?

    regards,

    fabian

    EDIT: Just out of curiosity, how much time would it take to write such a register-based comba routine for you?

  • First of all, keep in mind that Comba is a person, so it's Comba algorithm, not comba.

    Fabian B��ttner said:
    If i computed the formula(with K = 3) correctly the multiplication should execute in just 64 cycles, that's insane! :)

    One can as well argue that 120 cycles is insane... I mean it's all about perspective. Keep in mind that 64 cycles is an estimate and might turn out optimistic, for example because of cross-path limitations, but I'm confident that it will be noticeably faster than 94 cycles.

    Fabian B��ttner said:
    By the way. What are the shift left, shift right intrinsics good for? are they somehow superior to the "<< " and " >>" operators?

    I'm not right person to answer such question. The only thing I can say is that my formula accounts for 4 different instructions: load, 32x32 unsigned multiply, 40-bit addition, store. That's all it takes [in assembler].

     

  • Andy Polyakov said:
    Keep in mind that 64 cycles is an estimate and might turn out optimistic, for example because of cross-path limitations, ...

    Or it can turn out pessimistic. As an exercise I've implemented 4x4 Comba multiplication and it operates at N*(N+1)+12, or 32 cycles. Below is snippet that calculates first 4 words, so it should give the idea.

    	.text
    	.asg	B3,RA
    	.asg	A4,ARG0
    	.asg	B4,ARG1
    	.asg	A6,ARG2
    
    	.global	_bn_mul_comba4
    _bn_mul_comba4:
    	.asmfunc
    	LDW	*ARG1[0],B16	; a[0]
    ||	LDW	*ARG2[0],A16	; b[0]
    	LDW	*ARG1[1],B17	; a[1]
    ||	LDW	*ARG2[1],A17	; b[1]
    	LDW	*ARG1[2],B18	; a[2]
    ||	LDW	*ARG2[2],A18	; b[2]
    	LDW	*ARG1[3],B19	; a[3]
    ||	LDW	*ARG2[3],A19	; b[3]
    	NOP
    	MPY32U	A16,B16,A1:A0	; a[0]*b[0]
    	MPY32U	A17,B16,A23:A22	; a[0]*b[1]
    	MPY32U	A16,B17,A25:A24	; a[1]*b[0]
    	MPY32U	A16,B18,A27:A26	; a[2]*b[0]
    	STW	A0,*ARG0[0]
    ||	MPY32U	A17,B17,A29:A28	; a[1]*b[1]
    	MPY32U	A18,B16,A31:A30	; a[0]*b[2]
    ||	ADDU	A22,A1,A1:A0
    	MV	A23,B0
    ||	MPY32U	A19,B16,A21:A20	; a[3]*b[0]
    ||	ADDU	A24,A1:A0,A1:A0
    	ADDU	A25,B0,B1:B0
    ||	STW	A0,*ARG0[1]
    ||	MPY32U	A18,B17,A23:A22	; a[2]*b[1]
    ||	ADDU	A26,A1,A9:A8
    	ADDU	A27,B1,B9:B8
    ||	MPY32U	A17,B18,A25:A24	; a[1]*b[2]
    ||	ADDU	A28,A9:A8,A9:A8
    	ADDU	A29,B9:B8,B9:B8
    ||	MPY32U	A16,B19,A27:A26	; a[0]*b[3]
    ||	ADDU	A30,A9:A8,A9:A8
    	ADDU	A31,B9:B8,B9:B8
    ||	ADDU	B0,A9:A8,A9:A8
    	STW	A8,*ARG0[2]
    ||	ADDU	A20,A9,A1:A0
    	ADDU	A21,B9,B1:B0
    ||	MPY32U	A19,B17,A21:A20	; a[3]*b[1]
    ||	ADDU	A22,A1:A0,A1:A0
    	ADDU	A23,B1:B0,B1:B0
    ||	MPY32U	A18,B18,A23:A22	; a[2]*b[2]
    ||	ADDU	A24,A1:A0,A1:A0
    	ADDU	A25,B1:B0,B1:B0
    ||	MPY32U	A17,B19,A25:A24	; a[1]*b[3]
    ||	ADDU	A26,A1:A0,A1:A0
    	ADDU	A27,B1:B0,B1:B0
    ||	ADDU	B8,A1:A0,A1:A0
    	STW	A0,*ARG0[3]
    	...
    

     

  • Hi Andy,

    i am sorry i didn't reply earlier. Wow this is great! I hope i find enough time to get this into my thesis!

    If i am able to succeed with this, I'll mention you for your help in my thesis :)

    Do you use the Code Composer Studio for developing c64x+ assembly? I reckon that there is also some technique known as linear assembly. Is this piece of code linear assembly?

    regards,

    fabian

  • Fabian B��ttner said:
    Do you use the Code Composer Studio for developing c64x+ assembly?

    I don't understand. I use CCS to compile and run the on the board, but for actual development I use my head:-)

    Fabian B��ttner said:
    I reckon that there is also some technique known as linear assembly. Is this piece of code linear assembly?

    No, it's not. Admittedly linear assembly is likely to be appropriate in this case, maybe more than appropriate. The code would be more readable and you probably won't have to worry about explicitly packing the instructions. Signing off...

     

  • Haha,

    I was just curious if i am using the right tool for assembly development :)

    Yesterday I briefly created an empty assembly project and pasted the code in an .asm file. It compiled correctly. But i got a warning: warning #10202-D: no suitable entry-point found; setting to 0

    That's why i created a .c file in the same project hoping that the main() routine could serve as entry point to get started testing. But it failed, complaining that the comba function was declared implicelty(which is correct, since i don't have a header file). I am a bit lost, since i didn't find a good example on how to interface c and asm in the CCS.

    sorry for the noobish questions.

    By the way: i am suprised that you didn't have to specify the functional units in the assembly.

    regards,

    fabian

  • Hi it's me again hehe,

    I just created a C Project with the asm file inside and was able to call the function by removing the leading "_" in the c file file. I also added a B B3 and a NOP 5 at the end of the snippet to branch to the return address.

    Now i still have to figure out how to pass parameters.

    Now let's say I want to do a 17*17 Comba multiplication. Are there registers that i have to save before using? IIRC B15 holds the stack pointer. but any others that i shouldn't modify?

    regards,

    fabian

  • Fabian B��ttner said:
    I just created a C Project with the asm file inside and was able to call the function by removing the leading "_" in the c file file. I also added a B B3 and a NOP 5 at the end of the snippet to branch to the return address.

    Now i still have to figure out how to pass parameters.

    Now let's say I want to do a 17*17 Comba multiplication. Are there registers that i have to save before using? IIRC B15 holds the stack pointer. but any others that i shouldn't modify?

    It was never my intention to guide you (or anybody else) through such details, so don't expect much. If you are striving with calling convention, which is documented in publication SPRAB89, then I'd say do consider linear assembly. It among other things even liberates you from dealing with calling sequence including preserving and restoring of non-volatile registers. As for 17*17. I'd argue that if you want to cover multiple N*N, then you should consider automation, i.e. a script, or program, then emits assembler for specified N.

     

  • Fabian,

    In addition to the EABI Application Report mentioned by Andy, you will want to read the C Compiler User's Guide. Please see Section 7.5 "Interfacing C and C++ With Assembly Language".

    Regards,
    RandyP 

  • Fabian B��ttner said:
    Now i still have to figure out how to pass parameters.

    I just realized that I assumed that bn_mul_comba4 prototype is common knowledge, which might be wrong to assume. On the other hand google gives correct answer, function prototype, in first hit... Once again, keep in mind that the snippet is truncated to first 4 values and there are 4 remaining. Complete code will be available at some later point as part of OpenSSL.

     

  • Hi Randy,

    Thanks for the pointer! Must have missed it.

    regards,

    fabian

  • Andy Polyakov said:

    I just created a C Project with the asm file inside and was able to call the function by removing the leading "_" in the c file file. I also added a B B3 and a NOP 5 at the end of the snippet to branch to the return address.

    Now i still have to figure out how to pass parameters.

    Now let's say I want to do a 17*17 Comba multiplication. Are there registers that i have to save before using? IIRC B15 holds the stack pointer. but any others that i shouldn't modify?

    It was never my intention to guide you (or anybody else) through such details, so don't expect much. If you are striving with calling convention, which is documented in publication SPRAB89, then I'd say do consider linear assembly. It among other things even liberates you from dealing with calling sequence including preserving and restoring of non-volatile registers. As for 17*17. I'd argue that if you want to cover multiple N*N, then you should consider automation, i.e. a script, or program, then emits assembler for specified N.

    [/quote]

    I think I just finished my 6*6 Comba multiplication, thanks.

    .text

    .asg B3,RA

    .asg A4,ARG0

    .asg B4,ARG1

    .asg A6,ARG2

     

    .global _bn_mul_comba4

    _bn_mul_comba4:

    .asmfunc

    LDW *ARG1[0],B16 ; a[0]

    || LDW *ARG2[0],A16 ; b[0]

    LDW *ARG1[1],B17 ; a[1]

    || LDW *ARG2[1],A17 ; b[1]

    LDW *ARG1[2],B18 ; a[2]

    || LDW *ARG2[2],A18 ; b[2]

    LDW *ARG1[3],B19 ; a[3]

    || LDW *ARG2[3],A19 ; b[3]

    LDW *ARG1[4],B20 ; a[4]

    || LDW *ARG2[4],A20 ; b[4]

    MPY32U A16,B16,A1:A0 ; a[0]*b[0]

    || LDW *ARG1[5],B21 ; a[5]

    || LDW *ARG2[5],A21 ; b[5]

    MPY32U A17,B16,A23:A22 ; a[0]*b[1]

    MPY32U A16,B17,A25:A24 ; a[1]*b[0]

    MPY32U A18,B16,A27:A26 ; a[0]*b[2]

    STW A0,*ARG0[0]

    || MPY32U A17,B17,A29:A28 ; a[1]*b[1]

    MPY32U A16,B18,A31:A30 ; a[2]*b[0]

    || ADDU A22,A1,A1:A0

    MV A23,B0

    || MPY32U A19,B16,A3:A2 ; a[0]*b[3]

    || ADDU A24,A1:A0,A1:A0

    ADDU A25,B0,B1:B0

    || STW A0,*ARG0[1]

    || MPY32U A18,B17,A23:A22 ; a[1]*b[2]

    || ADDU A26,A1,A9:A8

    ADDU A27,B1,B9:B8

    || MPY32U A17,B18,A25:A24 ; a[2]*b[1]

    || ADDU A28,A9:A8,A9:A8

    ADDU A29,B9:B8,B9:B8

    || MPY32U A16,B19,A27:A26 ; a[3]*b[0]

    || ADDU A30,A9:A8,A9:A8

    ADDU A31,B9:B8,B9:B8

    || ADDU B0,A9:A8,A9:A8

    STW A8,*ARG0[2]

    || ADDU A2,A9,A1:A0

    ADDU A3,B9,B1:B0

    || MPY32U A20,B16,A3:A2 ; a[0]*b[4]

    || ADDU A22,A1:A0,A1:A0

    ADDU A23,B1:B0,B1:B0

    || MPY32U A19,B17,A23:A22 ; a[1]*b[3]

    || ADDU A24,A1:A0,A1:A0

    ADDU A25,B1:B0,B1:B0

    || MPY32U A18,B18,A25:A24 ; a[2]*b[2]

    || ADDU A26,A1:A0,A1:A0

    ADDU A27,B1:B0,B1:B0

    || ADDU B8,A1:A0,A1:A0

    MPY32U A17,B19,A27:A26 ; a[3]*b[1]

    || STW A0,*ARG0[3]

    MPY32U A16,B20,A29:A28 ; a[4]*b[0]

    || ADDU A2,A1,A9:A8

    ADDU A3,B1,B9:B8

    || MPY32U A21,B16,A3:A2 ; a[0]*b[5]

    || ADDU A22,A9:A8,A9:A8

    ADDU A23,B9:B8,B9:B8

    || MPY32U A20,B17,A23:A22 ; a[1]*b[4]

    || ADDU A24,A9:A8,A9:A8

    ADDU A25,B9:B8,B9:B8

    || MPY32U A19,B18,A25:A24 ; a[2]*b[3]

    || ADDU A26,A9:A8,A9:A8

    ADDU A27,B9:B8,B9:B8

    || MPY32U A18,B19,A27:A26 ; a[3]*b[2]

    || ADDU A28,A9:A8,A9:A8

    ADDU A29,B9:B8,B9:B8

    || MPY32U A17,B20,A29:A28 ; a[4]*b[1]

    ADDU B0,A9:A8,A9:A8

    MPY32U A16,B21,A31:A30 ; a[5]*b[0]

    || STW A8,*ARG0[4]

    || ADDU A2,A9,A1:A0

    ADDU A3,B9,B1:B0

    || MPY32U A21,B17,A3:A2 ; a[1]*b[5]

    || ADDU A22,A1:A0,A1:A0

    ADDU A23,B1:B0,B1:B0

    || MPY32U A20,B18,A23:A22 ; a[2]*b[4]

    || ADDU A24,A1:A0,A1:A0

    ADDU A25,B1:B0,B1:B0

    || MPY32U A19,B19,A25:A24 ; a[3]*b[3]

    || ADDU A26,A1:A0,A1:A0

    ADDU A27,B1:B0,B1:B0

    || MPY32U A18,B20,A27:A26 ; a[4]*b[2]

    || ADDU A28,A1:A0,A1:A0

    ADDU A29,B1:B0,B1:B0

    || MPY32U A17,B21,A29:A28 ; a[5]*b[1]

    || ADDU A30,A1:A0,A1:A0

    ADDU A31,B1:B0,B1:B0

    || ADDU B8,A1:A0,A1:A0

    STW A0,*ARG0[5]

    || ADDU A2,A1,A9:A8

    ADDU A3,B1,B9:B8

    || MPY32U A21,B18,A3:A2 ; a[2]*b[5]

    || ADDU A22,A9:A8,A9:A8

    ADDU A23,B9:B8,B9:B8

    || MPY32U A20,B19,A23:A22 ; a[3]*b[4]

    || ADDU A24,A9:A8,A9:A8

    ADDU A25,B9:B8,B9:B8

    || MPY32U A19,B20,A25:A24 ; a[4]*b[3]

    || ADDU A26,A9:A8,A9:A8

    ADDU A27,B9:B8,B9:B8

    || MPY32U A18,B21,A27:A26 ; a[5]*b[2]

    || ADDU A28,A9:A8,A9:A8

    ADDU A29,B9:B8,B9:B8

    || ADDU B0,A9:A8,A9:A8

    STW A8,*ARG0[6]

    || ADDU A2,A9,A1:A0

    ADDU A3,B9,B1:B0

    || MPY32U A21,B19,A3:A2 ; a[3]*b[5]

    || ADDU A22,A1:A0,A1:A0

    ADDU A23,B1:B0,B1:B0

    || MPY32U A20,B20,A23:A22 ; a[4]*b[4]

    || ADDU A24,A1:A0,A1:A0

    ADDU A25,B1:B0,B1:B0

    || MPY32U A19,B21,A25:A24 ; a[5]*b[3]

    || ADDU A26,A1:A0,A1:A0

    ADDU A27,B1:B0,B1:B0

    || ADDU B8,A1:A0,A1:A0

    STW A0,*ARG0[7]

    || ADDU A2,A1,A9:A8

    ADDU A3,B1,B9:B8

    || MPY32U A21,B20,A3:A2 ; a[4]*b[5]

    || ADDU A22,A9:A8,A9:A8

    ADDU A23,B9:B8,B9:B8

    || MPY32U A20,B21,A23:A22 ; a[5]*b[4]

    || ADDU A24,A9:A8,A9:A8

    ADDU A25,B9:B8,B9:B8

    || ADDU B0,A9:A8,A9:A8

    STW A8,*ARG0[8]

    || ADDU A2,A9,A1:A0

    ADDU A3,B9,B1:B0

    || MPY32U A21,B21,A3:A2 ; a[5]*b[5]

    || ADDU A22,A1:A0,A1:A0

    ADDU A23,B1:B0,B1:B0

    || ADDU B8,A1:A0,A1:A0

    MV A4, B2

    STW A0,*ARG0[9]

    || ADDU A2,A1,A9:A8

    ADDU A3,B1,B9:B8

    || ADDU B0,A9:A8,A9:A8

    STW A8,*ARG0[10]

    || STW B8,*B2[11]

    B B3

    NOP 5

    It executes in 74 cycles(with call + return). I don't know if one measures the cylces needed for call + return, too or if one states only the raw cycle count the function needs in its body.

    I think i am going to be _very_ low on registers if i want to code a 17*17 unrolled register-based Comba multiplication.

    regards,

    fabian

  • Fabian B��ttner said:
    ...
    ||  STW B8,*B2[11]
        B   B3
        NOP 5
    

    It executes in 74 cycles(with call + return).

    Note that you can move "B B3" upwards 5 execution packets, pair it with || and thus eliminate return overhead.

    Fabian B��ttner said:
    I don't know if one measures the cylces needed for call + return, too or if one states only the raw cycle count the function needs in its body.

    I usually inject MVC TSCL,BX in first and last execution packets. But I invoke the function second time with same arguments and look at results from the second invocation, when it operates on "hot" cache and result is free from set-up noise.

    Fabian B��ttner said:
    I think i am going to be _very_ low on registers if i want to code a 17*17 unrolled register-based Comba multiplication.

    You would need 17 A-registers and 17 B-registers to hold input values, but then you need only 4 A-pairs to hold multiplication results and 2+2 pairs to accumulate them. It appears tight, but it's totally doable. 4 pairs because you can reuse a pair after you accumulate the result.This is where code generation automation can be helpful: you take 4 pairs and arrange them as circular buffer... Number of 4 is determined by multiplication latency.

     

  • Hi Andy,

    this is really extremely helpful! Thanks!

    Andy Polyakov said:

    ...
    ||  STW B8,*B2[11]
        B   B3
        NOP 5
    

    It executes in 74 cycles(with call + return).

    Note that you can move "B B3" upwards 5 execution packets, pair it with || and thus eliminate return overhead.

    [/quote]

    Wow, that's a cool trick!

    Andy Polyakov said:

    I don't know if one measures the cylces needed for call + return, too or if one states only the raw cycle count the function needs in its body.

    I usually inject MVC TSCL,BX in first and last execution packets. But I invoke the function second time with same arguments and look at results from the second invocation, when it operates on "hot" cache and result is free from set-up noise.

    [/quote]

    Does the CCS debugger simulate such set-up noise too?

    Andy Polyakov said:

    I think i am going to be _very_ low on registers if i want to code a 17*17 unrolled register-based Comba multiplication.

    You would need 17 A-registers and 17 B-registers to hold input values, but then you need only 4 A-pairs to hold multiplication results and 2+2 pairs to accumulate them. It appears tight, but it's totally doable. 4 pairs because you can reuse a pair after you accumulate the result.This is where code generation automation can be helpful: you take 4 pairs and arrange them as circular buffer... Number of 4 is determined by multiplication latency.

    [/quote]

    I was wondering why were you always using A-pairs to hold multiplication results? I figured that maybe cross-path limitations were the reasoning for that decision.

    I think i am going to preserve A15, A11, A10 and B15 on the stack to be able to use  A15-A31 and B15-B31 for the input values. For the accumulator pairs I am going to use registers A1:A0, B1:B0, A3:A2, and B3:B2.

    I am going to move RA from B3 to B5 and ARG2 from A6 to A5 to make room for another A-pair, that is A7:A6. As the four A-pairs i am then able to use registers A3:A2, A7:A6, A9:A8 and A11:A10.

    regards,

    fabian

    EDIT: I just realized that saving B15 on the stack doesn't make much sense, since B15 holds the stack pointer. Instead I am going to use B6, B7 and B8 to preserve A10, A11 and B15.

  • Fabian B��ttner said:
    Does the CCS debugger simulate such set-up noise too?

    You mean simulator? Cycle-accurate the call it? Or something... I don't know, I don't have it and run code on hardware...

    Fabian B��ttner said:
    I was wondering why were you always using A-pairs to hold multiplication results? I figured that maybe cross-path limitations were the reasoning for that decision.

    It's surely possible to arrange it differently, but it won't give you any advantage. Most recurring execution packet is ADDU||MUL||ADDU. Two of them are cross-path, opposite ones, and it's the maximum you can have... If you want to alter between A and B for multiplication result, then you'd have to alter direction for ADDU. Performing two multiplications in one cycle won't result in any improvement, because code is dominated by ADDU pairs. In other words, you can do it differently, but the code would only be harder to follow, it won't be any faster.

     

  • Fabian and Andy,

    This is a remarkable thread. I do not understand the final purpose of the algorithm you are optimizing, but the discussion is insightful and productive.

    Fabian,

    I recommend you look through the material we have on the Wiki for optimization. Search there for "C6000 optimization" (no quotes), and you will find Wiki topics and full workshops. It can be tough to learn some of the optimization techniques just by reading, but some of the workshop material has some good text in the student guides and helpful diagrams in the presentation slides that are copied into the student guides.

    You can get a lot of insights into optimization techniques by writing your algorithm in C and using the best optimization methods for C (see some of those Wiki topics that your search finds), then look at the assembly output to see how it was written. The compiler can make some very good choices for instruction and register selection, and it can be very enlightening to see how the program flow is implemented.

    Since the simulator was mentioned, be careful that you choose a simulator that includes all of the features you will need. For example, if program and data are large and will need to be stored in external memory, choose a simulator that says it supports external memory. If your program and data are small enough to fit into L1P and L1D SRAM (no cache), then you can use lower-level simulators. I always go for a Device simulator because it has the most functionality; it may also run slower than some of the others, but I think for your application you would never be bothered by the difference.

    The only comment I might make for the code listings above is to consider using the LDDW instruction, or if alignment is not guaranteed use the LDNDW instruction. I cannot offer a guarantee that things will run faster, but you have a good chance of it.

    Regards,
    RandyP

  • RandyP said:
    You can get a lot of insights into optimization techniques by writing your algorithm in C and using the best optimization methods for C (see some of those Wiki topics that your search finds), then look at the assembly output to see how it was written. The compiler can make some very good choices for instruction and register selection, and it can be very enlightening to see how the program flow is implemented.

    By just looking at compiler output [in order to learn architecture] won't allow you to become better than compiler. So my recipe is to imagine how to do it in machine code and see if compiler does anything similar. At initial stage compiler might be better, but soon enough you start to see when it does proper job and when it's time to devote a special effort.

    RandyP said:
    The only comment I might make for the code listings above is to consider using the LDDW instruction, or if alignment is not guaranteed use the LDNDW instruction. I cannot offer a guarantee that things will run faster, but you have a good chance of it.

    You should also tell that it makes code byte order dependent. It's not really hard to deal with, but one has to do it. You might be able to shave off one cycle with LDDW (one out of whole procedure), but you'll have to clutter the code with details not related to algorithm. LDW-based code on the other hand is endian-neutral..

     

     

  • Andy Polyakov said:

    Does the CCS debugger simulate such set-up noise too?

    You mean simulator? Cycle-accurate the call it? Or something... I don't know, I don't have it and run code on hardware...

    [/quote]

    Yes exactly I meant the cycle-accurate c64x+ simulator.

    Andy Polyakov said:

    I was wondering why were you always using A-pairs to hold multiplication results? I figured that maybe cross-path limitations were the reasoning for that decision.

    It's surely possible to arrange it differently, but it won't give you any advantage. Most recurring execution packet is ADDU||MUL||ADDU. Two of them are cross-path, opposite ones, and it's the maximum you can have... If you want to alter between A and B for multiplication result, then you'd have to alter direction for ADDU. Performing two multiplications in one cycle won't result in any improvement, because code is dominated by ADDU pairs. In other words, you can do it differently, but the code would only be harder to follow, it won't be any faster.

    [/quote]

    I see. Thanks for the explanation. I am curious about how you figured out that formula that describes what the theoretical best cycle count would/could be?

    I am definitely giving automated code generation a shot.

    Are you somehow related to the openssl project? Because you said the code will be published in openssl later.

    regards,

    fabian

     

  • Hi Randy,

    The Comba Multiplication algorithm is used to compute the muliplication of two large numbers like, e.g., needed for cryptography. In my case it's 192-bit each. This method is superior to the standard schoolbook multiplication since it reduces the number of carries. It does however not alter the complexitiy O(n^2) compared to the standard schoolbook multiplication method, but it's altering the hidden constant in the O notation.

    I think my C is fairly well optimized.

    I am simulating only small time critical routines from my application. The simulator from CCS5.1 seems to be leaking memory. While testing, my test scenario sometimes crashes, when I restart the simulation then i often find all of my 4gb of ram occupied.

    regards,

    fabian

  • Fabian B��ttner said:
    I am curious about how you figured out that formula that describes what the theoretical best cycle count would/could be?

    Come on, you've got to earn the degree, figure it out... As already mentioned, the execution is dominated by ADDU pairs, how many are there, how often can processor schedule them, what are additional dependencies? Note that I never claimed that my estimate is perfectly accurate, but hardware did execute my 4x4 subroutine in 32 cycles.

  • Andy Polyakov said:

    I am curious about how you figured out that formula that describes what the theoretical best cycle count would/could be?

    Come on, you've got to earn the degree, figure it out... As already mentioned, the execution is dominated by ADDU pairs, how many are there, how often can processor schedule them, what are additional dependencies? Note that I never claimed that my estimate is perfectly accurate, but hardware did execute my 4x4 subroutine in 32 cycles.

    [/quote]

    Hehe,

    I think I have an idea how to do that!

    I integrated your suggestion to move up the B B3 Instructions 5 execution packets leaving a NOP 1 there. It is also possible to move the B B3 Instruction up 6 execution packets and getting rid of the last NOP 1, isn't it? With the first solution the cycle count dropped to like 60 Cycles, the second solution to 58, which is absolutely crazy!

    I guess that the two STWs at the end of the routine are no not yet finished writing the register values to the memory so one needs to be carefully not to write to these registers, is this correct? The compiler does not have a clue that its not allowed to modify these register, does it? So maybe it is safer to just leave the B B3 at the end of the routine or maybe place it that way that the STW are guaranteed to have finished their writing.

    At the moment i am writing a script to emit assembly for arbitrary NxN multiplications.

    regards,

    fabian

  • Fabian B��ttner said:
    I guess that the two STWs at the end of the routine are no not yet finished writing the register values to the memory so one needs to be carefully not to write to these registers, is this correct? The compiler does not have a clue that its not allowed to modify these register, does it? So maybe it is safer to just leave the B B3 at the end of the routine or maybe place it that way that the STW are guaranteed to have finished their writing.

    Manual is explicit about store having no delay slots, which should mean that its source register can be used as destination register for another instruction as if it was any other 0-delay instruction, i.e. even in same execution packet. I've used this pattern, store and target same register in same execution packet, in actual code and it does work. In other words, I for one see no reason whatsoever to wait for store to "finish its writing." Loads is different story in the context of return from subroutine. For example if you offload some registers to stack, you have to ensure that restore from stack is committed to registers by the time execution control is transferred to caller. In other words generally you ought to have 4 cycles after last load in any particular subroutine.