This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Mixing asm and C on C6474

Hello,

I want to write an assembly function having the prototype : long long fastmul(long long a,long long b); which I declared at my main.c

If I write the following code for the assembly function :

.global _fastmul

_fastmul:

...

B B3

That function return only the content of the 32-bit register A4 ; it's supposed to return 64-bit data (perhaps A5:A4) .. what I'm missing ?

 

Thanks

  • Mounir,

    There are many possibilities: incorrect assignments and incorrect coding practices. C6x assembly code is not easily learned and is not easily taught. 

    The best solution is to write the function in C, compile it with the -k option to keep the assembly output (may vary with CCS version), and use that as your starting point.

    Regards,
    RandyP

     

    If you need more help, please reply back. If this answers the question, please click  Verify Answer  , below.

  • Thanks RandyP for your reply,

    The C code doesn't achieve my timing requirements .. it's about a complex matrix product, knowing that dsp can achieve 2 cmpy instructions per cycle, I got only 32% of C code efficiency .. then i think the compiler doesn't well optimize my c code ..

    here is my asm function : [declared prototype in .c file : long long fastmul(long long a,long long b) ] ; [inputs : 4 32-bit complex numbers x1,x2,x3,x4 ; output shoud be the complex 64-bit number : x1*x2+x3*x4]

     

    .global _fastmul

    _fastmul:

    CMPY .M1X A4,B4,A7:A6
    || CMPY .M2X B5,A5,B7:B6
    NOP 3
    ADD .L1X A6,B6,A4
    || ADD .L2X B7,A7,B5
    MV .S1X B5,A5
    B B3
    That should do the same (almost the same) as :
    long long cmpy_mul(long long a,long long b) {
    long long resA,resB;
    resA=_cmpy(_loll(a),_loll(b));
    resB=_cmpy(_hill(a),_hill(b));
    return resA+resB;
    }
    running the instructions :
    printf("asm : %ld\n",fastmul(0x3950503940408040,0x4090404039393940));
    printf("c : %ld\n",cmpy_mul(0x3950503940408040,0x4090404039393940));
    gives :
    asm : 342741584
    c : 4637708880
    I checked that the content of A4 is 342741584, so only 32-bit is returned !!
    what do you suggest please ?

     

  • The compiler's assembly output will show you how to write the function to accept the arguments and to return the arguments, and also how to handle the rest of the control of the function. You do not have to use the math contents of the compiler's assembly output, but you can learn from the compiler's techniques.

    Also, you need to learn to use the CCS debugger if you plan to write assembly code. Using printf's for debugging assembly makes your job that much harder.

    And writing the function in C first will also allow you to test your C testbench to find out what else you need to modify.

    Here is my testbench C code:

    #include <stdio.h>

    long long fastmul(long long a,long long b);

    void main()
    {
        long long x = 0;
        long long a          = 0x01aaaaaabbbbbbbbLL;
        unsigned long long b = 0x02ccccccddddddddULL;
    //    printf("a: %lld, b: %llu\n", a, b);
       
        x = fastmul( a, b );
    }

    Here is my testbench asm code:

        .global _fastmul

    _fastmul:
            mvkl    0x03eeeeee, a5
            mvkh    0x03eeeeee, a5
            mvkl    0xffffffff, a4
            mvkh    0xffffffff, a4

            B        B3
            NOP        5

    In CCS, I can see in the Core Registers display window that when I enter _fastmul, the correct registers are loaded as expected for the parameters passed through the function argument list. And I can see that after completing the assignment to x in main() the correct value is loaded into x.

    Finally, you need to search the TI Wiki Pages for C6000 Optimization and study the Wiki topics and the workshop material. It would be best to take a class in C6000 optimization, but that may not be likely. Other than for trivial examples like my C code above, I always start with the C version, keep the assembly, strip it down to the parts I need, then start optimizing from there. Your code needs further optimization, as does mine above.

    Regards,
    RandyP

     

    If you need more help, please reply back. If this answers the question, please click  Verify Answer  , below.

  • Thanks so much,

    I was missing the instruction : "NOP 5" after the branching instruction .. so I obtained weird behaviours ..

  • I am glad you found your problem.

    As a parting gift, two more suggestions:

    1. Try linear assembly. The assembly optimizer will automatically insert the right number of NOPs and will optimize the code.

    2. Get rid of the NOP 5 and move the B B3 to be paralleled with the CMPY instructions.

    Regards,
    RandyP

  • Yes, I was planning to do your 2nd suggestion .. it worked as well, and improved efficiency ..

    Thanks ..

    It's so exciting to handle that parallel architecture of dsp as we want ..

    Best regards . Mounir