This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6678: call Assembly Language in C

Part Number: TMS320C6678
Other Parts Discussed in Thread: SYSBIOS

hello everyone

I have problem about : when I call Assembly Language in C , Can't break point debugging in Assembly Language,

Please help me.

in C:

in Assembly Language :

  • Hi,

    Is this Code Composer Studio or another debugger tool?

    Best Regards,
    Yordan
  • hi,
    Yordan Kovachev.

    thank you very much for your answer.

    Is Code Composer Studio .

    I have been able to debug in Code Composer Studio ,But now I have a new problem.

    In SYS/BIOS, I can't run through, for example,Following code:

    start=Timestamp_get32();

    sum(ARRAY_A,ARRAY_B,ARRAY_C,ARRAY_D,LENGTH);
    end=Timestamp_get32();
    cost=end-start;

    I set breakpoints in sum() function. Run at full break point. Single step debugging, The following error occurred:

    No source available for "C$$EXIT() at E:\biyework\asm_test1\Debug\asm_test1.out:{3} 0x82723760{4}"

    and Exit debugging,

    if Run at full break point. Open Disassembly window, Single step debugging, Get the right results, but exit appears:

    Can't find a source file at "/tmp/scratch/build_jenkins/workspace/BuildToolsLinux/build/c60/product/linux/lib-internal/src/dtor_list.cpp"

    Cannot return to the main function of the previous call,


    I'm confused about it,

    Thanks again.
    Best Ragards,
  • This is the assembly code:

    .global sum
    ;A4 B4 A6 is src, B6 is dst
    ;MVK .L1 500,B1
    ; loop:
    sum:
    MVK .L1 2,A3
    MV .L2X A8,B1

    loop:
    LDDW .D1 *A4++[1],A9:A8
    || LDDW .D2 *B4++[1],B9:B8
    ;NOP 4

    LDDW .D1 *A6++[1],A11:A10
    NOP 4

    DMPYSP .M2X A9:A8,B9:B8,B9:B8
    ;MPY32 .M2X A0,B3,B4
    || DMPYSP .M1 A9:A8,A11:A10,A11:A10
    NOP 3
    SUB .L2X B1,A3,B1
    DADDSP .L1X B9:B8,A11:A10,A11:A10
    [B1] B .S1 loop
    NOP 1
    ;ADD .L1 A0,B4,A0
    STDW .D2 A11:A10,*B6++[1]
    NOP 3

    ;NOP 5

    .end
  • Hi,

    The RTOS team have been notified. Feedback will be posted directly here.

    Best Regards,
    Yordan
  • hi,
    Yordan Kovachev.
    I have run successfully.

    Attach my successful assembly code

    .global sum
    ;A4 B4 A6 is src, B6 is dst
    ;MVK .L1 500,B1
    ; loop:
    .asmfunc
    sum:
    MVK .L1 2,A3
    MV .L2X A8,B1

    loop:
    LDDW .D1 *A4++[1],A9:A8
    || LDDW .D2 *B4++[1],B9:B8

    LDDW .D1 *A6++[1],A1:A0
    NOP 4

    DMPYSP .M2X A9:A8,B9:B8,B9:B8

    || DMPYSP .M1 A9:A8,A1:A0,A1:A0

    NOP 4
    DADDSP .L1X B9:B8,A1:A0,A1:A0
    NOP 2

    STDW .D2 A1:A0,*B6++[1]

    SUB .L2X B1,A3,B1
    [B1] B .S1 loop

    NOP 5
    B B3
    .endasmfunc
  • Thanks for updating the thread.

    Best Regards,
    Yordan
  • Hi,

    A NOP to fill the delay slot must be insert also after the return:

    [B1] B.S1 loop

    NOP 5    ; delay slot for previous B loop

    B B3      ; return

    NOP 5    ; delay slot for return branch

    .endasmfunc

    The branch and nop can be condensed in only one instruction:  BNOP  B3,5.

  • hi,
    Alberto Chessa.
    thank you for your answer.

    I will pay attention to the future programming.

    but now I have a new puzzled.

    about :A*B+C*D (A、B、C、D is floating point)

    In C language to open the -O3 compiler optimization of execution time is shorter than assembly optimization.

    assembly optimization of execution time is shorter than C language in principle.

    I don't know is my assembly language not write . Or now for the C language -O3 optimization has been more than the assembly optimization.

    please help me if you know.

    attach my code.

    C language:

    for(i=0;i<LENGTH;i++) //LENGTH IS 10000
    {
    ARRAY_E[i]=ARRAY_A1[i]*ARRAY_B1[i]+ARRAY_A1[i]*ARRAY_C1[i];

    }


    assembly language:

    .global sum

    .asmfunc
    sum:
    MVK .L1 2,A3
    MV .L2X A8,B1

    loop:
    LDDW .D1 *A4++[1],A9:A8
    || LDDW .D2 *B4++[1],B9:B8

    LDDW .D1 *A6++[1],A1:A0
    NOP 4

    DMPYSP .M2X A9:A8,B9:B8,B9:B8

    || DMPYSP .M1 A9:A8,A1:A0,A1:A0

    NOP 4
    DADDSP .L1X B9:B8,A1:A0,A1:A0
    NOP 2

    STDW .D2 A1:A0,*B6++[1]

    SUB .L2X B1,A3,B1
    [B1] B .S1 loop

    NOP 5

    BNOP B3,5
    .endasmfunc


    best regards.
    thank you.
  • In my experience this is quite normal: the C optimizer works better then my mind, so I never write assembler.
    In your case, I guess the compiler is using the so called Software Pipelined loop.

    Look at SPRABG7 and SPRABF2. You'll see that is you correctly instruct the C compiler, it will be able to do an excellent work.

    Beside the normal and generic optimization techniques, the key is to correctly inform the compiler about some precondition about the data, that is a good use of C "restric", "const", and loop pragma about data alignment and number of iteration (if possible, always align your data to 8 bytes - enable some SIMD instruction). Also "--speculative_loads" can help.

    In the worst case, to force the compiler to use some SIMD intructions, You can use C with the instrinsic operation
  • hi,

    Alberto Chessa.

    thank you very much for your answer. I'll read the SPRABG7 and SPRABF2 carefully.

    It's great for me to get started.
    About your answer, I don't understand"loop pragma about data alignment", is loadAlign in open memory? Or other?

    Thanks again
    best regards.
  • I'm talking about the align of the vector data you pass to sum(), I don't know what loadAlign is (I suppose something regarding the TI SYSBIOS). This will be clear when you'll read the optimization manual.


    Anyway, this is an example (not tested):

    #define LENGTH 10000
    
    #pragma FUNC_CANNOT_INLINE(sum);
    void sum(float* E, const float* A1, const float* B1, const float* C1)
    {
      int i;
    
      float* const restrict ARRAY_E=E;
      const float* const restrict ARRAY_A1=A1;
      const float* const restrict ARRAY_B1=B1;
      const float* const restrict ARRAY_C1=C1;
    
      _nassert((int)ARRAY_E % 8 ==0);   //precondition, not enforced by the compiler
      _nassert((int)ARRAY_A1 % 8 ==0);  //...
      _nassert((int)ARRAY_B1 % 8 ==0);
      _nassert((int)ARRAY_C1 % 8 ==0);
    
    #pragma MUST_ITERATE(10000, 10000);   //not required since LENGTH is constant and the compiler can figure out it by himself
      for(i=0;i<LENGTH;i++) //LENGTH IS 10000
      {
        ARRAY_E[i]=ARRAY_A1[i]*ARRAY_B1[i]+ARRAY_A1[i]*ARRAY_C1[i];
      }
    }
    
    #pragma DATA_ALIGN(F_E, 8);   //align to 8 bytes, so to meet the precondition of sum()
    float F_E[LENGTH];
    
    #pragma DATA_ALIGN(F_A1, 8);
    float F_A1[LENGTH];
    
    #pragma DATA_ALIGN(F_B1, 8);
    float F_B1[LENGTH];
    
    #pragma DATA_ALIGN(F_C1, 8);
    float F_C1[LENGTH];
    
    
    void call_sum()
    {
      sum(F_E, F_A1, F_B1, F_C1);
    }
    

    Try to compile with "-O3, --opt_for_speed=5, --optimize_with_debug=on, --speculative_loads=auto, --debug_software_pipeline" and then look at the generated assembler (--keep_asm).

    Estimated total cycle should be 10020, that is one cycle per iteration (two multiplication and one sum per iteration)! Hard to do better by direct asm coding (at least for me).

    Of course, in the "real life" you have to be sure to meet the precondition: data alignment and dependency (the "rescrict" qualifier) and number of iteration properties (see MUST_ITERATE in the compiler manual).

  • hi,
    Alberto Chessa.
    thank you very much for your answer.your code style and writing methods broaden my horizons, thanks again.

    I turn on "-O3, --opt_for_speed=5, --optimize_with_debug=on, --speculative_loads=auto, --debug_software_pipeline" ,Feel "-O3" is very useful, but " --opt_for_speed=5, --optimize_with_debug=on, --speculative_loads=auto, --debug_software_pipeline" not very useful, Is not that I can not open?

    I said the loadAlign is SYS/BIOS in configuing memory, but it doesn't feel very useful.

    I run your code, just 28582 clock cycle, my assembly code is 77502 clock cycle(use CCS5.2 menu bar run-> clock measured ), i don't know what the 28582 clock cycle has to do with the "Estimated total cycle should be 10020"? or i don't have the right time to test, but i generated assembler (--keep_asm) see "10020" in last line :

    ;*----------------------------------------------------------------------------*
    ;* SOFTWARE PIPELINE INFORMATION
    ;*
    ;* Loop found in file : ../main.c
    ;* Loop source line : 50
    ;* Loop opening brace source line : 51
    ;* Loop closing brace source line : 53
    ;* Loop Unroll Multiple : 10x
    ;* Known Minimum Trip Count : 1000
    ;* Known Maximum Trip Count : 1000
    ;* Known Max Trip Count Factor : 1000
    ;* Loop Carried Dependency Bound(^) : 0
    ;* Unpartitioned Resource Bound : 10
    ;* Partitioned Resource Bound(*) : 10
    ;* Resource Partition:
    ;* A-side B-side
    ;* .L units 0 0
    ;* .S units 0 0
    ;* .D units 10* 10*
    ;* .M units 5 5
    ;* .X cross paths 7 4
    ;* .T address paths 10 10
    ;* Logical ops (.LS) 3 2 (.L or .S unit)
    ;* Addition ops (.LSD) 0 0 (.L or .S or .D unit)
    ;* Bound(.L .S .LS) 2 1
    ;* Bound(.L .S .D .LS .LSD) 5 4
    ;*
    ;* Searching for software pipeline schedule at ...
    ;* ii = 10 Schedule found with 3 iterations in parallel
    ;*
    ;* Register Usage Table:
    ;* +-----------------------------------------------------------------+
    ;* |AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA|BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB|
    ;* |00000000001111111111222222222233|00000000001111111111222222222233|
    ;* |01234567890123456789012345678901|01234567890123456789012345678901|
    ;* |--------------------------------+--------------------------------|
    ;* 0: | * **** ** ** ******* | ** ** ********** |
    ;* 1: | ******* ** ** ******* | ** ** ** ********** |
    ;* 2: | ******* ** ******* | ** ************** |
    ;* 3: | ******* ** ******* | ****** ************** |
    ;* 4: | ******* ** ******* | ****** ********** |
    ;* 5: | * **** ** ******* | **** ** ********** |
    ;* 6: | *** ** ** ******* | ** ** ********** |
    ;* 7: | ******* **** ******* | ****** ************** |
    ;* 8: | ******* *************** | ****** ************ |
    ;* 9: | ***** ************* | ****** ********** |
    ;* +-----------------------------------------------------------------+
    ;*
    ;* Done
    ;*
    ;* Loop will be splooped
    ;* Collapsed epilog stages : 0
    ;* Collapsed prolog stages : 0
    ;* Minimum required memory pad : 0 bytes
    ;*
    ;* Minimum safe trip count : 1 (after unrolling)
    ;* Min. prof. trip count (est.) : 3 (after unrolling)
    ;*
    ;* Mem bank conflicts/iter(est.) : { min 0.000, est 1.500, max 6.000 }
    ;* Mem bank perf. penalty (est.) : 13.0%
    ;*
    ;* Effective ii : { min 10.00, est 11.50, max 16.00 }
    ;*
    ;*
    ;* Total cycles (est.) : 20 + min_trip_cnt * 10 = 10020
    ;*----------------------------------------------------------------------------*



    Finally, I found I could not assign to F_A1 ,F_B1,F_C1. How do I assign if I want to use?

    thanks again.
    best regards.
    Frank Zach.
  • Hi,

    It is normal having the actual cycles greater then estimated, since the effective time is also "data bound". The difference is more or less the time required to read the data from the memory. If they are not already in the L1 data cache, the CPU will stall while reading them from DDR or other Level 2 memory (such as MCSM or L2 cahce).

    About the data vectors assignment, I cannot figure out your problem. You have to be more specific: it is a compilation error? Where are allocated the data? How do you fill them?

  • hi,

    Alberto Chessa.

    thank you for your answer.

    The mistake was due to my negligence, the compiler has passed.

    thanks again.

    best regards.

    Frank Zach

    To help others community members, attach my project:

    ti_asm_test.zip

  • Alberto Chessa said:

    It is normal having the actual cycles greater then estimated, since the effective time is also "data bound". The difference is more or less the time required to read the data from the memory. If they are not already in the L1 data cache, the CPU will stall while reading them from DDR or other Level 2 memory (such as MCSM or L2 cahce).

    I'd say that it's not "also "data bound", but just "data bound". In every meaning. Off all the execution ports' types, it is data load/store ports that get used up in this algorithm. Well, multiplication ports are fully utilized too, but it's possible to omit one, a*b + a*c = a*(b+c), and addition ports are underutilized. And then of course, as correctly pointed out, processors would have to stall waiting for data to be delivered to cache. One should recognize that data set size in presented example is larger than cache size, which naturally means that actual execution time is bound to be noticeably higher than estimated one based solely on instruction schedule.

    Alberto Chessa said:

    Estimated total cycle should be 10020, that is one cycle per iteration (two multiplication and one sum per iteration)! Hard to do better by direct asm coding (at least for me).

    While ~10000 is indeed best possible estimate for this algorithm, it doesn't make assembly programming exercise meaningless. When it comes to performance real question is not if result is sufficient under the circumstances, but how far is it from theoretical best possible one. Because you commonly find yourself asking what is the best system can do, and then next question is inevitably if compiler does adequate job. And the best way to get acquainted with architecture enough to make such judgement is to do some assembly programming. In other words it makes sense to do some in order to avoid doing it later on. Or at least to know when you don't have to. So that in this particular case you, Frank, should aim for ~10000 estimate. You currently are at ~40000, i.e. 4 times below processor capacity. In order to make it more meaningful I would suggest to reduce data set size so that it fits in cache, In this case measured and estimated results would be close to each other...

  • Andy Polyakov said:

    So that in this particular case you, Frank, should aim for ~10000 estimate. You currently are at ~40000, i.e. 4 times below processor capacity. In order to make it more meaningful I would suggest to reduce data set size so that it fits in cache, In this case measured and estimated results would be close to each other...

    Just to clarify, 10000 and 40000 are for specific input vectors' length [of 10000]. So that when you shorten inputs, estimated result will naturally proportionally reduce. Or in other words estimate should be thought of rather as cycles per vector element, or in above example 1 cycle per float vs. 4 cycles per float.