Other Parts Discussed in Thread: SYSBIOS
hello everyone
I have problem about : when I call Assembly Language in C , Can't break point debugging in Assembly Language,
Please help me.
in C:
in Assembly Language :
This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
hello everyone
I have problem about : when I call Assembly Language in C , Can't break point debugging in Assembly Language,
Please help me.
in C:
in Assembly Language :
Hi,
A NOP to fill the delay slot must be insert also after the return:
[B1] B.S1 loop
NOP 5 ; delay slot for previous B loop
B B3 ; return
NOP 5 ; delay slot for return branch
.endasmfunc
The branch and nop can be condensed in only one instruction: BNOP B3,5.
I'm talking about the align of the vector data you pass to sum(), I don't know what loadAlign is (I suppose something regarding the TI SYSBIOS). This will be clear when you'll read the optimization manual.
Anyway, this is an example (not tested):
#define LENGTH 10000 #pragma FUNC_CANNOT_INLINE(sum); void sum(float* E, const float* A1, const float* B1, const float* C1) { int i; float* const restrict ARRAY_E=E; const float* const restrict ARRAY_A1=A1; const float* const restrict ARRAY_B1=B1; const float* const restrict ARRAY_C1=C1; _nassert((int)ARRAY_E % 8 ==0); //precondition, not enforced by the compiler _nassert((int)ARRAY_A1 % 8 ==0); //... _nassert((int)ARRAY_B1 % 8 ==0); _nassert((int)ARRAY_C1 % 8 ==0); #pragma MUST_ITERATE(10000, 10000); //not required since LENGTH is constant and the compiler can figure out it by himself for(i=0;i<LENGTH;i++) //LENGTH IS 10000 { ARRAY_E[i]=ARRAY_A1[i]*ARRAY_B1[i]+ARRAY_A1[i]*ARRAY_C1[i]; } } #pragma DATA_ALIGN(F_E, 8); //align to 8 bytes, so to meet the precondition of sum() float F_E[LENGTH]; #pragma DATA_ALIGN(F_A1, 8); float F_A1[LENGTH]; #pragma DATA_ALIGN(F_B1, 8); float F_B1[LENGTH]; #pragma DATA_ALIGN(F_C1, 8); float F_C1[LENGTH]; void call_sum() { sum(F_E, F_A1, F_B1, F_C1); }
Try to compile with "-O3, --opt_for_speed=5, --optimize_with_debug=on, --speculative_loads=auto, --debug_software_pipeline" and then look at the generated assembler (--keep_asm).
Estimated total cycle should be 10020, that is one cycle per iteration (two multiplication and one sum per iteration)! Hard to do better by direct asm coding (at least for me).
Of course, in the "real life" you have to be sure to meet the precondition: data alignment and dependency (the "rescrict" qualifier) and number of iteration properties (see MUST_ITERATE in the compiler manual).
Hi,
It is normal having the actual cycles greater then estimated, since the effective time is also "data bound". The difference is more or less the time required to read the data from the memory. If they are not already in the L1 data cache, the CPU will stall while reading them from DDR or other Level 2 memory (such as MCSM or L2 cahce).
About the data vectors assignment, I cannot figure out your problem. You have to be more specific: it is a compilation error? Where are allocated the data? How do you fill them?
hi,
Alberto Chessa.
thank you for your answer.
The mistake was due to my negligence, the compiler has passed.
thanks again.
best regards.
Frank Zach
To help others community members, attach my project:
Alberto Chessa said:It is normal having the actual cycles greater then estimated, since the effective time is also "data bound". The difference is more or less the time required to read the data from the memory. If they are not already in the L1 data cache, the CPU will stall while reading them from DDR or other Level 2 memory (such as MCSM or L2 cahce).
I'd say that it's not "also "data bound", but just "data bound". In every meaning. Off all the execution ports' types, it is data load/store ports that get used up in this algorithm. Well, multiplication ports are fully utilized too, but it's possible to omit one, a*b + a*c = a*(b+c), and addition ports are underutilized. And then of course, as correctly pointed out, processors would have to stall waiting for data to be delivered to cache. One should recognize that data set size in presented example is larger than cache size, which naturally means that actual execution time is bound to be noticeably higher than estimated one based solely on instruction schedule.
Alberto Chessa said:Estimated total cycle should be 10020, that is one cycle per iteration (two multiplication and one sum per iteration)! Hard to do better by direct asm coding (at least for me).
While ~10000 is indeed best possible estimate for this algorithm, it doesn't make assembly programming exercise meaningless. When it comes to performance real question is not if result is sufficient under the circumstances, but how far is it from theoretical best possible one. Because you commonly find yourself asking what is the best system can do, and then next question is inevitably if compiler does adequate job. And the best way to get acquainted with architecture enough to make such judgement is to do some assembly programming. In other words it makes sense to do some in order to avoid doing it later on. Or at least to know when you don't have to. So that in this particular case you, Frank, should aim for ~10000 estimate. You currently are at ~40000, i.e. 4 times below processor capacity. In order to make it more meaningful I would suggest to reduce data set size so that it fits in cache, In this case measured and estimated results would be close to each other...
Andy Polyakov said:So that in this particular case you, Frank, should aim for ~10000 estimate. You currently are at ~40000, i.e. 4 times below processor capacity. In order to make it more meaningful I would suggest to reduce data set size so that it fits in cache, In this case measured and estimated results would be close to each other...
Just to clarify, 10000 and 40000 are for specific input vectors' length [of 10000]. So that when you shorten inputs, estimated result will naturally proportionally reduce. Or in other words estimate should be thought of rather as cycles per vector element, or in above example 1 cycle per float vs. 4 cycles per float.