Hi,
I'm using CCS v4.2.4, I wrote a short test code as linear assembly in a file test_pipe.sa
which simply implement : for (i=0; i<N; i++) p_u[i] = p_u[i] + p_v[i]
_test: .cproc p_u, N, p_v
.reg j, u, ref_u, break_flag
.no_mdep
MV N, j
loop: .trip 16
[j] SUB j, 1, j
LDH *+p_u[j], u
LDH *+p_v[j], ref_u
ADD u, ref_u, u
STH u, *+p_u[j]
[j] B loop
.endproc
After compiled with –debug_software_pipeline, the compiler tells me the Loop Carried Dependency Bound(^) : 0
Then after a slight modified of the code,
ZERO break_flag
[!j] MVK 1, break_flag
[!break_flag] B loop
The compiler tells me Loop Carried Dependency Bound(^) : 3
Which doesn't quite make any sense, can somebody tells me where the dependency is,
The attached file is the asm file produced by compiler with the second test code
1738.test_pipe.txt
Thanks
Shi Tianqi
Shi,
A loop carried dependency means that the results of one iteration of the loop are required as inputs in the next iteration of the loop. It might not look like you have such a dependency, but the compiler is protecting against the case when the arrays p_v and p_u overlap.
Are you using the "restrict" keyword to describe p_v and p_u? I think that will fix your problem.
Luke Postema
DSP Engineer D3 Engineering
www.d3engineering.com
Please see if this app note is helpful. -George
TI C/C++ Compiler Forum ModeratorPlease click Verify Answer on the best reply to your question.The Compiler Wiki answers most common questions.Track an issue with SDOWP. Enter your bug id in the "Find Record ID" box.
Thanks for quick reply.
The "restrict" keyword -- can I ask what exactly it that?
The .no_mdep is used in the linear assembly, and the compiler option -mt is used,
And as I wrote, the first sample code the compiler tolds the loop dependency bound _is_ 0.
but the second sample code, compiler tells me it's 3, please have a reference of the attached file
I cannot figure out how the two instructions in blue text below is loop dependent,
And why the loop dependency bound is 3, the two instructions only takes 2 cycles.
;** --------------------------------------------------------------------------*
; EXCLUSIVE CPU CYCLES: 7
;
; _test: .cproc p_u, N, p_v
; .reg j, u, ref_u, break_flag
; .no_mdep
; loop: .trip 16
MV .L1X N,N' ; |2|
MV .L1X N,j ; |2|
MV .L1X N,j$5 ; |2|
[ j$1] ADD .L1X 0xffffffff,N,j ; |7|
|| ZERO .L2 break_flag ; |13| (P) <0,2>
|| MVC .S2 CSR,B16
|| MV .S1 p_v',p_v ; |2|
MV .L1 j,j$1 ; |7|
|| LDH .D1T2 *+p_v[j],ref_u ; |10| (P) <0,4>
|| MVK .L2 0x1,B1
|| MV .S1 p_u'',p_u ; |2|
|| MV .S2X p_u'',p_u' ; |2|
|| MV .D2 N,j$2 ; |2|
[ j$1] ADD .L1 0xffffffff,j,j ; |7| (P) <1,1> ^ instruction 1
|| AND .L2 -2,B16,B4
|| LDH .D1T1 *+p_u[j],u ; |9| (P) <0,3>
|| MV .S1 j,j$5 ; |7|
|| [!j] MVK .S2 0x1,break_flag ; |15| (P) <0,3>
|| [ j$5] ADD .D2 0xffffffff,N,j$2 ; |7|
MV .L1 j,j$6 ; |7| (P) <1,2> ^ Split a long life(pre-sched) instruction 2
|| ZERO .L2 break_flag ; |13| (P) <1,2>
|| MVC .S2 B4,CSR ; interrupts off
|| [ break_flag] ZERO .D2 B1 ; |15| (P) <0,5>
;* SETUP CODE
;*
;* MVK 0x1,B1
;* MV A1,B8
;* MV A3,B5
;* MV B1,B2
;* MV A1,A0
;* SINGLE SCHEDULED ITERATION
;* $C$C29:
;* 0 NOP 1
;* 1 [ A0] ADD .S1 0xffffffff,A1,A1 ; |7| ^ instruction 1
;* 2 MV .L1 A0,A2 ; |7| Split a long life(pre-sched)
;* || MV .D1 A1,A0 ; |7| ^ Split a long life(pre-sched) instruction 2
;* || ZERO .S2 B0 ; |13|
;* 3 [ A2] ADD .S2 0xffffffff,B8,B8 ; |7| Define a twin register
;* || [ B1] LDH .D1T1 *+A3[A1],A4 ; |9|
;* || [!A1] MVK .D2 0x1,B0 ; |15|
;* 4 [ B1] LDH .D1T2 *+A5[A1],B6 ; |10|
;* 5 [ B0] ZERO .L2 B1 ; |15|
;* 6 MV .L2 B8,B9 ; |7| Split a long life(pre-sched)
;* 7 MV .S2 B9,B7 ; |7| Split a long life(pre-sched)
;* 8 MV .D2 B1,B4 ; |15| Split a long life(pre-sched)
;* || [ B1] B .S1 $C$C29 ; |16|
;* 9 ADD .L1X A4,B6,A6 ; |11|
;* 10 [ B2] STH .D2T1 A6,*+B5[B7] ; |12|
;* || MV .L2 B4,B2 ; |15| Split a long life(pre-sched)
;* 11 NOP 3
;* 14 ; BRANCHCC OCCURS {$C$C29} ; |16|
In C code, indexed addressing (i.e. p_u[i]) is preferred because it is a bit easier for the compiler to know such references cannot overlap, i.e. are not aliases. Such code is often turned into auto-increment addressing (i.e. *A1++) in the generated assembly, because that requires only one register. You should do the same thing, even in linear assembly. Rewrite your linear assembly to use auto-increment addressing instead of indexed addressing, and I think most of your problems will go away.
Thanks and regards,
-George
Hi, Gorge,
Thanks for your reply.
I'm optimizing the code by doing some tuning of the code so that compiler can do better software pipeline,
My goal is the reduce the LOOP CARRIED DEPENDENCY BOUND to zero,
but sometimes it's hard to find where the loop dependency is, whatever I have done, the LOOP CARRIED DEPENDENCY BOUND remains the same,
So I start to suspect whether the compiler tells the right thing,
So I wrote some very simple code and test,I found the compiler quite doesn't make any sense.
at least for the code I posted, I cannot find how the instructions with (^) is loop dependent,
-- So can you help me to understand how the instruction with (^) above is loop dependent
Thanks.
Shi Tianqi ZERO break_flag [!j] MVK 1, break_flag [!break_flag] B loop
The output of the ZERO instruction is considered an input to the ZERO instruction because the MVK is a conditional write. Therefore, the ZERO must finish before the MVK starts; this is one cycle.
The output of MVK is read by the branch, so clearly it must come before the branch. This is the second cycle.
The branch reads the value in the first cycle. We must make sure there that the next iteration of the loop does not clobber this value before the branch reads it. The compiler considers this a write-after-read hazard between the branch and the ZERO in the next iteration. This is the third cycle, and closes the loop-carried dependence graph.
Hi, Archaeologist
Now I understand the instruction like j=j-1 is loop dependent, so every instruction related with j is on the loop carry path
Thanks all for the kind reply