This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hello everyone
I plan to do the dot production for multiple vectors, so I modify the example function DSPF_sp_dotp_cplx. The following is the modified function. I want know that if I use MUST_ITERATE(48,,48) correctly, or not. I mean will the Outside FOR loop be unrolled for 48n times; Will the inside FOR loop be unrolled at the same time? If I want to make the inside FOR loop unrolled, how can I do? I am using Compiler Version 8.0.4, and Opt Level is set at 3.
Thanks
Xining Yu
void DSPF_sp_dotp_cplx_new(const float * x, const float * y, unsigned int nx, unsigned int ny, float * restrict re, float * restrict im) { unsigned int i, j; __float2_t x0_im_re, y0_im_re, result0 = 0; __float2_t x1_im_re, y1_im_re, result1 = 0; __float2_t x2_im_re, y2_im_re, result2 = 0; __float2_t x3_im_re, y3_im_re, result3 = 0; __float2_t result; _nassert(nx % 4 == 0); _nassert(nx > 0); _nassert((int)x % 8 == 0); _nassert((int)y % 8 == 0); #pragma MUST_ITERATE(48,,48); for (j = 0; j < nx; j += 48) { for(i = 0; i < 2 * ny; i += 8) { /* load 4 sets of input data */ x0_im_re = _amem8_f2((void*)&x[i+j]); y0_im_re = _amem8_f2((void*)&y[i]); x1_im_re = _amem8_f2((void*)&x[i+2+j]); y1_im_re = _amem8_f2((void*)&y[i+2]); x2_im_re = _amem8_f2((void*)&x[i+4+j]); y2_im_re = _amem8_f2((void*)&y[i+4]); x3_im_re = _amem8_f2((void*)&x[i+6+j]); y3_im_re = _amem8_f2((void*)&y[i+6]); /* calculate 4 running sums */ result0 = _daddsp(_complex_mpysp(x0_im_re, y0_im_re), result0); result1 = _daddsp(_complex_mpysp(x1_im_re, y1_im_re), result1); result2 = _daddsp(_complex_mpysp(x2_im_re, y2_im_re), result2); result3 = _daddsp(_complex_mpysp(x3_im_re, y3_im_re), result3); } result = _daddsp(_daddsp(result0,result1),_daddsp(result2,result3)); result0 = 0; result1 = 0; result2 = 0; result3 = 0; *re = -_hif2(result); *im = _lof2(result); re += 2; im += 2; } }
Xining Yu said:will the Outside FOR loop be unrolled for 48n times
No. That's a bad idea. If the loop is unrolled too many times, then too many values are being computed at once. The compilation would take a long time, then it would finally give up and emit a schedule for the loop that is not software pipelined. That means it will perform very poorly.
Xining Yu said:Will the inside FOR loop be unrolled at the same time?
No. A MUST_ITERATE pragma applies only to the next loop, and not any subsequent loops.
Note the inner loop is already manually unrolled 4 times. In my experiments, I did not find any way to improve on that.
You can force the compiler to unroll a loop with the UNROLL pragma. Read about it in the C6000 compiler manual. Generally speaking, this is not a good idea. But it is a useful way to experiment. You can use it in this case to see that unrolling the outer or inner loop does not improve performance. Start by using #pragma UNROLL(1) on the inner loop, then increase it by multiples of 2. Use the compiler build switch --debug_software_pipeline. After each build, inspect the resulting .asm file. There is a large block comment before the inner loop. Focus on two numbers, the ii and the Loop Unroll Multiple. ii stands for initiation interval. If the loop unroll multiple is not present, then presume it is one. You want this to be a small as possible: ii/Loop Unroll Multiple. In the experiments I tried, that number never improved as I increased the unrolling of the inner loop.
Thanks and regards,
-George