This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

Compiler very slow - any suggestions?

I am working on some code for a C6678 that does some linear filters, using cl6x 7.4.1 and plain C.

The basic filters actually compile quite fast (several seconds), even with -O3 --opt_for_speed=5, and I'm happy with the results.

One specific file with some decimation filters takes a ludicrous amount of time to compile (434 seconds) for four little functions.

The annoying part of this is that most of the time seems to be spend exhaustively trying to find loop schedules for loops on which it eventually gives up.  I wouldn't mind the long compile times if it was actually succeeding...  I've tried forcing #pragma UNROLL(1) but this doesn't help.

If I build it with -O3 --disable_software_pipeline [1] it takes just 3.5 seconds to compile.  That's perhaps enough to speed up my edit-compile-debug cycle and get me through for now, while I work on optimising other files in the library that are actually more performance-critical.  It does disable pipelining for a handful of loops in the file that can be successfully pipelined, so it's not ideal. 

Is there something else I could do here?

The underlying issue is that the filter coefficients in question almost but don't quite fit in the register file, which is obviously very frustrating for the compiler.  I suspect that I could fix it by (for example) exploiting the symmetry in the filter coefficients, but given this is well tested and working code I feel like that's optimisation work I can't justify just now.

Thanks for any tips,

Gordon

  • By the way, the compiler User Guide (spru187u 7.4) incorrectly refers to --disable_software_pipelining rather than --disabl_software_pipeline.

  • Gordon Deane said:
    If I build it with -O3 --disable_software_pipeline [1] it takes just 3.5 seconds to compile.  That's perhaps enough to speed up my edit-compile-debug cycle and get me through for now, while I work on optimising other files in the library that are actually more performance-critical.

    I think this is a reasonable way to handle things.  

    When you do get back to this one file, please see if any of the source annotations described in this wiki article are helpful.

    Thanks and regards,

    -George

  • I have a partial solution.  The problem seemed to be several FIR filter loops of this form:

    #define D3_N1 32
    const double D3_h1[D3_N1] = { ... }; /* filter coefficients */
        const double * restrict pin;
        double * restrict pout1;
    ...
        for (n = 0; n < len1; n += 1)
        {
            acc = 0.0;
            #pragma UNROLL(4) 
            for (k = 0; k < D3_N1; k++)
            {
                acc += pin[k] * D3_h1[D3_N1-1-k];
            }
            pin += D3_DECIMATION1;
            pout1[n] = acc;
        }

    Interestingly, the compiler 7.3.1 I used before pipelined the inner loop and produced a better solution without help (ie. without the pragma in blue). 7.4.1 seems to fully unroll the inner loop (length 32) and then find that it cannot find any kind of pipelined schedule for the outer loop because the register pressure of 2*32 coefficients is just too high.  It seems to spend an enormously long time trying but the resulting performance was disappointing.

    I found that adding the #pragma UNROLL(4) in blue to force a partial unroll of the inner loop produced much better results.  The inner loop pipelines very well and the resulting speedup is great enough that not pipelining the outer loop explicitly doesn't matter.

    It's a bit of a shame this had to be done by trial and error, though.  If I hadn't seem better results with the old compiler I might have given up.  Anyway, I'm pleased with the final result.

    Thanks,

    Gordon