This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

what is the difference between pipeline & software pipeline in C674x ?



I'd like to know the difference between pipeline & software pipelline.

Is software pipeline a subset of pipeline?

And software pipelined loop buffer is giving more benefit in looping?

Because I'm comparing the cycle count between C674 & Xtensa and I see ~2.5x better performance in C674.

I wonder if this better performance is coming from software pipelined loop buffer.

Thanks,

Dongho

  • Software pipelining means overlapping different iterations of the same loop. Yes, this is faster than just looping. This lets us take better advantage of the C6000 parallelism. See en.wikipedia.org/.../Software_pipelining

    It's hard to say where the performance gain is coming from without seeing the source code for your application.
  • Below is the source code. So with the software pipeline, TI performs better to run below code than other processor that use normal pipeline?

    void undo( int16_t * p_buffer )
    {
    int32_t *tmp_ptr;
    uint16_t i;
    uint32_t tmp_S;
    uint32_t tmp_SW;

    tmp_ptr = (int32_t *) p_rx;

    for (i = 0; i < 200; i++)
    {
    //1st data
    tmp_S =(p_buffer[i*3+1])&0xE000;
    tmp_ptr[i*2] =(p_buffer[i*3]<<16);
    tmp_ptr[i*2] =(tmp_ptr[i*2]|tmp_S);

    //2nd data
    tmp_ptr[i*2+1] =(p_buffer[i*3+1]<<20);
    tmp_S =((p_buffer[i*3+2])&0xFE00)<<4;
    tmp_ptr[i*2+1] =(tmp_ptr[i*2+1]|tmp_S);

    //3rd data
    tmp_S =((p_buffer[i*3+1])&0x1000)<<3;
    tmp_SW =((p_buffer[i*3+2]&0x1FF)<<3);
    p_status[i]=((tmp_SW)|tmp_S);
    }

    }
  • I read the article at en.wikipedia.org/.../Software_pipelining.
    It says software pipelining and modulo scheduling is different and modulo scheduling is currently most effective technique.
    But when I read the sprufe8b, In ch 7, software pipelined loop buffer is using software pipeline + modulo scheduling.
    So does this means TI use software pipeline & modulo scheduling to run the loop?
  • I'm not familiar with the Xtensa architecture, so I can't say anything about that architecture.

    Any CPU that has parallel instructions or a deep pipeline can do some software pipelining, but it's much more effective on a wide-issue, deep pipeline machine like C6000.

    I can say that the C6000 compiler can easily software-pipeline this loop, and you should see roughly 5x the performance of non-software-pipelined code in the steady state.
  • It was a little bit ambiguous question and as result a little bit misleading answer. Ambiguity arises from the fact that there is general term software pipelining that denotes overlapping of loop iterations, and there is software pipelining facility in contemporary TI DSPs that allows you to efficiently express the said optimization in machine language. The keyword is "efficiently", because it's possible to exercise the technique without relying on TI's SPLOOP facility, both on TI and non-TI processors alike. It's just that technique by itself normally requires more registers (which you might not have), increases code size... But for simple enough algorithm code not utilizing SPLOOP can be as fast, just not as compact and elegant. Or in other words, there are situations when you can achieve same performance without using SPLOOP, yet using software pipelining technique. And if you can do it without SPLOOP, then who said that technique in general can only be exercised on TI processors? Nobody can or will say that, because it's not true. Of course then it boils down to question if used compiler will do it for you. What I'm trying to say is that reported difference between results might not be specifically because target TI processor has SPLOOP facility, but because TI compiler is using (or is better at) software pipelining technique in general. But if both compilers are equivalent in that respect, other factors can come to play. As we talk about software pipeling in general, we have to recognize that it would work only if there are independent instructions that we can overlap. Or in other words we can say that algorithm can be characterized by instruction level parallelism, i.e. how many instruction can execute in parallel. On the other hand we have processor with limited amount of resources, simply put it can't execute arbitrarily many operations in parallel. This means that for algorithms with high instruction level parallelism performance would be limited by amount of computational resources, i.e. how many instructions will execute in parallel. So that if another processor can't execute as many instructions in parallel, it will deliver worse result. But this, again, won't necessarily have everything to do with TI's SPLOOP facility...