This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

When does compiler add 'NOP'?

 Hi,

I find that there are some extra 'NOP's in the disassembly lines of C source code, comparing with the corresponding generated .asm files. Below the dot line, I present a part of the disassembly code (Because I cannot copy the lines in the mixed mode of the source files, I modify the .asm file according to the disassembly lines.)

I am just curious about the reason of the extra 'NOP"s. I know that the program fetch gets 8 bytes one time. The 8 bytes fetch length is fixed, or not? I do not find the answer yet. I guess it is fixed. Then, some instructions has 1 execute packet, some has more. Some stalls can generate. The disassembly lines are shown for the execute stage, right? In the below code, the last part has 8 instructions execute at the same time. The last 3 'NOP' is added by the compiler because there is not such 'NOP' in the .asm file. Here, my question is: Why is not there such 'NOP' in line 237---239? Other function units, except S1, L1 and L2, are busy at that time? Or, what anything those function units do are don't care about?

My question can be described as: if the excute packets are not 8 at the same time, when the compiler adds 'NOP's?

 

Thanks a lot.

 

 

 

 

 

 

 

 ......................................

 L1:    ; PIPED LOOP PROLOG
     234 0000002c 00000410             B       .S1     L2                ; |29| (P) <0,8>
     235 00000030 009C52C6             LDH     .D2T2   *++B7(4),B1       ; |31| (P) <0,0>
     236                   
     237 00000034 008001A9             MVK     .S1     0x3,A1            ; init prolog collapse predicate
     238 00000038 01901059  ||         MV      .L1X    B4,A3
     239 0000003c 0210405A  ||         ADD     .L2     2,B4,B4
     240                   
     241                    ;** --------------------------------------------------------------------------*
     242 00000040           L2:    ; PIPED LOOP KERNEL
     243 00000040           DW$L$_lesson2_c$3$B:
     244                   
     245 00000040 0291307B             ADD     .L2X    B9,A4,B5          ; |31| <0,11>
     246 00000044 20000011  || [ B0]   B       .S1     L2                ; |29| <1,8>
     247 00000048 0498AC83  ||         MPY     .M2     B5,B6,B9          ; |31| <1,8>
     248 0000004c 029C5245  ||         LDH     .D1T1   *++A7(4),A5       ; |31| <3,2>
     249 00000050 029C22C6  ||         LDH     .D2T2   *+B7(2),B5        ; |31| <3,2>
     250                    000000001 ||         NOP

                                000000001 ||         NOP

                                000000000 ||         NOP

 

  • The NOPs after the LDH instruction are placed there because the LDH takes 4 cycles to complete the read of memory and store the half-word into the destination register.  Before the destination register (in this case B5) can be used in further operations, the compiler needs to ensure the results are actually there.

    Details of delay slots associated with instructions can be found in the TMS320C64x/C64x+ DSP CPU and Instruction Set Reference Guide (Rev. J).

     

  • Robert W said:

    I am just curious about the reason of the extra 'NOP"s. I know that the program fetch gets 8 bytes one time. The 8 bytes fetch length is fixed, or not?

    Actually, the fetch packet is 32 bytes (8 instructions * 4 bytes), and must be aligned to a 32-bit boundary.  The assembler automatically adds padding NOPs when the next execute packet will not fit in this fetch packet.  The padding NOPs force the next execute packet to start at an aligned boundary.  Padding NOPs can be easily identified because they are in parallel with another instruction.  Padding NOPs aren't needed as much for C64x+.

    There can be up to 8 parallel instructions in one execute packet.  Depending on the size of each execute packet, there can be up to 8 execute packets in each fetch packet.  Fetch packets are always 32 bytes.

    You are showing us an assembly fragment for a software-pipelined loop, which cannot be easily understood in isolation.  In particular, instruction latencies are non-intuitive in a software-pipelined loop.  No NOPs are needed in the actual assembly code of the kernel because the latency is hidden by the modulo schedule.