This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

TMS320C6657: c66x SIMD performance

Part Number: TMS320C6657

Hello,

C66x has SIMD capability so that it can carry out 8 float MAC operation per instruction.

However, it has only two 64bit data bus from L1 data memory.

Therefore, if the dot product of two arrays is done, only two MAC operation will be used due to lack of loading data to registers, which is the same performance as C674x.

If so, I think that  the 8 MAC capability using C66x' SIMD is not so useful in many arithmetic cases in general.

Am I right?

  • Hi,

    There is a training explaining the C674x and C66x ISA difference. https://training.ti.com/keystone-i-training-instruction-set-architecture-isa?context=15819-1138812-975

    Regards, Eric

  • Thank you for your reply.

    I have already watched that training video.

    I understand the performance of the c66x is enhanced for some types of operations such as complex arithmetic and matrix operation.

    What I am wondering is the performance of other types of arithmetic operations such as the dot-product.

    In order to clear my question, I will ask the following question:

    "Is the required number of cycle for the c66x to perform the N-point dot-product the order of (1/8)N?"

    For reference, in case of the c674x, it is (1/2)N.

    If the required number of cycle is the same as that of c674x,

    my understanding of the characteristics of the c66x SIMD is right and my curiosity will be satisfied. 

    Thanks in advance.

  • Hi,

    For C66x, each side (A and B) has 4 32x32 fix point MACs, so it is true that: the required number of cycle for the c66x to perform the N-point dot-product the order of (1/8)N?

    For C674x+, it is 1/2*N.

    See the slides P9.

    Regards, Eric1243_presentation.pptx 

  • Hi,

    Thank you for the quick reply.

    I know that the C66x can do 4 floating-point MACs on each side, so total 8 MACs per cycle.

    Therefore, 8 floating-point MACs can be performed if all of data are already on the registers.

    (I am not a novice at TI DSPs. I have been using C6000 DSPs since the C6201 was introduced about 20 years ago.)

    It is important that all data must be prepared on the registers in advance for maximum arithmetic performance.

    What I mean is not the sheer MAC capability but data supply capability.

    Let's assume the dot-product of A[n] and B[n] with length N.

    If the required number of cycle for the c66x to perform the N-point dot-product  is the order of (1/8)N,

    total 16 32bit float data must be loaded to registers in a slngle cycle.

    As far as I know, the c66x can load 4 float data at the most from L1 memory or cache to registers in a single cycle.

    If then, how can be the number of the cycle the order of (1/8)N?

    Please don't tell me to see slides or videos this time which are what I had already seen quite a while ago.

    Please explain if my understanding is right or wrong. If wrong, please explain what is wrong.

    Thanks in advance.

  • Hi,

    I read the http://www.ti.com/lit/ug/sprugh7/sprugh7.pdf 

    1.2.2 Internal Memory The DSP has a 32-bit, byte-addressable address space. L1 memory for each DSP CPU memory is organized in separate data and program spaces, with unified memory for L2 and higher.

    The DSP has a 256-bit read-only port to access internal program memory and two 256-bit ports (read and write) to access internal data memory.

    Regards, Eric

  • Thank you for your reply.

    I read the sentence from the document.

    However, I also found the following figure from the document, TMS320C66x DSP CorePac User Guide(SPRUGW0C).

    According to the figure, there are two 64bit bus between CPU and L1 data memory.

    Which is correct?

  • Sorry that the figure was not pasted.

    The figure is "Figure 1-1 C66x CorePac Block Diagram" in 1.1 Introduction.

  • Hi,

    Besides the http://www.ti.com/lit/ug/sprugh7/sprugh7.pdf , I saw this "Throughput Performance Guide for C66x KeyStone Devices" Table 3 Theoretical Bandwidth of Different Memories (attached), it is also 256-bit. The diagram in "TMS320C66x DSP CorePac User Guide" I believe is inaccurate.

    4452.sprabk5a_throughput.pdf

    Regards, Eric

  • According to the Table 3 in the document you referred, the bandwidth of L1D is the total bandwidth of the memory.

    What is important is the bandwidth between the DSP core and the L1D.

    According to the Table 2,   the bus between the C66x core and L1D is 128bit.

    And more, the Figure 1, "TeraNet and Memory Access Diagram" in the same document shows

    the bus between the DSP core and the L1D is 128bit (two 64bit).

    Is this figure inaccurate too?

  • Hi,

    Those figures are copied back and force. I asked design team for clarification. I thought it should be 256-bit, otherwise there is no way to load 8 32x32 bit data for multiplication in one cycle.

    Regards, Eric

  • TI provides DSP libraries for the C6000 DSPs.

    The function name for single-precision dot product is "DSPF_sp_dotprod( )".

    According to the document, DSPLIB User's Manual(C674x), the cycle benchmark of the function is (1/2)N+58.

    However, I could not find the cycle benchmark of the same function in the DSPLIB User's Manual(c66x).

    What is the cycle benchmark of the "DSPF_sp_dotprod( )" provided in the c66x DSPLIB"?

  • Please open a new E2E for: What is the cycle benchmark of the "DSPF_sp_dotprod( )" provided in the c66x DSPLIB"?

    Regards, Eric

  • Hi,

    For the bus width between C66x CPU core and L1D, here is what I got from design team:

    There are 4 ports between CPU <-> L1D:

    •  2 read ports, each port is 64-bit wide
    •  2 write ports, each port is 64-bit wide.

    However, only 2 ports can be active in same cycle, i.e you can have either 2 64-bit loads, 2 64-bit stores, or 1 64-bit loads+1 64-bit stores. i.e total bus width is 256-bits but bandwidth is 128b/cycle.

    That’s correct. The 8 32x32 MAC listed in the training table is the theoretically max value that only a few special instructions can achieve like CMATMPY, CCMAMPY etc. For regular 32-bit multiplication, only 4 32x32 multiplications can be achieved per CPU cycle.

    Regards, Eric