This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

66AK2H12: Issue that if TI support 2-bytes data format on matrix function in LINALG lib

Part Number: 66AK2H12

On EVMK2H12, I used "cblas_sgemm" from Processor-SDK to calculate large matrix. I know the interface has been optimized based on OpenCL on 8 DSP cores. But the result is not satisfied according to performance requirement.

I've checked the performance for M=N=K=1000, it is 0.027s as same as TI said (www.ti.com/.../linear-algebra-libraries.page

However when the values are large such as M=10,N=200,000,K=30, it is 0.277s. Well, maybe this kind of calculation has reached the limit of performance for all DSP cores. So I think it could be improved if the input data format is short or fp16 (half precision float-point), which is 2 bytes length instead of 4 bytes like float.

Does anyone know if the SDK supports sgemm with 2 bytes data format? I didn't find anything by searching SDK manual and cblas.h. If not, is there any plan on this topic for TI?

BTW, I had used fp16 data format on matrix calculation based on NVIDIA platform, which improves performance significantly (NV supports fp16 in CUDA library)

Thanks very much for any help on this topic!

  • Hi Hao Yang,

    I've forwarded this to the experts. Their feedback should be posted here.

    BR

    Tsvetolin Shulev

  • Still no feedback here?

  • Hi Hao

    This is very interesting question. I will answer for the DSP and say something about the ARM.

    The DSP functional units do not have hardware 16-bit floating point support. So 32-bit floating point will be executed much faster than 16-bit floating point (that is, the software will have to emulate 16-bit floating point, not hardware support).

    It was unclear from your post what is your observation when you run a large matrix operation. Are you concern with the timing of the execution or the accuracy of the results? The only case where your idea can work is if the bottle-neck is the IO. In that case it is conceivable to get the data in 16-bit floating point, and convert it to 32-bit before the actual arithmetic operations (that are much faster for 32-bit)
    Does it make sense to you?

    About the ARM. I know that the floating point functional unit in the A15 hardware supports 16-bit half precision floating point, but I am not sure that there is TI software to drive any 16-bit function. I suspect that code like this can be developed.

    Ran
  • Hi, Ran

    Thanks for your explanation on FP16 topic.

    Actually what I concerned is the timing of execution, prefer to be less run-time for a large matrix operation. I'd better think about other solutions since TI didn't support hardware 16-bit floating point support.

    But I think FP16 hardware support is really a direction worthy of consideration in future for TI to improve application performance, when the precision of matrix operation is not quite strict in some cases.

  • I understand and appreciate what you say. Thank you for your suggestion

    Best Regards

    Ran