Does TI support 2-bytes data format on matrix function?

Hi, all
On EVMK2H12, I used "cblas_sgemm" from Processor-SDK to calculate large matrix. I know the interface has been optimized based on OpenCL on 8 DSP cores. But the result is not satisfied according to performance requirement.

I've checked the performance for M=N=K=1000, it is 0.027s as same as TI said (

However when the values are large such as M=10,N=200,000,K=30, it is 0.277s. Well, maybe this kind of calculation has reached the limit of performance for all DSP cores. So I think it could be improved if the input data format is short or fp16 (half precision float-point), which is 2 bytes length instead of 4 bytes like float.

Does anyone know if the SDK supports sgemm with 2 bytes data format? I didn't find anything by searching SDK manual and cblas.h. If not, is there any plan on this topic for TI?

BTW, I had used fp16 data format on matrix calculation based on NVIDIA platform, which improves performance significantly (NV supports fp16 in CUDA library)

Thanks very much for any help on this topic!