Does TI support 2-bytes data format on matrix function?

hao yang95

Hi, all
On EVMK2H12, I used "cblas_sgemm" from Processor-SDK to calculate large matrix. I know the interface has been optimized based on OpenCL on 8 DSP cores. But the result is not satisfied according to performance requirement.

I've checked the performance for M=N=K=1000, it is 0.027s as same as TI said (www.ti.com/.../linear-algebra-libraries.page

However when the values are large such as M=10,N=200,000,K=30, it is 0.277s. Well, maybe this kind of calculation has reached the limit of performance for all DSP cores. So I think it could be improved if the input data format is short or fp16 (half precision float-point), which is 2 bytes length instead of 4 bytes like float.

Does anyone know if the SDK supports sgemm with 2 bytes data format? I didn't find anything by searching SDK manual and cblas.h. If not, is there any plan on this topic for TI?

BTW, I had used fp16 data format on matrix calculation based on NVIDIA platform, which improves performance significantly (NV supports fp16 in CUDA library)

Thanks very much for any help on this topic!

over 7 years ago

0 Yuan Zhao over 7 years ago

TI__Expert 3705 points

Hi Hao,

I see that you got answers about Half data type on other forums. C66 DSP on K2H evm does not have half multiplication support.

I want to comment on the performance numbers that you got on your particular problem size (M=10,N=200,000,K=30). Have you tried to force computation to be dispatched to DSP. See "TI_CBLAS_OFFLOAD" environment variable on this wiki page:

I tested your problem size with our sgemm example shipped with OpenCL product (/usr/share/ti/examples/opencl/sgemm). I got about 0.086s. That's why I want to see if your particular matrix size is actually dispatched to the DSP side. I know CBLAS has tuned some parameters to determine whether the computation should stay on ARM or be dispatched to DSP. Maybe it didn't make the best decision for your matrix size.

sgemm# ./sgemm -M 10 -K 30 -N 200000
C[10,200000] = alpha * A[10,30] * B[30,200000] + beta * C[10,200000], use col-ma
jor storage
alpha=1.000000, beta=0.000000

Generating Input Data ...Complete
   8 DSPs: 1.393 Gflops (0.086138 s) 
   1 CPU : 1.159 Gflops (0.103579 s) with ATLAS library
PASS!

If you want to run sgemm example in our shipped OpenCL product installation, you need to apply the following patch and rebuild with "make clean; make".

--- a/sgemm/sgemm.c
+++ b/sgemm/sgemm.c
@@ -74,7 +74,7 @@ void sgemm(
     int kCntPrev, nCntPrev, mCntPrev;
 #endif
     int innerIndex_m, innerIndex_n;
-    int flagLastK, flagLastM, flagLastN;
+    int flagLastK, flagLastM, flagLastN, flagLastMXfers, flagLastNXfers;
     float * restrict ptrA, * restrict ptrB, * restrict ptrC;
     float * restrict ptrASeg1, * restrict ptrASeg2;
     float * restrict ptrBSeg1, * restrict ptrBSeg2;
@@ -185,8 +185,11 @@ void sgemm(
         {
             mCnt = ((m-mIndex) < MPARTITION) ? (m-mIndex) : MPARTITION;
             flagLastM = ((mIndex+MPARTITION)<m) ? 0 : 1;
+            flagLastMXfers = ((mIndex+2*MPARTITION)<m) ? 0 : 1;
             mCntNext = ((m-mIndex-MPARTITION) < MPARTITION) ?
                        (m-mIndex-MPARTITION) : MPARTITION;
+            mCntNext = (mCntNext <= 0) ? (m < MPARTITION ? m : MPARTITION)
+                                       :  mCntNext;
             if(flagLastM) mCntNext = (m < MPARTITION) ? m : MPARTITION;
 
             // bring in A into MSMC SRAM (a new parallel transfer)
@@ -215,7 +218,7 @@ void sgemm(
             {
                 if ((!flagLastM) || (!flagLastK))
                 {
-                    if (mIndex == 0)
+                    if (mIndex == 0 || flagLastMXfers)
                     {
 #if USE_EDMA
                         EdmaMgr_copy2D2DSep(chan0,
@@ -280,7 +283,10 @@ void sgemm(
             {
                 nCnt = ((n-nIndex) < NPARTITION) ? (n-nIndex) : NPARTITION;
                 nCntNext = ((n-nIndex-NPARTITION) < NPARTITION) ? (n-nIndex-NPARTITION) : NPARTITION;
+                nCntNext = (nCntNext <= 0) ? (n < NPARTITION ? n : NPARTITION)
+                                           : nCntNext;
                 flagLastN = ((nIndex+NPARTITION)<n) ? 0 : 1;
+                flagLastNXfers = ((nIndex+2*NPARTITION)<n) ? 0 : 1;
                 if(flagLastN) nCntNext = (n < NPARTITION) ? n : NPARTITION;
 
                 // bring in B into L1 SRAM (a new parallel transfer)
@@ -297,7 +303,7 @@ void sgemm(
                     nXferIndex = (!flagLastN) ? nXferIndex: kIndex;
                     nXferIndex = ((!flagLastN) || (!flagLastM)) ? nXferIndex: (kIndex+kCnt);
                     ptrB = (indexBNext == 0) ? ptrBSeg1: ptrBSeg2;
-                    if (nIndex == 0)
+                    if (nIndex == 0 || flagLastNXfers)
                     {
 #if USE_EDMA
                         EdmaMgr_copy2D2DSep(chan1,

0 hao yang95 over 7 years ago in reply to Yuan Zhao

Intellectual 500 points

Hi. Yuan,

Thanks so much for your effort on this topic!

I want to make the patch you showed me, but it seems not the complete code you paste in your post. Could you please tell me where could I download the complete patch regarding sgemm?

0 Yuan Zhao over 7 years ago in reply to hao yang95

TI__Expert 3705 points

Hi Hao,

What I enclosed is the complete diff (generated by git). I'll attach the patch as a file to this post as well. 3566.sgemm.diff

I tried it on my local EVM and it works. Which Processor SDK are you using? You can install the latest here:

Or you can download OpenCL examples from here:

Here are the steps applying the patch to installed OpenCL examples:

### save patch as ~/sgemm.diff
~# cd /usr/share/ti/examples/opencl
~# git apply ~/sgemm.diff
~# cd sgemm
~# make clean; make
~# ./sgemm -M 10 -K 30 -N 200000

0 hao yang95 over 7 years ago in reply to Yuan Zhao

Intellectual 500 points

Hi, Yuan
Sorry for reply late.
I've checked sgemm from opencl_examples with your patch. The result is just also close to yours (1.326 Gflops,0.090784s at M=16，K=27，N=200704).
However when I test cblas_sgemm from LINALG lib, it cost more time than sgemm: time=0.196028s, gflops=0.884610. I'd like to use cblas_sgemm in my project because it is easier to be integrated. So could you please help check why cblas_sgemm is not fast as sgemm?
Thanks for your kindly support!!

0 Yuan Zhao over 7 years ago in reply to hao yang95

TI__Expert 3705 points

Hi Hao,

I do not know the details of cblas implementation. I have forwarded your message to the cblas team.

Could you call cblas_sgemm() a few times in the same program and measure the time for each call? It is very possible that your first cblas call also included program loading time onto dsp. If that is the case, you should skip the first measurement and only use the subsequent measurements for comparison. Please see "/usr/share/ti/examples/opencl/null" for explanation.

- Yuan

0 Jianzhong Xu over 7 years ago in reply to Yuan Zhao

TI__Mastermind 40215 points

Hi Hao,

I did some experiments with cblas_sgemm on K2H EVM. Below are what I got:

1. force sgemm to run on DSP (TI_CBLAS_OFFLOAD=001): 0.22 second

2. let sgemm choose by itself where to run (TI_CBLAS_OFFLOAD=002): 0.096 second

3. force sgemm to run on ARM (TI_CBLAS_OFFLOAD=000): 0.095second. This means sgemm chooses to run on ARM when TI_CBLAS_OFFLOAD=002.

4. let sgemm choose by itself where to run and configure 3 ARM cores to run BLAS (BLIS_IC_NT=3): 0.048 second.

So the optimum way to run cblas_sgemm for (M=16，K=27，N=200704) is to run it on ARM with 3 cores if ARM is available. Usually we don't configure 4 ARM cores for BLAS since 1 ARM core is generally always used by the system.

The reason why cblas_sgemm in LINALG is worse than the sgemm in OpenCL example when running on DSP may be because LINALG is built on top of BLIS, which may not have optimum performance for extremely rectangular matrix.

Regards,

Jianzhong

0 hao yang95 over 7 years ago in reply to Jianzhong Xu

Intellectual 500 points

@Yuan:
Yes. You idea is correct as I have proved it in my test as mentioned below. Thanks so much for your support!!

@jianzhong:
Thanks for your effort on this topic.
My test results for most of items you listed in your post are almost different from yours except the first one. I summarized them here:

[Case1]TI_CBLAS_OFFLOAD=001, avg:0.19s
[Case2]TI_CBLAS_OFFLOAD=002, avg:0.19s
[Case3]TI_CBLAS_OFFLOAD=002, BLIS_IC_NT=3, avg:0.18s
[Case4]TI_CBLAS_OFFLOAD=000, BLIS_IC_NT=3, avg:0.17s

1. Take case2 as example, I've no idea why it(0.19s) is very worse than your result(0.096s). I also attached the execution log and source code at the end of the post. Could you please help check them if any incorrect code or options?
BTW, to ensure the system is clean enough for test, I reboot K2H12 each time before launching each testcase.

2. As you said, "LINALG is built on top of BLIS, which may not have optimum performance for extremely rectangular matrix. " Does LINALG team have any plan to optimize BLIS?

[Log]
[Case1]TI_CBLAS_OFFLOAD=001

k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=001
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.933559   gflops=0.185750   //The first call
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.191916   gflops=0.903565
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.199393   gflops=0.869683

---------------------------------------------------------------

[Case2]TI_CBLAS_OFFLOAD=002

k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=002
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.936287   gflops=0.185209
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.198556   gflops=0.873347
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.202608   gflops=0.855882
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.198693   gflops=0.872745
Passed.

---------------------------------------------------------------

[Case3]TI_CBLAS_OFFLOAD=002, BLIS_IC_NT=3

k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=000
k2hk-evm:/home/sgemm# export BLIS_IC_NT=3
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.918656   gflops=0.188763
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.185593   gflops=0.934347
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.177238   gflops=0.978394
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.287727   gflops=0.602683
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.184236   gflops=0.941229
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.176801   gflops=0.980812
Passed.

---------------------------------------------------------------

[Case4]TI_CBLAS_OFFLOAD=000, BLIS_IC_NT=3

k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=000
k2hk-evm:/home/sgemm# export BLIS_IC_NT=3
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.897856   gflops=0.193136
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.172612   gflops=1.004615
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.172352   gflops=1.006131
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.175291   gflops=0.989258
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.169396   gflops=1.023685
Passed.
k2hk-evm:/home/sgemm# ./sgemm_test

    16   200704        27   time=0.176328   gflops=0.983443
Passed.

---------------------------------------------------------------
---------------------------------------------------------------

[Code]
8103.sgemm_test.c

[Command]
arm-linux-gnueabihf-gcc -c -I/opt/ti-processor-sdk-linux-k2hk-evm-03.00.00.04/linux-devkit/sysroots/cortexa15hf-neon-linux-gnueabi/usr/include -I/opt/ti-processor-sdk-linux-k2hk-evm-03.00.00.04/linux-devkit/sysroots/cortexa15hf-neon-linux-gnueabi/usr/share/ti/ti-linalg-tree/packages/ti/linalg -O3 sgemm_test.c

0 Jianzhong Xu over 7 years ago in reply to hao yang95

TI__Mastermind 40215 points

Hi Hao,

Thanks for sharing your code. TI CBLAS needs to be set up and initialized during the first call in an application. Your test makes a single call to CBLAS and then exits. So for your experiments, CBLAS does initialization every time.

I would like to recommend you to try the following:

if (specified_mnk)
{
M = 16;
N = 200704;
K = 27;
run_sgemm(M, N, K, &time_secs, &gflops); // DON'T COUNT THIS ONE FOR MEASUREMENT

run_sgemm(M, N, K, &time_secs, &gflops); // USE THIS ONE FOR MEASUREMENT
printf("\n%6d\t%6d\t%6d\ttime=%f\tgflops=%f\n", M, N, K, time_secs, gflops);
}

What's even better is to call run_sgemm in a loop to get an averaged measurement.

Regards,
Jianzhong

0 hao yang95 over 7 years ago in reply to Jianzhong Xu

Intellectual 500 points

Ah, I see.
Thanks for your comments.
I tested it again, and the very interesting thing is in my side, sgemm chooses dsp when TI_CBLAS_OFFLOAD=002, but we can see that the performance seems better on ARM than on DSP. In another word, sgemm didn't choose a better place to run... Is this a problem?

Here is my log:

k2hk-evm:/home/mcw/opencl_example/sgemm# export TI_CBLAS_OFFLOAD=002
k2hk-evm:/home/mcw/opencl_example/sgemm# ./sgemm_test

16 200704 27 time=0.195576 gflops=0.886654

16 200704 27 time=0.062049 gflops=2.794703

16 200704 27 time=0.062210 gflops=2.787481
Passed.
k2hk-evm:/home/mcw/opencl_example/sgemm# export TI_CBLAS_OFFLOAD=001
k2hk-evm:/home/mcw/opencl_example/sgemm# ./sgemm_test

16 200704 27 time=0.197267 gflops=0.879053

16 200704 27 time=0.062037 gflops=2.795220

16 200704 27 time=0.062194 gflops=2.788201
Passed.
k2hk-evm:/home/mcw/opencl_example/sgemm# export TI_CBLAS_OFFLOAD=000
k2hk-evm:/home/mcw/opencl_example/sgemm# ./sgemm_test

16 200704 27 time=0.178331 gflops=0.972398

16 200704 27 time=0.041344 gflops=4.194293

16 200704 27 time=0.041137 gflops=4.215423
Passed.

0 Jianzhong Xu over 7 years ago in reply to hao yang95

TI__Mastermind 40215 points

The reasons could be:
1. The device you're using may have different CPU speed as the device that was used to tune BLAS. Please follow this link for more information:
processors.wiki.ti.com/.../Processor_SDK_Linear_Algebra_Library

2. For this specific matrix size, running on DSP and running on ARM (3 cores) are not too much different in terms of speed. Execution time on ARM can be affected by other running tasks. In addition, tuning was performed on matrix sizes that are multiple of power of 2. So the closed size that was tuned is actually (16, 262144, 32) which may run faster on DSP. Given all these factors, sgemm may not be running with optimum speed.

Hope this makes sense.

Regards,
Jianzhong

0 Jianzhong Xu over 7 years ago in reply to Jianzhong Xu

TI__Mastermind 40215 points

Sorry, one clarification: Given all these factors, sgemm may not be running with optimum speed for this specific matrix size.

0 hao yang95 over 7 years ago in reply to Jianzhong Xu

Intellectual 500 points

Thanks for your information, jianzhong!

All the things is clear in my side now.

Processors

Processors forum

Does TI support 2-bytes data format on matrix function?