• TI Thinks Resolved

Does TI support 2-bytes data format on matrix function?

Hi, all
On EVMK2H12, I used "cblas_sgemm" from Processor-SDK to calculate large matrix. I know the interface has been optimized based on OpenCL on 8 DSP cores. But the result is not satisfied according to performance requirement.

I've checked the performance for M=N=K=1000, it is 0.027s as same as TI said (www.ti.com/.../linear-algebra-libraries.page

However when the values are large such as M=10,N=200,000,K=30, it is 0.277s. Well, maybe this kind of calculation has reached the limit of performance for all DSP cores. So I think it could be improved if the input data format is short or fp16 (half precision float-point), which is 2 bytes length instead of 4 bytes like float.

Does anyone know if the SDK supports sgemm with 2 bytes data format? I didn't find anything by searching SDK manual and cblas.h. If not, is there any plan on this topic for TI?

BTW, I had used fp16 data format on matrix calculation based on NVIDIA platform, which improves performance significantly (NV supports fp16 in CUDA library)

Thanks very much for any help on this topic!

  • Hi Hao,

        I see that you got answers about Half data type on other forums.  C66 DSP on K2H evm does not have half multiplication support.

        I want to comment on the performance numbers that you got on your particular problem size (M=10,N=200,000,K=30).  Have you tried to force computation to be dispatched to DSP.  See "TI_CBLAS_OFFLOAD" environment variable on this wiki page:


    I tested your problem size with our sgemm example shipped with OpenCL product (/usr/share/ti/examples/opencl/sgemm).  I got about 0.086s.  That's why I want to see if your particular matrix size is actually dispatched to the DSP side.  I know CBLAS has tuned some parameters to determine whether the computation should stay on ARM or be dispatched to DSP.  Maybe it didn't make the best decision for your matrix size.

    sgemm# ./sgemm -M 10 -K 30 -N 200000
    C[10,200000] = alpha * A[10,30] * B[30,200000] + beta * C[10,200000], use col-ma
    jor storage
    alpha=1.000000, beta=0.000000
    
    Generating Input Data ...Complete
       8 DSPs: 1.393 Gflops (0.086138 s) 
       1 CPU : 1.159 Gflops (0.103579 s) with ATLAS library
    PASS!
    


    If you want to run sgemm example in our shipped OpenCL product installation, you need to apply the following patch and rebuild with "make clean; make".

    --- a/sgemm/sgemm.c
    +++ b/sgemm/sgemm.c
    @@ -74,7 +74,7 @@ void sgemm(
         int kCntPrev, nCntPrev, mCntPrev;
     #endif
         int innerIndex_m, innerIndex_n;
    -    int flagLastK, flagLastM, flagLastN;
    +    int flagLastK, flagLastM, flagLastN, flagLastMXfers, flagLastNXfers;
         float * restrict ptrA, * restrict ptrB, * restrict ptrC;
         float * restrict ptrASeg1, * restrict ptrASeg2;
         float * restrict ptrBSeg1, * restrict ptrBSeg2;
    @@ -185,8 +185,11 @@ void sgemm(
             {
                 mCnt = ((m-mIndex) < MPARTITION) ? (m-mIndex) : MPARTITION;
                 flagLastM = ((mIndex+MPARTITION)<m) ? 0 : 1;
    +            flagLastMXfers = ((mIndex+2*MPARTITION)<m) ? 0 : 1;
                 mCntNext = ((m-mIndex-MPARTITION) < MPARTITION) ?
                            (m-mIndex-MPARTITION) : MPARTITION;
    +            mCntNext = (mCntNext <= 0) ? (m < MPARTITION ? m : MPARTITION)
    +                                       :  mCntNext;
                 if(flagLastM) mCntNext = (m < MPARTITION) ? m : MPARTITION;
     
                 // bring in A into MSMC SRAM (a new parallel transfer)
    @@ -215,7 +218,7 @@ void sgemm(
                 {
                     if ((!flagLastM) || (!flagLastK))
                     {
    -                    if (mIndex == 0)
    +                    if (mIndex == 0 || flagLastMXfers)
                         {
     #if USE_EDMA
                             EdmaMgr_copy2D2DSep(chan0,
    @@ -280,7 +283,10 @@ void sgemm(
                 {
                     nCnt = ((n-nIndex) < NPARTITION) ? (n-nIndex) : NPARTITION;
                     nCntNext = ((n-nIndex-NPARTITION) < NPARTITION) ? (n-nIndex-NPARTITION) : NPARTITION;
    +                nCntNext = (nCntNext <= 0) ? (n < NPARTITION ? n : NPARTITION)
    +                                           : nCntNext;
                     flagLastN = ((nIndex+NPARTITION)<n) ? 0 : 1;
    +                flagLastNXfers = ((nIndex+2*NPARTITION)<n) ? 0 : 1;
                     if(flagLastN) nCntNext = (n < NPARTITION) ? n : NPARTITION;
     
                     // bring in B into L1 SRAM (a new parallel transfer)
    @@ -297,7 +303,7 @@ void sgemm(
                         nXferIndex = (!flagLastN) ? nXferIndex: kIndex;
                         nXferIndex = ((!flagLastN) || (!flagLastM)) ? nXferIndex: (kIndex+kCnt);
                         ptrB = (indexBNext == 0) ? ptrBSeg1: ptrBSeg2;
    -                    if (nIndex == 0)
    +                    if (nIndex == 0 || flagLastNXfers)
                         {
     #if USE_EDMA
                             EdmaMgr_copy2D2DSep(chan1,
    

  • In reply to Yuan Zhao:

    Hi. Yuan,

    Thanks so much for your effort on this topic!

    I want to make the patch you showed me, but it seems not the complete code you paste in your post. Could you please tell me where could I download  the complete patch regarding sgemm?

  • In reply to hao yang95:

    Hi Hao,

      What I enclosed is the complete diff (generated by git).  I'll attach the patch as a file to this post as well.  sgemm.diff

      I tried it on my local EVM and it works.  Which Processor SDK are you using?  You can install the latest here:

    Or you can download OpenCL examples from here:

    Here are the steps applying the patch to installed OpenCL examples:

    ### save patch as ~/sgemm.diff
    ~# cd /usr/share/ti/examples/opencl
    ~# git apply ~/sgemm.diff
    ~# cd sgemm
    ~# make clean; make
    ~# ./sgemm -M 10 -K 30 -N 200000

  • In reply to Yuan Zhao:

    Hi, Yuan
    Sorry for reply late.
    I've checked sgemm from opencl_examples with your patch. The result is just also close to yours (1.326 Gflops,0.090784s at M=16,K=27,N=200704).
    However when I test cblas_sgemm from LINALG lib, it cost more time than sgemm: time=0.196028s, gflops=0.884610. I'd like to use cblas_sgemm in my project because it is easier to be integrated. So could you please help check why cblas_sgemm is not fast as sgemm?
    Thanks for your kindly support!!

  • In reply to hao yang95:

    Hi Hao,

    I do not know the details of cblas implementation. I have forwarded your message to the cblas team.

    Could you call cblas_sgemm() a few times in the same program and measure the time for each call? It is very possible that your first cblas call also included program loading time onto dsp. If that is the case, you should skip the first measurement and only use the subsequent measurements for comparison. Please see "/usr/share/ti/examples/opencl/null" for explanation.

    - Yuan
  • In reply to Yuan Zhao:

    Hi Hao,

    I did some experiments with cblas_sgemm on K2H EVM. Below are what I got:

    1. force sgemm to run on DSP (TI_CBLAS_OFFLOAD=001): 0.22 second

    2. let sgemm choose by itself where to run (TI_CBLAS_OFFLOAD=002): 0.096 second 

    3. force sgemm to run on ARM (TI_CBLAS_OFFLOAD=000): 0.095second. This means sgemm chooses to run on ARM when TI_CBLAS_OFFLOAD=002.

    4. let sgemm choose by itself where to run and configure 3 ARM cores to run BLAS (BLIS_IC_NT=3): 0.048 second. 

    So the optimum way to run cblas_sgemm for (M=16,K=27,N=200704) is to run it on ARM with 3 cores if ARM is available. Usually we don't configure 4 ARM cores for BLAS since 1 ARM core is generally always used by the system. 

    The reason why cblas_sgemm in LINALG is worse than the sgemm in OpenCL example when running on DSP may be because LINALG is built on top of BLIS, which may not have optimum performance for extremely rectangular matrix. 

    Regards,

    Jianzhong

  • In reply to jianzhongxu:

    @Yuan:
    Yes. You idea is correct as I have proved it in my test as mentioned below. Thanks so much for your support!!


    @jianzhong:
    Thanks for your effort on this topic.
    My test results for most of items you listed in your post are almost different from yours except the first one. I summarized them here:

    [Case1]TI_CBLAS_OFFLOAD=001, avg:0.19s
    [Case2]TI_CBLAS_OFFLOAD=002, avg:0.19s
    [Case3]TI_CBLAS_OFFLOAD=002, BLIS_IC_NT=3, avg:0.18s
    [Case4]TI_CBLAS_OFFLOAD=000, BLIS_IC_NT=3, avg:0.17s

    1. Take case2 as example, I've no idea why it(0.19s) is very worse than your result(0.096s). I also attached the execution log and source code at the end of the post. Could you please help check them if any incorrect code or options?
    BTW, to ensure the system is clean enough for test, I reboot K2H12 each time before launching each testcase.

    2. As you said, "LINALG is built on top of BLIS, which may not have optimum performance for extremely rectangular matrix. " Does LINALG team have any plan to optimize BLIS?

    [Log]
    [Case1]TI_CBLAS_OFFLOAD=001

    k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=001
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.933559    gflops=0.185750    //The first call
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.191916    gflops=0.903565
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.199393    gflops=0.869683

    ---------------------------------------------------------------

    [Case2]TI_CBLAS_OFFLOAD=002

    k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=002
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.936287    gflops=0.185209
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.198556    gflops=0.873347
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.202608    gflops=0.855882
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.198693    gflops=0.872745
    Passed.

    ---------------------------------------------------------------

    [Case3]TI_CBLAS_OFFLOAD=002, BLIS_IC_NT=3

    k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=000
    k2hk-evm:/home/sgemm# export BLIS_IC_NT=3
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.918656    gflops=0.188763
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.185593    gflops=0.934347
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.177238    gflops=0.978394
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.287727    gflops=0.602683
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.184236    gflops=0.941229
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.176801    gflops=0.980812
    Passed.

    ---------------------------------------------------------------

    [Case4]TI_CBLAS_OFFLOAD=000, BLIS_IC_NT=3

    k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=000
    k2hk-evm:/home/sgemm# export BLIS_IC_NT=3
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.897856    gflops=0.193136
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.172612    gflops=1.004615
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.172352    gflops=1.006131
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.175291    gflops=0.989258
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.169396    gflops=1.023685
    Passed.
    k2hk-evm:/home/sgemm# ./sgemm_test

        16    200704        27    time=0.176328    gflops=0.983443
    Passed.

    ---------------------------------------------------------------
    ---------------------------------------------------------------

    [Code]
    sgemm_test.c


    [Command]
    arm-linux-gnueabihf-gcc -c -I/opt/ti-processor-sdk-linux-k2hk-evm-03.00.00.04/linux-devkit/sysroots/cortexa15hf-neon-linux-gnueabi/usr/include -I/opt/ti-processor-sdk-linux-k2hk-evm-03.00.00.04/linux-devkit/sysroots/cortexa15hf-neon-linux-gnueabi/usr/share/ti/ti-linalg-tree/packages/ti/linalg -O3 sgemm_test.c


  • In reply to hao yang95:

    Hi Hao,

    Thanks for sharing your code. TI CBLAS needs to be set up and initialized during the first call in an application. Your test makes a single call to CBLAS and then exits. So for your experiments, CBLAS does initialization every time.

    I would like to recommend you to try the following:

    if (specified_mnk)
    {
    M = 16;
    N = 200704;
    K = 27;
    run_sgemm(M, N, K, &time_secs, &gflops); // DON'T COUNT THIS ONE FOR MEASUREMENT

    run_sgemm(M, N, K, &time_secs, &gflops); // USE THIS ONE FOR MEASUREMENT
    printf("\n%6d\t%6d\t%6d\ttime=%f\tgflops=%f\n", M, N, K, time_secs, gflops);
    }

    What's even better is to call run_sgemm in a loop to get an averaged measurement.

    Regards,
    Jianzhong
  • In reply to jianzhongxu:

    Ah, I see.
    Thanks for your comments.
    I tested it again, and the very interesting thing is in my side, sgemm chooses dsp when TI_CBLAS_OFFLOAD=002, but we can see that the performance seems better on ARM than on DSP. In another word, sgemm didn't choose a better place to run... Is this a problem?

    Here is my log:

    k2hk-evm:/home/mcw/opencl_example/sgemm# export TI_CBLAS_OFFLOAD=002
    k2hk-evm:/home/mcw/opencl_example/sgemm# ./sgemm_test

    16 200704 27 time=0.195576 gflops=0.886654

    16 200704 27 time=0.062049 gflops=2.794703

    16 200704 27 time=0.062210 gflops=2.787481
    Passed.
    k2hk-evm:/home/mcw/opencl_example/sgemm# export TI_CBLAS_OFFLOAD=001
    k2hk-evm:/home/mcw/opencl_example/sgemm# ./sgemm_test

    16 200704 27 time=0.197267 gflops=0.879053

    16 200704 27 time=0.062037 gflops=2.795220

    16 200704 27 time=0.062194 gflops=2.788201
    Passed.
    k2hk-evm:/home/mcw/opencl_example/sgemm# export TI_CBLAS_OFFLOAD=000
    k2hk-evm:/home/mcw/opencl_example/sgemm# ./sgemm_test

    16 200704 27 time=0.178331 gflops=0.972398

    16 200704 27 time=0.041344 gflops=4.194293

    16 200704 27 time=0.041137 gflops=4.215423
    Passed.
  • In reply to hao yang95:

    The reasons could be:
    1. The device you're using may have different CPU speed as the device that was used to tune BLAS. Please follow this link for more information:
    processors.wiki.ti.com/.../Processor_SDK_Linear_Algebra_Library

    2. For this specific matrix size, running on DSP and running on ARM (3 cores) are not too much different in terms of speed. Execution time on ARM can be affected by other running tasks. In addition, tuning was performed on matrix sizes that are multiple of power of 2. So the closed size that was tuned is actually (16, 262144, 32) which may run faster on DSP. Given all these factors, sgemm may not be running with optimum speed.

    Hope this makes sense.

    Regards,
    Jianzhong