This thread has been locked.
If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.
Hi, allOn EVMK2H12, I used "cblas_sgemm" from Processor-SDK to calculate large matrix. I know the interface has been optimized based on OpenCL on 8 DSP cores. But the result is not satisfied according to performance requirement.I've checked the performance for M=N=K=1000, it is 0.027s as same as TI said (www.ti.com/.../linear-algebra-libraries.pageHowever when the values are large such as M=10,N=200,000,K=30, it is 0.277s. Well, maybe this kind of calculation has reached the limit of performance for all DSP cores. So I think it could be improved if the input data format is short or fp16 (half precision float-point), which is 2 bytes length instead of 4 bytes like float.Does anyone know if the SDK supports sgemm with 2 bytes data format? I didn't find anything by searching SDK manual and cblas.h. If not, is there any plan on this topic for TI?BTW, I had used fp16 data format on matrix calculation based on NVIDIA platform, which improves performance significantly (NV supports fp16 in CUDA library)Thanks very much for any help on this topic!
I see that you got answers about Half data type on other forums. C66 DSP on K2H evm does not have half multiplication support.
I want to comment on the performance numbers that you got on your particular problem size (M=10,N=200,000,K=30). Have you tried to force computation to be dispatched to DSP. See "TI_CBLAS_OFFLOAD" environment variable on this wiki page:
I tested your problem size with our sgemm example shipped with OpenCL product (/usr/share/ti/examples/opencl/sgemm). I got about 0.086s. That's why I want to see if your particular matrix size is actually dispatched to the DSP side. I know CBLAS has tuned some parameters to determine whether the computation should stay on ARM or be dispatched to DSP. Maybe it didn't make the best decision for your matrix size.
sgemm# ./sgemm -M 10 -K 30 -N 200000
C[10,200000] = alpha * A[10,30] * B[30,200000] + beta * C[10,200000], use col-ma
Generating Input Data ...Complete
8 DSPs: 1.393 Gflops (0.086138 s)
1 CPU : 1.159 Gflops (0.103579 s) with ATLAS library
If you want to run sgemm example in our shipped OpenCL product installation, you need to apply the following patch and rebuild with "make clean; make".
@@ -74,7 +74,7 @@ void sgemm(
int kCntPrev, nCntPrev, mCntPrev;
int innerIndex_m, innerIndex_n;
- int flagLastK, flagLastM, flagLastN;
+ int flagLastK, flagLastM, flagLastN, flagLastMXfers, flagLastNXfers;
float * restrict ptrA, * restrict ptrB, * restrict ptrC;
float * restrict ptrASeg1, * restrict ptrASeg2;
float * restrict ptrBSeg1, * restrict ptrBSeg2;
@@ -185,8 +185,11 @@ void sgemm(
mCnt = ((m-mIndex) < MPARTITION) ? (m-mIndex) : MPARTITION;
flagLastM = ((mIndex+MPARTITION)<m) ? 0 : 1;
+ flagLastMXfers = ((mIndex+2*MPARTITION)<m) ? 0 : 1;
mCntNext = ((m-mIndex-MPARTITION) < MPARTITION) ?
(m-mIndex-MPARTITION) : MPARTITION;
+ mCntNext = (mCntNext <= 0) ? (m < MPARTITION ? m : MPARTITION)
+ : mCntNext;
if(flagLastM) mCntNext = (m < MPARTITION) ? m : MPARTITION;
// bring in A into MSMC SRAM (a new parallel transfer)
@@ -215,7 +218,7 @@ void sgemm(
if ((!flagLastM) || (!flagLastK))
- if (mIndex == 0)
+ if (mIndex == 0 || flagLastMXfers)
@@ -280,7 +283,10 @@ void sgemm(
nCnt = ((n-nIndex) < NPARTITION) ? (n-nIndex) : NPARTITION;
nCntNext = ((n-nIndex-NPARTITION) < NPARTITION) ? (n-nIndex-NPARTITION) : NPARTITION;
+ nCntNext = (nCntNext <= 0) ? (n < NPARTITION ? n : NPARTITION)
+ : nCntNext;
flagLastN = ((nIndex+NPARTITION)<n) ? 0 : 1;
+ flagLastNXfers = ((nIndex+2*NPARTITION)<n) ? 0 : 1;
if(flagLastN) nCntNext = (n < NPARTITION) ? n : NPARTITION;
// bring in B into L1 SRAM (a new parallel transfer)
@@ -297,7 +303,7 @@ void sgemm(
nXferIndex = (!flagLastN) ? nXferIndex: kIndex;
nXferIndex = ((!flagLastN) || (!flagLastM)) ? nXferIndex: (kIndex+kCnt);
ptrB = (indexBNext == 0) ? ptrBSeg1: ptrBSeg2;
- if (nIndex == 0)
+ if (nIndex == 0 || flagLastNXfers)
We are glad that we were able to resolve this issue, and will now proceed to close this thread.
If you have further questions related to this thread, you may click "Ask a related question" below. The newly created question will be automatically linked to this question.
In reply to Yuan Zhao:
Thanks so much for your effort on this topic!
I want to make the patch you showed me, but it seems not the complete code you paste in your post. Could you please tell me where could I download the complete patch regarding sgemm?
In reply to hao yang95:
What I enclosed is the complete diff (generated by git). I'll attach the patch as a file to this post as well. sgemm.diff
I tried it on my local EVM and it works. Which Processor SDK are you using? You can install the latest here:
Or you can download OpenCL examples from here:
Here are the steps applying the patch to installed OpenCL examples:
### save patch as ~/sgemm.diff
~# cd /usr/share/ti/examples/opencl
~# git apply ~/sgemm.diff
~# cd sgemm
~# make clean; make
~# ./sgemm -M 10 -K 30 -N 200000
Hi, YuanSorry for reply late.I've checked sgemm from opencl_examples with your patch. The result is just also close to yours (1.326 Gflops,0.090784s at M=16，K=27，N=200704).However when I test cblas_sgemm from LINALG lib, it cost more time than sgemm: time=0.196028s, gflops=0.884610. I'd like to use cblas_sgemm in my project because it is easier to be integrated. So could you please help check why cblas_sgemm is not fast as sgemm?Thanks for your kindly support!!
I did some experiments with cblas_sgemm on K2H EVM. Below are what I got:
1. force sgemm to run on DSP (TI_CBLAS_OFFLOAD=001): 0.22 second
2. let sgemm choose by itself where to run (TI_CBLAS_OFFLOAD=002): 0.096 second
3. force sgemm to run on ARM (TI_CBLAS_OFFLOAD=000): 0.095second. This means sgemm chooses to run on ARM when TI_CBLAS_OFFLOAD=002.
4. let sgemm choose by itself where to run and configure 3 ARM cores to run BLAS (BLIS_IC_NT=3): 0.048 second.
So the optimum way to run cblas_sgemm for (M=16，K=27，N=200704) is to run it on ARM with 3 cores if ARM is available. Usually we don't configure 4 ARM cores for BLAS since 1 ARM core is generally always used by the system.
The reason why cblas_sgemm in LINALG is worse than the sgemm in OpenCL example when running on DSP may be because LINALG is built on top of BLIS, which may not have optimum performance for extremely rectangular matrix.
In reply to jianzhongxu:
@Yuan:Yes. You idea is correct as I have proved it in my test as mentioned below. Thanks so much for your support!!@jianzhong:Thanks for your effort on this topic.My test results for most of items you listed in your post are almost different from yours except the first one. I summarized them here:[Case1]TI_CBLAS_OFFLOAD=001, avg:0.19s[Case2]TI_CBLAS_OFFLOAD=002, avg:0.19s[Case3]TI_CBLAS_OFFLOAD=002, BLIS_IC_NT=3, avg:0.18s[Case4]TI_CBLAS_OFFLOAD=000, BLIS_IC_NT=3, avg:0.17s1. Take case2 as example, I've no idea why it(0.19s) is very worse than your result(0.096s). I also attached the execution log and source code at the end of the post. Could you please help check them if any incorrect code or options?BTW, to ensure the system is clean enough for test, I reboot K2H12 each time before launching each testcase.2. As you said, "LINALG is built on top of BLIS, which may not have optimum performance for extremely rectangular matrix. " Does LINALG team have any plan to optimize BLIS?[Log][Case1]TI_CBLAS_OFFLOAD=001k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=001k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.933559 gflops=0.185750 //The first callPassed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.191916 gflops=0.903565Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.199393 gflops=0.869683---------------------------------------------------------------[Case2]TI_CBLAS_OFFLOAD=002k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=002k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.936287 gflops=0.185209Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.198556 gflops=0.873347Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.202608 gflops=0.855882Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.198693 gflops=0.872745Passed.---------------------------------------------------------------[Case3]TI_CBLAS_OFFLOAD=002, BLIS_IC_NT=3k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=000k2hk-evm:/home/sgemm# export BLIS_IC_NT=3k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.918656 gflops=0.188763Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.185593 gflops=0.934347Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.177238 gflops=0.978394Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.287727 gflops=0.602683Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.184236 gflops=0.941229Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.176801 gflops=0.980812Passed.---------------------------------------------------------------[Case4]TI_CBLAS_OFFLOAD=000, BLIS_IC_NT=3k2hk-evm:/home/sgemm# export TI_CBLAS_OFFLOAD=000k2hk-evm:/home/sgemm# export BLIS_IC_NT=3k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.897856 gflops=0.193136Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.172612 gflops=1.004615Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.172352 gflops=1.006131Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.175291 gflops=0.989258Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.169396 gflops=1.023685Passed.k2hk-evm:/home/sgemm# ./sgemm_test 16 200704 27 time=0.176328 gflops=0.983443Passed.------------------------------------------------------------------------------------------------------------------------------[Code]sgemm_test.c[Command]arm-linux-gnueabihf-gcc -c -I/opt/ti-processor-sdk-linux-k2hk-evm-03.00.00.04/linux-devkit/sysroots/cortexa15hf-neon-linux-gnueabi/usr/include -I/opt/ti-processor-sdk-linux-k2hk-evm-03.00.00.04/linux-devkit/sysroots/cortexa15hf-neon-linux-gnueabi/usr/share/ti/ti-linalg-tree/packages/ti/linalg -O3 sgemm_test.c
All content and materials on this site are provided "as is". TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with regard to these materials, including but not limited to all implied warranties and conditions of merchantability, fitness for a particular purpose, title and non-infringement of any third party intellectual property right. TI and its respective suppliers and providers of content make no representations about the suitability of these materials for any purpose and disclaim all warranties and conditions with respect to these materials. No license, either express or implied, by estoppel or otherwise, is granted by TI. Use of the information on this site may require a license from a third party, or a license from TI.
TI is a global semiconductor design and manufacturing company. Innovate with 100,000+ analog ICs andembedded processors, along with software, tools and the industry’s largest sales/support staff.