This thread has been locked.

If you have a related question, please click the "Ask a related question" button in the top right corner. The newly created question will be automatically linked to this question.

[OpenMP 2.x, C6678] Slow performance on matvec sample

Hi,

In the past, I saw bad performance on matvec sample on OMP 1.x.
I recently know that TI had moved the support from OMP 1.x to OMP 2.x and I re-checked matvec performance with new OMP 2.x runtime.

I followed the porting guide described in below:

Now I successfully re-built the required environment for OMP 2.x.

And by following the user's guide delivered from openmp_dsp_2_01_16_03, I created the attached OpenMP 2.x CCS project and run it on C6678 EVM. The source code is based on matvec sample in openmp_dsp_2_01_16_03 package but I have slightly changed it to see the performance gap between core0 only and full core.

sample_openmp_matvec_keystone.zip

Following is the CCS console log (result)

Core0 Only : 964790 cycles
sum of all c[] = 250500235264.00

Full core : 2754332 cycles
sum of all c[] = 250500251648.00

The above "sum of all c[]"s were same value each other so OpenMP runtime itlself is working correctly, but its performance is still bad compared to core0 only use case.Please note the definition of SIZE has been changed from the default(10) to bigger value (say, 1000) to get better performance on OpenMP runtime, but still, I saw bad numbers in benchmark.

Now my questions are :

  1. Do you have any other OpenMP sample codes to see its benefit (faster execution time than one of a single core execution) ?

  2. Our customers might consider about using OpenMP, but I think the performance can be worse as you see in my matvec. Do you have any guideline to get better performance in execution time for the application powered by OpenMP ?

  3. During the link time, I saw the following warnings :

    warning #10247-D: creating output section ".tdata" without a SECTIONS specification
    warning #10247-D: creating output section ".tbss" without a SECTIONS specification

    This is because .tdata and .tbss are not intentionally mapped to the existing memory in cfg file. How should we handle these warning ?

Best Regards,
Naoki Kawada

  • Additional question:

           4. I noticed a project named "openmpbench_C_v3" in openmp_dsp_2_01_16_0/examples directory.
               I'm not 100% sure about this project, but anyway, I tried to build and run it on C6678 EVM and saw the reporting messages on CCS console:

    Running OpenMP benchmark version 3.0
    	1 thread(s)
    	20 outer repetitions
    	1000.00 test time (microseconds)
    	26 delay length (iterations)
    	0.100000 delay time (microseconds)
    
    --------------------------------------------------------
    Threads: 1
    Computing reference time 1 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 1 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 1
    Computing PARALLEL time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.059842   1.059749   1.059893    0.000037      0
    
    PARALLEL time     = 1.059842 microseconds +/- 0.000073, mean cycles = 1059
    PARALLEL overhead = 0.942824 microseconds +/- 0.000074, mean cycles = 942
    
    --------------------------------------------------------
    Threads: 1
    Computing FOR time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.921270   1.920733   1.923097    0.000770      0
    
    FOR time     = 1.921270 microseconds +/- 0.001509, mean cycles = 1921
    FOR overhead = 1.804252 microseconds +/- 0.001510, mean cycles = 1804
    
    --------------------------------------------------------
    Threads: 1
    Computing PARALLEL FOR time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.207475   1.207426   1.207525    0.000025      0
    
    PARALLEL FOR time     = 1.207475 microseconds +/- 0.000048, mean cycles = 1207
    PARALLEL FOR overhead = 1.090457 microseconds +/- 0.000050, mean cycles = 1090
    
    --------------------------------------------------------
    Threads: 1
    Computing BARRIER time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.722069   1.721248   1.727477    0.001464      1
    
    BARRIER time     = 1.722069 microseconds +/- 0.002869, mean cycles = 1722
    BARRIER overhead = 1.605051 microseconds +/- 0.002871, mean cycles = 1605
    
    --------------------------------------------------------
    Threads: 1
    Computing SINGLE time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.680365   2.679973   2.680669    0.000245      0
    
    SINGLE time     = 2.680365 microseconds +/- 0.000480, mean cycles = 2680
    SINGLE overhead = 2.563347 microseconds +/- 0.000481, mean cycles = 2563
    
    --------------------------------------------------------
    Threads: 1
    Computing CRITICAL time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.655303   2.655220   2.655372    0.000047      0
    
    CRITICAL time     = 2.655303 microseconds +/- 0.000091, mean cycles = 2655
    CRITICAL overhead = 2.538285 microseconds +/- 0.000093, mean cycles = 2538
    
    --------------------------------------------------------
    Threads: 1
    Computing LOCK/UNLOCK time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.247904   3.247575   3.248278    0.000173      0
    
    LOCK/UNLOCK time     = 3.247904 microseconds +/- 0.000339, mean cycles = 3248
    LOCK/UNLOCK overhead = 3.130885 microseconds +/- 0.000340, mean cycles = 3131
    
    --------------------------------------------------------
    Threads: 1
    Computing ORDERED time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.595613   2.595513   2.595686    0.000044      0
    
    ORDERED time     = 2.595613 microseconds +/- 0.000085, mean cycles = 2596
    ORDERED overhead = 2.478595 microseconds +/- 0.000087, mean cycles = 2479
    
    --------------------------------------------------------
    Threads: 1
    Computing reference time 2 time using 655360 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.004001   0.004001   0.004001    0.000000      0
    
    reference time 2 time     = 0.004001 microseconds +/- 0.000000 mean cycles = 4
    
    --------------------------------------------------------
    Threads: 1
    Computing ATOMIC time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.525881   2.525786   2.525945    0.000042      0
    
    ATOMIC time     = 2.525881 microseconds +/- 0.000083, mean cycles = 2526
    ATOMIC overhead = 2.521881 microseconds +/- 0.000083, mean cycles = 2522
    
    --------------------------------------------------------
    Threads: 1
    Computing reference time 3 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 3 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 1
    Computing REDUCTION time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.563366   3.559178   3.567547    0.004123      0
    
    REDUCTION time     = 3.563366 microseconds +/- 0.008082, mean cycles = 3564
    REDUCTION overhead = 3.446348 microseconds +/- 0.008083, mean cycles = 3447
    Running OpenMP benchmark version 3.0
    	2 thread(s)
    	20 outer repetitions
    	1000.00 test time (microseconds)
    	26 delay length (iterations)
    	0.100000 delay time (microseconds)
    
    --------------------------------------------------------
    Threads: 2
    Computing reference time 1 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 1 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 2
    Computing PARALLEL time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                5.585338   5.583128   5.588791    0.001714      0
    
    PARALLEL time     = 5.585338 microseconds +/- 0.003359, mean cycles = 5585
    PARALLEL overhead = 5.468319 microseconds +/- 0.003361, mean cycles = 5468
    
    --------------------------------------------------------
    Threads: 2
    Computing FOR time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.077540   2.074889   2.082720    0.002170      0
    
    FOR time     = 2.077540 microseconds +/- 0.004253, mean cycles = 2077
    FOR overhead = 1.960522 microseconds +/- 0.004255, mean cycles = 1960
    
    --------------------------------------------------------
    Threads: 2
    Computing PARALLEL FOR time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                5.833653   5.825503   5.848056    0.004669      1
    
    PARALLEL FOR time     = 5.833653 microseconds +/- 0.009151, mean cycles = 5834
    PARALLEL FOR overhead = 5.716635 microseconds +/- 0.009153, mean cycles = 5717
    
    --------------------------------------------------------
    Threads: 2
    Computing BARRIER time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.884255   1.881975   1.890247    0.002376      0
    
    BARRIER time     = 1.884255 microseconds +/- 0.004656, mean cycles = 1884
    BARRIER overhead = 1.767237 microseconds +/- 0.004658, mean cycles = 1767
    
    --------------------------------------------------------
    Threads: 2
    Computing SINGLE time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.731013   2.687625   2.749788    0.027111      0
    
    SINGLE time     = 2.731013 microseconds +/- 0.053138, mean cycles = 2731
    SINGLE overhead = 2.613995 microseconds +/- 0.053140, mean cycles = 2614
    
    --------------------------------------------------------
    Threads: 2
    Computing CRITICAL time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.402575   1.395294   1.412257    0.008248      0
    
    CRITICAL time     = 1.402575 microseconds +/- 0.016165, mean cycles = 1402
    CRITICAL overhead = 1.285557 microseconds +/- 0.016167, mean cycles = 1285
    
    --------------------------------------------------------
    Threads: 2
    Computing LOCK/UNLOCK time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.864586   1.864473   1.864787    0.000085      0
    
    LOCK/UNLOCK time     = 1.864586 microseconds +/- 0.000166, mean cycles = 1865
    LOCK/UNLOCK overhead = 1.747567 microseconds +/- 0.000168, mean cycles = 1748
    
    --------------------------------------------------------
    Threads: 2
    Computing ORDERED time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.306332   3.295553   3.317913    0.005878      0
    
    ORDERED time     = 3.306332 microseconds +/- 0.011520, mean cycles = 3307
    ORDERED overhead = 3.189314 microseconds +/- 0.011522, mean cycles = 3190
    
    --------------------------------------------------------
    Threads: 2
    Computing reference time 2 time using 655360 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.004001   0.004001   0.004001    0.000000      0
    
    reference time 2 time     = 0.004001 microseconds +/- 0.000000 mean cycles = 4
    
    --------------------------------------------------------
    Threads: 2
    Computing ATOMIC time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.397373   1.397320   1.397405    0.000022      0
    
    ATOMIC time     = 1.397373 microseconds +/- 0.000043, mean cycles = 1397
    ATOMIC overhead = 1.393373 microseconds +/- 0.000043, mean cycles = 1393
    
    --------------------------------------------------------
    Threads: 2
    Computing reference time 3 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 3 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 2
    Computing REDUCTION time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                8.270118   8.259981   8.279244    0.004865      0
    
    REDUCTION time     = 8.270118 microseconds +/- 0.009535, mean cycles = 8271
    REDUCTION overhead = 8.153100 microseconds +/- 0.009537, mean cycles = 8154
    Running OpenMP benchmark version 3.0
    	3 thread(s)
    	20 outer repetitions
    	1000.00 test time (microseconds)
    	26 delay length (iterations)
    	0.100000 delay time (microseconds)
    
    --------------------------------------------------------
    Threads: 3
    Computing reference time 1 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 1 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 3
    Computing PARALLEL time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                5.819970   5.818516   5.823234    0.001265      0
    
    PARALLEL time     = 5.819970 microseconds +/- 0.002479, mean cycles = 5820
    PARALLEL overhead = 5.702951 microseconds +/- 0.002481, mean cycles = 5703
    
    --------------------------------------------------------
    Threads: 3
    Computing FOR time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.225003   2.222820   2.232733    0.002196      1
    
    FOR time     = 2.225003 microseconds +/- 0.004304, mean cycles = 2225
    FOR overhead = 2.107985 microseconds +/- 0.004305, mean cycles = 2108
    
    --------------------------------------------------------
    Threads: 3
    Computing PARALLEL FOR time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.059567   6.050641   6.065444    0.004029      0
    
    PARALLEL FOR time     = 6.059567 microseconds +/- 0.007897, mean cycles = 6060
    PARALLEL FOR overhead = 5.942549 microseconds +/- 0.007898, mean cycles = 5943
    
    --------------------------------------------------------
    Threads: 3
    Computing BARRIER time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.029729   2.028578   2.036858    0.001773      1
    
    BARRIER time     = 2.029729 microseconds +/- 0.003476, mean cycles = 2029
    BARRIER overhead = 1.912711 microseconds +/- 0.003477, mean cycles = 1912
    
    --------------------------------------------------------
    Threads: 3
    Computing SINGLE time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.866747   2.846355   2.874797    0.009125      0
    
    SINGLE time     = 2.866747 microseconds +/- 0.017885, mean cycles = 2866
    SINGLE overhead = 2.749729 microseconds +/- 0.017886, mean cycles = 2749
    
    --------------------------------------------------------
    Threads: 3
    Computing CRITICAL time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.401200   1.394692   1.478257    0.018851      1
    
    CRITICAL time     = 1.401200 microseconds +/- 0.036947, mean cycles = 1400
    CRITICAL overhead = 1.284182 microseconds +/- 0.036949, mean cycles = 1283
    
    --------------------------------------------------------
    Threads: 3
    Computing LOCK/UNLOCK time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.988831   1.876412   2.113611    0.081642      0
    
    LOCK/UNLOCK time     = 1.988831 microseconds +/- 0.160019, mean cycles = 1988
    LOCK/UNLOCK overhead = 1.871813 microseconds +/- 0.160020, mean cycles = 1871
    
    --------------------------------------------------------
    Threads: 3
    Computing ORDERED time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.310024   3.305519   3.315800    0.003661      0
    
    ORDERED time     = 3.310024 microseconds +/- 0.007176, mean cycles = 3310
    ORDERED overhead = 3.193005 microseconds +/- 0.007178, mean cycles = 3193
    
    --------------------------------------------------------
    Threads: 3
    Computing reference time 2 time using 655360 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.004001   0.004001   0.004001    0.000000      0
    
    reference time 2 time     = 0.004001 microseconds +/- 0.000000 mean cycles = 4
    
    --------------------------------------------------------
    Threads: 3
    Computing ATOMIC time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.354010   1.352858   1.355841    0.001180      0
    
    ATOMIC time     = 1.354010 microseconds +/- 0.002313, mean cycles = 1353
    ATOMIC overhead = 1.350009 microseconds +/- 0.002313, mean cycles = 1349
    
    --------------------------------------------------------
    Threads: 3
    Computing reference time 3 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 3 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 3
    Computing REDUCTION time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                9.771400   9.762087   9.781500    0.005036      0
    
    REDUCTION time     = 9.771400 microseconds +/- 0.009871, mean cycles = 9773
    REDUCTION overhead = 9.654382 microseconds +/- 0.009873, mean cycles = 9656
    Running OpenMP benchmark version 3.0
    	4 thread(s)
    	20 outer repetitions
    	1000.00 test time (microseconds)
    	26 delay length (iterations)
    	0.100000 delay time (microseconds)
    
    --------------------------------------------------------
    Threads: 4
    Computing reference time 1 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 1 time     = 0.117018 microseconds +/- 0.000001 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 4
    Computing PARALLEL time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.056951   6.053416   6.060103    0.002065      0
    
    PARALLEL time     = 6.056951 microseconds +/- 0.004047, mean cycles = 6057
    PARALLEL overhead = 5.939933 microseconds +/- 0.004049, mean cycles = 5940
    
    --------------------------------------------------------
    Threads: 4
    Computing FOR time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.373155   2.366277   2.384739    0.006521      0
    
    FOR time     = 2.373155 microseconds +/- 0.012781, mean cycles = 2373
    FOR overhead = 2.256137 microseconds +/- 0.012783, mean cycles = 2256
    
    --------------------------------------------------------
    Threads: 4
    Computing PARALLEL FOR time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.186251   6.182666   6.193269    0.002511      0
    
    PARALLEL FOR time     = 6.186251 microseconds +/- 0.004921, mean cycles = 6186
    PARALLEL FOR overhead = 6.069233 microseconds +/- 0.004923, mean cycles = 6069
    
    --------------------------------------------------------
    Threads: 4
    Computing BARRIER time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.180477   2.176070   2.190273    0.005141      0
    
    BARRIER time     = 2.180477 microseconds +/- 0.010076, mean cycles = 2180
    BARRIER overhead = 2.063459 microseconds +/- 0.010078, mean cycles = 2063
    
    --------------------------------------------------------
    Threads: 4
    Computing SINGLE time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.976714   2.970547   2.984936    0.004177      0
    
    SINGLE time     = 2.976714 microseconds +/- 0.008186, mean cycles = 2976
    SINGLE overhead = 2.859696 microseconds +/- 0.008188, mean cycles = 2859
    
    --------------------------------------------------------
    Threads: 4
    Computing CRITICAL time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.427155   1.394809   1.476102    0.025281      0
    
    CRITICAL time     = 1.427155 microseconds +/- 0.049550, mean cycles = 1427
    CRITICAL overhead = 1.310137 microseconds +/- 0.049551, mean cycles = 1310
    
    --------------------------------------------------------
    Threads: 4
    Computing LOCK/UNLOCK time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.860653   1.842442   1.883656    0.014441      0
    
    LOCK/UNLOCK time     = 1.860653 microseconds +/- 0.028304, mean cycles = 1860
    LOCK/UNLOCK overhead = 1.743635 microseconds +/- 0.028305, mean cycles = 1743
    
    --------------------------------------------------------
    Threads: 4
    Computing ORDERED time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.312502   3.304959   3.321581    0.004326      0
    
    ORDERED time     = 3.312502 microseconds +/- 0.008479, mean cycles = 3313
    ORDERED overhead = 3.195484 microseconds +/- 0.008480, mean cycles = 3196
    
    --------------------------------------------------------
    Threads: 4
    Computing reference time 2 time using 655360 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.004001   0.004001   0.004001    0.000000      0
    
    reference time 2 time     = 0.004001 microseconds +/- 0.000000 mean cycles = 4
    
    --------------------------------------------------------
    Threads: 4
    Computing ATOMIC time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.711760   1.710056   1.715572    0.001771      0
    
    ATOMIC time     = 1.711760 microseconds +/- 0.003470, mean cycles = 1711
    ATOMIC overhead = 1.707759 microseconds +/- 0.003470, mean cycles = 1707
    
    --------------------------------------------------------
    Threads: 4
    Computing reference time 3 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 3 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 4
    Computing REDUCTION time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                11.328404   11.323000   11.337756    0.003746      0
    
    REDUCTION time     = 11.328404 microseconds +/- 0.007342, mean cycles = 11330
    REDUCTION overhead = 11.211386 microseconds +/- 0.007343, mean cycles = 11213
    Running OpenMP benchmark version 3.0
    	5 thread(s)
    	20 outer repetitions
    	1000.00 test time (microseconds)
    	26 delay length (iterations)
    	0.100000 delay time (microseconds)
    
    --------------------------------------------------------
    Threads: 5
    Computing reference time 1 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 1 time     = 0.117018 microseconds +/- 0.000001 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 5
    Computing PARALLEL time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.196515   6.188159   6.204772    0.004221      0
    
    PARALLEL time     = 6.196515 microseconds +/- 0.008273, mean cycles = 6197
    PARALLEL overhead = 6.079497 microseconds +/- 0.008274, mean cycles = 6080
    
    --------------------------------------------------------
    Threads: 5
    Computing FOR time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.528982   2.523859   2.543095    0.004844      0
    
    FOR time     = 2.528982 microseconds +/- 0.009495, mean cycles = 2529
    FOR overhead = 2.411964 microseconds +/- 0.009496, mean cycles = 2412
    
    --------------------------------------------------------
    Threads: 5
    Computing PARALLEL FOR time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.432599   6.426756   6.438325    0.002753      0
    
    PARALLEL FOR time     = 6.432599 microseconds +/- 0.005396, mean cycles = 6434
    PARALLEL FOR overhead = 6.315581 microseconds +/- 0.005397, mean cycles = 6317
    
    --------------------------------------------------------
    Threads: 5
    Computing BARRIER time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.338413   2.333514   2.349305    0.005609      0
    
    BARRIER time     = 2.338413 microseconds +/- 0.010994, mean cycles = 2338
    BARRIER overhead = 2.221395 microseconds +/- 0.010996, mean cycles = 2221
    
    --------------------------------------------------------
    Threads: 5
    Computing SINGLE time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.132868   3.126409   3.144072    0.004611      0
    
    SINGLE time     = 3.132868 microseconds +/- 0.009037, mean cycles = 3133
    SINGLE overhead = 3.015851 microseconds +/- 0.009038, mean cycles = 3016
    
    --------------------------------------------------------
    Threads: 5
    Computing CRITICAL time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.418253   1.418191   1.418310    0.000029      0
    
    CRITICAL time     = 1.418253 microseconds +/- 0.000058, mean cycles = 1418
    CRITICAL overhead = 1.301235 microseconds +/- 0.000059, mean cycles = 1301
    
    --------------------------------------------------------
    Threads: 5
    Computing LOCK/UNLOCK time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.852150   1.843994   1.871095    0.007201      0
    
    LOCK/UNLOCK time     = 1.852150 microseconds +/- 0.014114, mean cycles = 1852
    LOCK/UNLOCK overhead = 1.735132 microseconds +/- 0.014115, mean cycles = 1735
    
    --------------------------------------------------------
    Threads: 5
    Computing ORDERED time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.316784   3.305528   3.331538    0.006494      0
    
    ORDERED time     = 3.316784 microseconds +/- 0.012728, mean cycles = 3317
    ORDERED overhead = 3.199766 microseconds +/- 0.012729, mean cycles = 3200
    
    --------------------------------------------------------
    Threads: 5
    Computing reference time 2 time using 655360 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.004001   0.004001   0.004001    0.000000      0
    
    reference time 2 time     = 0.004001 microseconds +/- 0.000000 mean cycles = 4
    
    --------------------------------------------------------
    Threads: 5
    Computing ATOMIC time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.376536   1.376487   1.376634    0.000038      0
    
    ATOMIC time     = 1.376536 microseconds +/- 0.000074, mean cycles = 1376
    ATOMIC overhead = 1.372535 microseconds +/- 0.000074, mean cycles = 1372
    
    --------------------------------------------------------
    Threads: 5
    Computing reference time 3 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 3 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 5
    Computing REDUCTION time using 160 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                12.991490   12.974688   13.003075    0.007381      0
    
    REDUCTION time     = 12.991490 microseconds +/- 0.014467, mean cycles = 12995
    REDUCTION overhead = 12.874472 microseconds +/- 0.014469, mean cycles = 12878
    Running OpenMP benchmark version 3.0
    	6 thread(s)
    	20 outer repetitions
    	1000.00 test time (microseconds)
    	26 delay length (iterations)
    	0.100000 delay time (microseconds)
    
    --------------------------------------------------------
    Threads: 6
    Computing reference time 1 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 1 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 6
    Computing PARALLEL time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.355777   6.347206   6.362444    0.004493      0
    
    PARALLEL time     = 6.355777 microseconds +/- 0.008806, mean cycles = 6357
    PARALLEL overhead = 6.238758 microseconds +/- 0.008808, mean cycles = 6240
    
    --------------------------------------------------------
    Threads: 6
    Computing FOR time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.693511   2.686883   2.708286    0.005433      0
    
    FOR time     = 2.693511 microseconds +/- 0.010648, mean cycles = 2693
    FOR overhead = 2.576493 microseconds +/- 0.010650, mean cycles = 2576
    
    --------------------------------------------------------
    Threads: 6
    Computing PARALLEL FOR time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.609851   6.601313   6.619550    0.005196      0
    
    PARALLEL FOR time     = 6.609851 microseconds +/- 0.010184, mean cycles = 6611
    PARALLEL FOR overhead = 6.492833 microseconds +/- 0.010186, mean cycles = 6494
    
    --------------------------------------------------------
    Threads: 6
    Computing BARRIER time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.501581   2.496703   2.514939    0.005536      0
    
    BARRIER time     = 2.501581 microseconds +/- 0.010850, mean cycles = 2501
    BARRIER overhead = 2.384563 microseconds +/- 0.010852, mean cycles = 2384
    
    --------------------------------------------------------
    Threads: 6
    Computing SINGLE time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.316304   3.308684   3.326856    0.005499      0
    
    SINGLE time     = 3.316304 microseconds +/- 0.010778, mean cycles = 3317
    SINGLE overhead = 3.199286 microseconds +/- 0.010780, mean cycles = 3200
    
    --------------------------------------------------------
    Threads: 6
    Computing CRITICAL time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.427989   1.427950   1.428036    0.000027      0
    
    CRITICAL time     = 1.427989 microseconds +/- 0.000052, mean cycles = 1428
    CRITICAL overhead = 1.310971 microseconds +/- 0.000054, mean cycles = 1311
    
    --------------------------------------------------------
    Threads: 6
    Computing LOCK/UNLOCK time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.855418   1.845097   1.875791    0.008659      0
    
    LOCK/UNLOCK time     = 1.855418 microseconds +/- 0.016971, mean cycles = 1855
    LOCK/UNLOCK overhead = 1.738400 microseconds +/- 0.016973, mean cycles = 1738
    
    --------------------------------------------------------
    Threads: 6
    Computing ORDERED time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.317848   3.307437   3.326697    0.004729      0
    
    ORDERED time     = 3.317848 microseconds +/- 0.009269, mean cycles = 3318
    ORDERED overhead = 3.200830 microseconds +/- 0.009271, mean cycles = 3201
    
    --------------------------------------------------------
    Threads: 6
    Computing reference time 2 time using 655360 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.004001   0.004001   0.004001    0.000000      0
    
    reference time 2 time     = 0.004001 microseconds +/- 0.000000 mean cycles = 4
    
    --------------------------------------------------------
    Threads: 6
    Computing ATOMIC time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.389887   1.389393   1.390483    0.000509      0
    
    ATOMIC time     = 1.389887 microseconds +/- 0.000998, mean cycles = 1389
    ATOMIC overhead = 1.385886 microseconds +/- 0.000998, mean cycles = 1385
    
    --------------------------------------------------------
    Threads: 6
    Computing reference time 3 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 3 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 6
    Computing REDUCTION time using 160 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                14.604939   14.597913   14.616338    0.005458      0
    
    REDUCTION time     = 14.604939 microseconds +/- 0.010698, mean cycles = 14608
    REDUCTION overhead = 14.487920 microseconds +/- 0.010699, mean cycles = 14491
    Running OpenMP benchmark version 3.0
    	7 thread(s)
    	20 outer repetitions
    	1000.00 test time (microseconds)
    	26 delay length (iterations)
    	0.100000 delay time (microseconds)
    
    --------------------------------------------------------
    Threads: 7
    Computing reference time 1 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 1 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 7
    Computing PARALLEL time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.577366   6.571906   6.588081    0.003691      0
    
    PARALLEL time     = 6.577366 microseconds +/- 0.007234, mean cycles = 6579
    PARALLEL overhead = 6.460348 microseconds +/- 0.007236, mean cycles = 6462
    
    --------------------------------------------------------
    Threads: 7
    Computing FOR time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.853039   2.842598   2.867686    0.006840      0
    
    FOR time     = 2.853039 microseconds +/- 0.013406, mean cycles = 2853
    FOR overhead = 2.736021 microseconds +/- 0.013408, mean cycles = 2736
    
    --------------------------------------------------------
    Threads: 7
    Computing PARALLEL FOR time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.800647   6.793344   6.810975    0.004931      0
    
    PARALLEL FOR time     = 6.800647 microseconds +/- 0.009665, mean cycles = 6802
    PARALLEL FOR overhead = 6.683629 microseconds +/- 0.009666, mean cycles = 6685
    
    --------------------------------------------------------
    Threads: 7
    Computing BARRIER time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.662619   2.652877   2.675330    0.006578      0
    
    BARRIER time     = 2.662619 microseconds +/- 0.012893, mean cycles = 2662
    BARRIER overhead = 2.545601 microseconds +/- 0.012895, mean cycles = 2545
    
    --------------------------------------------------------
    Threads: 7
    Computing SINGLE time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.464547   3.456684   3.472103    0.005234      0
    
    SINGLE time     = 3.464547 microseconds +/- 0.010259, mean cycles = 3465
    SINGLE overhead = 3.347529 microseconds +/- 0.010261, mean cycles = 3348
    
    --------------------------------------------------------
    Threads: 7
    Computing CRITICAL time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.391868   1.391668   1.392073    0.000169      0
    
    CRITICAL time     = 1.391868 microseconds +/- 0.000331, mean cycles = 1391
    CRITICAL overhead = 1.274850 microseconds +/- 0.000332, mean cycles = 1274
    
    --------------------------------------------------------
    Threads: 7
    Computing LOCK/UNLOCK time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.847362   1.838678   1.857870    0.006143      0
    
    LOCK/UNLOCK time     = 1.847362 microseconds +/- 0.012041, mean cycles = 1847
    LOCK/UNLOCK overhead = 1.730344 microseconds +/- 0.012042, mean cycles = 1730
    
    --------------------------------------------------------
    Threads: 7
    Computing ORDERED time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.325697   3.311838   3.353878    0.009365      1
    
    ORDERED time     = 3.325697 microseconds +/- 0.018355, mean cycles = 3326
    ORDERED overhead = 3.208679 microseconds +/- 0.018357, mean cycles = 3209
    
    --------------------------------------------------------
    Threads: 7
    Computing reference time 2 time using 655360 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.004001   0.004001   0.004001    0.000000      0
    
    reference time 2 time     = 0.004001 microseconds +/- 0.000000 mean cycles = 4
    
    --------------------------------------------------------
    Threads: 7
    Computing ATOMIC time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.407279   1.407235   1.407327    0.000026      0
    
    ATOMIC time     = 1.407279 microseconds +/- 0.000051, mean cycles = 1407
    ATOMIC overhead = 1.403279 microseconds +/- 0.000051, mean cycles = 1403
    
    --------------------------------------------------------
    Threads: 7
    Computing reference time 3 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 3 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 7
    Computing REDUCTION time using 160 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                16.272913   16.260200   16.287163    0.006981      0
    
    REDUCTION time     = 16.272913 microseconds +/- 0.013682, mean cycles = 16276
    REDUCTION overhead = 16.155895 microseconds +/- 0.013684, mean cycles = 16159
    Running OpenMP benchmark version 3.0
    	8 thread(s)
    	20 outer repetitions
    	1000.00 test time (microseconds)
    	26 delay length (iterations)
    	0.100000 delay time (microseconds)
    
    --------------------------------------------------------
    Threads: 8
    Computing reference time 1 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 1 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 8
    Computing PARALLEL time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                6.803014   6.793306   6.807987    0.003187      1
    
    PARALLEL time     = 6.803014 microseconds +/- 0.006246, mean cycles = 6804
    PARALLEL overhead = 6.685996 microseconds +/- 0.006248, mean cycles = 6687
    
    --------------------------------------------------------
    Threads: 8
    Computing FOR time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.046861   3.035508   3.060453    0.005960      0
    
    FOR time     = 3.046861 microseconds +/- 0.011682, mean cycles = 3046
    FOR overhead = 2.929843 microseconds +/- 0.011684, mean cycles = 2929
    
    --------------------------------------------------------
    Threads: 8
    Computing PARALLEL FOR time using 320 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                7.012879   7.006631   7.021144    0.004255      0
    
    PARALLEL FOR time     = 7.012879 microseconds +/- 0.008340, mean cycles = 7014
    PARALLEL FOR overhead = 6.895861 microseconds +/- 0.008342, mean cycles = 6897
    
    --------------------------------------------------------
    Threads: 8
    Computing BARRIER time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.853963   2.841664   2.863036    0.007432      0
    
    BARRIER time     = 2.853963 microseconds +/- 0.014567, mean cycles = 2854
    BARRIER overhead = 2.736945 microseconds +/- 0.014569, mean cycles = 2737
    
    --------------------------------------------------------
    Threads: 8
    Computing SINGLE time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.638552   3.633037   3.642400    0.002710      0
    
    SINGLE time     = 3.638552 microseconds +/- 0.005312, mean cycles = 3639
    SINGLE overhead = 3.521534 microseconds +/- 0.005313, mean cycles = 3522
    
    --------------------------------------------------------
    Threads: 8
    Computing CRITICAL time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.421305   1.421235   1.421355    0.000030      0
    
    CRITICAL time     = 1.421305 microseconds +/- 0.000059, mean cycles = 1421
    CRITICAL overhead = 1.304287 microseconds +/- 0.000061, mean cycles = 1304
    
    --------------------------------------------------------
    Threads: 8
    Computing LOCK/UNLOCK time using 1280 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                2.048067   2.038444   2.064064    0.006806      0
    
    LOCK/UNLOCK time     = 2.048067 microseconds +/- 0.013339, mean cycles = 2048
    LOCK/UNLOCK overhead = 1.931049 microseconds +/- 0.013341, mean cycles = 1931
    
    --------------------------------------------------------
    Threads: 8
    Computing ORDERED time using 640 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                3.382601   3.359741   3.457400    0.029729      0
    
    ORDERED time     = 3.382601 microseconds +/- 0.058269, mean cycles = 3383
    ORDERED overhead = 3.265583 microseconds +/- 0.058271, mean cycles = 3266
    
    --------------------------------------------------------
    Threads: 8
    Computing reference time 2 time using 655360 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.004001   0.004001   0.004001    0.000000      0
    
    reference time 2 time     = 0.004001 microseconds +/- 0.000000 mean cycles = 4
    
    --------------------------------------------------------
    Threads: 8
    Computing ATOMIC time using 2560 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                1.417196   1.417158   1.417262    0.000029      0
    
    ATOMIC time     = 1.417196 microseconds +/- 0.000056, mean cycles = 1417
    ATOMIC overhead = 1.413196 microseconds +/- 0.000056, mean cycles = 1413
    
    --------------------------------------------------------
    Threads: 8
    Computing reference time 3 time using 20480 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                0.117018   0.117017   0.117019    0.000001      0
    
    reference time 3 time     = 0.117018 microseconds +/- 0.000002 mean cycles = 117
    
    --------------------------------------------------------
    Threads: 8
    Computing REDUCTION time using 160 reps
    
    Sample_size       Average     Min         Max          S.D.          Outliers
     20                18.016152   18.005050   18.031850    0.006064      0
    
    REDUCTION time     = 18.016152 microseconds +/- 0.011885, mean cycles = 18020
    REDUCTION overhead = 17.899134 microseconds +/- 0.011887, mean cycles = 17903
    
    

               This project looks something like benchmarking overhead for OpenMP.  How can we read these logs to understand ?

    Best Regards,
    Naoki 

               

  • Hi Naoki Kawada,

    I am working with experts to answer this post. We will get back to you shortly on this.

    Thank you.
  • Hello Raja,

    Do you have any progress on this ?

    Best Regards,
    Naoki
  • Hi Naoki,
    We have reproduced the performance issue and working on to resolve it. Thank you for your patience.
  • Hi Naoki,

    The matvec performance degradation with OpenMP is caused by the serial code in the critical section (combined with the printfs) outweighs the benefit of running across multiple cores. Please move the serial code out of the for loop and compute the running total separately.

    #pragma omp critical

               {

                   total = total + c[i];

                   printf("  thread %d did row %d\t c[%d]=%.2f\t",tid,i,i,c[i]);

                   printf("Running total= %.2f\n",total);

               }

    Here is the cycles after the change with matrix size 1000x1000:

    Core0 Only : 964074 cycles

    sum of all c[] = 250500235264.00

    Full core : 281352 cycles

    sum of all c[] = 250500235264.00

    openmpbench_C_v3 (OpenMP micro-benchmark suite) illustrates the overhead of commonly used OpenMP constructsm. For example, the paragraph below indicates that the overhead of starting and terminating a parallel region across 8 threads/DSP cores is 7169 cycles.

    Threads: 8

    Computing PARALLEL FOR time using 320 reps

    Sample_size       Average     Min         Max          S.D.          Outliers

    20                7.284269   7.273963   7.294844    0.005279      0

    PARALLEL FOR time     = 7.284269 microseconds +/- 0.010347, mean cycles = 7286

    PARALLEL FOR overhead = 7.167252 microseconds +/- 0.010349, mean cycles = 7169

    Regards, Garrett

  • Hi Garrett,

    Thank you very much for your confirmation.
    Well, I'm getting stuck at other works. Once I solved them, I'll be back on this thread. If I saw the same result in benchmark, I'll mark your answer with verified answer.

    Best Regards,
    Naoki

  • Hi,

    I'm back at this work.
    I noticed the result was actually different between the products with OMP-off-matvec and OMP-on-matvec.
    To move the omp critical section to the out of loop for speed optimization, I need to clarify why the result was different.

    Well, the following is my investigations and results. Please take a look at the following shared CCS project.

    5381.sample_openmp_matvec_keystone.zip

    The common compile options for omp_matvec.c looks below:

    "C:/ti/ti-cgt-c6000_8.0.3/bin/cl6x" -mv6600 --abi=eabi --opt_for_speed=5 --include_path="C:/ti/ti-cgt-c6000_8.0.3/include" -g --display_error_number --diag_warning=225 --diag_wrap=off --openmp -k --preproc_with_compile --preproc_dependency="omp_matvec.pp" --cmd_file="./configPkg/compiler.opt" "../omp_matvec.c"

    And I added -O option and changed the array size to see what happens.

    ============================================================================
    Optimization level : off
    Size definition for arrays in omp_matvec.c: SIZE=128
    Logs on CCS console:

        Core0 Only : 573456 cycles
        sum of all c[] = 68161536.00

        Full core : 395846 cycles
        sum of all c[] = 68161536.00

        Success: Result matches!
    ============================================================================

    In this case, the result looks correct. OK, now Enabling pipeline.

    ============================================================================
    Optimization level : -O2 (Added to the above compile options)
    Size definition for arrays in omp_matvec.c: SIZE=128
    Logs on CCS console:

        Core0 Only : 28018 cycles
        sum of all c[] = 68161536.00

        Full core : 372921 cycles
        sum of all c[] = 68161536.00

        Success: Result matches!
    ============================================================================

    Ok, the result still looks correct.
    Now increasing the size of array from 128 to 1280.

    ============================================================================
    Optimization level : -O2(Added to the above complie options)
    Size definition for arrays in omp_matvec.c: SIZE=1280
    Logs on CCS console:

        Core0 Only : 2466835 cycles
        sum of all c[] = 672137216000.00

        Full core : 3620381 cycles
        sum of all c[] = 672137674752.00

        Failure: Result does not matche!
    ============================================================================

    ...the result became incorrect. Why this can happen ?

    Best Regards,
    Naoki

  • Hi Naoki,
    The response may be delayed due to holidays and timebank. Thank you for your patience.
  • Hi

    Do you have any updates on this ?
    Sorry to rush you but we are worried that we could recommend the customers to use OpenMP 2.x runtime for "ease to use" of multi-core development on keystone1 platform because I saw the wrong result on matvec example.

    Best Regards,
    Naoki
  • Hi Naoki,

    see here: e2e.ti.com/.../1775317
    "It looks like the result mismatch between single core and OMP is caused by rounding of the floating point data when optimization level is -o2 and size = 512. If you change the data type of 'total' for single core from float to double, you should be able to see result matches."

    Regards,
    Garrett