I am working on a curve fit application and I am generating a data of 4096k elements of type float and storing in DDR3. With one core I, the time taken to find mean is 0.59 sec but this doesn't scale well with OpenMP.
I have attached the platform file.
Stack: L2
Data: MSMCSRAM_NOCACHE
code: MSMCSRAM
L2 Cache: 256k
L1D Cache: 32k
L1P Cache: 32k
Is there any optimization to get a salable speedup. With 4 cores it takes 0.25sec and with 8 cores 0.31sec