What is the proper way to calculate execution times within a single core for a multi-core application using OpenMP?
In a single-core application, I use the Timestamp_get32() function to count cycles between lines of code, although this doesn't seem to return the correct value for code inside the #pragma omp parallel private(nthreads, tid) block in my code.
I ran the multiplication for a Hanning window both inside and outside the OMP pragma and roughly got the results I was seeing on my single-core application. Time benchmarks within the pragma are around 6-8 times what they are outside the pragma. Check out the simple C code below. There isn’t anything fancy going on, this is built around the HelloWorld template for OpenMP. This time difference is irrelevant if I set this application for 1 core, 4 cores, or 8 cores.
Can Timestamp_get32() be trusted within the pragma statement?