One of the first topics you stumble over when discussing multicore and performance is Amdahl's law. This law is often used in parallel computing to calculate the theoretical maximum multicore performance. For example, if your code is serial to 50% and can be executed in parallel for the other half, the maximum speedup - versus a single core implementation - would be two times faster no matter how many cores you’d be using. Personally, I believe it's good to be reminded that throwing more parallel computing power at the problem to speed up your code doesn't always help more - and the result that Amdahl’s formula is providing is too simplistic. Even though Amdahl's law is theoretically correct, the serial quota is not really practically obtainable. A simple example would be speculative execution of serial code that therefore could run in parallel. If the results of that speculative execution can be used, it should be considered parallel if not it should be considered serial. Also, Amdahl's law does not take into account the load balancing issues or the synchronization overhead. Sometimes you might also need a temporary performance lift to address quick response times and then fall back to a lower performance and lower power consumption mode by putting cores not needed into a sleep mode. Now there are other laws, such as Gustafson’s law, that try to discuss tackling the parallel computing power. But is the speedup really all we care about when we turn to multicore? Maybe not. A device should unleash the full multicore computing power while consuming a limited amount of electrical power. What more do you expect to get out of a multicore device? Comments are welcome.I hope you enjoyed this read and stay tuned for more.Kind regards,one and zero
re: "Amdahl's law does not take into account the load balancing issues or the synchronization overhead." Quite so.
Moreover, AL only accounts for *contention* effects in the sense of queueing theory, e.g., waiting for a lock; which is why it approaches a plateau (speedup of 2 in your example, i.e., the inverse of the 1/2 seriality factor). To include "synchronization overhead" or, I would say, any kind of point-to-point exchange required to reach coherency or consistency, involves yet another parameter en.wikipedia.org/.../Neil_J._Gunther. With coherency included, the USL subsumes AL. There is generally little virtue in modeling *retrograde* scalability beyond the scalability maximum; better to try and remove it.
I would also remark that GL was proposed at a time when people were frustrated by the AL bound, but GL scaled-sizing of workloads is exceedingly difficult to attain in practice. It certainly cannot be applied to most commercial workloads.
We have successfully applied the USL as a *statistical regression* model in general for a long time www.perfdynamics.com/.../USLscalability.html and a forthcoming book will address its application to multicores www.perfdynamics.com/books.html, although I would be interested in analyzing more multicore data, if it were made available.
As far as the AL is concerned I feel it is a good way of quickly getting rough estimates of speed up. It has some deficiencies as pointed out.
We expect not just fast computations but diverse usage capability out of multi-cores which would result in lower manufacturing costs of these devices.