There are many studies and publications claiming large speedups on High Performance Computing (HPC) hardware. The reality is that pretty much all kinds of speedups can be reported depending on the baseline used… What does this mean in practice? How should one interpret these numbers? And – more importantly – what should be the baseline to compare to? To help clear the confusion we will try to shed some light on this. We will also give our recommendations on how to choose a “fair baseline” and how to report benchmark results.
There are many ways to tweak benchmark results to make them look more favourable. For GPU benchmarks for example, only the pure computation on the GPU for a single kernel execution could be measured, single precision could be used, or the CPU baseline may not be optimised at all and intentionally left inefficient, to name just a few. This article: Ten Ways to Trick the Masses when giving Performance Results on GPUs gives a list of different ways to “cheat” when publishing GPU benchmark results. Similar techniques can also be used for other types of HPC hardware (e.g. Intel Xeon Phi or FPGAs) or in other types of benchmarks, for example comparing one acceleration framework to another.
Here are a few general guidelines we suggest for publishing benchmark results:
Use a reasonably-optimised baseline
It is certainly not fair to use intentionally inefficient baselines for comparison, but it is also not worth fine-tuning the code using intrinsics to get the absolute maximum performance for the baseline. Basically, the baseline code should reflect a level of optimisation that an average user would perform. No more and no less. However, this also applies to the code it gets compared to – spending months on fine-tuning GPU code to bits is certainly not what the average user would do and therefore not a fair comparison either.
Prefer full user applications
Only measuring the speedup of the pure computation, without considering the setup overhead, data transfers, synchronisation, etc. is not a fair comparison and meaningless to users. Benchmarks should always look at a full application, starting the clock before passing the inputs and stopping it when the result is available and ready to use. This is what really matters to users and makes it fair when comparing to the baseline.
Clearly state used compilers and environment
Not only the system’s hardware configuration affects the performance, but also the software and tools used. For example, moving from Visual C++ 2013 to 2015 can make a huge performance difference on some code, and the compiler flags used are equally important. As a particular example, the “fast math” optimization, allowing compilers to violate some IEEE floating point rules for better performance, can make a big difference. The same applies for the operating system, the C library or other libraries used, GPU driver versions and CUDA toolkit versions, etc. Vendors constantly improve the performance and the optimisations applied in new versions of tools and drivers and it is therefore important that the used toolset is clearly stated with the benchmark results.
Clearly mention the baseline used
Overall it doesn’t really matter which baseline you are using, as long as you clearly say it. If you compare to a sequential unoptimised baseline, that is fine, but indicate what level of optimisations it went through when publishing benchmarks.