Nvidia’s Pascal generation GPUs, in particular the flagship compute-grade GPU P100, is said to be a game-changer for compute-intensive applications. Compared to the Kepler generation flagship Tesla K80, the P100 provides 1.6x more GFLOPs (double precision float). P100’s stacked memory features 3x the memory bandwidth of the K80, an important factor for memory-intensive applications. In dense GPU configurations, i.e. 2-4 GPUs per machine, NVlink can offer a 3x performance boost in GPU-GPU communication compared to the traditional PCI express.
In the following, we compare the performance of the Tesla P100 to the previous Tesla K80 card using selected applications from the Xcelerit Quant Benchmarks.
The table below shows the key hardware differences between the two cards.
|Processor||Cores||CUDA Cores||Frequency||GFLOPs (double)1||Memory||Memory B/W|
|NVIDIA Tesla K80 GPU (Kepler)||2 x 13 (SMX)||2 x 2,496||562 MHz||2 x 1,455||2 x 12 GB||2 x 240 GB/s|
|NVIDIA Tesla P100 GPU (Pascal)||56 (SM)||3,584||1,126 MHz||4,670||16 GB||720 GB/s|
1Note that the FLOPs are calculated by assuming purely fused multiply-add (FMA) instructions and counting those as 2 operations (even though they map to just a single processor instruction).
Xcelerit Quant Benchmarks
The peak GFLOPs given above are rarely reached in real-world applications. Beyond compute instructions, many other factors influence performance, such as memory and cache latencies, thread synchronisation, instruction-level parallelism, GPU occupancy, and branch divergance. To give an indication of the performance in the real world, we use selected applications form the Xcelerit Quant Benchmarks, a representative set of applications widely used in Quantitative Finance. Those applications have been hand-tuned for maximum performance using native implementation by code optimisation experts, often in collaboration with the relevant processor maker.
|Financial Instrument||Numerical Method||Description||Parameters|
|LIBOR Swaption Portfolio||Monte-Carlo||Prices a portfolio of LIBOR swaptions on a LIBOR Market Model and computes sensitivities|
|American Options||Binomial Lattice||Prices a batch of American call options under the Black-Scholes model using a Binomial lattice (Cox, Ross and Rubenstein method). (read more)|
|European Options||Closed form||Prices a batch of European call and put options the Black-Scholes-Merton formula. We repeat the formula 100 times to increase the overall runtime for performance measurements. (read more)|
|Barrier Options||Monte-Carlo||Prices a portfolio of up-and-in barrier options under the Black-Scholes model using a Monte-Carlo simulation. (read more)|
Selected Applications from the Xcelerit Quant Benchmarks
We compare the performance of each application on the K80 and P100 cards. The system configuration is given in the following:
- CPU: 2 sockets, Haswell (Intel Xeon E5-2698 v3)
- GPU: NVIDIA Tesla K80 and NVIDIA Tesla P100 (ECC on)
- OS: RedHat Enterprise Linux 7.2 (64bit)
- RAM: 128GB (K80 system) and 256GB (P100 system)
- CUDA Version: 8.0
- CPU Backend Compiler: GCC 4.8
- GPU clock: maximum boost
- Precision: double
To measure the performance, the application is executed repeatedly, recording the wall-clock time for each run, until the estimated timing error is below a specified value. The measurement includes the full algorithm execution time from inputs to outputs, including setup of the GPU and data transfers. The speedup versus a sequential implementation on a single CPU core is reported, averaged over varying numbers of paths or options:
We observe that the P100 gives a boost between 1.3 and 2.3x over the the K80 (1.7x on average). This high variation of the speedup across applications can be explained by the different application characteristics, in particular the relation of compute instructions to memory access operations. In peak performace, the P100 has 1.6x the FLOPs (double precision) and 3x the memory bandwidth of the K80 GPU.
Both the LIBOR swaption portfolio and Black-Scholes option pricers are heavy in compute instructions and need less memory accesses. Therefore these applications benefit mostly from the increased GFLOPs and less from the memory bandwidth improvement. This explains the speedup of around 1.3x compared to the K80.
The Binomial American option pricer is memory intensive, on global and shared memory as well as cache. It also uses thread synchronisation operations heavily. The performance of these operations has been increased significantly on the P100, which explains the highest-end gain for 2.3x.
The Monte-Carlo Barrier options application benefits from both the compute and memory performance increases to some extend. This results in a speedup of around 1.8x.