In this post, we compare the performance of the Nvidia Tesla P100 (Pascal) GPU with the brand-new V100 GPU (Volta) for recurrent neural networks (RNNs) using TensorFlow, for both training and inference.

## Recurrent Neural Networks (RNNs)

Most financial applications for deep learning involve time-series data as inputs. For example, the stock price development over time used as an input for an algorithmic trading predictor or the revenue development as input for a default probability predictor. Recurrent Neural Networks (RNNs) are well suited to learn temporal dependencies, both long and short term, and are therefore ideal for the task.

The figure below depicts one neuron in an RNN. It can be observed that the output of a neuron depends not only on the current input but also the previous state stored in the network (the feedback loop). It is this loop that enables RNNs to learn temporal dependencies.

These RNN neurons are organised in layers and stacked to form a deep learning RNN model, as shown in the figure below.

RNNs have difficulties learning long-term dependencies in the inputs due to the vanishing / exploding gradient problem. This occurs during the back-propagation algorithm used for training, where gradients of the cost function are calculated backwards from outputs to inputs. Due to the feedback loop, small gradients vanish quickly and large gradients increase dramatically.

The vanishing gradient problem prevents RNNs from learning longer-term temporal dependencies. Long-Short Term Memory Models (LSTMs) is a specialised form of RNNs designed to bypass this problem. They introduce an input gate, a forget gate, an input modulation gate, and a memory unit. These allow LSTMs to learn highly complex long-term dynamics in the input data and are ideally suited to financial time series learning. Further, LSTMs can be stacked into multiple layers to learn even more complex dynamics.

The computational complexity of a deep RNN network scales linearly with the number of layers employed (assuming they are of the same size). A single layer of an RNN or LSTM network can therefore be seen as the fundamental building block for deep RNNs in quantitative finance, which is why we chose to benchmark the performance of one such layer in the following.

## Hardware Comparison

The table below shows the key hardware differences between Nvidia’s P100 and V100 GPUs.

Processor | SMs | CUDA Cores | Tensor Cores | Frequency | TFLOPs (double)^{1} | TFLOPs (single)^{1} | TFLOPs (half/Tensor)^{1,2} | Cache | Max. Memory | Memory B/W |
---|---|---|---|---|---|---|---|---|---|---|

Nvidia P100 PCIe (Pascal) | 56 | 3,584 | N/A | 1,126 MHz | 4.7 | 9.3 | 18.7 | 4 MB L2 | 16 GB | 720 GB/s |

Nvidia V100 PCIe (Volta) | 80 | 5,120 | 640 | 1.53 GHz | 7 | 14 | 112 | 6 MB L2 | 16 GB | 900 GB/s |

^{1}Note that the FLOPs are calculated by assuming purely fused multiply-add (FMA) instructions and counting those as 2 operations (even though they map to just a single processor instruction).

^{2}On P100, half-precision (FP16) FLOPs are reported. On V100, tensor FLOPs are reported, which run on the Tensor Cores in mixed precision: a matrix multiplication in FP16 and accumulation in FP32 precision.

Perhaps the most interesting hardware feature of the V100 GPU in the context of deep learning is its *Tensor Cores*. These are specialised cores that can compute a 4×4 matrix multiplication in half-precision and accumulate the result to a single-precision (or half-precision) 4×4 matrix – *in one clock cycle*. This means one Tensor Core can perform 128 FLOPs per clock cycle, and a Streaming Multiprocessor (SM) with 8 Tensor Cores can do 1024 FLOPs/cycle. This is 8x faster than using the regular single precision CUDA cores. To benefit from this specialised hardware, deep learning models should be written in mixed precision (half and single), or purely in half precision and leverage a deep learning framework which efficiently uses V100 Tensor Cores.

## TensorFlow

TensorFlow is a Google-maintained open source software library for numerical computation using data flow graphs, primarily used for machine learning applications. It allows to deploy computations to one or more CPUs or GPUs in a desktop, server, or mobile device. Users employ Python to describe their machine learning models and training algorithms, and TensorFlow maps this to a computation graph where the nodes are implemented in C++, CUDA, or OpenCL. RNNs and LSTMs are supported natively in TensorFlow.

As of version 1.4 (released in November 2017), half precision (FP16) data type support has been added and the GPU backend has been configured to use the V100 Tensor Cores for half or mixed-precision matrix multiplications. In addition to the 1.4 mainline release, Nvidia maintains a custom and optimised version as a Docker container in their GPU Cloud (NGC) Docker registry. The latest version of this container is 17.11. For best performance, we used this NGC container for our benchmarks.

## Benchmark Setup

We benchmark the performance of a single layer network for varying hidden sizes for both vanilla RNNs (using TensorFlow’s `BasicRNNCell`

) and LSTMs (using TensorFlow’s `BasicLSTMCell`

). The weights are initialised randomly and we use random input sequences for benchmarking purposes.

We compare the performance on the Pascal and Volta GPUs, with the system configuration given below:

Pascal System | Volta System | |
---|---|---|

CPU | 2 x Intel Xeon E5-2680 v3 | 2 x Intel Xeon E5-2686 v4 |

GPU | Nvidia Tesla P100 PCIe | Nvidia Tesla V100 PCIe |

OS | RedHat Enterprise Linux 7.4 | RedHat Enterprise Linux 7.4 |

RAM | 64GB | 128GB |

NGC TensorFlow | 17.11 | 17.11 |

Clock Boost | GPU: 1328 MHz, memory: 715 MHz | GPU: 1370 MHz, memory: 1750 MHz |

ECC | on | on |

## Performance

To measure the performance, the application is executed repeatedly, recording the wall-clock time for each run, until the estimated timing error is below a specified value. The measurements include the full algorithm execution time (training using gradient descent and inference), run for 100,000 batches of input data with a batch size of 128 with a sequence length of 32 samples. This gives approximately 13 million training samples, which is equivalent for example to taking the daily closing price of 5,000 stocks over 10 years (assuming 250 business days per year), and using overlapping windows for the sequences. A deep learning predictor would look 32 days into the past to predict the future, e.g. stock movements or the probability of default. We vary the number of RNN/LSTM units in the hidden layer for the benchmark.

### Training

The figures below show the speedup of the V100 vs the P100 GPU in training mode for vanilla RNNs and LSTMs, using the NGC container, for both single precision (FP32) and half precision (FP16). The number of hidden units is given in the chart.

#### RNN Training Performance

#### LSTM Training Performance

### Inference

The figures below show the speedups in inference mode for both GPUs for vanilla RNNs and LSTMs, using the NGC container, for both single precision (FP32) and half precision (FP16). The number of hidden units is given in the chart.

#### RNN Inference Performance

#### LSTM Inference Performance

### Discussion

For the tested RNN and LSTM deep learning applications, we notice that the relative performance of V100 vs. P100 increase with network size (128 to 1024 hidden units) and complexity (RNN to LSTM). We record a maximum speedup in FP16 precision mode of 2.05x for V100 compared to the P100 in training mode – and 1.72x in inference mode. Those figures are many-fold below the expected performance for the V100 based on its hardware specifications.

The reason for this disappointing performance is that the powerful Tensor Cores in the V100 are only used for matrix multiplications in half-precision (FP16) or mixed-precision mode. Profiling the tested applications showed that matrix multiplications only account for around 20% of the overall training time in the LSTM case, and even lower in the other configurations. The other operations (e.g. softmax, scalar products, etc.) cannot use the powerful Tensor Cores. This is in contrast to the convolutional networks used for image recognition for example, where the runtime is dominated by large matrix multiplications and hence they can optimally leverage the Tensor Cores.

While V100 displays impressive hardware improvements compared to P100, some deep learning applications, such as RNNs dealing with financial time series, might not be able to fully exploit the very specialised hardware in the V100, and hence will only get a limited performance boost.