Large financial services companies have vast compute resources available, organised into computing grids (i.e., federations of compute resources to reach a common goal). But, as Murex’ Pierre Spatz explains1, they don’t use them as supercomputers. “They use them as grids of small computers. It means that most of their code is in fact mono-threaded.” In contrast, the world’s top supercomputing sites2 often use clusters of machines in a very different – and more efficient – way. In the following, we will explain and demonstrate why – and illustrate how financial services firms can improve the efficiency of their existing compute grids. We only look at compute-intensive workloads, for example for XVA, capital, risk management and derivatives pricing.
Most of the code used today in financial services is mono-threaded. This is for historical reasons, i.e., relying on legacy code from the days of single-core processors, and for simplicity of developing and maintaining the code. For large parallel computation tasks, for example when computing XVAs, each processor core is given a subset of the trade portfolio to be valued under a set of the market scenarios. The individual results from each core are aggregated later, using files or network communication for gathering the individual results.
The supercomputers2, used for example in energy exploration or weather forecasting, are viewed as single machines running distributed applications. The applications themselves are multi-threaded and/or multi-process, using message passing3, thread synchronisation, or other techniques to communicate between parallel tasks.
But, what difference does this make to performance and efficiency for financial applications?
Running individual mono-threaded processes on each processor core means that each process has its own memory space and all common data is replicated in each process. For example, considering an XVA computation, the instrument data and scenario data used by the processes is largely the same. However, each process stores its own copy. This results in a higher memory usage than necessary, which can be substantial in risk computations. It can also lead to congestion on the PCI bus, as lots of independent memory queries have to be served by the system.
As shown in the figure below, CPU memory access occurs through multiple layers of caches, the last level of which is shared between the cores. If each process runs mono-threaded on its own core and with its own independent memory, it cannot benefit from data already in the last layer of cache from another process. As cache access is typically 10-100x faster than main memory access, this results in a major inefficiency for applications which could share data (e.g. the scenarios or instrument data in an XVA Monte-Carlo calculation).
As the individual processes tend to operate on larger chunks (i.e., parallelism with a coarse granularity), load-balancing between processes is difficult to achieve. One process might work on a smaller counterparty than another, causing it to finish computation earlier and leaving its processor core idle. This results in under-utilized compute resources.
For aggregation, the individual results need to be stored and gathered. Depending on the application, this might be done through saving intermediate data into files or through network communication. This can pose a significant overhead compared to multi-threaded implementations where the intermediate data can be shared in memory, and hence also in the processor caches.
In practical implementations of computing the expected future exposure for a all counterparties in the bank’s portfolio (as needed for example in XVA or regulatory capital calculations), speedups of around 2x have been achieved by moving from a multi-process grid approach to a multi-threaded implementation with shared memory. This gain comes purely from higher processor utilisation due to better load balancing, lower memory requirements due to less replication, and better caching. Note that a multi-threaded implementation opens up further optimisation opportunities for cache efficiency, such as cache-blocking, for vectorisation, and for integrating accelerator processors such as GPUs or Intel’s Xeon Phi.
More details about optimising the calculation performance of specific financial applications can be found in our compute resources.