gtc 2016 hpc in the post 2008 crisis...
TRANSCRIPT
STANFORD CENTER FOR FINANCIAL AND RISK ANALYTICS
HPC IN THE POST 2008
CRISIS WORLD
Pierre SPATZ
© MUREX 2016
© 2015 Murex S.A.S. All rights reserved 4
FINANCIAL MARKETS
THE PICTURE BEFORE 2008
Margins are high, regulation costs are small
Flexibility of the tools, Invention of new exotic features and time to market count more
than performance
Tier 1 and big Tier 2 banks have no budget issues and invest in huge grid of computers
Other banks are more intermediaries and resale products and need only an informative
present value
Code is mainly mono threaded
Most quants focus only on the mathematics disregarding IT problems and we are not
different
© 2015 Murex S.A.S. All rights reserved 5
MUREX POSITION
THE PICTURE BEFORE 2008
We are already a leader in our market
Tier 1 banks plug their own models inside our system and like our system for being
fully integrated from front office to processing
Murex front office teams invest heavily in risk measure, scenario flexibility, complex
sensitivities for nested calibration cases computation and automatic grid management
Our financial model library quality is close to the ones of the biggest banks
Our customers who want to challenge Tier1 banks like our models but
do want to not invest in a huge infrastructure
© 2015 Murex S.A.S. All rights reserved 6
COMPUTATION NEEDS IN FINANCE
THE PICTURE BEFORE 2008
Pricing and front office risk management of
Exotic structured products with scripted payoffs evaluated by Monte-Carlo
Credit derivatives
American and barrier options evaluated by partial differential equations
1 Year historical value at risk as a night batch
© 2015 Murex S.A.S. All rights reserved 7
COMPUTERS, CHIPS AND TOOLS
THE PICTURE IN 2008
Xeon and Opteron have 4 cores and we have no practice of parallel programming
Sun microsystem doesn’t belong to Oracle and Solaris on Sparc processors is still preferred by our
customers
Quants love Excel and IT wants us to do everything in Java
PlayStation 3 with its cell processor is available worldwide, can be used and programmed as a workstation
under Yellow Dog Linux
RoadRunner featuring a double precision friendly cell processor becomes the first computer to pass the
PetaFlop barrier
NVIDIA gaming GPUs are said to be programmable using something called CUDA and first Unix servers
with Tesla cards are delivered to some universities and research centers
We are playing with our first iPhones and they are powered by a low consumption ARM processor
© 2015 Murex S.A.S. All rights reserved 8
CELL & ARM
They are mostly CPUs like Intel Xeons
ARM processors achieve better performance per watt by implementing simpler
instructions and running at smaller frequency
CELL processors achieve better performance for the same number of transistors by
implementing wide vector functions inside simpler and slower cores, replacing cache by
cores and by letting the programmer responsible of accessing the memory using explicit
instructions with a high latency
CELL processor was extremely complex to program and is deprecated today but we
can consider Xeon Phi as its natural descendant featuring a cache
© 2015 Murex S.A.S. All rights reserved 9
CPU & GPU
THEY ARE BOTH BUNCH OF CORES
CPUs multi cores run at high frequency and are
optimized for fast execution of mono threaded code with
unpredictable execution stack
CPUs are not specialized in computation
CPUs can handle a huge amount of memory
CPUs cores have fast access to the memory thanks to a
huge and fast L2/L3 memory cache
CPUs cores have a fast L1 cache managed automatically
CPUs parallelization is better implemented at the level of
the task
CPUs multithreading is software managed
GPUs many cores run at small frequency and are optimized
for batch execution of the same set of instructions across
the board
GPUs are Flops machines
GPUs memory is limited but has high bandwidth
GPUs cores access memory with a latency but hide it by
doing something else
GPUs cores have a fast local memory managed by the
programmer
GPUs parallelization is better implemented at the level of
the data
GPUs multithreading is hardware managed
© 2015 Murex S.A.S. All rights reserved 11
EXOTIC STRUCTURED PRODUCTS MONTE-CARLO WITH
SCRIPTED PAYOFFS WITH GPU
Monte-Carlo is embarrassingly parallel
Best performance with payoff scripting/DSL by path
Generate and compile CUDA/OpenCL kernels
In practice you are limited by the number of registers by CUDA core and the
complexity of the payoff
Best flexibility with payoff scripting/DSL by date
Use your preferred interpreted scripting language on CPU and implement vector based
operations on the GPU
In practice you are limited by the memory bandwidth of the GPU
Choose a good random number generator to cope with flexible implementation and be
able to replay a part of the Monte-Carlo for optimization purposes
In practice De Shaw Philox is great
© 2015 Murex S.A.S. All rights reserved 12
THE LATENCY PROBLEM
GPUs are only efficient when treating
big problems and there is a real latency
when launching the kernels
In practice reshape your code to
see more problems at the same
time – sensitivities, scenarios, trades,..
– but keep in mind that GPU memory
is limited
© 2015 Murex S.A.S. All rights reserved 13
OPTION PRICING AND CALIBRATION SOLVED BY PARTIAL
DIFFERENTIAL EQUATIONS
LU solvers are not GPU friendly since they are sequential
Choose instead a divide and conquer algorithm like PCR
N log(N) operations but only in log(N) steps
Stencil computation is more about accessing inputs than doing computation
Keep as much as possible your data in local memory
1D problems are not big enough to feed a GPU but you have many options in your portfolios
2 a - b = 1 x1/2
- 1 a + 2b - 1 c = 1 x1
- b + 2 c - d = 1 x1/2 x1/2
- 1 c + 2d - 1e = 1 x1
- d + 2e - 1 f = 1 x1/2 x1/2
- 1 e + 2 f - 1g = 1 x1
- 1 f + 2g = 1 x1/2
+ 1 b - 1/2 d = 2 x1/2
- 1/2 b + 1 d - 1/2 f = 2 x1
- 1/2 d + 1 f = 2 x1/2
1/2 d = 4
© 2015 Murex S.A.S. All rights reserved 15
FINANCIAL MARKETS
THE PICTURE TODAY
Lower margins, higher volumes, regulation costs are high
We see a trend in exotic standardization but we still have 40 years PRDCS in our books
Tier 1 banks and Murex have had GPUs in production for some time and are continuing to invest while
other experiences like FPGAs for Monte Carlo have failed
GPUs are mainstream in super-computers and are there to stay
Medium size banks are obliged to be able to manage their risk and run their VAR on exotic portfolios
even when trades are asset swapped and theoretically risk free
CVA is our day to day topic and invest only in computers without a rewrite of an efficient and parallel
friendly code is no-more an option
A good quant is also a good computer science expert
© 2015 Murex S.A.S. All rights reserved 17
CVA & PFE
A Monte-Carlo with a reduced set of paths on all the trades done
with a counterpart
Where we need to retrieve all PVs for all future paths and dates
for future flexible aggregation and drill down type analysis
Where counterparty trades composition and volume may be very
different
© 2015 Murex S.A.S. All rights reserved 18
CVA A FLAVOR OF THE DIFFICULTY
LCH
Foreign branches
Many other small
counterparts
1 TB of results generated when computing sensitivities for a medium size bank and far more for a Tier1
Considering all trades or all counterparts equivalent would be a mistake in building a system
Re-compute everything in case of a failure is not an option
Swaps
Caps
Exotics
© 2015 Murex S.A.S. All rights reserved 19
HPC FOR CVA
Group vanilla trades and evaluate them together on GPUs independently of their
counterpart in a compute centric cluster aka small nodes
Use GPU American Monte-Carlo with non linear regression for exotic trades
Use specific boxes with enough memory for aggregation in a big data centric sub-
cluster aka big nodes
Use a parallel fast flash file storage as an intermediate buffer and checkpoint for the
calculation chain to insure performance and reliability
Use IB network as interconnect being able to convey several GB per second
© 2015 Murex S.A.S. All rights reserved 21
THE PICTURE TOMORROW
FRTB and till 15K scenarios using front office models
MVA which leads to the computation of an historical VAR inside a
Monte-Carlo in a scalable manner of all trades done with a CCP
AD and AAD are back in the game but are no game changers yet
Always faster and more flexible GPUs
Cars become self aware
© 2015 Murex S.A.S. All rights reserved 22
AD AND AADIN A NUTSHELL
AD is the good old forward pathwise method for computing sensitivities but done
automatically by tools
AAD is about the same method but generates sensitivities to all inputs and intermediate values
in a unique additional backward sweep at a ridiculous compute cost
AAD can be implemented using some special compilers which are only partially compatible
with GPUs or by overloading C++ basic scalar operators used to program the MC which is
totally GPU friendly
The operator execution keeps the record of all operations and intermediary results of the
forward sweep. The tape is played backward on all path in // and the derivatives per path are
computed using the rule of chain keeping future results constant
The result sensitivities are finally the expectation of the sensitivities computed for each path
𝜃
𝑥
𝑦𝑧 𝑝
……
…
𝜕𝑝
𝜕𝑥=𝜕𝑝
𝜕𝑧
𝜕𝑧
𝜕𝑥
𝜕𝑝
𝜕𝑦=𝜕𝑝
𝜕𝑧
𝜕𝑧
𝜕𝑦 𝑧 ↦ 𝑥, 𝑦
𝜕𝑝
𝜕𝑧↦𝜕𝑝
𝜕𝑥,𝜕𝑝
𝜕𝑦
© 2015 Murex S.A.S. All rights reserved 23
AD AND AAD PROMISING, GOOD FOR VANILLAS, BUT …
The method is simple but the implementation can be tricky.
Everything should be done to have generic enough kernels to keep the GPU fed while avoiding race conditions
To obtain the best performance one still needs to trick the order of operations inside the computation tree making the method
often incompatible with cases where we want to keep the full flexibility at the level of the post-aggregation of several Monte-Carlo
detailed results
AAD is not applicable to all complex exotics even if the vibrato method smoother helps
AAD doesn’t solve the stress test and historical VAR problems
AAD is also said to be memory bound. Well implemented it is only memory bandwidth bound
© 2015 Murex S.A.S. All rights reserved 24
PASCAL THE MEMORY BANDWIDTH JUMP FOR IN A SINGLE GPU
102
178
288
1000
0 100 200 300 400 500 600 700 800 900 1000
C10160
M2090
K40
X80 TITAN
GB/Sec
It is the first time since 2008 that the number of Bytes per Flop has increased for a single GPU during a generation change
– and maybe the last -
Our AAD code will simply be 3.5 faster on next generation
but most of our algorithms are at least partially limited by the memory bandwidth of the GPUs and will show huge benefits
© 2015 Murex S.A.S. All rights reserved 25
SIERRA SUPERCOMPUTER 2017-2018A FULL FLEDGED CVA RISK SYSTEM IN A NODE
The revival of the big nodes
The Flops of 8 K40
A lot of CPU cores and memory to prepare inputs, convert outputs, interpret
scripts, aggregate, query, …
Enough GPU/CPU interconnect speed to retrieve CVA or MVA profiles
unnoticed
NVRAM to replace external flash array storage
Enough network bandwidth to have the flexibility of keeping results locally or
remotely
Bilateral MVA with SIMM at the same cost
CCP MVA with full revaluation using only a few nodes
THANK
YOU!
PARIS NEW YORK SINGAPORE
linkedin.com/company/murex
twitter.com/Murex_Group