gtc 2016 hpc in the post 2008 crisis...

25
GTC 2016 HPC IN THE POST 2008 CRISIS WORLD Pierre SPATZ © MUREX 2016

Upload: trannhi

Post on 20-Mar-2018

220 views

Category:

Documents


6 download

TRANSCRIPT

GTC 2016

HPC IN THE POST 2008

CRISIS WORLD

Pierre SPATZ

© MUREX 2016

STANFORD CENTER FOR FINANCIAL AND RISK ANALYTICS

HPC IN THE POST 2008

CRISIS WORLD

Pierre SPATZ

© MUREX 2016

BACK TO 2008

© 2015 Murex S.A.S. All rights reserved 4

FINANCIAL MARKETS

THE PICTURE BEFORE 2008

Margins are high, regulation costs are small

Flexibility of the tools, Invention of new exotic features and time to market count more

than performance

Tier 1 and big Tier 2 banks have no budget issues and invest in huge grid of computers

Other banks are more intermediaries and resale products and need only an informative

present value

Code is mainly mono threaded

Most quants focus only on the mathematics disregarding IT problems and we are not

different

© 2015 Murex S.A.S. All rights reserved 5

MUREX POSITION

THE PICTURE BEFORE 2008

We are already a leader in our market

Tier 1 banks plug their own models inside our system and like our system for being

fully integrated from front office to processing

Murex front office teams invest heavily in risk measure, scenario flexibility, complex

sensitivities for nested calibration cases computation and automatic grid management

Our financial model library quality is close to the ones of the biggest banks

Our customers who want to challenge Tier1 banks like our models but

do want to not invest in a huge infrastructure

© 2015 Murex S.A.S. All rights reserved 6

COMPUTATION NEEDS IN FINANCE

THE PICTURE BEFORE 2008

Pricing and front office risk management of

Exotic structured products with scripted payoffs evaluated by Monte-Carlo

Credit derivatives

American and barrier options evaluated by partial differential equations

1 Year historical value at risk as a night batch

© 2015 Murex S.A.S. All rights reserved 7

COMPUTERS, CHIPS AND TOOLS

THE PICTURE IN 2008

Xeon and Opteron have 4 cores and we have no practice of parallel programming

Sun microsystem doesn’t belong to Oracle and Solaris on Sparc processors is still preferred by our

customers

Quants love Excel and IT wants us to do everything in Java

PlayStation 3 with its cell processor is available worldwide, can be used and programmed as a workstation

under Yellow Dog Linux

RoadRunner featuring a double precision friendly cell processor becomes the first computer to pass the

PetaFlop barrier

NVIDIA gaming GPUs are said to be programmable using something called CUDA and first Unix servers

with Tesla cards are delivered to some universities and research centers

We are playing with our first iPhones and they are powered by a low consumption ARM processor

© 2015 Murex S.A.S. All rights reserved 8

CELL & ARM

They are mostly CPUs like Intel Xeons

ARM processors achieve better performance per watt by implementing simpler

instructions and running at smaller frequency

CELL processors achieve better performance for the same number of transistors by

implementing wide vector functions inside simpler and slower cores, replacing cache by

cores and by letting the programmer responsible of accessing the memory using explicit

instructions with a high latency

CELL processor was extremely complex to program and is deprecated today but we

can consider Xeon Phi as its natural descendant featuring a cache

© 2015 Murex S.A.S. All rights reserved 9

CPU & GPU

THEY ARE BOTH BUNCH OF CORES

CPUs multi cores run at high frequency and are

optimized for fast execution of mono threaded code with

unpredictable execution stack

CPUs are not specialized in computation

CPUs can handle a huge amount of memory

CPUs cores have fast access to the memory thanks to a

huge and fast L2/L3 memory cache

CPUs cores have a fast L1 cache managed automatically

CPUs parallelization is better implemented at the level of

the task

CPUs multithreading is software managed

GPUs many cores run at small frequency and are optimized

for batch execution of the same set of instructions across

the board

GPUs are Flops machines

GPUs memory is limited but has high bandwidth

GPUs cores access memory with a latency but hide it by

doing something else

GPUs cores have a fast local memory managed by the

programmer

GPUs parallelization is better implemented at the level of

the data

GPUs multithreading is hardware managed

GPU & 2008 PROBLEMS

© 2015 Murex S.A.S. All rights reserved 11

EXOTIC STRUCTURED PRODUCTS MONTE-CARLO WITH

SCRIPTED PAYOFFS WITH GPU

Monte-Carlo is embarrassingly parallel

Best performance with payoff scripting/DSL by path

Generate and compile CUDA/OpenCL kernels

In practice you are limited by the number of registers by CUDA core and the

complexity of the payoff

Best flexibility with payoff scripting/DSL by date

Use your preferred interpreted scripting language on CPU and implement vector based

operations on the GPU

In practice you are limited by the memory bandwidth of the GPU

Choose a good random number generator to cope with flexible implementation and be

able to replay a part of the Monte-Carlo for optimization purposes

In practice De Shaw Philox is great

© 2015 Murex S.A.S. All rights reserved 12

THE LATENCY PROBLEM

GPUs are only efficient when treating

big problems and there is a real latency

when launching the kernels

In practice reshape your code to

see more problems at the same

time – sensitivities, scenarios, trades,..

– but keep in mind that GPU memory

is limited

© 2015 Murex S.A.S. All rights reserved 13

OPTION PRICING AND CALIBRATION SOLVED BY PARTIAL

DIFFERENTIAL EQUATIONS

LU solvers are not GPU friendly since they are sequential

Choose instead a divide and conquer algorithm like PCR

N log(N) operations but only in log(N) steps

Stencil computation is more about accessing inputs than doing computation

Keep as much as possible your data in local memory

1D problems are not big enough to feed a GPU but you have many options in your portfolios

2 a - b = 1 x1/2

- 1 a + 2b - 1 c = 1 x1

- b + 2 c - d = 1 x1/2 x1/2

- 1 c + 2d - 1e = 1 x1

- d + 2e - 1 f = 1 x1/2 x1/2

- 1 e + 2 f - 1g = 1 x1

- 1 f + 2g = 1 x1/2

+ 1 b - 1/2 d = 2 x1/2

- 1/2 b + 1 d - 1/2 f = 2 x1

- 1/2 d + 1 f = 2 x1/2

1/2 d = 4

BACK TO TODAY

© 2015 Murex S.A.S. All rights reserved 15

FINANCIAL MARKETS

THE PICTURE TODAY

Lower margins, higher volumes, regulation costs are high

We see a trend in exotic standardization but we still have 40 years PRDCS in our books

Tier 1 banks and Murex have had GPUs in production for some time and are continuing to invest while

other experiences like FPGAs for Monte Carlo have failed

GPUs are mainstream in super-computers and are there to stay

Medium size banks are obliged to be able to manage their risk and run their VAR on exotic portfolios

even when trades are asset swapped and theoretically risk free

CVA is our day to day topic and invest only in computers without a rewrite of an efficient and parallel

friendly code is no-more an option

A good quant is also a good computer science expert

© 2015 Murex S.A.S. All rights reserved 17

CVA & PFE

A Monte-Carlo with a reduced set of paths on all the trades done

with a counterpart

Where we need to retrieve all PVs for all future paths and dates

for future flexible aggregation and drill down type analysis

Where counterparty trades composition and volume may be very

different

© 2015 Murex S.A.S. All rights reserved 18

CVA A FLAVOR OF THE DIFFICULTY

LCH

Foreign branches

Many other small

counterparts

1 TB of results generated when computing sensitivities for a medium size bank and far more for a Tier1

Considering all trades or all counterparts equivalent would be a mistake in building a system

Re-compute everything in case of a failure is not an option

Swaps

Caps

Exotics

© 2015 Murex S.A.S. All rights reserved 19

HPC FOR CVA

Group vanilla trades and evaluate them together on GPUs independently of their

counterpart in a compute centric cluster aka small nodes

Use GPU American Monte-Carlo with non linear regression for exotic trades

Use specific boxes with enough memory for aggregation in a big data centric sub-

cluster aka big nodes

Use a parallel fast flash file storage as an intermediate buffer and checkpoint for the

calculation chain to insure performance and reliability

Use IB network as interconnect being able to convey several GB per second

BACK TO THE FUTURE

© 2015 Murex S.A.S. All rights reserved 21

THE PICTURE TOMORROW

FRTB and till 15K scenarios using front office models

MVA which leads to the computation of an historical VAR inside a

Monte-Carlo in a scalable manner of all trades done with a CCP

AD and AAD are back in the game but are no game changers yet

Always faster and more flexible GPUs

Cars become self aware

© 2015 Murex S.A.S. All rights reserved 22

AD AND AADIN A NUTSHELL

AD is the good old forward pathwise method for computing sensitivities but done

automatically by tools

AAD is about the same method but generates sensitivities to all inputs and intermediate values

in a unique additional backward sweep at a ridiculous compute cost

AAD can be implemented using some special compilers which are only partially compatible

with GPUs or by overloading C++ basic scalar operators used to program the MC which is

totally GPU friendly

The operator execution keeps the record of all operations and intermediary results of the

forward sweep. The tape is played backward on all path in // and the derivatives per path are

computed using the rule of chain keeping future results constant

The result sensitivities are finally the expectation of the sensitivities computed for each path

𝜃

𝑥

𝑦𝑧 𝑝

……

𝜕𝑝

𝜕𝑥=𝜕𝑝

𝜕𝑧

𝜕𝑧

𝜕𝑥

𝜕𝑝

𝜕𝑦=𝜕𝑝

𝜕𝑧

𝜕𝑧

𝜕𝑦 𝑧 ↦ 𝑥, 𝑦

𝜕𝑝

𝜕𝑧↦𝜕𝑝

𝜕𝑥,𝜕𝑝

𝜕𝑦

© 2015 Murex S.A.S. All rights reserved 23

AD AND AAD PROMISING, GOOD FOR VANILLAS, BUT …

The method is simple but the implementation can be tricky.

Everything should be done to have generic enough kernels to keep the GPU fed while avoiding race conditions

To obtain the best performance one still needs to trick the order of operations inside the computation tree making the method

often incompatible with cases where we want to keep the full flexibility at the level of the post-aggregation of several Monte-Carlo

detailed results

AAD is not applicable to all complex exotics even if the vibrato method smoother helps

AAD doesn’t solve the stress test and historical VAR problems

AAD is also said to be memory bound. Well implemented it is only memory bandwidth bound

© 2015 Murex S.A.S. All rights reserved 24

PASCAL THE MEMORY BANDWIDTH JUMP FOR IN A SINGLE GPU

102

178

288

1000

0 100 200 300 400 500 600 700 800 900 1000

C10160

M2090

K40

X80 TITAN

GB/Sec

It is the first time since 2008 that the number of Bytes per Flop has increased for a single GPU during a generation change

– and maybe the last -

Our AAD code will simply be 3.5 faster on next generation

but most of our algorithms are at least partially limited by the memory bandwidth of the GPUs and will show huge benefits

© 2015 Murex S.A.S. All rights reserved 25

SIERRA SUPERCOMPUTER 2017-2018A FULL FLEDGED CVA RISK SYSTEM IN A NODE

The revival of the big nodes

The Flops of 8 K40

A lot of CPU cores and memory to prepare inputs, convert outputs, interpret

scripts, aggregate, query, …

Enough GPU/CPU interconnect speed to retrieve CVA or MVA profiles

unnoticed

NVRAM to replace external flash array storage

Enough network bandwidth to have the flexibility of keeping results locally or

remotely

Bilateral MVA with SIMM at the same cost

CCP MVA with full revaluation using only a few nodes