counterparty credit risk and im computation for ccp on ...€¦ · for ccp an efficient solution is...

55
1 Counterparty Credit Risk and IM Computation for CCP on Multicore Systems Prasad Pawar Nishant Kumar Amit Kalele Tata Consultancy Services Limited

Upload: others

Post on 04-Feb-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Counterparty Credit Risk and IM Computation for CCP

    on Multicore Systems

    Prasad Pawar

    Nishant Kumar

    Amit Kalele

    Tata Consultancy Services Limited

  • 2

    Overview

    Introduction

    Counter Party Credit Risk

    • Basic Terminology

    Sequential Algorithm

    Parallel algorithm using CUDA

    Optimizations applied on GPGPU

    Parallelized and Optimized algorithm on Intel Platform

    Comparison Results

    Conclusion and future work

  • 3

    Introduction

    Counterparty credit risk

    Counterparty credit risk is defined as the risk that the counterparty to a

    transaction could default before the final settlement of the transaction’s cash flows.

    Basic Terminology

    • IRS trade - Interest rate swap trades are trades done primarily for hedging or

    speculation on interest rate direction by the market participants

    • Cash Flow - In the context of IRS trades cash flow refers to sum of money to be paid or

    received on predefined cash flow dates that are mentioned in the trade

    • Zero Coupon Yield Curve - It is a curve representing the yield of zero coupon bonds

    which are plotted against the length of time they have to run to maturity and essentially it

    provides forward rates and spot rates used for calculating cash flow and discounting

    • MTM value - MTM value refers to the mark to market value of the IRS trade which

    reflects monetary gain or loss on the trade to the two parties of the trade

  • 4

    Counter Party Credit Risk

    Mark to Market Computations

    • The central counterparty (CCP) values, using current yield curve, the complete portfolio

    of all interest rate swap trades received from all the members on intraday basis.

    • Calculates the mark to market (MTM) margin requirement for each member.

    • Block the margin from a member’s collateral and if the margin is not sufficient make

    margin call to the required members.

    Initial Margin Computations

    • The IM computation requires 250 times valuation of the member’s current portfolio using

    250 different yield curves that are picked from historical data.

  • 5

    Challenges

    On traditional systems (database systems) MTM computations takes ~25min for

    20,000 trades, each with ~150 cash flows.

    Initial Margin (IM) computations takes ~10min for 250 different yield curves each

    with 20,000 trades along with ~150 cash flows on .NET based solution.

    Such a high timings leads to

    • It makes the process inefficient as the user is unproductive during the 25 minutes for

    which the valuation happens

    • The timings are high if information is required by senior executives or regulators on

    an urgent basis

    • If the trade volumes increase say to 100,000, a realistic possibility, then the time

    taken will be more than 2 hours which will be virtually unacceptable

    • Till the time IM result is computed, a member can continue to do trading but the

    trades are guaranteed for settlement from point of trade in TS which increases risk

    for CCP

    An efficient solution is required to solve such problem

  • 6

    Computational Steps

    Yield curve generations:

    1. Using the linear interpolation, compute the intermediate swap rates for

    tenors whose swap rates are not provided. Where (x,y) represents tenor and

    corresponding interest rate.

    2. Zero rates for tenor up to one year are computed using continuous

    compounding method as:

    3. Standard bootstrapping method is used to compute zero rates for tenor

    beyond one year.

  • 7

    Computational Steps

    MTM values

    The input to the mark to market computation is business date, immediate previous and

    future cash flow dates, principal amount, accrued interest, fixed rate, floating rate and

    zero rate.

    1. Compute fixed cash flows and compute floating cash flows using discrete equivalent

    formula for future floating interest rates i.e. forward rates:

    2. Calculate MTM value of each trade by doing discounting of fixed and floating cash

    flows using zero rate and netting off fixed and floating cash flows.Discounted Value = Present value of cash flows

    Trade MTM = Sum (Discounted Value)

    3. The MTM value i.e. margin requirement for each member is obtained by aggregating

    MTM value of all the trades of the member.

    MTM for Member = Sum (Trade MTM of that Member)

  • 8

    Computational Steps

    Initial Margin calculation

    1. Value the complete IRS trade portfolio of a Member 250 times using 250

    historical zero rates

    2. Compute the daily percentage change in the MTM value of the portfolio and

    record the 249 results

    3. Assign weight to each of the result using EWMA scheme such that more

    recent the result more is the weight

    4. Sort the 249 results in ascending order

    5. Add the weights from top and wherever the cumulative weight is equal to 0.05

    i.e. worst 5 percent, the corresponding percentage change is then

    multiplied by portfolio value and adjusted for the holding period to compute IM

  • 9

    Implementation Details

  • 10

    Input and Output of IRS

    • Input - Swap Rate, Swap Tenor

    • Output - Yield curve

    • Input for MTM Computation -

    Cash_flow_dates, Prev_cash_flow_dates, Notional_amt, accured_int,

    fixed_rate

    • Output – Present CashFlow, MTM value

  • 11

    Sequential Algorithm

    Single MTM Computations

    Compute zero rates

    MTMfinal = 0

    for trades = 0 to nTrade do

    MTM[trade] = 0

    for CF = 0 to nCf do

    Compute Present_CashFlow[CF]

    MTM[trade] = MTM[trade] + Present_CashFlow[CF]

    end for

    MTMfinal = MTMfinal + MTM[trade]

    end for

  • 12

    Sequential Algorithm

    Single MTM Computation

    for CF = 0 to nCf do

    if(CF ==0)

    Read eff_date, curr_cash_flow_date

    else

    Read last_cash_flow_date, curr_cash_flow_date

    1. Calculate no. of days between curr_date and last_date to calculate the tenor.

    2. Calculates intermediate values of fw_rate, comp_fw_rate, dist_fw_rate etc

    3. Calculated floating cashflow of particular trade based on above calculated values and

    inputs such as notional_amt, accrued_int and fixed_rate.

    4. Compute Present value of cashflow using fixed/floatinig cashflow and yield curve.

    5. MTM = MTM + Present_CashFlow[CF]

    end for

  • 13

    Sequential Algorithm

    Initial Margin Computations using 250 MTM values:

    Compute current dates zero rates and retrieve 249 different zero rates from

    database.

    MTMfinal [nRate] = 0

    for zero rate = 0 to nRate do

    for trades = 0 to nTrade do

    MTM[trade] = 0

    for CF = 0 to nCf do

    Compute Present_CashFlow[CF]

    MTM[trade] = MTM[trade] + Present_CashFlow[CF]

    end for

    MTMfinal[nRate] = MTMfinal[nRate] + MTM[trade]

    end for

    end for

    Compute IM using MTMfinal[nRate]

    Single MTM

    Computations

  • 14

    NVIDIA GPU Systems

    Kepler K20x -

    Device: The nVidia’s Kepler K20x GPU with 796 MHz and 2496 cores,

    5GB RAM.

    Host: The Intel Xeon(R) CPU E5-2697 v3@ 2.1 GHz, dual socket, 6

    cores/socket, 16GB RAM.

    Kepler K40 -

    Device: The nVidia’s Kepler K40 GPU with 745 MHz and 2880 cores, 12

    GB RAM.

    Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14

    cores/socket, 64GB RAM.

    Kepler K80 -

    Device: The nVidia’s Kepler K80 GPU with 562 MHz and 2x2496 cores,

    2x12 GB RAM.

    Host: The Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz, dual socket, 14

    cores/socket, 64GB RAM.

  • 15

    MTMfinal = 0

    Launch CUDA Kernel with nTrade=20000 threads

    Compute zero rates

    MTM[Ti] = 0

    for CF = 0 to nCf do

    Compute discount rate[CF]

    MTM[Ti] = MTM[Ti] + discount rate[CF]

    end for

    T1 T2 T3 T4 T20000

    MTMfinal =∑ MTM

    Kernel

    Computation

    GPU Algorithm for single MTM

  • 16

    Results

    Sr. No. Experiment Time in Sec

    Performance Gain

    1 Sequential computation of 20000

    trades with 150 cash flows each on 250

    diff. yield curves

    81.15 -

    2 Parallel computation of 20000 trades

    with 150 cash flows each on 250 diff.

    yield curves on K20x

    9.612 8.44x

    Results taken on Kepler K20x system

  • 17

    Further Optimization

    Nvidia GPU optimization:

    Multi level parallelism using Hyper-Q

    Using Shared Memory with coalesced memory access

    Modified data structure

    Resolved the issue of warp divergence

    Using constant memory

    Read-only cache memory using const __restrict__

  • 18

    Multi level parallelism using Hyper-Q

    Allows connection from multiple CUDA streams, Message Passing Interface (MPI)

    processes, or multiple threads of the same process.

    32 concurrent work queues, can receive work from 32 process cores at the same time.

    1.5x performance benefit achieved .

    Figure source: nvidia.com

  • 19

    Set 32 CUDA stream and nRate=250

    Distribute nRate/32 computations to each of 32 streams

    MTMfinal[nRate] = 0

    Compute zero rates and retrieve 249 previous zero rates

    S0 S1 S2 S31 Streams

    GPU Algorithm for 250 MTM & IM Computation

    . . . . . . . . . . . . . . .

    . . . . . .

    Compute IM using MTM[nRate]

  • 20

    Hyper-Q using default streaming

    nvcc --default-stream per-thread -c MTM_value.cu -arch

    sm_35 -w -Xcompiler -fopenmp

  • 21

    Results

    Experiment

    Sr. No. Experiment Time in Sec

    Performance Gain

    1 Sequential computation of 20000

    trades with 150 cash flows each on 250

    diff. yield curves

    81.15 -

    2 Parallel computation of 20000 trades

    with 150 cash flows each on 250 diff.

    Zero rates.

    9.612 1x

    3 Using default streaming flag 6.241.54x

    Results taken on Kepler K20x system

  • 22

    Using Shared Memory

    Grid

    Global Memory

    Block (0, 0)

    Shared Memory

    Thread (0, 0)

    Registers

    Thread (1, 0)

    Registers

    Block (1, 0)

    Shared Memory

    Thread (0, 0)

    Registers

    Thread (1, 0)

    Registers

    Host

    Constant Memory

    • Read/write per-block

    • Speed equivalent to

    local cache

    • 100x faster than Global

    memory

    • Limit up to 48KB

    • Zero rate and Swap

    tenor are used as

    shared memory

    Figure source: nvidia.com

  • 23

    Coalesced global memory access

    Threads0 Threads1Threads30 Threads31

    . . . . .. . . . .

    . . . . . .

    . . . . . .

    128 192 256 2112 2176

    Threads0 Threads1Threads30 Threads31

    . . . . . .

    . . . . . .

    128 132 136 252 256

    Shared Memory

    T0 T1 T2 T30 T31

    Global Memory

    Global Memory

    Threads

  • 24

    Using Shared Memory

    Here ARRAY_COUNT=22 and No. of threads per Block are

    32,64 …

    As zero_rate and swap_tenor data is being used multiple

    times per thread, so stored in a shared memory.

  • 25

    Results

    Sr. No. Experiment Time in Sec

    Performance Gain

    1 Sequential computation of 20000

    trades with 150 cash flows each on 250

    diff. yield curves

    81.15 -

    2 Parallel computation of 20000 trades

    with 150 cash flows each on 250 diff.

    Zero rates.

    9.612 1x

    3 Using default streaming flag 6.24 1.54x

    4 Shared memory used to store zero_rate

    and swap_tenor data5.221 1.84x

    Results taken on Kepler K20x system

  • 26

    NVVP Profiling

  • 27

    Data Structure of Dates

    Data Structure -

    Stored Cash_flow_dates and prev_cash_flow_date in the form of structure

    typedef struct _d {

    int day; → 4 bytes

    int mon; → 4 bytes 12 bytes

    int year; → 4 bytes

    }dt;

  • 28

    Issue with Data Structure

    . . . . . .

    . . . . . .

    0 4 8

    T0 T1 T2 T30 T31

    Global Memory

    Threads

    12 16 20 24

    In one cycle cache fetches 128 bytes, will eventually

    fetches 128/12 ~= 10 elements of date

    Large size of data needs to be fetch from Global Memory

    Access pattern to global memory is strided

  • 29

    Modified Data Structure

    Original Data Structure -

    Stored Cash_flow_dates and prev_cash_flow_date in the form of structure

    typedef struct _d {

    int day; → 4 bytes

    int mon; → 4 bytes 12 bytes

    int year; → 4 bytes

    }dt;

    Insted of structure stored the date information in the form of integer.

    ex. dt date; int tmp_date;

    date.day = 12;

    date.mom = 3; tmp_date = date | date

  • 30

    Modified Data Structure

    To extract the data from integer to separated by day, month and year we

    use below method:

    date.day = tmp_date & 0xFF;

    date.mon = (tmp_date >> 8) & 0xFF;

    date.year = (tmp_date >> 16) & 0xFFFF;

  • 31

    Modified Data Structure

    . . . . . .

    . . . . . .

    0 4 8

    T0 T1 T2 T30 T31

    Global Memory

    Threads

    12 16 20 24

    In one cycle cache fetches 128 bytes, will eventually

    fetches 128/4 = 32 elements of date

    Less data needs to be fetch from Global Memory

    Access pattern to global memory is coalesced

  • 32

    Results

    Sr. No. Experiment Time in Sec

    Performance Gain

    1 Sequential computation of 20000

    trades with 150 cash flows each on 250

    diff. yield curves

    81.15 -

    2 Parallel computation of 20000 trades

    with 150 cash flows each on 250 diff.

    Zero rates.

    9.612 1x

    3 Using default streaming flag 6.24 1.54x

    4 Shared memory used to store zero_rate

    and swap_tenor data5.221 1.84x

    5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure

    4.229 2.27x

    Results taken on Kepler K20x system

  • 33

    NVVP Profiling

  • 34

    Warp Divergence

    Threads are executed in warps of 32, with all threads in the

    warp executing the same instruction at the same time

  • 35

    Warp Divergence

  • 36

    Warp Divergence - solution

  • 37

    Results

    Sr. No. Experiment Time in Sec Performance Gain

    1 Sequential computation of 20000

    trades with 150 cash flows each on 250

    diff. yield curves

    81.15 -

    2 Parallel computation of 20000 trades

    with 150 cash flows each on 250 diff. Zero

    rates.

    9.612 1x

    3 Using default streaming flag 6.24 1.54x

    4 Shared memory used to store zero_rate

    and swap_tenor data5.221 1.84x

    5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure

    4.229 2.27x

    6 Resolved the issue of warp divergence

    ( By separated if-else position of code)3.148 3.05x

    Results taken on Kepler K20x system

  • 38

    NVVP Profiling

  • 39

    Constant memory

    Constant Memory

    Where is constant memory?

    - Data is stored in the device global memory

    - Read data through multiprocessor constant cache

    - 64KB constant memory and 8KB cache for each SM

    //declare constant memory__constant__ float cst_ptr[size];

    //copy data from host to constant memorycudaMemcpyToSymbol(cst_ptr,host_ptr,data_size);

  • 40

    Constant memory

    Here ARRAY_COUNT=22 and No. of threads per Block are

    32,64 …

    Issues with above implementation – Warp Divergence

  • 41

    Results

    Sr. No. Experiment Time in Sec Performance Gain

    1 Sequential computation of 20000 trades with 150

    cash flows each on 250 diff. yield curves81.15 -

    2 Parallel computation of 20000 trades

    with 150 cash flows each on 250 diff. Zero rates.

    9.612 1x

    3 Using default streaming flag 6.24 1.54x

    4 Shared memory used to store zero_rate

    and swap_tenor data5.221 1.84x

    5 Change in data structure – Cash Flow Dates stored in the form of single integer value instead of structure

    4.229 2.27x

    6 Resolved the issue of warp divergence

    ( By separated if-else position of code)3.148 3.05x

    7 Changed shared Memory to constant memory 2.892 3.32x

    Results taken on Kepler K20x system

  • 42

    NVVP Profiling

  • 43

    Read-Only Cache Memory

    The read-only data cache was introduced with Compute Capability 3.5

    architectures (e.g. Tesla K20c/K20X and GeForce GTX Titan/780

    GPUs).

    Each SMX has a 48KB read-only cache.

    The CUDA compiler automatically accesses data via the read-only

    cache when it can determine that data is read-only for the lifetime of

    kernel.

    In practice, you need to qualify pointers as const and __restrict__ before

    the compiler can satisfy this condition.

    Also specify a read-only data cache access with the __ldg() intrinsic

    function.

  • 44

    Read-Only Cache Memory

    Without Read-only Cache

    With Read-only Cache

  • 45

    NVVP Profiling

  • 46

    Results

    Sr. No.

    Experiment Time in Sec

    Performance Gain

    1 Sequential computation of 20000 trades with 150

    cash flows each on 250 diff. yield curves81.15 -

    2 Parallel computation of 20000 trades

    with 150 cash flows each on 250 diff. Zero rates.

    9.612 1x

    3 Using default streaming flag 6.24 1.54x

    4 Shared memory used to store zero_rate and swap_tenor data 5.221 1.84x

    5 Change in data structure – Cash Flow Dates stored in the

    form of single integer value instead of structure

    4.229 2.27x

    6 Resolved the issue of warp divergence ( By separated if-else

    position of code)3.148 3.05x

    7 Changed shared Memory to constant memory 2.892 3.32x

    8 Read-only Cache memory using const __restrict__ 2.765 3.47x

    Results taken on Kepler K20x system

  • 47

    Intel Multi-core Systems

    Experimental Setup

    Ivy Bridge :

    The Intel Xeon E5 2650 v2, Ivy Bridge 2.6 GHz, dual socket, 8

    cores/socket with Hyper-threading, 24GB RAM.

    Haswell :

    The Intel Xeon E5-2697 v3 @ 2.60GHz, dual socket, 14

    cores/socket with Hyper-threading, 64 GB RAM.

  • 48

    Intel OpenMP

    .

    .

    .

    Parallel section where each thread will

    calculate MTM value of each trade

    Sequential

    regionSequential

    region

    Parallelization using

    OpenMP

    Syntax:

    #pragma omp parallel for clauses(private, firstprivate, ...)

    for ( trades =0 to nTrade)

    {

    for( CF = 0 to nCf )

    {

    computation steps

    }

    }

    T1

    T2

    Tn

  • 49

    Intel Optimizations

    Intel optimization

    Multi thread settings

    Optimization using compiler flags

    – no-prec-div

    – -unroll-aggressive

    – -ipo

    Schedule (Dynamic, chunk)

  • 50

    Intel Optimizations

    No-prec-div

    -prec-div improves precision of floating-point divides.

    -no-prec-div disables this option and enables optimizations that giveslightly less precise results than full IEEE division.

    -unroll-aggressive

    This option determines whether the compiler uses more aggressive unrolling

    for certain loops.

    -ipo[n]

    This option permits inlining and other interprocedural optimizations among multiple

    source files.

  • 51

    Intel Optimizations Results on HSW-EP

    Sr. No.

    Experiment Time in Sec

    Throughput

    1 Parallel computation using OpenMP of 20000

    trades with 150 cash flows each

    10.6647 -

    2 Using KMP_AFFINITY=granularity=fine,

    compact,1,07.7874 1.37x

    3 Using optimization flag -O2 3.4197 3.12x

    4 Using optimization flags -no-prec-div

    -unroll-aggressive3.1642 3.37x

    5 Using optimization flag -ipo 2.7451 3.87x

    Results taken on Intel Haswell System

  • 52

    Comparative Performance

  • 53

    Conclusion and Future Work

    Traditional computing approaches are not suitable for the workload involved in the

    computation of MTM and IM for CCP risk assessment.

    Using HPC, 750 million cash-flows can be computed for identifying liquidity

    requirement in few seconds.

    Achieved more 100 times improvement compared to best known system for single

    MTM computation.

    HPC is well suited for complex calculations like credit value adjustment to price

    counterparty risk in deal (which is out next PoC use case), expected shortfall

    calculation for market risk measurement, potential future exposure and exposure at

    default calculations, collateral valuation, basis risk computation etc.

    Risk management is moving towards intraday risk computation for all major risk

    categories of credit risk, market risk, liquidity risk and counterparty credit risk and

    HPC is well suited for meeting the performance demands of these computations

    The performance of Nvidia Kepler K80 is the best among all systems compared in our

    benchmarking results.

  • 54

    Acknowledgement

    We are thankful to Vinay Deshpande from Nvidia Pune, India for providing

    access to latest K40 and K80 GPU systems to benchmark and fine tune the

    application.

    We are thankful to HPC Advisory Council for providing access to the Thor

    system for evaluating our application.

  • 5555

    THANK YOU