machine learning applications in credit risk

57
Location: ARPM Open Source Conference 8/13/2017 Machine Learning applications in Credit Risk 2017 Copyright QuantUniversity LLC. Presented By: Sri Krishnamurthy, CFA, CAP [email protected] www.analyticscertificate.com

Upload: quantuniversity

Post on 21-Jan-2018

2.083 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Machine Learning Applications in Credit Risk

Location:

ARPM Open Source Conference

8/13/2017

Machine Learning applications in Credit Risk

2017 Copyright QuantUniversity LLC.

Presented By:

Sri Krishnamurthy, CFA, CAP

[email protected]

www.analyticscertificate.com

Page 2: Machine Learning Applications in Credit Risk

2

Slides will be available at: www.analyticscertificatecom/MachineLearning

Page 3: Machine Learning Applications in Credit Risk

• Founder of QuantUniversity LLC. and www.analyticscertificate.com

• Advisory and Consultancy for Financial Analytics• Prior Experience at MathWorks, Citigroup and

Endeca and 25+ financial services and energy customers.

• Regular Columnist for the Wilmott Magazine• Author of forthcoming book

“Financial Modeling: A case study approach”published by Wiley

• Charted Financial Analyst and Certified Analytics Professional

• Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston

Sri KrishnamurthyFounder and CEO

3

Page 4: Machine Learning Applications in Credit Risk

4

Quantitative Analytics and Big Data Analytics Onboarding

• Data Science, Quant Finance and Machine Learning Advisory

• Trained more than 1000 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R

• Launching ▫ Analytics Certificate Program

Spring 2018

▫ Fintech Certification program Fall 2017

• Building

Page 5: Machine Learning Applications in Credit Risk

6

Page 6: Machine Learning Applications in Credit Risk

Credit risk in consumer credit

Credit-scoring models and techniques assess the risk in lending to customers.

Typical decisions:• Grant credit/not to new applicants• Increasing/Decreasing spending limits• Increasing/Decreasing lending rates• What new products can be given to existing applicants ?

Page 7: Machine Learning Applications in Credit Risk

Credit assessment in consumer credit

History: • Gut feel• Social network• Communities and influence

Traditional:• Scoring mechanisms through credit bureaus• Bank assessments through business rules

Newer approaches (FINTECH):• Peer-to-Peer lending• Lending club, Prosper Market place

Page 8: Machine Learning Applications in Credit Risk

9

Page 9: Machine Learning Applications in Credit Risk

10

Types of algorithms

Machine learning

Supervised Learning

Prediction

Classification

Unsupervised Learning

Clustering

Page 10: Machine Learning Applications in Credit Risk

11

Used to derive a relationship between dependent and independent variables

• Prediction▫ Regression

▫ Decision Trees (CART)

▫ Neural Networks

• Classification▫ Logistic Regression

▫ CART, Random Forest, SVM

▫ Neural Networks

Supervised Learning

Page 11: Machine Learning Applications in Credit Risk

12

Data pre-processing

Split data into Training and Testing sets

Train the model on Training data

Test the model using Testing data to evaluate model

performance

Methodology

Page 12: Machine Learning Applications in Credit Risk

13

• No distinction between independent variables and dependent variables

• No result labels to determine “correct” results

• Goals:▫ Data Reduction

▫ Clustering

Unsupervised Learning

Page 13: Machine Learning Applications in Credit Risk

14

• Partitioning Clustering▫ Starts with K –number of clusters sought

▫ Observations randomly divided to form cohesive clusters

▫ Example : K-means

• Hierarchical Agglomerative Clustering▫ Each observation is its own cluster

▫ Combine clusters two at a time to finally have one cluster

▫ Example: Hierarchical clustering using single linkage, Ward’s method etc.

Types of Clustering

Page 14: Machine Learning Applications in Credit Risk

15

• Tries to separate samples into K groups with a goal of maximizing between group variance and minimizing within group variance

• Requires K to be specified up front.

• Starts with K initial centroids and optimizes to minimize the criterion or till the number of specified iterations are reached.

• Suited for larger datasets

K-means

Page 15: Machine Learning Applications in Credit Risk

16

• Goal is to derive a dendrogram starting from each record being its own cluster

• Works well for smaller data sets

• Proximity is measured in multiple ways (more later)

Hierarchical clustering

Page 16: Machine Learning Applications in Credit Risk

17

How do you measure similarity between two entities ?▫ Apples and Bananas

▫ Coke and Pepsi vs Orange juice

▫ Honda Civic vs Toyota Corolla

▫ New York and Boston

• The notion of distance

The notion of distance

Page 17: Machine Learning Applications in Credit Risk

18

• Euclidean distance

• Cosine distance

Distance measures

Page 18: Machine Learning Applications in Credit Risk

19

• Manhattan distance

(Taxi-cab distance)

• Jaccard distance▫ Used to measure similarity or dissimilarity between binary and non-

binary variables

▫ http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

Other distance measures

Page 19: Machine Learning Applications in Credit Risk

20

• Gower distance is used for calculating distances when we have mixed types of variables (continuous and categorical)

• Variables can be:▫ Quantitative (such as rating scale)▫ Binary (such as present/absent)▫ Nominal (such as worker/teacher/clerk)

• The metrics used for each data type are described below:▫ Quantitative: range-normalized Manhattan distance▫ Ordinal: variable is first ranked, then Manhattan distance is used with a special

adjustment for ties▫ Nominal: variables of k categories are first converted into k binary columns and

then the Dice coefficient is used (https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient )

Working with mixed-data

Page 20: Machine Learning Applications in Credit Risk

21

• Daisy : Compute all the pairwise dissimilarities (distances) between observations in the data set

• Pam: Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.

• Agnes: Computes agglomerative nesting (hierarchical clustering) of the dataset.

Support in R

Page 21: Machine Learning Applications in Credit Risk

22

Page 22: Machine Learning Applications in Credit Risk

23

Lending club

Page 23: Machine Learning Applications in Credit Risk

24

The Data

https://www.lendingclub.com/info/download-data.action

Page 24: Machine Learning Applications in Credit Risk

25

The Data

https://www.kaggle.com/wendykan/lending-club-loan-data

Page 25: Machine Learning Applications in Credit Risk

Variable description

Page 26: Machine Learning Applications in Credit Risk

• Calculate dissimilarity between observations.

• Select algorithm to group observations together

• Choose the best number of clusters

• Visualize clusters on reduced dimensions

Objective

Page 27: Machine Learning Applications in Credit Risk

• Partitioning around medoids (PAM) is used in this case.

• PAM is an iterative clustering procedure with the following steps:▫ Step 1: Choose k random entities to become the medoids.

▫ Step 2: Assign every entity to its closest medoid (using the distance matrix we have calculated).

▫ Step 3: For each cluster, identify the observation that would yield the lowest average distance if it were to be re-assigned as the medoid. If so, make this observation the new medoid.

▫ Step 4: If at least one medoid has changes, return to step 2. Otherwise, end the algorithm.

Selecting number of clusters

Page 28: Machine Learning Applications in Credit Risk

• One way to visualize many variables in a lower dimensional space is with t-distributed stochastic neighborhood embedding (t-SNE)

• This method is a dimension reduction technique that tries to preserve local structure so as to make clusters visible in a 2D or 3D visualization.

• https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding

Visualization with reduced dimension

Page 29: Machine Learning Applications in Credit Risk

30

Page 30: Machine Learning Applications in Credit Risk

31

Alternative Credit scoring in the news

Page 31: Machine Learning Applications in Credit Risk

32

Fintech being noticed by Regulators

Page 32: Machine Learning Applications in Credit Risk

33

• The regulatory sandbox allows businesses to test innovative products, services, business models and delivery mechanisms in the real market, with real consumers.

• The sandbox is a supervised space, open to both authorized and unauthorized firms, that provides firms with:▫ reduced time-to-market at potentially lower cost▫ appropriate consumer protection safeguards built in to new products and

services▫ better access to finance

• https://www.fca.org.uk/firms/regulatory-sandbox

Regulatory Sandboxes

Page 33: Machine Learning Applications in Credit Risk

34

US Regulators catching up

Page 34: Machine Learning Applications in Credit Risk

Model Validation

• “Model risk is the potential for adverse consequences from decisions based on incorrect or misused model outputs and reports. “ [1]

• “Model validation is the set of processes and activities intended to verify that models are performing as expected, in line with their design objectives and business uses. ” [1]

• Ref:• [1] . Supervisory Letter SR 11-7 on guidance on Model Risk

Page 35: Machine Learning Applications in Credit Risk

36

Popularity of Open-source software in the enterprise increasing

Page 36: Machine Learning Applications in Credit Risk

37

• Financial Services customers like Capital One, FINRA, and Pacific Life are moving critical workloads to AWS

Cloud maturing

Page 37: Machine Learning Applications in Credit Risk

38

• Versions and packages

Challenges in adopting Open-source software in the enterprise

Page 38: Machine Learning Applications in Credit Risk

39

• Difficulty in replicating and reconciling differences in environments

Challenges in adopting Open-source software in the enterprise

Page 39: Machine Learning Applications in Credit Risk

40

• Deploying models built by Data Scientists still a problem

Challenges in adopting Open-source software in the enterprise

Data Scientists Enterprise IT

Page 40: Machine Learning Applications in Credit Risk

41

• The try-before-adopt model is difficult with unproven open-source solutions

Challenges in adopting Open-source software in the enterprise

Page 41: Machine Learning Applications in Credit Risk

42

www.QuSandbox.com

Page 42: Machine Learning Applications in Credit Risk

43

Quant/Enterprise use cases

• Create an environment that can support multiple platforms and programming languages

• Enable remote running of applications

• Ability to try out a Github submission/ someone else’s code

• Facilitate creation of Docker images to create replicable containers

• Create prototyping environments for Data Science/Quant teams

• Enable Data scientists/Quants to deploy their solutions

• Enable running multiple experiments concurrently

• Integrate seamlessly with the cloud to scale up computations

Use cases

Page 43: Machine Learning Applications in Credit Risk

44

Fintech use cases

• To demonstrate solutions to enterprises

• Create customized enterprise trials for companies that don’t permit installation of vendor software prior to procurement

• To manage quick updates

• Enable effective integration and hosting of services (REST APIs)

Use cases

Page 44: Machine Learning Applications in Credit Risk

45

Academic use cases

• Enable creation of course material and exercises that could be shared

• Enable students and workshop participants to focus on the data science experiments rather than environment setting

Use cases

Page 45: Machine Learning Applications in Credit Risk

46

Creating replicable environments

Creating and manage replicable environments (Code + software + data) in a single portal

Page 46: Machine Learning Applications in Credit Risk

47

Creating replicable environments

Create replicable environments (Code + software + data) through a easy point & click tool and publish to Dockerhub or manage internallyShare it with target users

Page 47: Machine Learning Applications in Credit Risk

48

User portal

• Run multiple experiments in pre-created environments (Code + software + data)• Deploy your own solutions• Run any Docker image or Github submission on the cloud

Page 48: Machine Learning Applications in Credit Risk

49

Run Jupyter notebooks and prototype applications

Page 49: Machine Learning Applications in Credit Risk

50

Run Rstudio and Shiny applications

Page 50: Machine Learning Applications in Credit Risk

51

Run any Docker application

Page 51: Machine Learning Applications in Credit Risk

52

Manage tasks and errors

Page 52: Machine Learning Applications in Credit Risk

53

User portal

• Dockerize and deploy applications on AWS in just a few steps

Page 53: Machine Learning Applications in Credit Risk

54

Deploy applications with ease

Page 54: Machine Learning Applications in Credit Risk

55

QU’s open source project – Project Mozaic

Page 55: Machine Learning Applications in Credit Risk

56

www.QuSandbox.com

Page 56: Machine Learning Applications in Credit Risk

57

www.analyticscertificatecom/MachineLearning

Page 57: Machine Learning Applications in Credit Risk

Thank you ARPM and enjoy the boot camp!

Checkout our programs at:www.analyticscertificate.com/fintech

www.qusandbox.com

Sri Krishnamurthy, CFA, CAPFounder and CEO

QuantUniversity LLC.

srikrishnamurthy

www.QuantUniversity.comInformation, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be

distributed or used in any other publication without the prior written consent of QuantUniversity LLC.

58