credit card analysis of czech bank

35
Credit Card Analysis of Czech Bank Goals Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer. The Berka dataset is from the 1999 PKDD Discovery Challenge. The Berka dataset is a collection of financial

Upload: byron-mccullough

Post on 31-Dec-2015

84 views

Category:

Documents


0 download

DESCRIPTION

Credit Card Analysis of Czech Bank. Goals. Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankGoals

• Our goal for this project is to analyze customer and credit-card information, from the Berka dataset, to extrapolate the type of customer who makes a good candidate for a credit-card, and what level of credit to extend to that customer.

• The Berka dataset is from the 1999 PKDD Discovery Challenge.

• The Berka dataset is a collection of financial information from a Czech bank.

Page 2: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankDomain Description

Entity-Relationship Description

• Each account has both static characteristics (e.g. date of creation, address of the branch) given in relation "account" and dynamic characteristics (e.g. payments debited or credited, balances) given in relations "permanent order" and "transaction".

• Relation "client" describes characteristics of persons who can manipulate with the accounts.

• Relations "loan" and "credit card" describe some services which the bank offers to its clients;

• Relation "demographic data" gives some publicly available information about the districts (e.g. the unemployment rate); additional information about the clients can be deduced from this.

Page 3: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankDomain Description - Continued

The dataset contains the following tables:• Accounts

o Each record describes static characteristics of an account

o Size: 4500 records• Clients

o Each record describes characteristics of a cliento Size : 5369 records

• Disposition (Disp)o Each record relates a client with an account and

describes the client’s right to operate that accounto Size: 5369 records

Page 4: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankDomain Description - Continued

Dataset Table Description, Continued• Permanent Orders, Debit Only (Orders)

o Each record describes characteristics of a payment order

o Size : 6471 records• Transactions (Trans)

o Each record describes one transaction on an account

o Size: 1056320 records • Loans

o Each record describes a loan granted for a given account

o Size: 682 records

Page 5: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankDomain Description - Continued

Dataset Table Description, Continued• Credit Cards (Cards)

o Each record describes a credit card issued to an account

o Size : 892 records• Demographic Data (District)

o Each record describes demographic characteristics of a district

o Size: 77 records

Page 6: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities

1. Converted the ascii files to:

a. MS Excel and/or MS Word files for cleaning data

b. MS Access database for use ini. data mining, ii. de-normalizing or ‘flattening’ files, and iii. basic querying to learn more about the data.

c. Put all modified files into file types recognized by Weka. i. These files are comma delimited with a ‘heading’ of

attribute definition information.

Page 7: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

2. Verified all table relationships:

a. Every account has an Owner via Disp and Account tables

b. Order and Loan records are duplicated in transaction records. That is, the transactions include Order records and Loan payments.

i. Loan records in Trans are identified by k_symbol=”LP”

Page 8: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

3. De-normalize, or ‘flatten’, files for mining. Our database is relational. In order to mine or cluster attributes, those attributes must be in a single table. We have created a de-normalized table based on our goals.

a. Goal: Analyze credit-card information to extrapolate the type of customer who makes a good candidate for a credit-card.

                  i.      Account-Client-Disp-Card-District-Loan-Transaction Table• Using information we discovered about accounts from

previous clustering, cluster customer information        Using card type as clustering attribute       Added "N" (None) as a possible value to the Loan Status

attribute• In order to better understand customers, we looked at this

table in two ways: • To identify, from all customers, which were credit card

holders and which were not.• To examine the variances that exist between all credit

card holding customers.             

Page 9: Credit Card Analysis of Czech Bank

4. Add, change, remove, and descretize attributes as necessary. The tables shown below describe the changes made:

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

Preprocessing Activities – Account Table:Changes Made With Excel

Column Description Changes Missing or Invalid Values

Notes

Frequency Frequency of Statement Issuance

Translated values as follows: •POPLATEK MESICNE changed to MONTHLY ISSUANCE (MI) •POPLATEK TYDNE changed to WEEKLY ISSUANCE (WI) •POPLATEK PO OBRATU change to ISSUANCE AFTER TRANSACTION (TI)

N/A Translated for ease of use

Date Date of account creation Removed Ignore This attribute is not used in our mining effort

Page 10: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

Preprocessing Activities – Client Table:Changes Made With Excel

Column Description Changes Missing or Invalid Values

Notes

BirthNumber Birthday and gender Removed This is a 6-digit number. The documentation says that its format is as follows:   - YYMMDD    (Men)   - YYMM50+DD    (Women) Analysis suggests that the format is as follows:   - YYMMDD    (Men)   - YY50+MMDD    (Women) Format changed to:   - MM/DD/YYYY   - Created a new field for Client_Sex

N/A Attribute removed from dataset and replaced by attributes Client_Sex and Client_Age

Page 11: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

Preprocessing Activities – Client Table, Continued:

Client_Sex Gender derived from BirthNumber attribute

Added Built during data pre-processing. Values are MALE (M) and FEMALE (F)

N/A Distribution of Values: - 49% Male- 51% Female

Client_Age Discretized value derived from BirthNumber attribute Values: - 1 = YOUTH    (0 - 24) - 2 = ADULT    (24 - 35) - 3 = MIDDLE-AGE    (36 - 64) - 4 = SENIOR    (65 - *)

Added Built during data pre-processing.

N/A None

Column Description Changes Missing or Invalid Values

Notes

Page 12: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

Preprocessing Activities – Disposition Table

Changes Made With Excel

Column Description Changes Missing or Invalid Values

Notes

Type Type of Disposition(owner/user)

Translated as follows: •Disponent = User •Owner = Owner

User Distribution of Values: - 84% Owner - 16% User

Preprocessing Activities – Order Table

Changes Made With Excel

Column Description Changes Missing or Invalid Values

Notes

K_Symbol Characterization of the payment

Removed 22% of the values are missing

Removed attribute due to missing values

Page 13: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

Preprocessing Activities – Loan Table

Column Description Changes Missing or Invalid Values

Notes

Date Date the loan was granted

Changed format from YYMMDD to MM/DD/YYYY

Ignore Correlation with Status Attribute: •  R2 = 98%• r = 0.49

Amount Amount of loan •Removed •Discretized values stored in Amt attribute

N/A Correlation with Status Attribute: •  R2 = 68%• r = 0.34

Duration •Duration of the loan •Possible Values:

•  12 months•  24 months•  36 months•  48 months•  60 months

•Removed •Discretized values stored in Dur attribute

N/A •Correlation with Status attribute:

•  R2 = 100%• r = 0.51

•Distribution of Values:

•  12 = 19%•  20 = 20%•  36 = 19%•  40 = 20%•  60 = 21%

Changes Made With Excel

Page 14: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

Preprocessing Activities – Loan Table, ContinuedColumn Description Changes Missing or

Invalid Values

Notes

Payments Monthly loan payment •Removed •Discretized values stored in Amt attribute

N/A  

Dur Discretized Duration Attribute: •  (*,30)•  (31,42)•  (43,54)•  (55,*)

•Added •Discretized in Rosetta using Entropy Algorithm

N/A Distribution of values: •  (*,30) = 39%•  (31,42) = 19%•  (43,54) = 20%•  (55,*) = 21%

Pmt Discretized Payment Attribute: •  (*,8041) = 1•  (8041,*) = 2

•Added •Discretized in Rosetta using Entropy Algorithm •Rosetta discretized into 50+ values.. We merged values using <1%

N/A Distribution of values: •  (*,8041) = 94%•  (8041,*) = 6%

Amt Discretized Amount Attribute: •  (*,30708) = 1•  (30709,49380) = 2•  (49381,76926) = 3•  (76927,230310) = 4•  (230311,*) = 5

•Added •Discretized in Rosetta using Entropy Algorithm •Rosetta discretized into 100+ values.. We merged values using <1%

N/A Distribution of values: •  (*,30708) = 9%•  (30709,49380) = 10%•  (49381,76926) = 13%•  (76927,230310) = 47%•  (230311,*) = 21%

Page 15: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

Preprocessing Activities – Transaction Table

Column Description Changes Missing or Invalid Values

Notes

Date Date of transaction Format changed from YYMMDD to MM/DD/YYYY

Ignore  

Type +/- transaction Translated Values: • PRIJEM = Credit • VYDAJ = Withdrawal

Flag Record - Do not use

 

Operation Mode of transaction Translated values: • VYBER KARTOU = Credit Card Withdrawal (CCW) • VKLAD = Credit in Cash (CRC) • PREVOD Z UCTU = Collection from Another Bank (CAB) • VYBER = Withdrawal in Cash (WC) • PREVOD NA UCET = Remittance to Another Bank (RAB)

Flag record - Do not use

 

Changes Made With Excel

K_Symbol Characterization of transaction

Translated Values: • POJISTNE = Insurance Payment (IP) • SLUZBY = Payment on Statement (PS) • UROK = Interest Credited (ICR) • SANKC. UROK - Sanction Interest (SI) • SIPO - Household (H) • DUCHOD - Old-age Pension (OP) • UVER - Loan Payment (LP)

Flag record - Do not use

Translated for ease of use

Page 16: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

Preprocessing Activities – Demographic TableChanges Made With Excel

Column Description Changes Missing or Invalid Values

Notes

A2 District Name Replaced Spaces with underscore ("_")

N/A  

A3 Region Name Replaced Spaces with underscore ("_")

N/A Distribution of Values: •  Prague = 1%•  Central_Bohemia = 16%•  South_Bohemia = 10%•  West_Bohemia = 13%•  North_Bohemia = 13%•  East_Bohemia = 16%•  South_Moravia = 18%•  North_Moravia = 14%

Issued Date card issued •Removed •Eliminated null time-stamp and changed date format from YYMMDD to MM/DD/YYYY

Ignore For our purposes, date card issued is not material

Page 17: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankData Preprocessing Activities - Continued

Preprocessing Activities – Credit Card Table

Changes Made With Excel

Column Description Changes Missing or Invalid Values

Notes

Issued Date card issued •Removed •Eliminated null time-stamp and changed date format from YYMMDD to MM/DD/YYYY

Ignore For our purposes, date card issued is not material

Page 18: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankMethodology

Methodology Overview

1. Attribute Ranking

2. Classification Analysis

3. Clustering Analysis

Page 19: Credit Card Analysis of Czech Bank

• Classification is a process where a model is built describing a predetermined set of data classes.

• The model is constructed by analyzing all the records in the database.

• Each tuple is assumed to belong to a predefined class, as determined by one of the attributes called the class label.

• The tuples analyzed to build the model form the training data set.

• Typically, the learned model is expressed in terms of decision trees or classsification rules.

• These rules can be used to predict the test data set.

Credit Card Analysis of Czech BankMethodology – Classification Analysis

Page 20: Credit Card Analysis of Czech Bank

• We used See5 as the classification tool for our project. A brief description of the algorithm is as follows:

• The tree starts as a single node representing the training samples.

• If the samples are all of the same class, then the node becomes a leaf and is labeled with that class.

• Otherwise, the algorithm uses an entropy-based measure known as Information Gain as a heuristic for selecting the attribute that will best separate the samples into individual classes. This attribute becomes the "test" or "decision" attribute at the node.

• A branch is created for each known value of the test attribute and the samples are partitioned accordingly.

Credit Card Analysis of Czech BankMethodology – Classification Analysis

Page 21: Credit Card Analysis of Czech Bank

• See5 Algorithm, cont.• The algorithm uses the same process recursively to form a

decision tree for the samples at each partition.

• The recursive partition stops when all the samples for a given node belong to the same class or if there are no remaining attributes on which samples may be further partitioned.

Credit Card Analysis of Czech BankMethodology – Classification Analysis

Page 22: Credit Card Analysis of Czech Bank

Trans_Avg_Balance <= 37097.18

Class :0 (1828.3/93.4)

Client_District_ID = 10

Class: 1(6.5)

Client_District ID = 38

Class: 1(8.5 /0.9)

Trans_Avg_Balance > 46123.76

Class: 1(8.3)

Account_Opened = 1995

Client_District ID = 1 Client_District ID = 15

Client_District ID = 13

Class: 1(5.3)

Class: 1(5.3)

Trans_Avg_Balance > 66899.77

Class: 1(8.3)

Client_Age = Senior

Class: 1(5.3)

Credit Card Analysis of Czech BankMethodology – Classification Analysis

We ran See5 using the RuleSets option and the Boost option with 3

trials. This represents simplified version of the decision tree generated

by See5. To build this tree, we used Rules obtained, which had high

confidence (>85%).

Page 23: Credit Card Analysis of Czech Bank

See5 also has the ability to express the clasifiers as Rule sets, which are easier to understand. 96 rules were generated, out of which we have listed 3 rules. All of these have confidence of more than 90%

Rule 1 Rule 1/1: (1828.4/93.4, lift 1.3)    Trans_Avg_Balance <= 37097.18 --> class 0 [0.948]

If Trans_Avg_Balance <= 37097.18, Then Class Prediction is taken as Non-CardHolder

Credit Card Analysis of Czech BankMethodology – Classification Analysis

Page 24: Credit Card Analysis of Czech Bank

Rule 2 Rule 1/3: (8.3, lift 3.7)    Client_Age = M    Client_District_ID = 1    Trans_Avg_Balance > 66899.77 --> class 1 [0.903]

If (Client_Age = Middle-Age) and (Client_District_ID = 1) and (Trans_Avg_Balance = 66899.77), Then Class Prediction is that of a Card Holder.

Credit Card Analysis of Czech BankMethodology – Classification Analysis

Page 25: Credit Card Analysis of Czech Bank

Rule 3

Rule 1/4: (8.3, lift 3.7)    Client_District_ID = 60    Trans_Avg_Balance > 46123.76    Loan_Status = none --> class 1 [0.903]

If (Client_District_ID = 60) and (Trans_Avg_Balance > 46123.76) and (Loan_Status = none), Then Class Prediction is that of a Card Holder.

Credit Card Analysis of Czech BankMethodology – Classification Analysis

Page 26: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankMethodology – Classification Analysis

• The estimated predictive error = (64 + 390)/3600 = 12.6% • The percentage of instances that were correctly classified as Non-card holders

= (2816/2880) = 97%• The percentage of instances that were incorrectly classified as Non-card

holders = (64/2880) = 22.2% • The percentage of instances that were in-correctly classified as Card holders =

(390/720) = 54% The percentage of instances that were correctly classified as Card holders = (330/720) = 45%

Training Set Evaluation

Page 27: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankMethodology – Classification Analysis

Test Set Evaluation

• The estimated predictive error on the Test set is = ( 59 +119)/900 = 19.7% • The percentage of instances that were correctly classified as Non-card holders

= (669/728) =91% • The percentage of instances that were incorrectly classified as Non-card

holders = (59/728) = 8.1% • The percentage of instances that were in-correctly classified as Card holders =

(119/172) = 69% • The percentage of instances that were correctly classified as Card holders =

(53/172)= 73.6%

Page 28: Credit Card Analysis of Czech Bank

• We used cluster analysis to partition the data into a set of classes, grouping together customers or attributes with similar characteristics.

• The purpose of cluster analysis is to place observations into groups or clusters suggested by the data such that observations in a given cluster tend to be similar to each other in some sense, and objects in different clusters tend to be dissimilar.

• The COBWEB Algorithm was chosen for this task. The COBWEB algorithm used was implemented in the Weka Toolkit.

Credit Card Analysis of Czech BankMethodology – Cluster Analysis, Conceptual Method

Page 29: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankMethodology – Cluster Analysis , Conceptual Method

• The COBWEB algorithm was run twice. Once with a cutoff value of 0.18 and once with a cutoff value of 0.17.

• This is done because Weka returns the cluster information only for the leaf clusters and prunes the tree by levels.

• The results of the COBWEB Clustering analysis are interpreted in the following chart.

Cluster 0# Instances: 892

Cluster 2# Instances: 71

Cluster 3# Instances: 185

Cluster 5# Instances: 83

Cluster 6# Instances: 84

Cluster 7# Instances: 469

Cluster 57# Instances: 278

Cluster 59# Instances: 88

Cluster 60# Instances: 103

Cluster 1# Instances: 423

Cluster 4# Instances: 167

Cluster 58# Instances: 191

Page 30: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankMethodology – Cluster Analysis , Partitioning Method

Description All Card

Holders Cluster 2 Cluster 3 Cluster 5 Cluster 6 Cluster 7 Cluster 57 Cluster 59 Cluster 60

Number of

Instances 892  75  182  83  83  469  277  94  83 

Percentage Distributions within Clusters by AGE

Youth 21.00%  12.00%  0.00%  98.00%  0.00%  20.00%  0.00%  100.00%  0.00% 

Adult 21.00%  24.00%  0.00%  0.00%  94.00%  19.00%  0.00%  0.00%  93.00% 

Middle Age 56.00%  60.00%  100.00%  0.00%  0.00%  59.00%  100.00%  0.00%  0.00% 

Senior 2.00%  4.00%  0.00%  1.00%  6.00%  2.00%  0.00%  0.00%  6.00% 

Percentage Distributions within Clusters by GENDER

Female 47.00%  100.00%  100.00%  100.00%  100.00%  0.00%  0.00%  0.00%  0.00% 

Male 53.00%  0.00%  0.00%  0.00%  0.00%  100.00%  100.00%  100.00%  100.00% 

Mean Value of AVERAGE BALANCE

Mean 51653.23  54304.88  51668.22  46762.24  51136.43  52179.92  53217.44  48965.39  51136.43 

Std. Deviation 11109.32  12241.39  9367.23  13780.03  10064.00  10983.18  11055.47  11241.49  10064.01 

Median 51292.00  54205.72  51580.11  43661.80  50401.42  51449.19  52323.52  48804.16  50401.42 

Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT

Single 83.00%  0.00%  100.00%  100.00%  100.00%  83.00%  82.00%  88.00%  100.00% 

Joint 17.00%  100.00%  0.00%  0.00%  0.00%  17.00%  18.00%  12.00%  0.00% 

Percentage Distributions Within Clusters by CARDTYPE

Junior 17.00%  10.00%  0.00%  74.00%  1.00%  16.00%  0.00%  78.00%  1.00% 

Classic 74.00%  80.00%  91.00%  21.00%  88.00%  73.00%  86.00%  18.00%  88.00% 

Gold 10.00%  10.00%  9.00%  4.00%  11.00%  11.00%  13.00%  3.00%  11.00% 

Page 31: Credit Card Analysis of Czech Bank

• The distinction of each cluster was not obvious beyond the coupling of age group and card type.

• To further investigate the characteristics of customers, we ran another clustering analysis using partitioning methods. We used the SAS Enterprise Miner product. SAS uses a partitioning clustering tool implementing the WARD (Minimum variance) method.

Credit Card Analysis of Czech BankMethodology – Cluster Analysis , Partitioning Method

Page 32: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankMethodology – Cluster Analysis , Conceptual Method

Description All Card Holders Cluster 1 Cluster 2 Cluster 4 Cluster 6 Cluster 19 Cluster 21

Percentage Distributions within Clusters by AGE

Youth 21.00%  80.00%  18.18%  17.65%  62.50%  0.00%  0.00% 

Adult 21.00%  0.00%  18.00%  11.76%  0.00%  100.00%  50.00% 

Middle Age 56.00%  20%  63.64%  70.59%  37.50%  0.00%  50.00% 

Senior 2.00%  0.00%  0.00%  0.00%  0.00%  0.00%  0.00% 

Percentage Distributions within Clusters by GENDER

Female 47.00%  20.00%  45.45%  29.41%  37.50%  0.00%  50.00% 

Male 53.00%  80.00%  54.55%  70.59%  62.50%  100.00%  50.00% 

Mean Value of AVERAGE BALANCE

Mean 51653.23  26887.36  75684.47  73667.43  30521.51  78237.63  80549.51 

Std. Deviation 11109.32  328.09  508.51  579.88  601.73  0.00%  909.67 

Minimum 0.00%  26522.30  74787.91  72794.25  29640.67  78237.63  79906.27 

Maximum 0.00%  27362.37  76453.60  7424.22  31253.80  78237.63  81192.75 

Percentage Distributions Within Clusters by NUMBER OF USERS PER ACCOUNT

Single 83.00%  80.00%  54.55%  70.59%  100.00%  100.00%  100.00% 

Joint 17.00%  20.00%  45.45%  29.41%  0.00%  0.00%  0.00% 

Percentage Distributions Within Clusters by CARDTYPE

Junior 16.00%  60.00%  9.09%  11.76%  62.50%  0.00%  0.00% 

Classic 74.00%  20.00%  45.45%  47.06%  37.50%  100.00%  100.00% 

Gold 10.00%  20.00%  45.45%  41.18%  0.00%  0.00%  0.00% 

Page 33: Credit Card Analysis of Czech Bank

Credit Card Analysis of Czech BankResults

The following rules represent our findings in analyzing whether or not a customer will/will not have a credit card:

1. If the average account balance is less-than-or-equal 37097.18, then the client will not be a card holder.

2. If the client is middle-age and lives in Prague, and his/her averageaccount balance is greater-than 66899.77, then that client will be a card holder.

3. If the client lives in Prostejov, and has an average account balance in excess of 46123.76, then that client will be a card holder.

Page 34: Credit Card Analysis of Czech Bank

The following are characterizations of cardholders within the current population of bank clients

1. There is no apparent difference between men and women in the

issuance of credit cards.

2. Middle aged single females have a higher percentage of classic

and gold cards than the general population.

3. All credit card customers that have taken a loan have good status in repayment.

4. Customers who are in the age bracket less than 24 and whose average balance is under 50000.00 are likely candidates for a JUNIOR Card.

5. Otherwise, it is not easily determined from customer information

which card type has been selected. Perhaps further examination of

customer behavior is warranted.

Credit Card Analysis of Czech BankResults

Page 35: Credit Card Analysis of Czech Bank

• Future continuation of this analysis might include • identifying customers with a junior or classic credit card to whom the

bank can offer a higher-limit card.

• Other Opportunities• Which accounts are likely to default on loan payments? Why?

• What are the characteristics of a good bank client ("good" is defined here, in the most general of terms, as who will make payments on their outstanding balances in a timely manner)?

• What are the identifying characteristics of a good bank branch (again, we define "good" only in the most general terms of which branches are most successful in the collection of loan payments)?

Credit Card Analysis of Czech BankNext Steps