machine learning for fraud detection

26
Bigger Data. Better Results.™ Machine Learning for Fraud Detec3on Nitesh Kumar, PhD [email protected]

Upload: nkumards

Post on 05-Aug-2015

489 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Machine Learning for Fraud Detection

Bigger Data. Better Results.™

Machine  Learning  for  Fraud  Detec3on

Nitesh  Kumar,  PhD [email protected]

Page 2: Machine Learning for Fraud Detection

Bigger Data. Better Results.™

Who  Am  I?  

•  Applied  Math  PhD  

•  Deriva3ve/  Op3ons  Pricing  Background

•  7  years  doing  analy3cs  

•  Data  Science  at  Skytree  for  2  years  

Page 3: Machine Learning for Fraud Detection

Bigger Data. Better Results.™

Skytree  Inc.

•  Came  out  of  Alex  Gray’s  (CTO)  FastLab  @  Georgia  Tech  

•  SoTware  Company  that  provides  Machine  Learning  SoTware

•  Built  to  func3on  on  top  of  Hadoop  

•  Automa3on,  speed,  and  scalability  

•  User  can  interact  through  command  line  interface,  APIs,  and  GUI

•  20  million  dollars  in  series  A

•  TAB:  Michael  Jordan,  James  Demmel,  Dave  Pa[erson,  Pat  Hanrahan

Page 4: Machine Learning for Fraud Detection

What  is  Skytree?

•  Machine  Learning  Pla\orm   GBM,  K-­‐means,  RF,  SVD/  PCA,  Linear/  Logis3c,  SVM,  collabora3ve  filtering  etc.  

•  Built  for  Big  Data Scales  linearly  with  data  size  and  compute  nodes  (map-­‐reduce,  hadoop)

•  Usability     SDK  in  Python,  Java,  REST,  even  GUI

Data  prepara3on  through  Spark

•  Automa3on 1-­‐click  modeling      

•  ML  on  Bigger  Data  produces  Be[er  Results   Larger  datasets  lead  to  higher  accuracy  

Page 5: Machine Learning for Fraud Detection

Bigger Data. Better Results.™

Outline

•  Introduc3on Why  Skytree,  Big  Data,  and  Machine  Learning  for  Fraud?      

•  Machine  Learning  in  Financial  Services   Issues,  methods,  and  solu3on

•  Live  Demo  of  Skytree  on  real-­‐world  dataset  (command  line,  API,  GUI) Time  and  setup  permidng

Page 6: Machine Learning for Fraud Detection

Bigger Data. Better Results.™

Introduc3on

•  Fraud  is  a  Big  problem  (Big  Data,  Big  Cost)

•  Why  is  Machine  Learning  necessary?

•  Comprehensive  solu3on?  

Page 7: Machine Learning for Fraud Detection

Fraud  is  a  Big  Data  Problem

•  “More  than  23  billion  credit  card  transac3ons  are  processed  annually  in  USA”  CreditCards.com

•  Credit  card  transac3on  alone  generates  mul3ple  Terabytes  of  data  a  year

•  Each  transac3on  has  100-­‐300  a[ributes    

•  Distributed  data  across  mul3ple  nodes    

Page 8: Machine Learning for Fraud Detection

Fraud  is  a  Big  Cost  Problem

•  “Businesses  lose  an  es3mated  $3.5  billion  annually  to  fraud  and  financial  crime.” Forbes,  2014

•  “Total  value  of  credit  card  transac3ons  in  the  U.S.  in  2012:  $2.48  trillion” CreditCard.com h[p://www.federalreserve.gov/releases/g19/Current/

Page 9: Machine Learning for Fraud Detection

Why  Machine  Learning?

•  Tradi3onal  ideas  of  finding  pa[erns  through  hand  craTed,  careful  querying,  does  

not  scale  to  large  datasets  

•  Prior  rule  based  engines  do  not  make  use  of  informa3on  from  mul3ple  a[ributes  at  

the  same  3me

•  Machine  Learning  concerns  with  algorithms  that  can  learn  from  data     Mul3variate  Sta3s3cs  

Automated  predic3ve  analy3cs

•  Even  a  3ny  increase  in  accuracy  can  lead  to  millions  of  dollars  in  savings  

Page 10: Machine Learning for Fraud Detection

Gap  between  Machine  Learning  and  Big  Data  

 Ø  Awakening  to  

Big  Data,  experimen3ng  with  ML?  

 

Ø ML  is  necessary  to  derive  value  out  of  Big  Data    

Page 11: Machine Learning for Fraud Detection

ML  on  Bigger  Data  produces  Be[er  Results •  Weak  and  Strong  Law  of  Large  numbers  

•   “We  have  shown  that  for  a  prototypical  natural  language  classifica3on  task,  the  

performance  of  learners  can  benefit  significantly  from  much  larger  training  sets.”  

Banco  and  Brill,  Proceedings  of  ACL,  2001.

•  “Breiman’s  procedure  (random  forest)  is  consistent  and  adapts  to  sparsity,  in  the  

sense  that  its  rate  of  convergence  depends  only  on  the  number  of  strong  features  

and  not  on  how  many  noise  variables  are  present.”  Gerard  Biau,  JMLR,  2012

•  Some%mes  Big  Data  is  all  you  need!  

Page 12: Machine Learning for Fraud Detection

Experiment:  ML  on  Bigger  Data  produces  Be[er  Results

•  Source  dataset:  DNA  dataset  from  Pascal  Large  Scale  Learning  Challenge.

•  A  4M-­‐row  dataset  was  held  out  for  tes3ng.  Training  datasets  with  20M,  40M,  80M,  160M,  320M,  640M,  5120M  elements,  arranged  into  200  columns,  were  used.    No  featuriza3on  was  applied.

•  Op3mal  model  for  each  training  dataset  size  was  found  by  tuning  Gradient  Boos3ng  Machine  on  a  holdout  dataset  with  Skytree  smart-­‐search.

•  AUC  (Area  under  ROC  curve)  was  used  for  evalua3on.

•  Experiment  by  Skytree  Inc,  2015

Page 13: Machine Learning for Fraud Detection

Bigger  Data,  Be[er  Results  on  Real  World  Data  

Dataset  Size AUC

20,000,000 93.9%

40,000,000 95.0%

80,000,000 95.6%

160,000,000 96.2%

320,000,000 96.7%

640,000,000 97.2%  

5,120,000,000 98.1%

Page 14: Machine Learning for Fraud Detection

Machine  Learning  Solu3on  for  Financial  Services Mul3ple  algorithms  for  higher  accuracy • Gradient  Boos3ng • Random  Decision  Forest • SVM • Stacked  models  (combined  models) • Mixed  models  (combine  supervised  and  unsupervised  models)

Automa3c  Parameter  Selec3on   • Automa3cally  create  best  performing  model  for  any  algorithm  in  fewer  itera3on • Allow  for  usage  by  domain  experts  (non  data  scien3sts)   • Higher  Accuracy  machine  can  tune  be[er  than  humans  

Speed  and  Scalability    

• Big  Data  scale • Catch  latest  trends  in  fraud   • Improve  accuracy   • Iterate  over  mul3ple  algorithms  and  parameters • Faster  model  crea3on  and  model  update

Visualiza3on  and  Op3miza3on

 

• Op3mize  directly  for  dollars

• Visualize  model  performance  

• Provide  knobs  to  choose  a  model  

• Ensure  op3mality  of  models  without  over  fidng  

• Visualize  models  to  interpret  results  

Page 15: Machine Learning for Fraud Detection

Bigger Data. Better Results.™

Machine  Learning  for  Fraud  Detec3on

•  Countering  Fraud  is  a  Machine  Learning  Problem

•  Challenges

•  Solu3on  (GBM  and  advanced)

Page 16: Machine Learning for Fraud Detection

Fraud  Detec3on

•  Counter  complex  and  transient  fraud  pa[erns

•  Analyze  mul3ple  and  large  datasets  to  discover  and  predict  fraud   “More  than  23  billion  credit  card  transac3ons  are  processed  annually  in  USA”  CreditCards.com

Page 17: Machine Learning for Fraud Detection

Machine  Learning  Problem Supervised  Learning:  Predict  Fraud  

Collect  historical  transac3ons

Learn  from  past  examples  of  fraud

Predict  fraud  (in  real-­‐3me)

Unsupervised  Learning:  Discover  Fraud  

Segment  transac3ons

Inves3gate  poten3ally  new  fraud

Detect  Outliers

Mixed  Approach:  Discover  and  predict  Fraud  

Detect  “Points  of  Compromise”  to  prevent  fraud    

Page 18: Machine Learning for Fraud Detection

Common  Issues

•  Imbalanced  Datasets Too  few  examples  of  ‘known’  fraud

•  What  to  op3mize? Fraud  capture  rate  

False  posi3ve  rate:  what  is  the  cost  associated?  

Total  loss  incurred  due  to  fraud  

What  loss  func3on  to  use

•  How  to  handle  missing  values?  

•  Which  algorithm  to  use?

Page 19: Machine Learning for Fraud Detection

[Current]  Industry  Standard  Solu3on

GBM  algorithm  (Friedman,  2001  and  variants)

•  Sequen3ally  combines  simple  models,  with  each  “new”  model  correc3ng  the  mistakes  of  the  

previous  ones

•  Base  Model  in  this  case  is  decision  trees

•  Inspired  by  gradient  descent  in  op3miza3on  

Page 20: Machine Learning for Fraud Detection

GBM  Pros

•  Automa3cally  handles  missing  values

•  Highly  accurate  models

•  Captures  nonlinearity  in  the  data  

•  Does  not  require  deep  understanding  of  the  data  

     

Page 21: Machine Learning for Fraud Detection

GBM  Cons

•  Does  not  handle  datasets  with  high  dimensions  well  

•  Minimizes  bias,  not  necessarily  variance

•  Chance  of  over  fidng  the  training  data  when  data  is  noisy  

•  Not  the  best  at  handling  very  high  imbalance  in  the  data    

•  Requires  extensive  parameter  tuning    

•  Not  simple  to  distribute  

Page 22: Machine Learning for Fraud Detection

GBM:  overcoming  the  odds •  Does  not  handle  datasets  with  high  dimensions  well  

•  SVMs  handle  datasets  with  high  dimensionality  

•  Minimizes  bias,  not  necessarily  variance •  Ensemble  of  GBM  (eGBM,  Skytree,  2013)  and  stochas3c  GBM  (sGBM)

•  eGBM:  Idea  is  to  use  ensembles  of  GBMs  where  each  GBM  is  built  using  bootstrap  

samples

•  sGBM:  Each  base  learner  (decision  tree)  uses  different  samples

•  Mixed  Models

•  Combine  Linear/  Logis3c  models  with  GBM  by  blending/  stacking

•  High  chance  of  over  fidng  the  training  data   •  Carefully  check  for  generaliza3on  error

•  Restrict  to  simple  base  learners  (shallow  decision  trees)  etc.  

Page 23: Machine Learning for Fraud Detection

GBM:  overcoming  the  odds

•  Not  the  best  at  handling  very  high  imbalance  in  the  data •  Ensemble  GBMs,  stochas3c  GBMs,  Random  Forests  etc.  

•  Requires  extensive  parameter  tuning     •  Smart-­‐Search  (Skytree  Inc.,2014)

•  Patent-­‐pending  technology

•  Op3miza3on  that  itera3vely  learns  from  the  previous  itera3ons

•  Successively  improves  the  space  in  which  to  search  for  the  best  solu3on  

•  Faster  way  to  obtain  the  op3mal  set  of  parameters

•  Not  simple  to  distribute

•  Bring  High  Performance  Compu3ng  (HPC)  distribu3ng  

       

Page 24: Machine Learning for Fraud Detection

Machine  Learning  Solu3on  for  Financial  Services Mul3ple  algorithms  for  higher  accuracy • Gradient  Boos3ng • Random  Decision  Forest • SVM • Stacked  models  (combined  models) • Mixed  models  (combine  supervised  and  unsupervised  models)

Automa3c  Parameter  Selec3on   • Automa3cally  create  best  performing  model  for  any  algorithm  in  fewer  itera3on • Allow  for  usage  by  domain  experts  (non  data  scien3sts)   • Higher  Accuracy  machine  can  tune  be[er  than  humans  

Speed  and  Scalability    

• Big  Data  scale • Catch  latest  trends  in  fraud   • Improve  accuracy   • Iterate  over  mul3ple  algorithms  and  parameters • Faster  model  crea3on  and  model  update

Visualiza3on  and  Op3miza3on

 

• Op3mize  directly  for  dollars

• Visualize  model  performance  

• Provide  knobs  to  choose  a  model  

• Ensure  op3mality  of  models  without  over  fidng  

• Visualize  models  to  interpret  results  

Page 25: Machine Learning for Fraud Detection

Bigger Data. Better Results.™

Lets  see  how  it  works!

•  Skytree  Workspace  

•  Demo

•  CLI

•  Python  SDK  

•  GUI

Page 26: Machine Learning for Fraud Detection

Unified  Data  Scien3st  Workspace