machine learning for fraud detection

Click here to load reader

Post on 05-Aug-2015

483 views

Category:

Technology

2 download

Embed Size (px)

TRANSCRIPT

1. Bigger Data. Better Results. Machine Learning for Fraud Detec3on Nitesh Kumar, PhD nitesh@skytree.net 2. Bigger Data. Better Results. Who Am I? Applied Math PhD Deriva3ve/ Op3ons Pricing Background 7 years doing analy3cs Data Science at Skytree for 2 years 3. Bigger Data. Better Results. Skytree Inc. Came out of Alex Grays (CTO) FastLab @ Georgia Tech SoTware Company that provides Machine Learning SoTware Built to func3on on top of Hadoop Automa3on, speed, and scalability User can interact through command line interface, APIs, and GUI 20 million dollars in series A TAB: Michael Jordan, James Demmel, Dave Pa[erson, Pat Hanrahan 4. What is Skytree? Machine Learning Plaorm GBM, K-means, RF, SVD/ PCA, Linear/ Logis3c, SVM, collabora3ve ltering etc. Built for Big Data Scales linearly with data size and compute nodes (map-reduce, hadoop) Usability SDK in Python, Java, REST, even GUI Data prepara3on through Spark Automa3on 1-click modeling ML on Bigger Data produces Be[er Results Larger datasets lead to higher accuracy 5. Bigger Data. Better Results. Outline Introduc3on Why Skytree, Big Data, and Machine Learning for Fraud? Machine Learning in Financial Services Issues, methods, and solu3on Live Demo of Skytree on real-world dataset (command line, API, GUI) Time and setup permidng 6. Bigger Data. Better Results. Introduc3on Fraud is a Big problem (Big Data, Big Cost) Why is Machine Learning necessary? Comprehensive solu3on? 7. Fraud is a Big Data Problem More than 23 billion credit card transac3ons are processed annually in USA CreditCards.com Credit card transac3on alone generates mul3ple Terabytes of data a year Each transac3on has 100-300 a[ributes Distributed data across mul3ple nodes 8. Fraud is a Big Cost Problem Businesses lose an es3mated $3.5 billion annually to fraud and nancial crime. Forbes, 2014 Total value of credit card transac3ons in the U.S. in 2012: $2.48 trillion CreditCard.com h[p://www.federalreserve.gov/releases/g19/ Current/ 9. Why Machine Learning? Tradi3onal ideas of nding pa[erns through hand craTed, careful querying, does not scale to large datasets Prior rule based engines do not make use of informa3on from mul3ple a[ributes at the same 3me Machine Learning concerns with algorithms that can learn from data Mul3variate Sta3s3cs Automated predic3ve analy3cs Even a 3ny increase in accuracy can lead to millions of dollars in savings 10. Gap between Machine Learning and Big Data Awakeningto BigData, experimen3ng withML? MLisnecessary toderivevalue outofBigData 11. ML on Bigger Data produces Be[er Results Weak and Strong Law of Large numbers We have shown that for a prototypical natural language classica3on task, the performance of learners can benet signicantly from much larger training sets. Banco and Brill, Proceedings of ACL, 2001. Breimans procedure (random forest) is consistent and adapts to sparsity, in the sense that its rate of convergence depends only on the number of strong features and not on how many noise variables are present. Gerard Biau, JMLR, 2012 Some%mes Big Data is all you need! 12. Experiment: ML on Bigger Data produces Be[er Results Source dataset: DNA dataset from Pascal Large Scale Learning Challenge. A 4M-row dataset was held out for tes3ng. Training datasets with 20M, 40M, 80M, 160M, 320M, 640M, 5120M elements, arranged into 200 columns, were used. No featuriza3on was applied. Op3mal model for each training dataset size was found by tuning Gradient Boos3ng Machine on a holdout dataset with Skytree smart-search. AUC (Area under ROC curve) was used for evalua3on. Experiment by Skytree Inc, 2015 13. Bigger Data, Be[er Results on Real World Data Dataset SizeAUC 20,000,00093.9% 40,000,00095.0% 80,000,00095.6% 160,000,00096.2% 320,000,00096.7% 640,000,00097.2% 5,120,000,00098.1% 14. Machine Learning Solu3on for Financial Services Mul3ple algorithms for higher accuracy Gradient Boos3ng Random Decision Forest SVM Stacked models (combined models) Mixed models (combine supervised and unsupervised models) Automa3c Parameter Selec3on Automa3cally create best performing model for any algorithm in fewer itera3on Allow for usage by domain experts (non data scien3sts) Higher Accuracy machine can tune be[er than humans Speed and Scalability Big Data scale Catch latest trends in fraud Improve accuracy Iterate over mul3ple algorithms and parameters Faster model crea3on and model update Visualiza3on and Op3miza3on Op3mize directly for dollars Visualize model performance Provide knobs to choose a model Ensure op3mality of models without over dng Visualize models to interpret results 15. Bigger Data. Better Results. Machine Learning for Fraud Detec3on Countering Fraud is a Machine Learning Problem Challenges Solu3on (GBM and advanced) 16. Fraud Detec3on Counter complex and transient fraud pa[erns Analyze mul3ple and large datasets to discover and predict fraud More than 23 billion credit card transac3ons are processed annually in USA CreditCards.com 17. Machine Learning Problem Supervised Learning: PredictFraud Collect historical transac3ons Learn from past examples of fraud Predict fraud (in real-3me) Unsupervised Learning: DiscoverFraud Segment transac3ons Inves3gate poten3ally new fraud Detect Outliers Mixed Approach: Discoverand predictFraudDetect Points of Compromise to prevent fraud 18. Common Issues Imbalanced Datasets Too few examples of known fraud What to op3mize? Fraud capture rate False posi3ve rate: what is the cost associated? Total loss incurred due to fraud What loss func3on to use How to handle missing values? Which algorithm to use? 19. [Current] Industry Standard Solu3onGBM algorithm (Friedman, 2001 and variants) Sequen3ally combines simple models, with each new model correc3ng the mistakes of the previous ones Base Model in this case is decision trees Inspired by gradient descent in op3miza3on 20. GBM Pros Automa3cally handles missing values Highly accurate models Captures nonlinearity in the data Does not require deep understanding of the data 21. GBM Cons Does not handle datasets with high dimensions well Minimizes bias, not necessarily variance Chance of over dng the training data when data is noisy Not the best at handling very high imbalance in the data Requires extensive parameter tuning Not simple to distribute 22. GBM: overcoming the odds Does not handle datasets with high dimensions well SVMs handle datasets with high dimensionality Minimizes bias, not necessarily variance Ensemble of GBM (eGBM, Skytree, 2013) and stochas3c GBM (sGBM) eGBM: Idea is to use ensembles of GBMs where each GBM is built using bootstrap samples sGBM: Each base learner (decision tree) uses dierent samples Mixed Models Combine Linear/ Logis3c models with GBM by blending/ stacking High chance of over dng the training data Carefully check for generaliza3on error Restrict to simple base learners (shallow decision trees) etc. 23. GBM: overcoming the odds Not the best at handling very high imbalance in the data Ensemble GBMs, stochas3c GBMs, Random Forests etc. Requires extensive parameter tuning Smart-Search (Skytree Inc.,2014) Patent-pending technology Op3miza3on that itera3vely learns from the previous itera3ons Successively improves the space in which to search for the best solu3on Faster way to obtain the op3mal set of parameters Not simple to distribute Bring High Performance Compu3ng (HPC) distribu3ng 24. Machine Learning Solu3on for Financial Services Mul3ple algorithms for higher accuracy Gradient Boos3ng Random Decision Forest SVM Stacked models (combined models) Mixed models (combine supervised and unsupervised models) Automa3c Parameter Selec3on Automa3cally create best performing model for any algorithm in fewer itera3on Allow for usage by domain experts (non data scien3sts) Higher Accuracy machine can tune be[er than humans Speed and Scalability Big Data scale Catch latest trends in fraud Improve accuracy Iterate over mul3ple algorithms and parameters Faster model crea3on and model update Visualiza3on and Op3miza3on Op3mize directly for dollars Visualize model performance Provide knobs to choose a model Ensure op3mality of models without over dng Visualize models to interpret results 25. Bigger Data. Better Results. Lets see how it works! Skytree Workspace Demo CLI Python SDK GUI 26. Unied Data Scien3st Workspace

View more