python as part of a production machine learning stack by michael manapat pydata sv 2014
DESCRIPTION
Over the course of three years, we've built Stripe from scratch and scaled it to process billions of dollars of transaction volume a year by making it easy and painless for merchants to get set up and start accepting payments. While the vast majority of transactions facilitated by Stripe are honest, we do need to protect our merchants from rogue individuals and groups seeing to "test" or "cash" stolen credit cards. To combat this sort of activity, Stripe uses Python (together with Scala and Ruby) as part of its production machine learning pipeline to detect and block fraud in real time. In this talk, I'll go through the scikit-based modeling process for a sample data set that is derived from production data to illustrate how we train and validate our models. We'll also walk through how we deploy the models and monitor them in our production environment and how Python has allowed us to do this at scale.TRANSCRIPT
![Page 1: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/1.jpg)
Python as part of a produc0on machine learning stack Michael Manapat @mlmanapat Stripe
![Page 2: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/2.jpg)
Outline -‐Why we need ML at Stripe -‐Simple models with sklearn -‐Pipelines with Luigi -‐Scoring as a service
![Page 3: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/3.jpg)
Stripe is a technology company focusing on making payments easy -‐Short applica>on
![Page 4: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/4.jpg)
Tokeniza0on Customer
browser Stripe
Stripe.js
Token
Merchant server Stripe
API call
Result
![Page 5: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/5.jpg)
API Call import stripe stripe.Charge.create( amount=400, currency="usd", card="tok_103xnl2gR5VxTSB” [email protected]" )"
![Page 6: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/6.jpg)
Fraud / business viola0ons -‐Terms of service viola>ons (weapons) -‐Merchant fraud (card “cashers”) -‐Transac>on fraud -‐No machine learning a year ago
![Page 7: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/7.jpg)
Fraud / business viola0ons -‐Terms of service viola>ons E-‐cigareMes, drugs, weapons, etc. How do we find these automa>cally?
![Page 8: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/8.jpg)
Merchant sign up flow
Applica>on submission
Website scraped
Text scored Applica>on reviewed
![Page 9: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/9.jpg)
Merchant sign up flow
Applica>on submission
Website scraped
Text scored Applica>on reviewed
Machine learning
pipeline and service
![Page 10: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/10.jpg)
Building a classifier: e-‐cigareIes data = pandas.from_pickle(‘ecigs’) data.head() text violator 0 " please verify your age i am 21 years or older ... True 1 coming soon toggle me drag me with your mouse ... False 2 drink moscow mules cart 0 log in or create an ... False 3 vapors electronic cigarette buy now insuper st... True 4 t-shirts shorts hawaii about us silver coll... False [5 rows x 2 columns]
![Page 11: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/11.jpg)
Features for text classifica0on cv = CountVectorizer features = cv.fit_transform(data['text'])
Sparse matrix of word counts from input text (omiSng feature selec>on)
![Page 12: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/12.jpg)
Features for text classifica0on X_train, X_test, y_train, y_test = train_test_split( features, data['violator'], test_size=0.2)
-‐Avoid leakage -Other cross-‐valida>on methods
![Page 13: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/13.jpg)
Training model = LogisticRegression() model.fit(X_train, y_train)
Serializer reads from model.intercept_ model.coef_
![Page 14: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/14.jpg)
Valida0on probs = model.predict_proba(X_test) fpr, tpr, thresholds = roc_curve(y_test, probs[:, 1]) matplotlib.pyplot(fpr, tpr)
![Page 15: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/15.jpg)
ROC: Receiver opera0ng characteris0c
![Page 16: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/16.jpg)
Pipeline -‐Fetch website snapshots from S3 -‐Fetch classifica>ons from SQL/Impala -‐Sani>ze text (strip HTML) -‐Run feature genera>on and selec>on -‐Train and serialize model -‐Export valida>on sta>s>cs
![Page 17: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/17.jpg)
Luigi class GetSnapshots(luigi.Task): def run(self): " "... class GenFeatures(luigi.Task): def requires(self): return GetSnapshots()"
![Page 18: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/18.jpg)
Luigi runs tasks on Hadoop cluster "
![Page 19: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/19.jpg)
Scoring as a service " Applica>on submission
Website scraped
Text scored Applica>on reviewed
ThriO RPC
Scoring Service
![Page 20: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/20.jpg)
Scoring as a service struct ScoringRequest { 1: string text 2: optional string model_name } struct ScoringResponse { 1: double score" " "// Experiments? 2: double request_duration }"
![Page 21: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/21.jpg)
Why a service? -‐Same code base for training/scoring -‐Reduced duplica>on/easier deploys -‐Experimenta>on
![Page 22: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/22.jpg)
-‐Log requests and responses (Parquet/Impala) -‐Centralized monitoring (Graphite)
![Page 23: Python as part of a production machine learning stack by Michael Manapat PyData SV 2014](https://reader033.vdocuments.mx/reader033/viewer/2022050905/54c6c8624a795938448b459a/html5/thumbnails/23.jpg)
Summary -‐Simple models with sklearn -‐Pipelines with Luigi -‐Scoring as a service Thanks! @mlmanapat