robust and declarative machine learning pipelines for predictive buying at barclays

24
Robust and declarative machine learning pipelines for predictive buying GIANMARIO SPACAGNA ADVANCED DATA ANALYTICS, BARCLAYS 2016/04/07

Upload: gianmario-spacagna

Post on 16-Apr-2017

3.687 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Robust and declarative machine learning pipelines for predictive buyingGIANMARIO SPACAGNAADVANCED DATA ANALYTICS, BARCLAYS2016/04/07

“”

Latest technology advancements made data processing accessible to everyone, cheaply and quickly.We believe better value can be produced at the intersection of engineering practices and the correct application of the scientific method.

DATA SCIENCE MANIFESTO*

Principle 2: “All validation of data, hypotheses and performance should be tracked, reviewed and automated.”

Principle 3: “Prior to building a model, construct an evaluation framework with end-to-end business focused acceptance criteria.”

* The current manifesto is still in beta version, check the full list of principles at datasciencemanifesto.org

MLlib

u Machine Learning library for Apache Sparku Scalable, production-oriented

(theoretically can handle any size of data given enough cluster resources)u Built on top of RDD collections of vector objectsu Provides:

u Linear algebra

u Basic Statisticsu Classification and regression

u Collaborative filteringu Clusteringu Dimensionality Reduction

u Feature extraction and transformationu Pattern mining

u Evaluation

ML pipelinesu ML is the Mllib evolution on top of DataFrameu DataFrame: SQL-like schema of representing data, similar to the original

concept in R and Python Pandas.u Offers optimised execution plans without engineering efforts

u High-Level APIs for machine learning pipelinesu Transformer: DataFrame => DataFrame

u Features transformer u Fitted model usable for inferences

u Estimator: def fit(data: DataFrame): Transformeru The algorithm that trains your model

Great but…u No type safety at compile time (only dynamic types)u A features transformer and a model have the same API?u Non fully functional

u functions should be registered as UDFsu explicit casting between SQL Row objects and JVM classesu complex logic harder to be tested and expressed in isolation

u Can’t explicitly distinguishing between metadata and feature fields, though you can have as many columns you want (and also remove them at run-time!)

u Harder to “safely” implement bespoke algorithmsu Underlying technology is very robust but the higher API, although is

declarative, could be error-prone when refactoring and maintaining in a real production system

Barclays use case

u Binary classifier for predicting users willing to buy a financial product in the following 3 months

u Each user has different characteristics at different points in time (multiple samples of the same user)

u A sample is uniquely defined by the customer id and dateu Features consists of a mix of sequence of events, categorical

attributes and numerical valuesu Each sample is labeled as true or false whether the user bought or

not in the following horizon windowu It must be production-ready but we want to estimate performances

before deployment based on retrospective analysis

Our Requirements

u Nested data structures (mix of sequences and dictionaries) with some meta information attached

u Abstraction between the implementations of a binary classifier in vector form and its domain-specific features extraction/selection

u Fitted model should simply be a function Features => Doubleu Robust cross-fold validation, no Data Leakage:

u No correlation between training and test setu Temporal order, only predict future based on the pastu Don’t predict same users multiple times

u Scientifically correct: every procedure, every stage in the workflow should be transparent, tested and reproducible.

Sparkz

u Extension of Spark for better functional programmingu Type-safe, everything checked at compile timeu Re-implementing some components for working around limitations

due to non functional nature of the development u Currently just a proof-of-concept of designs and patterns of data

science applications, examples:u Data Validation using monads and applicative functors

u Binary classifier evaluation using sceval

u A bunch of utils (lazy loggers, pimps, monoids…)

u A typed API for machine learning pipelines

https://github.com/gm-spacagna/sparkz

Our domain specific data type is UserFeaturesWithBooleanLabel and the information contained in it can be separated into:• features: events, attributes and numericals• Metadata: user id and date• target label, bought the product in the next 3 months?

By expressing our API in functions of the raw data type we don’t drop any information from the source and we can always explain the results at any stage of pipeline

A Binary Classifier is generic to the type Features and offers two way of implementing it:• Generic typed API (for domain specific algorithms)• Vector API (for standard machine learning algorithms)

• LabeledPoint is a pair of vector and label (no meta data though)

By splitting the vector-form implementation and the transformation of the source data into a vector gives us more control and robustness of the process. The generic typed API could be used for bespoke implementations that are not based on standard linear algebra (thus don’t require vectors).

The simplest implementation of a generic model is a Random classifier, no matter what the features look like it will return a random number.

No vector transformation required.

We can optionally specify a seed in case we want it to be systematically repeatable.

In this example the train method does nothing and e the anonymous implementation of the trained model simply returns a random number between 0 and 1.

This vector classifier wraps the MLlib implementation but the score function is re-implemented in DecisionTreeInference such way that traverses the tree and computes the proportion of true and false samples. Thus the fitted model only needs to store the top node of the tree.

We had to implement it ourselves because MLlib API is limited to returning only the output class (true/false) without the associated score.

A model implemented via the vector API then requires the features type to be transformed into a vector.The FeaturesTransformer API is generic to the Features type and has a sort of train method that given the full training dataset will return the mapping function Features => Vector.This API allows us to learn the transformation rules directly from the underlying data and can be used to turn vector-based models into our domain specific models.

The original structure of the source data is always preserved, the transformation function is lazily applied on-the-fly when needed.

If our data type is hybrid we want to apply different transformation for each subset of features and then combining them together by concatenating the resulting sub-vectors.

The abstract class SubFeaturesTransformer specifies both the SubFeatures type and extend the base transformer on type Features. That means can be threated as transformer of the base type but only consider a subset of the features.

The EnsembleTransformer will concatenate multiple subvectors together.

The simplest numerical transformer, takes a getter function that extract a Map[Key, Double] from the generic Features type and return a SubFeaturesTransformer where they key-value map is flattened into a vector with the original values (no normalization).

If we want to apply for example a standard scaling we could simply train on OriginalNumericalsTransformerand pipe the function returned by subfeaturesToVector into a function Vector => Vector which would represent our scaler.

The One Hot Transformer will flatten each distinct combination of (key, value) pairs into a vector of 1s and 0s.

The values represent categories. We encoded them as String but you could easily extend it to be of a generic type as long as it has an equality function or a singleton property (e.g. case objects).

Hierarchies and granularities can also be pushed in this logic.

The Term Frequency Transformer takes a sequence of terms and turn them into frequency counts.

The index of each term is learnt during training and shared into a broadcast variables so that we don’t have to use the hashing trick which may lead to incorrectnesses.

The Term type must have an Ordering defined so that we can preserve the indexing on different runs.

This transformer can easily be piped with an IDF function and be turned into a TF-IDF transformer.

We can easily create multiple typed-API classifiers by combining the vector trainer algorithm with one or any combination of sub-features transformers.

The cross-fold validation uses meta data to stop data leakage and simulate the training/inference stage that we would do in the live production environment:

It requires the following parameters:• data: full dataset with features, label and meta information• k: number of folds• classifiers: list of classifiers to be evaluated• uniqueId: function extracting the unique id from the meta data• orderingField: function extracting the temporal order of the samples for the before/after splitting• singleInference: if enabled the test set contains at most one sample for each unique id• seed: seed of the random generator (for repeatability)

It returns a map of Classifier -> Collection of test samples and associated score (no metrics aggregation).

We used the customer id as unique identifier for our samples and the date as temporal ordering field.We enable the singleInference so that we only predict each user once during the tests (at a random point in time).

The returning evaluation outputs are then mapped into collections of score and boolean label so that we can use sceval for computing the confusions matrices and finally the area under the ROC.

As general advice we discourage to use AUC as evaluation metric in a real business context. A pool of measures should be created mapping the true business needs, e.g. precision @ 1000, uplift w.r.t. prior, % of new buyers predicted when running the inference after a couple of weeks time, variance of performances across multiple runs, seasonality...

Source Data

Features

Meta

User Id

Date

Events sequence

Demographic attributes

Financial statements

TF

One Hot

Numerical value

Vector Decision TreeTrainer

Binary Classifier Trainer

Binary Classifier Fitted Model

Decision Tree

Model

Cross Fold Validation

Decision Tree

Model 1

Decision Tree

Model k…

Scores and Labels

Performance metrics

Conclusions1. In this PoC we wanted to show a specific use case of which we could not

find suitable out-of-the-box machine learning frameworks.2. We focused on the production quality aspects rather than quick and dirty

prototypes.3. We wanted the whole workflow to be transparent, tested and reproducible.4. Nevertheless, we discourage to implement any premature abstractions until

you get your first MVP released and possibly a couple of extra iterations.5. Our final requirements of what to implement where defined after an

iterative and incremental data exploration and knowledge discovery.6. We used the same production stack together with notebooks during the

MVP development/investigation but in a simpler and flat code structure.7. Regardless of the programming language/technology the thoughtful

methodology really makes the difference

Further Referencesu Sparkz:

https://github.com/gm-spacagna/sparkz

u Other examples and tutorials available at Data Science Vademecum: https://datasciencevademecum.wordpress.com/

u Data Science Manifesto:http://www.datasciencemanifesto.org