data science workshop

Data Science with Hadoop

Fall, 2014

Ajay Singh Director, Technical Alliance

Agenda

•  Data Science

•  Machine Learning – quick overview

•  Data Science with Hadoop

•  Demo

Data Science

What is Data Science?

Data facts and statistics collected together for reference or analysis

Science The intellectual and practical activity encompassing the systematic study of the structure and behavior of the physical and natural world through observation and experiment.

Data Science The scien&fic explora+on of data to extract meaning or insight, and the construction of software systems to utilize such insight in a business context.

Someone who does this … Data Scientist

Where Can We Use Data Science?

Healthcare • Predict diagnosis • Prioritize screenings • Reduce re-admittance rates

Financial services • Fraud Detection/prevention • Predict underwriting risk • New account risk screens

Public Sector • Analyze public sentiment • Optimize resource allocation • Law enforcement & security

Retail • Product recommendation • Inventory management • Price optimization

Telco/mobile • Predict customer churn • Predict equipment failure • Customer behavior analysis

Oil & Gas • Predictive maintenance • Seismic data management • Predict well production levels

Data Science is an Iterative Activity

Visualize, Explore

Hypothesize; Model

Measure/Evaluate Acquire Data

Clean Data

Formulate the question Deploy

Data Science combines proficiencies…

Data Exploration

Feature Engineering

Raw Transforms

The data science process is comprised of three main tasks, requiring different skill types, including technical, analytical and programming.

Signal Processing

Geo-spatial

Normalize

Transform/aggregate

Sample

Dimensionality reduction

Feature Selection

Mutual Information

Data Modeling

Frequent Itemset

Anomaly Detection

Clustering

Collaborative Filter

Regression

Classification

Supervised Learning

Unsupervised Learning

Reporting Visualization Data Quality

technical analytical

A data scientist needs to be proficient in all these tasks.

Pre-processing

Data Science with Big Data…

Very large raw datasets are now available:

-  Log files

-  Sensor data

-  Sentiment information

With more raw data, we can build better models with improved predictive performance.

To handle the larger datasets we need a scalable processing platform like Hadoop and YARN

Data scientists master many skills Applied Science

•  Statistics, applied math •  Machine Learning •  Tools: Python, R, SAS, SPSS

Big data engineering •  Big data pipeline engineering

•  Statistics and machine learning over large datasets

•  Tools: Hadoop, PIG, HIVE, Cascading, SOLR, etc

Business Analysis •  Data Analysis, BI

•  Business/domain expertise

•  Tools: SQL, Excel, EDW

Data engineering •  Database technologies

•  Computer science

•  Tools: Java, Scala, Python, C++

Which makes them hard to find… Applied Science

•  Statistics, applied math •  Machine Learning •  Tools: Python, R, SAS, SPSS

Business Analysis •  Data Analysis, BI

•  Business/domain expertise

•  Tools: SQL, Excel, EDW

Data engineering •  Database technologies

•  Computer science

•  Tools: Java, Scala, Python, C++

Big data engineering •  Big data pipeline engineering

•  Statistics and machine learning over large datasets

•  Tools: Hadoop, PIG, HIVE, Cascading, SOLR, etc

The Data Science Team

Business Analyst

Data engineer Applied

Scientist

Machine Learning Overview

What is Machine Learning?

WALL-E was a machine that learned how to feel emotions after 700 years of experiences on Earth collecting human artifacts.

Machine learning is the science of getting computers to learn from data and act without being explicitly programmed. •  Machine learning is about the construction and

study of systems that can learn from data.

•  The core of machine learning deals with representation and generalization so that the system will perform well on unseen data instances and predict unknown events.

•  There is a wide variety of machine learning tasks and successful applications.

Supervised vs. Unsupervised learning

Data Modeling

Frequent Itemset

Anomaly Detection

Clustering

Regression

Classification

Supervised Learning

Supervised learning: Applications in which the training data is a set of “labeled” examples of the input vectors along with their corresponding target variable (labels) Unsupervised learning: Applications in which the training data comprises examples of input vectors WITHOUT any corresponding target variables. The goal is to unearth “naturally occurring patterns” in the data, such as in clustering Collaborative filtering: (recommendations engine) uses techniques from both supervised and unsupervised world.

Supervised Learning: learn from examples

Labeled dataset

Test data

Patient Age

Tumor Size

Clump Thickness

… Malignant?

55 5 3 TRUE

70 4 7 TRUE

85 4 6 FALSE

35 2 1 FALSE

… … … … FALSE

Patient age Tumor size Clump … 72 3 3 66 4 4

Cancer model F(k1, k2, k3, k4)

Malignant

f(V1, V2, V3, …) = ?

Feature Matrix

Target function

Feature Vector

Classification: predicting a category

Some techniques: -  Naïve Bayes -  Decision Tree -  Logistic Regression -  SGD -  Support Vector Machines -  Neural Network -  Ensembles

Regression: predict a continuous value

Some techniques: -  Linear Regression / GLM -  Decision Trees -  Support vector regression -  SGD -  Ensembles

Example: Ad Click-Through Rates in Ad Search

Rank = bid * CTR Predict CTR for each ad to determine placement, based on: -  Historical CTR -  Keyword match -  Etc…

Unsupervised Learning: detect natural patterns

Age State Annual Income Marital status

25 CA $80,000 M 45 NY $150,000 D 55 WA $100,500 M 18 TX $85,000 S … … … …

No labels

Model Naturally occurring (hidden) structure

Clustering: detect similar instance groupings

Some techniques: -  k-means -  Spectral clustering -  DB-scan -  Hierarchical

clustering

Example: market segmentation

Outlier Detection: identify abnormal patterns

Example: identify engine anomalies Features: -  Heat generated -  Vibration of engine

Outlier Detection Target Function: outlier factor

Outlier factor (0…1)

ID Total$ Age City OF 101 $200 25 SF 0.1 102 $350 35 LA 0.05 103 $25 15 LA 0.2 … … … … 0.1

0.9 0.2 0.15 0.1

Some techniques: -  Statistical techniques -  Local outlier factor -  One-class SVM

Example: Credit Card Fraud Detection

Affinity Analysis: identifying frequent item sets

Y N N Y N Y N N Y N Y Y N Y N N N Y Y Y

Tx 1 Tx 2

Tx 4 Tx 5

Y N N Y N Y N N Y N Y Y N Y N N N Y Y Y

Tx 1 Tx 2

Tx 4 Tx 5

Goal: identify frequent item set Techniques: FP Growth, a priori

Example: Affinity Analysis

Use affinity analysis for -  store layout design -  Coupons

Product recommendation: predicting “preference”

Collaborative Filtering Identify users with similar “taste”

Collaborative filtering -> matrix completion

5 2 4 ? ? ? ? 5 2 ? 1 2 ? ? 3

5 2 4 1 3 4 1 5 2 3 1 2 4 1 3

101 102 103 104 105 …

Example: Netflix

Hadoop and Data Science

• Data Lake: all the data in one place – Ability to store ALL the data in raw format – Data silo convergence

• Data/compute capabilities available as shared asset – Data scientists can quickly prototype a new idea without an up-front request for funding – YARN enables multiple processing applications

Hadoop Improves Data Scientist Productivity

I need new data

Finally, we start

collec+ng

Let me see… is it any good?

Start 6 months 9 months

“Schema change” project

Let’s just put it in a folder on HDFS

Let me see… is it any good?

3 months

My model is awesome!

“Schema on read” Accelerates Data Innovation

Hadoop is ideal for pre-processing

Feature Engineering

Raw Transforms

Signal Processing

Geo-spatial

Normalize

Transform/aggregate

Sample

Feature Selection

Mutual Information

Data Modeling

Frequent Itemset

Anomaly Detection

Clustering

Regression

Classification

Supervised Learning

Pre-processing

Build a better feature matrix -  More/new features -  More instances -  Faster and at more scale

Training a Supervised Learning model with Hadoop

•  Typically “training set” is not that large –  In this case, it’s very common to train on a high-memory node –  Using existing tools: R, Python Scikit-learn or SAS

•  For really large training sets that don’t fit in memory –  SAS –  Spark ML-Lib is a promising (albeit new) solution –  Mahout is workable in some cases (but future is unclear)

•  Hadoop is also useful in parameter tuning: –  Grid-search: optimizing the model’s parameters

Scoring a Supervised Learning Model with Hadoop

•  Scoring of a single instance is usually fast •  Some use-cases require frequent batch re-scoring of a

large population (e.g, 20M customers): -  Use PMML scoring engine (e.g., Zementis, Pattern) -  Custom implementation with Python, R, Java, etc

Unsupervised learning with Hadoop

•  Clustering: –  Many clustering algorithms are parallelizable –  Distributed K-means is popular and available in Spark ML-Lib &

Mahout

•  Collaborative Filtering: –  Alternating Least Squares (ALS) – very parallelizable –  ALS implemented in Mahout, Spark ML-Lib, others –  Item-based or user-based collaborative filtering available in Mahout

Deployment Considerations: Hadoop and Spark

•  User runs Spark (or ML-Lib) job directly from Edge Node •  Scala API or Java API •  Python API also good

•  Spark runs directly as a YARN job

•  No need to install anything else

Spark ML-Lib Edge node

Spark . .

. . Spark

Deployment Considerations: Hadoop and R

•  R and relevant packages installed on each node

•  User runs R on high-memory node •  Rstudio or Rstudio server •  RCloud

•  Interfaces to Hadoop

•  RMR: run map-reduce with R •  RHDFS: access HDFS files from R •  RHIVE: run hive queries from R •  RHBase: Hbase from R •  RODBC

Rstudio, Rcloud Rhadoop RHive

R high-memory node

Deployment Considerations: Hadoop and Python

•  Python and relevant packages installed on each node and high-memory nodes

•  User runs Python on high-memory node •  IPython notebook is a great UI

•  Interfaces to Hadoop

•  PyDoop: access HDFS from Python •  Map-reduce jobs with Hadoop streaming •  Python UDFs with PIG

IPython Pandas, Scikit-learn Numpy, Scipy Matplotlib PyDoop

PythonScikit-learn

Pandas. .

. .Python

Scikit-learnPandas

Python high-memory node

Supervised Learning with Hadoop More details + demo

Predict

Supervised Learning Workflow

Feature Extraction

Train the

Model Model

Raw Data

(Train)

Labels

New Data

Feature Extraction Labels

Training

Predicting

Eval Model

Feature Matrix

Feature Vector

Close up: Feature Extraction

Raw Data

ID Total$ Age City Target

101 200 25 SF

102 350 35 LA

103 25 15 LA

… … … …

Feature Matrix Feature Engineering

Raw Transforms

Signal Processing

Geo-spatial

Normalize

Transform/aggregate

Sample

Feature Selection

Mutual Information

TB, PB

Feature Extraction

MB, GB

How Big is your Feature Matrix?

Example: •  10M rows, 100 features •  Each feature = 8 bytes (double) •  Total memory = ~7.5GB

Close-Up: Training the Model

Train the

Model Training

Set Model Eval

Model Metric

-  Feature matrix randomly split into “training” (70%) and “validation” set (30%) -  Model is built using training set and error measure is computed over validation set -  Iterative process or grid-search to determine the best algorithm and choice of

parameters so that: -  We get optimal model accuracy -  We prevent over-fitting

Validation Set

Evaluating Performance of a Classifier

•  Determine “confusion matrix” •  Compute metrics: precision, recall, accuracy and specificity

Actual

Yes No

d Yes True positives

False positives

No False negatives

True negatives

Confusion Matrix

From confusion matrix, we can compute these metrics: Precision = % of positive predicts that are correct Recall = % of positive instances that were predicts as positive F1 score = a measure of test’s accuracy, combining precision and recall Accuracy = % of correct classifications

Demo overview • Datasets:

– Airline delay data (we’re using only 2007, 2008 years) –  http://stat-computing.org/dataexpo/2009/the-data.html

– Weather data from http://ncdc.noaa.gov/ • Goal:

– Predict delay (delayTime >= 15 mins) in flights – For simplicity, limited to flights originating from ORD

• Tools: – Pre-process: PIG or Spark on Hadoop – Modeling: Scikit-learn or Spark/ML-Lib or R

Demo Flow

Feature Extraction

Train the

Predict / Score

Model Raw Data

(Train)

Labels

Raw Data (Test)

Feature Extraction Labels

Training Prediction

Airline, Weather (2007) ORD_2007

ORD_2008

Airline, Weather (2008)

DEMO now!

Q&A, Open discussion

Architecting the Future of Big Data Page 49

data science workshop

data clean data

data facts

data science process

data scientist page

quick overview data

analysis science

science statistics

yarn page

Documents

accelerated data and computing workshop · accelerated data...

the research data alliance: a (data ... - open science...

jads workshop on responsible data science (rds) · jads...

4. workshop responsible data science - discussion on...

data intensive collaboration in science and engineering:...

garnet workshop on integrating large data into plant science

the joint center for satellite data assimilation: science...

11 th jcsda science workshop on satellite data assimilation

data science: past present & future [american statistical...

high-energy physics data delivering data in science icsti...

practical data science workshop - recommendation systems -...

why do data science with r? - cuny data mining...

data science in aviation workshop - amazon s3science… ·...

data citation dataverse mercè crosas chief data science and...

us-japan workshop on bridging fluid mechanics and data...

digital dividends from subsurface data: data science meets...

data science for food, energy and water: a workshop...

science-, policy-, data-driven: federating indicator...

one day workshop - data science and big data analytics -...

ska regional science data centres - ec. · pdf fileska...