applications of machine learning at ucsb

37
APPLICATIONS OF MACHINE LEARNING Alex Tellez + Amy Wang + H2O Team UC Santa Barbara, 4/6/15

Upload: sri-ambati

Post on 17-Jul-2015

821 views

Category:

Software


3 download

TRANSCRIPT

APPLICATIONS OF MACHINE LEARNING

Alex Tellez + Amy Wang + H2O TeamUC Santa Barbara, 4/6/15

AGENDA1. Introduction to Big Data / ML

2. What is H2O.ai?

3. Use Cases:

4. Data Science Competition

a) Beat Bill Belichick

b) Fight Crime in Chicago

c) Ham/Spam Text Messages

d) Cycling Article Search

1. INTRO TO BIG DATA / MLBIG DATA IS LIKE TEENAGE SEX:

everyone talks about it,nobody really knows how to do it,everyone thinks everyone else is

doing it, so everyone claims they are doing it…

Dan Ariely, Prof. @ Duke

BIG VS. SMALL DATAWhen you try to open

file in excel, excelCRASHES

SMALL = Data fits in RAMBIG = Data does NOT fit in RAM

Basically…Big Data is data too big

to process using conventional methods

(e.g. excel, access)

V + V + VToday, we have access to more data than we know what to do with!

1) Wearables (fitbit, iWatch, etc)2) Click streams from web visitors

3. Sensor readings4. Social Media Outlets (e.g. twitter, facebook, etc)

Volume - Data volumes are becoming unmanageableVariety - More data types being captured

Velocity - Data arrives rapidly and must be processed / stored

THE HOPE OF BIG DATA1. Data contains information of great business / personal value

Examples:

a) Predicting future stock movements = $$$b) Netflix movie recommendations = Better experience = $$$

2. IF you can extract those insights from the data, you can make better decisions

Enter, Machine Learning (ML)…

So how the hell do you do it?

MACHINE LEARNINGThe Wikipedia Definition:

…a scientific discipline that explores the construction and studyof algorithms that can learn from data. Such algorithms operate

by building a model…. ZZZzzzzzZZZzzzzzz

My Definition:

The development, analysis, and application of algorithms that enable machines to: make predictions and / or better understand data

2 Types of Learning:

SUPERVISED + UNSUPERVISED

SUPERVISED LEARNINGWhat is it?

Examples of supervised learning tasks:

1. Classification Tasks - Benign / Malignant tumor 2. Regression Tasks - Predicting future stock market prices

3. Image Recognition - Highlighting faces in pictures

Methods that infer a function from labeled training data. Key task: Predicting ________ . (Insert your task here)

UNSUPERVISED LEARNINGWhat is it?

Examples of unsupervised learning tasks:

1. Clustering - Discovering customer segments2. Topic Extraction - What topics are people tweeting about?

3. Information Retrieval - IBM Watson: Question + Answer

Methods to understand the general structure of input data whereno predictions is needed.

4. Anomaly Detection - Detecting irregular heart-beats

NO CURATION NEEDED!

2. WHAT IS H2O?What is H2O? (water, duh!)

It is ALSO an open-source, parallel processing engine for machine learning.

What makes H2O different?

Cutting-edge algorithms + parallel architecture + ease-of-use

=Happy Data Scientists / Analysts

TEAM @ H2O.AI16,000 commits

H2O World Conference 2014

COMMUNITY REACH

120 meetups in 201411,000 installations2,000 corporationsFirst Friday Hack-A-Thons

TRY IT!Don’t take my word for it…www.h2o.ai

Simple Instructions

1. CD to Download Location2. unzip h2o file3. java -jar h2o.jar4. Point browser to: localhost:54321

GUI

R

3. USE CASES (LOTS OF EM)

BEAT BILL BELICHICK

TB + BBBill Belichick Tom Brady

+ =

15 years together3 Super Bowls

PASS OR RUN?On any given offensive play…

Coach Bill can either call a PASS or a RUN

What determines this?Game situationOpposing team

Time remaining, etc, etcYards to go (until 1st down)

Basically, LOTS of stuff.

Personnel

BUT WHAT IF??Question:

Can we try to predict whether the next play will be PASS or RUNusing historical data?

Approach:

Download every offensive play from Belichick-Brady era since 2000

Use various Machine Learning approaches to model PASS / RUN

Disclaimer: I’m not a Seahawks fan!

Extract known features to build model inputs

DATA COLLECTIONData:

13 years of data (2002 -2013 season)194 games total

14,547 total offensive plays (excludes punts, kickoffs, returns)

Response Variable: PASS / RUN

Model Inputs:Quarter, Minutes, Seconds, Opposing Team, Down, Distance,Line of Scrimmage, NE-Score, Opposing Team Score, Season,

Formation, Game Status (is NE losing / winning / tied)

FIGHTING CRIME IN CHICAGO

Spark + H2O

OPEN CITY, OPEN DATA“…my kind of town” - F. Sinatra

~4.6 Million rows of crimes from 2001, updated weekly*External data source considerations???

Weather Data ?U.S. CensusData ?

Crime Data

ML WORKFLOW

1. Collect datasets (Crime + Weather + Census)2. Do some feature extraction (e.g. dates, times)3. Join Crime data Weather Data Census Data4. Build deep learning model to predict

arrest / no arrest made

GOAL:For a given crime,

predict if an arrest is more / less likely to be made!

SPARK SQL + H2O RDD3 table join using Spark SQL

Convert joined table to H2O RDD

HOW’D WE DO?

nice!

~ 10 mins

NEW: TEXT CLASSIFICATION

Text Processing in Spark + H2O Deep Learning!

HAM / SPAM TEXTSProblem:

No one likes to be spammed. Can we look at text messages and come up with a ham (real text) / spam classifier using Spark feature

processing + h2o deep learning?

ML Workflow:

1. Tokenize words in text messages (1,024 texts)2. Transform each text using Spark’s implementation of TF-IDF

3. Convert TF-IDF Spark RDD H2O RDD

4. Run Deep Learning on Train / Test Data

FEATURE EXTRACTIONOriginal Text:

“Ok…But they said i’ve got wisdom teeth hidden inside n mayb need 2 remove.”

Post Data Cleaning & Tokenization:( but, they, said, got, wisdom, teeth, hidden, inside,

maybe, need, remove)

lower caseignore stopwordsstrip punctuation

remove numbers

FEATURE TRANSFORMATIONPost Data Cleaning & Tokenization:

( but, they, said, got, wisdom, teeth, hidden, inside,maybe, need, remove)

Term Frequency - Inverse Document Frequency (TF-IDF)1. TF - How often does “wisdom” occur in above text?

2. IDF - Normalization which calc’s frequency of “wisdom” across all other text messages.

tf-idf(t, d) = tf(t, d) x idf(t) WHERE idf(t) = log(N / n)

SO…WHAT JUST HAPPENED?

0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0[ ]

( but, they, said, got, wisdom, teeth, hidden, inside,maybe, need, remove)

wisdom teeth removeBag-O-Words

0 , 0 , 0 , 0 , 3.5, 2.9, 0 , 0 , 0.85, 0 , 0 …, 0

wisdom teeth remove

TF-IDF

DO IT LIVE!Let’s fire up H2O and run a model to predict ham / spam!

DEEP AUTOENCODERS + K-MEANS EXAMPLE

Help cyclists with their health related questions!

CYCLING + __________Problem:

New and Experienced Cyclists have questions about cycling + ______ (given topic). Let’s build a question + answer system to help!

ML Workflow:1) Scrape thousands of article titles from internet about cycling /

cycling tips / cycling health, etc from various sources.

2) Build Bag-of-Words Dataset on article titles corpus

3) Reduce # of dimensions via deep autoencoder

4) Extract ‘last layer’ of deep features and cluster using k-means

5) Inspect Results!

BAG-OF-WORDSBuild dataset of cycling-related articles from various sources:

The Basics of Exercise Nutrition

0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0

basics exercise nutrition

lower caseremove ‘stopwords’remove punctuation

Article Title

[ ]

DIMENSIONALITY REDUCTION

Use deep autoencoder to reduce # features (~2,700 words!)

2,700 Words

500 hidden features

250 H.F.

125 H.F.

50

125 H.F.

250 H.F.

500 hidden features

2,700 Words

Decoder

Encoder

The Basics of Exercise Nutrition

K-MEANS CLUSTERINGFor each article: Extract ‘last’ layer of autoencoder (50 deep features)

The Basics of Exercise Nutrition 50 ‘deep features’

The Basics of Exercise Nutrition -­‐0.09330833 0.167881429 -­‐0.234307408 0.247723639 -­‐0.067700267 -­‐0.094107866

DF1 DF2 DF3 DF4 DF5 DF6

K-Means ClusteringInputs: Extracted 50 deep features for each cycling-related articleK = 50 clusters after grid-search of values

RESULT: CYCLING + A.I.Now we inspect the clusters!

Test Article Title:Fluid & Carbohydrate Ingestion Improve Performance During 1Hour of

Intense Exercise

Result:Clustered w/ 17 other titles (out of ~5,700)

Top 5 similar titles within cluster :

Caffeine ingestion does not alter performance during a 100-km cycling time-trial performance

Immuno-endocrine response to cycling following ingestion of caffeine and carbohydrate

Metabolism and performance following carbohydrate ingestion late in exercise

Increases in cycling performance in response to caffeine ingestion are repeatable

Fluid ingestion does not influence intense 1-h exercise performance in a mild environment

HOW TO GET FASTER?Test Article Title:

Muscle Coordination is Key to Power Output & Mechanical Efficiency of Limb Movements

Result:Clustered w/ 29 other titles (out of ~5,700)

Top 5 similar titles within cluster :Muscle fibre type efficiency and mechanical optima affect freely chosen pedal rate during cycling.

Standard mechanical energy analyses do not correlate with muscle work in cycling.

The influence of body position on leg kinematics and muscle recruitment during cycling.

Influence of repeated sprint training on pulmonary O2 uptake and muscle deoxygenation kinetics in humans

Influence of pedaling rate on muscle mechanical energy in low power recumbent pedaling using forward dynamic simulations

4. DATA SCIENCE COMPETITION

Apply / Learn More @: apps.h2o.aiCheckout our YouTube Channel for last year’s talks @ H2O World