applications of machine learning at ucsb
TRANSCRIPT
AGENDA1. Introduction to Big Data / ML
2. What is H2O.ai?
3. Use Cases:
4. Data Science Competition
a) Beat Bill Belichick
b) Fight Crime in Chicago
c) Ham/Spam Text Messages
d) Cycling Article Search
1. INTRO TO BIG DATA / MLBIG DATA IS LIKE TEENAGE SEX:
everyone talks about it,nobody really knows how to do it,everyone thinks everyone else is
doing it, so everyone claims they are doing it…
Dan Ariely, Prof. @ Duke
BIG VS. SMALL DATAWhen you try to open
file in excel, excelCRASHES
SMALL = Data fits in RAMBIG = Data does NOT fit in RAM
Basically…Big Data is data too big
to process using conventional methods
(e.g. excel, access)
V + V + VToday, we have access to more data than we know what to do with!
1) Wearables (fitbit, iWatch, etc)2) Click streams from web visitors
3. Sensor readings4. Social Media Outlets (e.g. twitter, facebook, etc)
Volume - Data volumes are becoming unmanageableVariety - More data types being captured
Velocity - Data arrives rapidly and must be processed / stored
THE HOPE OF BIG DATA1. Data contains information of great business / personal value
Examples:
a) Predicting future stock movements = $$$b) Netflix movie recommendations = Better experience = $$$
2. IF you can extract those insights from the data, you can make better decisions
Enter, Machine Learning (ML)…
So how the hell do you do it?
MACHINE LEARNINGThe Wikipedia Definition:
…a scientific discipline that explores the construction and studyof algorithms that can learn from data. Such algorithms operate
by building a model…. ZZZzzzzzZZZzzzzzz
My Definition:
The development, analysis, and application of algorithms that enable machines to: make predictions and / or better understand data
2 Types of Learning:
SUPERVISED + UNSUPERVISED
SUPERVISED LEARNINGWhat is it?
Examples of supervised learning tasks:
1. Classification Tasks - Benign / Malignant tumor 2. Regression Tasks - Predicting future stock market prices
3. Image Recognition - Highlighting faces in pictures
Methods that infer a function from labeled training data. Key task: Predicting ________ . (Insert your task here)
UNSUPERVISED LEARNINGWhat is it?
Examples of unsupervised learning tasks:
1. Clustering - Discovering customer segments2. Topic Extraction - What topics are people tweeting about?
3. Information Retrieval - IBM Watson: Question + Answer
Methods to understand the general structure of input data whereno predictions is needed.
4. Anomaly Detection - Detecting irregular heart-beats
NO CURATION NEEDED!
2. WHAT IS H2O?What is H2O? (water, duh!)
It is ALSO an open-source, parallel processing engine for machine learning.
What makes H2O different?
Cutting-edge algorithms + parallel architecture + ease-of-use
=Happy Data Scientists / Analysts
TRY IT!Don’t take my word for it…www.h2o.ai
Simple Instructions
1. CD to Download Location2. unzip h2o file3. java -jar h2o.jar4. Point browser to: localhost:54321
GUI
R
PASS OR RUN?On any given offensive play…
Coach Bill can either call a PASS or a RUN
What determines this?Game situationOpposing team
Time remaining, etc, etcYards to go (until 1st down)
Basically, LOTS of stuff.
Personnel
BUT WHAT IF??Question:
Can we try to predict whether the next play will be PASS or RUNusing historical data?
Approach:
Download every offensive play from Belichick-Brady era since 2000
Use various Machine Learning approaches to model PASS / RUN
Disclaimer: I’m not a Seahawks fan!
Extract known features to build model inputs
DATA COLLECTIONData:
13 years of data (2002 -2013 season)194 games total
14,547 total offensive plays (excludes punts, kickoffs, returns)
Response Variable: PASS / RUN
Model Inputs:Quarter, Minutes, Seconds, Opposing Team, Down, Distance,Line of Scrimmage, NE-Score, Opposing Team Score, Season,
Formation, Game Status (is NE losing / winning / tied)
OPEN CITY, OPEN DATA“…my kind of town” - F. Sinatra
~4.6 Million rows of crimes from 2001, updated weekly*External data source considerations???
Weather Data ?U.S. CensusData ?
Crime Data
ML WORKFLOW
1. Collect datasets (Crime + Weather + Census)2. Do some feature extraction (e.g. dates, times)3. Join Crime data Weather Data Census Data4. Build deep learning model to predict
arrest / no arrest made
GOAL:For a given crime,
predict if an arrest is more / less likely to be made!
HAM / SPAM TEXTSProblem:
No one likes to be spammed. Can we look at text messages and come up with a ham (real text) / spam classifier using Spark feature
processing + h2o deep learning?
ML Workflow:
1. Tokenize words in text messages (1,024 texts)2. Transform each text using Spark’s implementation of TF-IDF
3. Convert TF-IDF Spark RDD H2O RDD
4. Run Deep Learning on Train / Test Data
FEATURE EXTRACTIONOriginal Text:
“Ok…But they said i’ve got wisdom teeth hidden inside n mayb need 2 remove.”
Post Data Cleaning & Tokenization:( but, they, said, got, wisdom, teeth, hidden, inside,
maybe, need, remove)
lower caseignore stopwordsstrip punctuation
remove numbers
FEATURE TRANSFORMATIONPost Data Cleaning & Tokenization:
( but, they, said, got, wisdom, teeth, hidden, inside,maybe, need, remove)
Term Frequency - Inverse Document Frequency (TF-IDF)1. TF - How often does “wisdom” occur in above text?
2. IDF - Normalization which calc’s frequency of “wisdom” across all other text messages.
tf-idf(t, d) = tf(t, d) x idf(t) WHERE idf(t) = log(N / n)
SO…WHAT JUST HAPPENED?
0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0[ ]
( but, they, said, got, wisdom, teeth, hidden, inside,maybe, need, remove)
wisdom teeth removeBag-O-Words
0 , 0 , 0 , 0 , 3.5, 2.9, 0 , 0 , 0.85, 0 , 0 …, 0
wisdom teeth remove
TF-IDF
CYCLING + __________Problem:
New and Experienced Cyclists have questions about cycling + ______ (given topic). Let’s build a question + answer system to help!
ML Workflow:1) Scrape thousands of article titles from internet about cycling /
cycling tips / cycling health, etc from various sources.
2) Build Bag-of-Words Dataset on article titles corpus
3) Reduce # of dimensions via deep autoencoder
4) Extract ‘last layer’ of deep features and cluster using k-means
5) Inspect Results!
BAG-OF-WORDSBuild dataset of cycling-related articles from various sources:
The Basics of Exercise Nutrition
0 , 0 , 0 , 0 , 1, 1, 0 , 0 , 1, 0 , 0 …, 0
basics exercise nutrition
lower caseremove ‘stopwords’remove punctuation
Article Title
[ ]
DIMENSIONALITY REDUCTION
Use deep autoencoder to reduce # features (~2,700 words!)
2,700 Words
500 hidden features
250 H.F.
125 H.F.
50
125 H.F.
250 H.F.
500 hidden features
2,700 Words
Decoder
Encoder
The Basics of Exercise Nutrition
K-MEANS CLUSTERINGFor each article: Extract ‘last’ layer of autoencoder (50 deep features)
The Basics of Exercise Nutrition 50 ‘deep features’
The Basics of Exercise Nutrition -‐0.09330833 0.167881429 -‐0.234307408 0.247723639 -‐0.067700267 -‐0.094107866
DF1 DF2 DF3 DF4 DF5 DF6
K-Means ClusteringInputs: Extracted 50 deep features for each cycling-related articleK = 50 clusters after grid-search of values
RESULT: CYCLING + A.I.Now we inspect the clusters!
Test Article Title:Fluid & Carbohydrate Ingestion Improve Performance During 1Hour of
Intense Exercise
Result:Clustered w/ 17 other titles (out of ~5,700)
Top 5 similar titles within cluster :
Caffeine ingestion does not alter performance during a 100-km cycling time-trial performance
Immuno-endocrine response to cycling following ingestion of caffeine and carbohydrate
Metabolism and performance following carbohydrate ingestion late in exercise
Increases in cycling performance in response to caffeine ingestion are repeatable
Fluid ingestion does not influence intense 1-h exercise performance in a mild environment
HOW TO GET FASTER?Test Article Title:
Muscle Coordination is Key to Power Output & Mechanical Efficiency of Limb Movements
Result:Clustered w/ 29 other titles (out of ~5,700)
Top 5 similar titles within cluster :Muscle fibre type efficiency and mechanical optima affect freely chosen pedal rate during cycling.
Standard mechanical energy analyses do not correlate with muscle work in cycling.
The influence of body position on leg kinematics and muscle recruitment during cycling.
Influence of repeated sprint training on pulmonary O2 uptake and muscle deoxygenation kinetics in humans
Influence of pedaling rate on muscle mechanical energy in low power recumbent pedaling using forward dynamic simulations