machine learning with hadoop boston hug 2012
TRANSCRIPT
Why Now?
• But Moore’s law has applied for a long time
• Why is Hadoop/Big Data exploding now?
• Why not 10 years ago?
• Why not 20?
48/9/2013
Size Matters, but …
• If it were just availability of data then existing big companies would adopt big data technology first
5
Size Matters, but …
• If it were just availability of data then existing big companies would adopt big data technology first
They didn’t
6
Or Maybe Cost
• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
7
Or Maybe Cost
• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
They didn’t
8
Backwards adoption
• Under almost any threshold argument startups would not adopt big data technology first
9
Backwards adoption
• Under almost any threshold argument startups would not adopt big data technology first
They did
10
Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
11
Everywhere at Once?
• Something very strange is happening
– Big data is being applied at many different scales
– At many value scales
– By large companies and small
Why?
12
Analytics Scaling Laws
• Analytics scaling is all about the 80-20 rule
– Big gains for little initial effort
– Rapidly diminishing returns
• The key to net value is how costs scale
– Old school – exponential scaling
– Big data – linear scaling, low constant
• Cost/performance has changed radically
– IF you can use many commodity boxes
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue
Anybody with eyes
Intern with a spreadsheet
In-house analytics
Industry-wide data consortium
NSA, non-proliferation
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue Net value optimum has a
sharp peak well before maximum effort
2,0000 500 1000 1500
1
0
0.25
0.5
0.75
Scale
Va
lue
Initially, linear cost scaling actually makes things worse
A tipping point is reached and things change radically …
Pre-requisites for Tipping
• To reach the tipping point,
• Algorithms must scale out horizontally
– On commodity hardware
– That can and will fail
• Data practice must change
– Denormalized is the new black
– Flexible data dictionaries are the rule
– Structured data becomes rare
Agenda
• Mahout outline
– Recommendations
– Clustering
– Classification
• Hybrid Parallel/Sequential Systems
• Real-time learning
Agenda
• Mahout outline
– Recommendations
– Clustering
– Classification
• Supervised on-line learning
• Feature hashing
• Hybrid Parallel/Sequential Systems
• Real-time learning
Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
Classification in Detail
• Naive Bayes Family
– Hadoop based training
• Decision Forests
– Hadoop based training
• Logistic Regression (aka SGD)
– fast on-line (sequential) training
– Now with MORE topping!
How it Works
• We are given “features”
– Often binary values in a vector
• Algorithm learns weights
– Weighted sum of feature * weight is the key
• Each weight is a single real value
Features
From: Dr. Paul Acquah
Dear Sir,
Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India
hospital directory, I am pleased to propose a
confidential business deal for our mutual
benefit. I have in my possession, instruments
(documentation) to transfer the sum of
33,100,000.00 eur thirty-three million one hundred
thousand euros, only) into a foreign company's
bank account for our favor.
...
Date: Thu, May 20, 2010 at 10:51 AM
From: George <[email protected]>
Hi Ted, was a pleasure talking to you last night
at the Hadoop User Group. I liked the idea of
going for lunch together. Are you available
tomorrow (Friday) at noon?
But …
• Text and words aren’t suitable features
• We need a numerical vector
• So we use binary vectors with lots of slots
Training Data
Training examples
with target values
Tokens
VectorsTraining
algorithm
Parsing
Encoding
Raw
data
Joining,
combining,
transforming
Full Scale Training
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
SequentialSGD
Learning
Map-reduce
Now via NFS
Hybrid Model Development
44
Logs User sessions
Training dataGroup by user
Count transaction
patterns
Account info
Training data
Big-data cluster Legacy modeling
Shared filesystem
Merge PROC LOGISTIC
Model
Enter the Pig Vector
• Pig UDF’s for
– Vector encoding
– Model training
define EncodeVector
org.apache.mahout.pig.encoders.EncodeVector(
'10','x+y+1',
'x:numeric, y:numeric, z:numeric');
vectors = foreach docs generate newsgroup, encodeVector(*) as v;
grouped = group vectors all;
model = foreach grouped generate 1 as key,
train(vectors) as model;
Real-time Developments
• Storm + Hadoop + Mapr
– Real-time with Storm
– Long-term with Hadoop
– State checkpoints with MapR
• Add the Bayesian Bandit for on-line learning
Mobile Network Monitor
48
Transaction data
Batch aggregation
HBase
Real-time dashboard and alerts
Geo-dispersed ingest servers
Retro-analysisinterface
A Quick Diversion
• You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again
• I catch the coin and ask again
• I look at the coin (and you don’t) and ask again
• Why does the answer change?
– And did it ever have a single value?
A First Conclusion
• Probability as expressed by humans is subjective and depends on information and experience
A Second Conclusion
• A single number is a bad way to express uncertain knowledge
• A distribution of values might be better
Bayesian Bandit
• Compute distributions based on data
• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2
The Basic Idea
• We can encode a distribution by sampling
• Sampling allows unification of exploration and exploitation
• Can be extended to more general response models
Deployment with Storm/MapR
Impression Logs
Click Logs
Targeting Engine
Conversion Detector
Model Selector
RPC
Online Model
Online Model
Online Model
RPC
RPC
RPC
Conversion Dashboard
RPC
Training
Training
Training
All state managed transactionally in MapR file system
Service Architecture
MapR Lockless Storage Services
MapR Pluggable Service Management
Storm
HadoopImpression Logs
Click Logs
Targeting Engine
Conversion Detector
Model Selector
RPC
Online Model
Online Model
Online Model
RPC
RPC
RPC
Conversion Dashboard
RPC
Training
Training
Training
Find Out More
• Me: [email protected]
• MapR: http://www.mapr.com
• Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning