boston hug
DESCRIPTION
Talk about why and how machine learning works with Hadoop with recent developments for real-time operation.TRANSCRIPT
Machine Learning with Hadoop
2
Agenda
• Why Big Data? Why now?
• What can you do with big data?
• How does it work?
3
Slow Motion Explosion
4
Why Now?
• But Moore’s law has applied for a long time
• Why is Hadoop/Big Data exploding now?
• Why not 10 years ago?
• Why not 20?
2/15/12
5
Size Matters, but …
• If it were just availability of data then existing big companies would adopt big data technology first
6
Size Matters, but …
• If it were just availability of data then existing big companies would adopt big data technology first
They didn’t
7
Or Maybe Cost
• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
8
Or Maybe Cost
• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte
They didn’t
9
Backwards adoption
• Under almost any threshold argument startups would not adopt big data technology first
10
Backwards adoption
• Under almost any threshold argument startups would not adopt big data technology first
They did
11
Everywhere at Once?
• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small
12
Everywhere at Once?
• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small
Why?
Analytics Scaling Laws
• Analytics scaling is all about the 80-20 rule – Big gains for little initial effort– Rapidly diminishing returns
• The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant
• Cost/performance has changed radically– IF you can use many commodity boxes
We knew that
We should have known that
We didn’t know that!
You’re kidding, people do that?
Anybody with eyes
Intern with a spreadsheet
In-house analytics
Industry-wide data consortium
NSA, non-proliferation
Net value optimum has a sharp peak well before maximum effort
But scaling laws are changing both slope and shape
More than just a little
They are changing a LOT!
Initially, linear cost scaling actually makes things worse
A tipping point is reached and things change radically …
Pre-requisites for Tipping
• To reach the tipping point, • Algorithms must scale out horizontally– On commodity hardware– That can and will fail
• Data practice must change– Denormalized is the new black– Flexible data dictionaries are the rule– Structured data becomes rare
26
So that is why and why now
27
So that is why, and why now
What can you do with it?And how?
Agenda
• Mahout outline– Recommendations– Clustering– Classification
• Hybrid Parallel/Sequential Systems• Real-time learning
Agenda
• Mahout outline– Recommendations– Clustering– Classification• Supervised on-line learning• Feature hashing
• Hybrid Parallel/Sequential Systems• Real-time learning
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training– Now with MORE topping!
How it Works
• We are given “features”– Often binary values in a vector
• Algorithm learns weights– Weighted sum of feature * weight is the key
• Each weight is a single real value
An Example
Features
From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....
Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>
Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
But …
• Text and words aren’t suitable features• We need a numerical vector• So we use binary vectors with lots of slots
Feature Encoding
Hashed Encoding
Feature Collisions
Training Data
Training Data
Training Data
Full Scale Training
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
SequentialSGD
Learning
Map-reduce
Now via NFS
44
Hybrid Model Development
Logs User sessions
Training dataGroup by user
Count transaction
patterns
Account info
Training data
Big-data cluster Legacy modeling
Shared filesystem
Merge PROC LOGISTIC
Model
Enter the Pig Vector
• Pig UDF’s for– Vector encoding
– Model training
define EncodeVector org.apache.mahout.pig.encoders.EncodeVector( '10','x+y+1', 'x:numeric, y:numeric, z:numeric');
vectors = foreach docs generate newsgroup, encodeVector(*) as v;grouped = group vectors all;model = foreach grouped generate 1 as key, train(vectors) as model;
Real-time Developments
• Storm + Hadoop + Mapr– Real-time with Storm– Long-term with Hadoop– State checkpoints with MapR
• Add the Bayesian Bandit for on-line learning
Aggregate Splicing
tHadoop handles the
past
Storm handles the present
48
Mobile Network MonitorTransaction
data
Batch aggregation
HBase
Real-time dashboard and alerts
Geo-dispersed ingest servers
Retro-analysisinterface
A Quick Diversion
• You see a coin– What is the probability of heads?– Could it be larger or smaller than that?
• I flip the coin and while it is in the air ask again• I catch the coin and ask again• I look at the coin (and you don’t) and ask again• Why does the answer change?– And did it ever have a single value?
A First Conclusion
• Probability as expressed by humans is subjective and depends on information and experience
A Second Conclusion
• A single number is a bad way to express uncertain knowledge
• A distribution of values might be better
I Dunno
5 and 5
2 and 10
Bayesian Bandit
• Compute distributions based on data• Sample p1 and p2 from these distributions
• Put a coin in bandit 1 if p1 > p2
• Else, put the coin in bandit 2
The Basic Idea
• We can encode a distribution by sampling• Sampling allows unification of exploration and
exploitation
• Can be extended to more general response models
Deployment with Storm/MapR
All state managed transactionally in MapR file system
Service Architecture
MapR Lockless Storage Services
MapR Pluggable Service Management
Storm
Hadoop
Find Out More
• Me: [email protected] [email protected] [email protected]
• MapR: http://www.mapr.com • Mahout: http://mahout.apache.org
• Code: https://github.com/tdunning