data mining-2011-09
DESCRIPTION
Talk given on September 20 to the bay area data mining. The basic idea is that integrating map-reduce programs with the real world is easier than ever.TRANSCRIPT
Data-mining, Hadoop and the Single Node
Map-Reduce
Input Output
Shuffle
MapR's Streaming Performance
Read Write0
250
500
750
1000
1250
1500
1750
2000
2250
Read Write0
250
500
750
1000
1250
1500
1750
2000
2250
HardwareMapRHadoopMB
persec
Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB
11 x 7200rpm SATA 11 x 15Krpm SAS
Higher is better
Terasort on MapR
1.0 TB0
10
20
30
40
50
60
3.5 TB0
50
100
150
200
250
300
MapRHadoop
Elapsed time (mins)
10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm
Lower is better
Data Flow Expected Volumes
Node
Storage
6 x 1Gb/s =600 MB / s
12 x 100MB/s =900 MB / s
MUCH faster for some operations
# of files (millions)
CreateRate
Same 10 nodes …
ClusterNode
NFSServer
Universal export to self
Task
Cluster Nodes
ClusterNode
NFSServer
Task
ClusterNode
NFSServer
Task
ClusterNode
NFSServer
Task
Nodes are identical
Sharded text Indexing
MapReducer
Input documents
Localdisk Search
EngineLocal
disk
Clustered index storage
Assign documents to shards
Index text to local disk and then copy index to
distributed file store
Copy to local disk typically required before
index can be loaded
Conventional data flow
MapReducer
Input documents
Localdisk Search
EngineLocal
disk
Clustered index storage
Failure of a reducer causes garbage to accumulate in the
local disk
Failure of search engine requires
another download of the index from clustered storage.
SearchEngine
Simplified NFS data flows
MapReducer
Input documents
Clustered index storage
Failure of a reducer is cleaned up by
map-reduce framework
Search engine reads mirrored index directly.
Index to task work directory via NFS
Aggregatenew
centroids
K-means, the movie
Assignto
Nearestcentroid
Centroids
Input
But …
Averagemodels
Parallel Stochastic Gradient Descent
Trainsub
model
Model
Input
Updatemodel
Variational Dirichlet Assignment
Gathersufficientstatistics
Model
Input
Old tricks, new dogs
• Mapper– Assign point to cluster– Emit cluster id, (1, point)
• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)
• Output to HDFS
Read fromHDFS to local disk by distributed cache
Written by map-reduce
Read from local disk from distributed cache
Old tricks, new dogs
• Mapper– Assign point to cluster– Emit cluster id, (1, point)
• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)
• Output to HDFSMapR FS
Read fromNFS
Written by map-reduce
Poor man’s Pregel
• Mapper
• Lines in bold can use conventional I/O via NFS
18
while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input formatemit summary
Mahout
• Scalable Data Mining for Everybody
What is Mahout
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)• Classification (learn decision making from
examples)• Stuff (LDA, SVD, frequent item-set, math)
What is Mahout?
• Recommendations (people who x this also x that)
• Clustering (segment data into groups of)• Classification (learn decision making from
examples)• Stuff (LDA, SVM, frequent item-set, math)
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
Classification in Detail
• Naive Bayes Family– Hadoop based training
• Decision Forests– Hadoop based training
• Logistic Regression (aka SGD)– fast on-line (sequential) training
So What?
• Online training has low overhead for small and moderate size data-sets
big starts here
An Example
And Another
From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence
Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....
Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>
Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?
Mahout’s SGD
• Learns on-line per example– O(1) memory– O(1) time per training example
• Sequential implementation– fast, but not parallel
Special Features
• Hashed feature encoding• Per-term annealing– learn the boring stuff once
• Auto-magical learning knob turning– learns correct learning rate, learns correct
learning rate for learning learning rate, ...
Feature Encoding
Hashed Encoding
Feature Collisions
Learning Rate AnnealingLe
arni
ng R
ate
# training examples seen
Per-term AnnealingLe
arni
ng R
ate
# training examples seen
Common Feature
Rare Feature
General Structure
• OnlineLogisticRegression– Traditional logistic regression– Stochastic Gradient Descent– Per term annealing– Too fast (for the disk + encoder)
Next Level
• CrossFoldLearner– contains multiple primitive learners– online cross validation– 5x more work
And again
• AdaptiveLogisticRegression– 20 x CrossFoldLearner– evolves good learning and regularization rates– 100 x more work than basic learner– still faster than disk + encoding
A comparison
• Traditional view– 400 x (read + OLR)
• Revised Mahout view– 1 x (read + mu x 100 x OLR) x eta– mu = efficiency from killing losers early– eta = efficiency from stopping early
Click modeling architecture
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
SequentialSGD
Learning
Map-reduce
Now via NFS
Click modeling architecture
Map-reduceMap-reduce
Featureextraction
anddown
sampling
Input
Side-data
Datajoin
SequentialSGD
Learning
Map-reduce cooperates
with NFSSequential
SGDLearning
SequentialSGD
Learning
SequentialSGD
Learning
Deployment
• Training– ModelSerializer.writeBinary(..., model)
• Deployment– m = ModelSerializer.readBinary(...)– r = m.classifyScalar(featureVector)
The Upshot• One machine can go fast– SITM trains in 2 billion examples in 3 hours
• Deployability pays off big– simple sample server farm