scal(a)ingup*machine*learning and*biggraph*analycs · 2012-04-19 · develop a platform for machine...
TRANSCRIPT
![Page 1: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/1.jpg)
Scal(a)ing up Machine Learningand Big Graph Analy7cs
Tyson Condie, Markus Weimer and Raghu RamakrishnanYahoo! Research
Vinayak Borkar, Yingyi Bu and Mike CareyU.C. Irvine
Joshua Rosen and Neoklis Polyzo7sU.C. Santa Cruz
![Page 2: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/2.jpg)
Develop a platform for Machine Learningand Graph analytics on a cluster of
shared-nothing machines
Programming environment› Specific to ML and Graph domain
Data-parallel runtime› Supports ML and Graph analytic workflows
![Page 3: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/3.jpg)
Feature Extraction› ETL workflow
Modeling› Iterative algorithm that fits a model to the data
Evaluation› Checks model fidelity
Characteristic Workflow for ML Analytics
11/1/113
FeatureExtraction Modeling Evaluation
![Page 4: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/4.jpg)
What is a Model?
Global Model› Aggregate summary of all data points
› Small in size compared to data set
› e.g., regression, classification
Local Model› Interdependent parameters for each data point
› Size proportional to the data set
› e.g., topic (graphical) models, clustering, PageRank
11/1/114
![Page 5: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/5.jpg)
Outline
Case Study Modeling in the Cloud Our work Project roadmap
11/1/115
![Page 6: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/6.jpg)
Outbound Spam Filter for Yahoo! Mail
11/1/116
![Page 7: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/7.jpg)
Ingredient: Client Classification
Goal:
› Train a spam classifier that can classify outbound messages
Problem:
› Other mail services won’t tell us if our users send spam
Solution:
› Take the mail from Yahoo! to Yahoo! and observe how it wasclassified by our users
11/1/117
![Page 8: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/8.jpg)
EMail Flow
11/1/118
YMailSMTPServer
Mail DB
From: …@yahoo.comTo: …@yahoo.com
YMailWeb
ServerVote
![Page 9: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/9.jpg)
EMail Flow: Data Sets
11/1/119
YMailSMTPServer
Mail DBYMailWeb
Server
OUTBOUNDLOG
INBOUNDLOG
WEB_LOG
USER_DB NET_DB
Vote
![Page 10: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/10.jpg)
Step I Extract Features
OUTBOUND_LOG› [OUT_ID, FROM, IP]
USER_DB› [USER_ID, …]
NET_DB› [IP, …]
11/1/1110
[OUT_ID,USER_FEATURENET_FEATURES]
JOIN
![Page 11: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/11.jpg)
Step II: Extract Labels
OUTBOUND_LOG› [OUT_ID,FROM,TO,TIME]
INBOUND_LOG› [IN_ID,FROM,TO,TIME]
WEB_LOG› [IN_ID,VOTE]
11/1/1111
[OUT_ID,VOTE]
JOIN
![Page 12: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/12.jpg)
Step II: Finding a Label
OUTBOUND_LOG› [OUT_ID,FROM,TO,TIME]
INBOUND_LOG› [IN_ID,FROM,TO,TIME]
WEB_LOG› [IN_ID,VOTE]
11/1/1112
[OUT_ID,VOTE]This needs afuzzy join
JOIN
![Page 13: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/13.jpg)
Step III: Creating the training data
JOIN labels and features Subsample the training data Copy to a single machine
11/1/1113
![Page 14: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/14.jpg)
Step IV: Training the model
Until a satisfactory model is found:› Train the model using sequential code› Evaluate the found model in Pig again› Apply insights to feature extraction, label
generation and learning algorithm
11/1/1114
![Page 15: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/15.jpg)
Recap
11/1/1115
![Page 16: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/16.jpg)
Take-Away
Usability is bad (took an intern 3 months)› Many tools and technologies needed› Fractured workflow, captured in e.g. Oozie
Subpar solution› Subsampling hurts classifier fidelity› Copying data to a single machine is slow
11/1/1116
![Page 17: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/17.jpg)
Now: Parallelize the middle
11/1/1117
![Page 18: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/18.jpg)
What do people do?
Gradient Boosted Decision Trees
› Popular non-linear modeling algorithm
› Used in Yahoo! search engine to learn the ranking function
Gradient Descent
› A method for minimizing objective functions
› Written as a sum of differentiable functions
Latent Dirichlet Allocation (LDA)
› Example of a topic model
› Usually expressed via Graphical Model
11/1/1118
![Page 19: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/19.jpg)
What do people do?
Gradient Boosted Decision Trees
› Popular non-linear modeling algorithm
› Used in Yahoo! search engine to learn the ranking function
› Solution: Fake mappers that implement MPI
Gradient Descent
› A method for minimizing objective functions
› Written as a sum of differentiable functions
› Solution: Fake mappers that run aggregation trees
Latent Dirichlet Allocation (LDA)
› Example of a topic model
› Usually expressed via Graphical Model
› Solution: Fake mappers that implement this graphical model
11/1/1119
![Page 20: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/20.jpg)
Framework I (Global Models): Spark
Data Model: Resilient Distributed Datasets› Support for hash and range partitioning
› Transformations or Actions can be added to them
› The results can be materialized in memory(RDD.cache()) or to disk (RDD.save(…))
Caching allows fast iteration
11/1/1120
points.cache()var w = Vector.random(D)for (i <- 1 to ITERATIONS) { val gradient = points.map { p => f(p, w) } .reduce ((a, b) => a + b) w -= gradient}
Worker
Worker
Worker
Drivertasks
results
$ points 1
$ points 2
$ points 3
model
![Page 21: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/21.jpg)
Framework II (Local Models): Pregel Graph-oriented API
› UDF that implements single vertex update
› Input messages from neighboring vertices
› Output: outgoing messages and new vertex state
Bulk synchronous parallel computing› Runtime executes all vertices with messages
› Model updated and cached in memory
› Checkpoint for fault-tolerance
› Continues until everyone votes to halt
11/1/1121
go
halt?
msg
msg
Worker
Worker
Worker
Driver
model
model
model
![Page 22: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/22.jpg)
Framework: Pros/Cons
Pros› Adds the notion of a working set for fast iterations
› Exports a clean API specific to the problem domain
› Abstracts (many) low-level implementation details
Cons› Point solutions that mainly focus on modeling
› Limited optimization support
› No external memory operators
11/1/1122
![Page 23: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/23.jpg)
ScalOps + HyracksML
ScalOps› Scala DSL for the entire analytic pipeline
› Supports Pig Latin
› Looping construct that captures iteration
HyracksML + Algebricks› Data-parallel runtime that runs on a cluster of shared-nothing
machines
› Deductive database extensions to directly support iteration
› Relational algebra and query optimizer that captures the entireworkflow
11/1/1123
Runtime EngineHyracksML
Scalable FileSystem
e.g. HyracksFS, HDFS
Compiler/OptimizerAlgebricks
Programming ModelScalOps
![Page 24: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/24.jpg)
Spam Filter: ScalOps
11/1/1124
object OutboundSpamFilter extends Query {
// Join Targets case class Features(outID: String, features: Vector) case class Label(outID: String, label: Double)
def getLabels(inboundLogs: Queryable[YMail.InboundRecord], outboundLogs: Queryable[YMail.OutboundRecord], webLogs: Queryable[YMail.WebRecord]) ={ val inoutmap = join(inboundLogs by (YMail.keyFor(_)), outboundLogs by (YMail.keyFor(_))) join(inoutmap by (_._1.inID), webLogs by (_.inID)).map(e => Label(e._1._2.outID, e._2.vote)) }
def getFeatures(outboundLogs: Queryable[YMail.OutboundRecord], userDB: Queryable[YMail.User], netDB: Queryable[YMail.IP]) = { case class IPInformation(outID: String, ipInfo: YMail.IP) case class UserInformation(outID: String, userInfo: YMail.User) val users = join(outboundLogs by (_.from), userDB by (_.id)).map(e => UserInformation(e._1.outID, e._2)) val ips = join(outboundLogs by (_.ip), netDB by (_.ip)).map(e => IPInformation(e._1.outID, e._2)) join(outboundLogs by (_.outID), users by (_.outID), ips by (_.outID)).map(e => Features(e._1.outID, YMail.extractFeatures(e._1, e._2.userInfo,e._3.ipInfo))) }
def run { val inboundLogs = load[YMail.InboundRecord]("hdfs://ymail/inbound.dat").filter(YMail.isYMail) val outboundLogs = load[YMail.OutboundRecord]("hdfs://ymail/outbound.dat").filter(YMail.isYMail) val webLogs = load[YMail.WebRecord]("hdfs://ymail/web.dat") val userDB = load[YMail.User]("hdfs://y/users.dat") val netDB = load[YMail.IP]("hdfs://y/network.dat")
// Extract labels val labels = getLabels(inboundLogs, outboundLogs, webLogs)
// Extract Features val features = getFeatures(outboundLogs, userDB, netDB)
// Assemble a dataset val dataset = join(labels by (_.outID), features by (_.outID)).map(e => new Example(e._1.outID, e._1.label, e._2.features))
// Train a model val model = ML.train(dataset)
// Evaluate val score = dataset.map(ML.evaluate(model, _)).reduce(_ + _) / dataset.count
}
}
![Page 25: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/25.jpg)
Spam Filter: Hyracks RQ
11/1/1125
LabelExtractio
n
Modeling
Evaluation
FeatureExtractio
n
![Page 26: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/26.jpg)
Spam Filter: Label Extraction
11/1/1126
val inoutmap = join(in by (YMail.keyFor(_)), out by (YMail.keyFor(_)))val result = join(inoutmap by (_._1.inID), web by (_.inID))return result.map(e => Label(e._1._2.outID, e._2.vote))
![Page 27: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/27.jpg)
Spam Filter: Feature Extraction
11/1/1127
val users = join(out by(_.from),udb by(_.id)).map(UserInfo(_._1.outID, _._2))val ips = join(out by (_.ip), net by (_.ip)).map(e => IPInformation(e._1.outID, e._2))val results = join(users by (_.outID), ips by (_.outID))
return results.map(Features(_._1.outID, YMail.extractFeatures(_._1, _._2.userInfo, _._3.ipInfo)))
![Page 28: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/28.jpg)
11/1/1128
class MyModel(w: VectorType, funcValue: DoubleType) extends Modelval init = new MyModel(VectorType.zeros(1000), Double.MaxValue)val result = loop(init, (mdl: Model) => mdl.funcValue > eps) { mdl => val gradients = data.map(x => computeGradient(x, mdl)) mdl.funcValue = gradients.map(x => x._1).reduce(_+_) mdl.funcValue += lambda * env.w.norm2 mdl.w -= gradients.map(x => x._2).reduce(_+_) / gradients.count mdl.w *= (1.0 – 0.5 * lambda) mdl}return result
Spam Filter: Modeling
![Page 29: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/29.jpg)
Spam Filter: Evaluation
11/1/1129
dataset.map(ML.evaluate(model, _)).reduce(_ + _) / dataset.count
![Page 30: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/30.jpg)
Conclusion: Target Algorithms
Linear Models› Batch Gradient Descent
› Stochastic Gradient Descent
› Logistic Regression
Clustering› K-Means
Matrix Factorization Graph Analysis› PageRank
› LDA (Graphical Models)
11/1/1130
![Page 31: Scal(a)ingup*Machine*Learning and*BigGraph*Analycs · 2012-04-19 · Develop a platform for Machine Learning and Graph analytics on a cluster of shared-nothing machines Programming](https://reader033.vdocuments.mx/reader033/viewer/2022042302/5ecd617beaac6c5f67389d0e/html5/thumbnails/31.jpg)
Coming Attractions
Recursive data-parallel runtime (HyracksML)
› Deductive Databases, BOOM, Flying Fixed-Point
› Yingyi Bu and Joshua Rosen
Extended Relational Algebra (Algebricks)
› Captures the entire workflow in an optimizer friendly representation
› Vinayak Borkar
High-level language (ScalOps)
› Abstractions for Machine learning and Graph-based algorithms
› Markus Weimer
Executive Producers
› Mike Carey, Tyson Condie, Neoklis Polyzotis and Raghu Ramakrishnan
11/1/1131