scaling the mobile millennium - uc berkeley amp...
TRANSCRIPT
![Page 1: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/1.jpg)
January 11, 2012
Scaling the Mobile Millennium System in the Cloud
Timothy Hunter,Teodor Moldovan, Matei Zaharia, Samy Merzgui, Justin Ma,Michael J. Franklin, Pieter Abbeel, Alexandre M. Bayen
UC Berkeley
![Page 2: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/2.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 2/26
Machine learning at scale
● One goal of the AMPLab software: enable ML researchers to work at scale
● Issues with more traditional frameworks:● Algorithms often iterative in nature● Not taken into account in traditional Map-Reduce● A number of alternatives: Pregel, Twister, HaLoop, Spark
● We report lessons applying one of these frameworks (Spark) to a real-world application (car traffic estimation)
● We identified 3 challenges not widely studied before:● Framework-level memory management● Sharing large parameters● Access to storage system
![Page 3: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/3.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 3/26
Plan
● Need for traffic estimation● Overview of Mobile Millennium● Presentation of the algorithm● Programming with the Spark framework● Challenges, current solutions
![Page 4: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/4.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 4/26
Need for good traffic estimation
● Traffic congestion affects everyone● Up-to-date estimation is critical● Well-studied in the case of highways● More complex for urban streets (arterial roads)● Most promising source of data: cellphone GPS
![Page 5: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/5.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 5/26
Real-time processing of fleet data
● Input: sampled position of taxicabs
● Observed every minute
● Covers the whole SF Bay
● 0.5 Million points / day(60M / day total)
● 0.1 Million road links
![Page 6: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/6.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 6/26
Estimating the travel times
● Input: sampled position of taxicabs
● Observed every minute
● Covers the whole SF Bay
● 0.5 Million points / day(60M / day total)
● 0.1 Million road links
![Page 7: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/7.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 7/26
Filtering of fleet data
● Trajectories need to be recovered
● Done using a Conditional Random Field
● Output: segments of most likely trajectories between GPS points
![Page 8: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/8.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 8/26
Mobile Millennium
● A cyberphysical system for participatory sensing
![Page 9: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/9.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 9/26
Mobile Millennium
● A cyberphysical system for participatory sensing
Today's talk:Batch ML jobsoutsourced a cluster
Today's talk:Batch ML jobsoutsourced a cluster
![Page 10: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/10.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 10/26
Estimation of arterial traffic
● Input:● Description of road network● Observations: start time, end time, route followed
● Output: probability distributions of travel time● For each link● At different time intervals● parameter vector θ (for example: mean and variance of link
travel time)©
Go
ogle
, In
c.
![Page 11: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/11.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 11/26
Estimation of traffic: Graphical Model
Links
Observations
ill
Link states(50k multidimensional variables)
Partial travel time distributions for each link(about 400k-200M variables)
Travel time observations: pairs of travel time and path(about 100k-50M observations)
Hidden random variable
Observed random variable
![Page 12: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/12.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 12/26
System workflow
Database
Worker nodes
Master node
![Page 13: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/13.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 13/26
System workflow
Start link parameters(on master node)
Observations(distributed, persisted across nodes)
![Page 14: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/14.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 14/26
System workflow
Network parameters(distributed over the nodes)
![Page 15: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/15.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 15/26
System workflow
Travel time samplesFor each observation link
![Page 16: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/16.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 16/26
System workflow
Travel time samples aggregatedon a link basis
![Page 17: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/17.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 17/26
System workflow
New parameters are generatedThe maximize sampled travel timesfor each link.
The master collects the vector ofnew parameters.
![Page 18: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/18.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 18/26
Using the Spark programming model
● Spark: Open-source cluster computing system● Can persist datasets in memory across cluster● In a fault-tolerant manner● Written in Scala (emphasizes functional programming)
observations = spark.textFile(“hdfs:...”) .map(parseObservation _).cache()params = // Initialize parameterswhile (!converged) { samples = observations.map( obs => generateSamples(obs, params)) params = samples.groupByKey().map( (linkId, vals) => mostLikelyParam(linkId, vals) ).collect()}
Main loop of the program:
Step 1 (E step)
Step 2 (M step)
![Page 19: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/19.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 19/26
Challenges
![Page 20: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/20.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 20/26
Efficient utilization of memory
● The observation data is stored in memory:● Be careful with the memory footprint● Diagnose when the cluster runs out of memory
● We cache pointer-based structures ● Significant overhead in the JVM
● Solution: keep serialized data in memory● Lesson: Need for more tools understanding memory
bottlenecks● Lesson: Provide more memory-efficient primitives
![Page 21: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/21.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 21/26
Broadcast of large parameters
● Need to broadcast data to all workers:● At the start of the job (network description)● Between iterations (updated parameters θ)● Common problem to many ML problems
● Network description larger than 40MB● Solution: “broadcast variables” before starting computations:
● Using Cornet (BitTorrent-like protocol)
network = // load networkbv = spark.broadcast(network)observations = spark.textFile(“...”) .map(parseObservation(_, bv.get()))
network = // load networkobservations = spark.textFile(“...”) .map(parseObservation(_, network))
No broadcast BT broadcast0
500
1000
1500
2000
2500
Data loading time
Load
ing
time
(sec
)
![Page 22: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/22.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 22/26
Access to storage system
● Mobile Millennium uses PostgreSQL● Reliable, previous experience, PostGIS extensions
● We ran the DB to the cloud:● Still 75% running time spent on data loading● Bursty access pattern● Small dataset overall (1GB)
● Solution: export data to HDFS● Stale snapshot● Distributed, much faster
● Ideal solution: same storage system for on-site and cloud applications
On-site DB Cloud DB HDFS1
10
100
1000
10000
100000
1000000
Data loading throughput
Ave
rag
e th
rou
gh
pu
t (re
cord
s /
se
c)
![Page 23: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/23.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 23/26
Conclusion
● We presented a first test of Spark to a real-world ML problem
● This test exposed some strengths and weaknesses:
● Improved memory management
● Distributing common large parameters
● Difficulty of integration with common storage solutions
● Implementation now faster than real time
● Used to learn more sophisticated travel time distributions
● Evaluating the quality of the output
cores
runtime
![Page 24: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/24.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 24/26
Thank you
![Page 25: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/25.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 25/26
System workflow
Observations(1M-100M)Persisted in memory
Travel time samples1k per observation
Link distributions(100k)
Link distribution parameters
![Page 26: Scaling the Mobile Millennium - UC Berkeley AMP Campampcamp.berkeley.edu/.../06/tim-hunter-amp-camp-2012-mobile-mill… · Scaling the Mobile Millennium System in the Cloud Timothy](https://reader033.vdocuments.mx/reader033/viewer/2022060408/5f0ff1857e708231d446a933/html5/thumbnails/26.jpg)
January 11, 2012 MM on the cloud - AMPLab retreat winter 2012 26/26
Estimation of arterial traffic
● We lack travel times on individual links:● One way around it: Expectation Maximization
● Randomly partition the total travel time amongst links● Weight each partition by likelihood according to model● Aggregate weighted samples for each link● For each link, update parameters to maximize
likelihood of link samples