a berkeley view of big data
DESCRIPTION
A Berkeley View of Big Data. Ion Stoica UC Berkeley BEARS February 17, 2011. Big Data is Massive…. Facebook : 130TB/day: user logs 200-400TB/day: 83 million pictures Google: > 25 PB/day processed data Data generated by LHC: 1 PB/sec - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/1.jpg)
A Berkeley View of Big Data
Ion StoicaUC Berkeley
BEARSFebruary 17, 2011
![Page 2: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/2.jpg)
Big Data is Massive…• Facebook:
– 130TB/day: user logs– 200-400TB/day: 83 million pictures
• Google: > 25 PB/day processed data
• Data generated by LHC: 1 PB/sec
• Total data created in 2010: 1.ZettaByte (1,000,000 PB)/year– ~60% increase every year
2
![Page 3: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/3.jpg)
…and Grows Bigger and Bigger!• More and more devices
• More and more people
• Cheaper and cheaper storage– ~50% increase in GB/$ every year
3
![Page 4: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/4.jpg)
…and Grows Bigger and Bigger!• Log everything!
– Don’t always know what question you’ll need to answer
• Hard to decide what to delete
– Thankless decision: people know only when you are wrong! – “Climate Research Unit (CRU) scientists admit they threw
away key data used in global warming calculations”
• Stored data grows faster than GB/$4
![Page 5: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/5.jpg)
What is Big Data?
• You don’t need to be big to have big data problem!– Inadequate tools to analyze data– Data management may dominate infrastructure cost
5
Data that is expensive to manage, and hard to extract value from
![Page 6: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/6.jpg)
Big Data is not Cheap!• Storing and managing 1PB
data: $500K-$1M/ year– Facebook: 200 PB/year
6
• “Typical” cloud-based service startup (e.g., Conviva)– Log storage dominates
infrastructure cost2007 2008 2009 2010
0%10%20%30%40%50%60%70%80%90%
100%
Storage cluster Other
Infra
stru
ctur
e co
st
~1PB storage capacity
![Page 7: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/7.jpg)
Hard to Extract Value from Data!• Data is
– Diverse, variety of sources– Uncurated, no schema, inconsistent semantics, syntax– Integration a huge challenge
• No easy way to get answers that are– High-quality– Timely
• Challenge: maximize value from data by getting best possible answers7
![Page 8: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/8.jpg)
Requires Multifaceted Approach• Three dimensions to improve data
analysis– Improving scale, efficiency, and quality of
algorithms (Algorithms)– Scaling up datacenters (Machines)– Leverage human activity and intelligence
(People)
• Need to adaptively and flexibly combine all three dimensions
8
![Page 9: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/9.jpg)
Algorithms, Machines, People• Today’s apps: fixed point in solution space
9
Algorithms
Machines
People
Need techniques to dynamically pick best operating point
search
Watson/IBM
![Page 10: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/10.jpg)
The AMP Lab
10
search
Watson/IBM
Machines
People
Algorithms
Make sense of data at scale by tightly integrating algorithms, machines, and people
![Page 11: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/11.jpg)
AMP Faculty and Sponsors• Faculty
– Alex Bayen (mobile sensing platforms)– Armando Fox (systems)– Michael Franklin (databases): Director– Michael Jordan (machine learning): Co-director– Anthony Joseph (security & privacy)– Randy Katz (systems)– David Patterson (systems)– Ion Stoica (systems): Co-director– Scott Shenker (networking)
• Sponsors:
11
![Page 12: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/12.jpg)
Algorithms• State-of-art Machine Learning (ML)
algorithms do not scale– Prohibitive to process all data points
12
How do you know when to stop?
true answer
Est
imat
e
# of data points
![Page 13: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/13.jpg)
Algorithms• Given any problem, data and a budget
– Immediate results with continuous improvement– Calibrate answer: provide error bars
13
Error bars on every answer!
Est
imat
e
# of data points
true answer
![Page 14: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/14.jpg)
Algorithms• Given any problem, data and a time budget
– Immediate results with continuous improvement– Calibrate answer: provide error bars
14
Stop when error smaller than a given threshold
Est
imat
e
# of data pointstime
true answer
![Page 15: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/15.jpg)
Algorithms• Given any problem, data and a time budget
– Automatically pick the best algorithm
15
Est
imat
e
time
pick sophisticated pick simple
error too high
true answersophisticated
simple
![Page 16: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/16.jpg)
Machines• “The datacenter as a computer” still in its
infancy– Special purpose clusters, e.g., Hadoop cluster– Highly variable performance– Hard to program– Hard to debug
16
=?
![Page 17: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/17.jpg)
Machines• Make datacenter a real computer!
17
Node OS(e.g. Linux)
Node OS(e.g. Windows)
Node OS(e.g. Linux)
…
Datacenter “OS” (e.g., Mesos)
• Share datacenter between multiple cluster computing apps
• Provide new abstractions and servicesAMP stack
Existingstack
![Page 18: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/18.jpg)
Machines• Make datacenter a real computer!
18
Node OS(e.g. Linux)
Node OS(e.g. Windows)
Node OS(e.g. Linux)
…
Datacenter “OS” (e.g., Mesos)
Hado
op
MPI
Hype
rtbal
e
…Ca
ssan
draHive
Support existing cluster computing apps
AMP stack
Existingstack
![Page 19: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/19.jpg)
Machines• Make datacenter a real computer!
19
Node OS(e.g. Linux)
Node OS(e.g. Windows)
Node OS(e.g. Linux)
…
Spar
kSCADS
…
Datacenter “OS” (e.g., Mesos)
Hado
op
MPI
Hype
rtbal
e
…Ca
ssan
draHive PIQL
Support interactive and iterative data analysis (e.g., ML algorithms)
Consistency adjustable data store
Predictive & insightful query language
AMP stack
Existingstack
![Page 20: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/20.jpg)
Machines• Make datacenter a real computer!
20
Node OS(e.g. Linux)
Node OS(e.g. Windows)
Node OS(e.g. Linux)
…
Spar
kSCADS
…
Datacenter “OS” (e.g., Mesos)
Applications, tools
Hado
op
MPI
Hype
rtbal
e
…Ca
ssan
draHive PIQL• Advanced ML algorithms
• Interactive data mining• Collaborative
visualizationAMP stack
Existingstack
![Page 21: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/21.jpg)
People• Humans can make sense of messy data!
21
![Page 22: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/22.jpg)
People• Make people an integrated part of
the system!– Leverage human activity– Leverage human intelligence
(crowdsourcing):• Curate and clean dirty data• Answer imprecise questions• Test and improve algorithms
• Challenge– Inconsistent answer quality in all
dimensions (e.g., type of question, time, cost) 22
Machines + Algorithms
data
, ac
tivity
Que
stio
ns Answ
ers
![Page 23: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/23.jpg)
Real Applications• Mobile Millennium Project
– Alex Bayen, Civil and Environment Engineering, UC Berkeley
• Microsimulation of urban development– Paul Waddell, College of
Environment Design, UC Berkeley• Crowd based opinion formation
– Ken Goldberg, Industrial Engineering and Operations Research, UC Berkeley
• Personalized Sequencing– Taylor Sittler, UCSF
23
![Page 24: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/24.jpg)
Personalized Sequencing
24
![Page 25: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/25.jpg)
Sequencing
Microsimulation Mobile Millennium
The AMP Lab
25
Machines
People
Algorithms
Make sense of data at scale by tightly integrating algorithms, machines, and people
![Page 26: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/26.jpg)
Big Data in 2020
Almost Certainly:• Create a new
generation of big data scientist
• A real datacenter OS • ML becoming an
engineering discipline• People deeply
integrated in big data analysis pipeline
If We’re Lucky:• System will know
what to throw away• Generate new
knowledge that an individual person cannot
![Page 27: A Berkeley View of Big Data](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56816935550346895de08dfe/html5/thumbnails/27.jpg)
Summary• Goal: Tame Big Data Problem
– Get results with right quality at the right time • Approach: Holistically integrate
Algorithms, Machines, and People • Huge research issues across many
domains
2727