why exploring big data is hard - danyel fisher
TRANSCRIPT
![Page 1: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/1.jpg)
WHY EXPLORING BIG DATA IS HARD(& WHAT WE CAN DO ABOUT IT)DANYEL FISHER, MICROSOFT RESEARCH
![Page 2: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/2.jpg)
![Page 3: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/3.jpg)
/tiles/r02123002133111.png
![Page 4: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/4.jpg)
![Page 5: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/5.jpg)
![Page 6: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/6.jpg)
One of the most popular spots in the world.
Based on a table with a few billion rows
![Page 7: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/7.jpg)
![Page 8: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/8.jpg)
Can you distinguish American users from international?
![Page 9: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/9.jpg)
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data
Create shapes
Assign scales to shapes
Render to screen
By hand!
SLOW!
NETWORK!
![Page 10: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/10.jpg)
Defining “Big” Volume
“…200,000 magnetic tape reels
which represent over 900 billion
characters of data”
1975
![Page 11: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/11.jpg)
“the size of the dataset is part of
the problem”
![Page 12: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/12.jpg)
Why is Big Data different?
REPRESENTATION
What visualizations are suitable for big data?
INTERACTION
What do we need to do to make that visualization useful for interaction?
![Page 13: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/13.jpg)
And it’s costly!
Big data has the potential to cost unlimited amounts of money
A query on 100 cores for an hour costs 100 core-hours … and an analyst-hour.
Massive savings for doing less, or early termination
![Page 14: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/14.jpg)
A Note on Infrastructure
![Page 15: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/15.jpg)
You Won’t Plot Every Point…Screen space to draw each data point [106 points]
Every data point in memory [109 bytes]
Store all the data points [1012 bytes]
![Page 16: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/16.jpg)
… Even If You Tried
x
y
Scatterplot(at least one pixel per point)
Network DiagramParallel Coordinates
(individual lines)
![Page 17: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/17.jpg)
Aggregation
What is the aggregation equivalent of a bar graph?
What is an aggregated line chart, or a scatterplot?
N. Elmqvist and J.-D. Fekete. Hierarchical aggregation for information visualization: Overview, techniques, and design guidelines. IEEE Trans-actions on Visualization and Computer Graphics, 16(3):439–454, May 2010.
![Page 18: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/18.jpg)
Some things aggregate well
![Page 19: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/19.jpg)
020406080
100120140160180200
3/1
3/1
98
6
3/1
3/1
98
7
3/1
3/1
98
8
3/1
3/1
98
9
3/1
3/1
99
0
3/1
3/1
99
1
3/1
3/1
99
2
3/1
3/1
99
3
3/1
3/1
99
4
3/1
3/1
99
5
3/1
3/1
99
6
3/1
3/1
99
7
3/1
3/1
99
8
3/1
3/1
99
9
3/1
3/2
00
0
3/1
3/2
00
1
3/1
3/2
00
2
3/1
3/2
00
3
3/1
3/2
00
4
3/1
3/2
00
5
3/1
3/2
00
6
3/1
3/2
00
7
3/1
3/2
00
8
3/1
3/2
00
9
3/1
3/2
01
0
3/1
3/2
01
1
3/1
3/2
01
2
3/1
3/2
01
3
3/1
3/2
01
4
Daily values
0
20
40
60
80
100
120
140
160
180
200
Monthly aggregate:min and max
![Page 20: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/20.jpg)
Multiple dimensions
Liu, Jiang, Heer: imMens (2013)
![Page 21: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/21.jpg)
Wattenberg: PivotGraph (2005)
![Page 22: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/22.jpg)
![Page 23: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/23.jpg)
Treemaps (mostly)
![Page 24: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/24.jpg)
“Generalized Histograms”Select buckets on data
then
Examine points, placing them into buckets
then
Create shapes based on buckets
Hadley Wickham: "Bin, Summarize, Smooth: A Framework for Visualizing Large Data"
![Page 25: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/25.jpg)
Big Data Exploration
EXPLORATION
Learn about the dataset
Explore multiple hypotheses
Manipulate data freely
May be discarded after completion
Rapid iteration
Examples: Some of Tableau, PowerView, GGPLOT, etc
PRESENTATION
Communicate a specific view
Constrain interaction
Visual style important
Examples: visual dashboards, data storytelling
![Page 26: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/26.jpg)
![Page 27: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/27.jpg)
The Story of Walt
the hypothetical histogram
![Page 28: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/28.jpg)
The Story of Walt
ASSUMPTION
The dataset is too big to fit into memory
ASSUMPTION
Every query takes a full minute
![Page 29: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/29.jpg)
Creating Walt(Min,Max)
Bucket all points
Total time: 2 minutes
![Page 30: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/30.jpg)
Interact With WaltCHANGE BUCKET COUNT
One pass.
Re-bucket every point
Or maybe we were clever…
CROSS-FILTER WALT WITH ANOTHER HISTOGRAM
One pass.
Check filter on every point
Or maybe we were clever…
![Page 31: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/31.jpg)
How clever do we have to be?Which operations are worth pre-caching?
◦ Change number of buckets, or their size
◦ Zoom in on a single bar
◦ Filter out some data
◦ Cross-filter into other visualizations
◦ Cross-filter from other visualizations
◦ Show sample rows from the histogram
OLAP!
![Page 32: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/32.jpg)
The Moral of Walt’s StoryDecide what operations will support rapidly … and which we’ll tolerate being slow
![Page 33: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/33.jpg)
Solution Space◦ Work Offline◦ Index
OLAP: PentahoInMems, Nanocubes
◦ Restrict Data◦ Sample (or Stream)◦ Divide & Conquer
◦ Multiple passes across the data in parallel
Limited exploration!
![Page 34: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/34.jpg)
Trade accuracy for latency
Time
100%
Online
Traditional
Image adapted from Hellerstein
![Page 35: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/35.jpg)
Computing Confidence Bounds
𝑏𝑜𝑢𝑛𝑑𝑠 ~𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒
𝑠𝑎𝑚𝑝𝑙𝑒𝑠
![Page 36: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/36.jpg)
The Progressive Pitch
“Trust Me, I'm Partially Right: Incremental Visualization
Lets Analysts Explore Large Datasets Faster"
![Page 37: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/37.jpg)
CHI 2012
![Page 38: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/38.jpg)
What We LearnedUsers made lots of mistakes
…carried out lots of queries
…and cut them off early
Users were fearless about exploration
Most numbers are rough
Randomness in databases is a pain
![Page 39: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/39.jpg)
![Page 40: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/40.jpg)
Supporting StreamsRESERVOIR SAMPLE
Keep a sample of k elements of the data such that each element has a k/size chance of being in the reservoir
EQUI-DEPTH HISTOGRAM
Good one-pass algorithms exist
… but we have no idea how to visualize them
![Page 41: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/41.jpg)
Incremental changes the rulesCategorical: add categories on the fly
Numerical: changing bounds
Any color map or scale can change
![Page 42: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/42.jpg)
SAMPLING: You’ll never know it all
TASKS
Find extreme
Compare bars
Bar to constant
Bar to range
Order (top-K)
![Page 43: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/43.jpg)
SAMPLING: Probabilistic Views
“Sample-Oriented Task-Driven Visualizations: Allowing Users to Make Better, More Confident Decisions”
Design Goals
Easy to interpret
Consistency across tasks
Spatial Stability
Minimize Visual Noise (overhead)
![Page 44: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/44.jpg)
“Is Bar A > Bar B”
![Page 45: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/45.jpg)
Compare to constant
![Page 46: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/46.jpg)
Other Tasks
Find extremeCompare to Range
![Page 47: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/47.jpg)
A Tentative Framework
![Page 48: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/48.jpg)
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
CACHE or INDEX
NETWORK!
SAMPLE
Place These!
![Page 49: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/49.jpg)
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
Using the FrameworkHOTMAP
CACHE
NETWORK!
CACHE
D3
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
NETWORK!CACHE
![Page 50: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/50.jpg)
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
Using the FrameworkSERVER-SIDE RENDER
NETWORK!
D3
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
NETWORK!
![Page 51: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/51.jpg)
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
Using the FrameworkOLAP & PRE-INDEX
SAMPLE ACTION
Raw dataRelevant
dimensionsFilter data
Choose bucket bounds
Aggregate data Create shapes
Assign scales to shapes
Render to screen
CACHE
NETWORK!NETWORK!
SAMPLE
OLAP: Pentaho, MondrianInMems, NanoCubes
![Page 52: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/52.jpg)
Cross-DisciplinarityThis isn’t the way SQL—or Hadoop—works today
Infovis needs to be very integrated with the back-end
New skills, new training
Close collaboration across fields
![Page 53: Why Exploring Big Data is Hard - Danyel Fisher](https://reader031.vdocuments.mx/reader031/viewer/2022020302/58a2484f1a28ab7b3c8b75a5/html5/thumbnails/53.jpg)
Let’s Build Cool Stuff!
@fisherdanyel
http://research.microsoft.com/bigdataux