icpsr biennial meeting october 2, 2015 ryan womack ...€¦ · to present a 3-d surface...
TRANSCRIPT
(A bit about) Data VisualizationICPSR Biennial Meeting
October 2, 2015
Ryan Womack ([email protected])Data Librarian, Rutgers University
This work is licensed under a Creative Commons Attribution
-NonCommercial-ShareAlike 4.0 International License.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 1 / 52
Introduction
What this talk IS:
Discusses standard techniques of data visualization, the day-to-daypower tools for understanding data
Reviews various graphical techniques, from early to recent, fromsimple to advanced
Presents principles of good data presentation, and show the Rimplementation of many functions
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 2 / 52
Introduction
What this talk is NOT:
It is not about “infographics”, the beautiful, heavily customizedproducts of expert graphic designers. [See 1 and 2 for morediscussion]
It is not about the cognitive science aspects of data perception[wish I knew more about this!]
It is not about how to use R or other software [although code isprovided for those who are interested]
It is not necessarily a balanced survey of all data visualization. Inparticular, it is light on graph networks, clustering, and trees [notmy expertise]
Very little mapping, too [Others are better at this]
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 3 / 52
Setup
Most of the graphics examples that are not web accessible are runin R.
R is open source software available at http://r-project.org
RStudio is a useful freely available editor available athttp://rstudio.com
Workshop materials, including R scripts, supplemental images anddata, are available for download fromhttp://ryanwomack.com/ICPSR2015
The R script file contains working demonstrations of many of theconcepts mentioned here for you to try on your own.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 4 / 52
Outline
Why?
Whirlwhind tour of historical data viz
Standard visualization vs. some less commonly used examples
3-D and Animation
Interactivity, data exploration
A little bit of big data
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 5 / 52
Why Data Visualization?
Data visualization can:
provide clear understanding of patterns in data
detect hidden structures in data
condense information
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 6 / 52
Anscombe’s Quartet
For example, see Anscombe’s quartet (image source:http://commons.wikimedia.org/wiki/File:Anscombe%27s quartet 3.svg):
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 7 / 52
Links to DataViz sites
Some examples of good data visualization (and fancy infographics) canbe found at:
Information Aesthetics
Chart Porn
Eagereyes
DataVis.ca
VizWiz
US Census Data Visualization Gallery
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 8 / 52
Bad Graphs
Pie Charts are known to be problematic
Clutter and other issues can ruin graphics
Novel or nonsensical?
For more bad ideas, try:
Junk Charts
Ten Worst Graphs
WTFviz
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 9 / 52
Pie Chart Examples
image source: http://peltiertech.com/WordPress/3d-pie-charts/
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 10 / 52
Pie Chart Examples
image source: http://ndevisual.wordpress.com/tag/uses-of-pie-charts/
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 11 / 52
Pie Chart Examples
image source: http://www.nbcchicago.com/news/local/FOX-News-Chart-Fails-Math-73711092.html
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 12 / 52
Pie Chart Examples
image source: http://tips.vovici.com/content/111031 swb
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 13 / 52
Pie Chart Examples
image source: http://tips.vovici.com/content/111031 swb
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 14 / 52
Clutter Example
image source:http://junkcharts.typepad.com/junk charts/2013/03/which-software-is-responsible-for-this.html
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 15 / 52
Playfair
Astronomical observations, charts, and maps led in graphicalinnovation prior to 1800. See also Classic Data Visualizations
William Playfair is the pioneer of the line chart, bar chart, timeseries plots, and pie chart.
Playfair, W. (1786). Commercial and Political Atlas: Representing, byCopper-Plate Charts, the Progress of the Commerce, Revenues, Expenditure,and Debts of England, during the Whole of the Eighteenth Century,
Playfair, W. (1801). Statistical Breviary.
Both republished in The Commercial and Political Atlas and StatisticalBreviary, 2005, Cambridge University Press.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 16 / 52
Playfair Examples
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 17 / 52
Playfair Examples
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 18 / 52
Minard
Charles Joseph Minard was the next influential data graphic creatorafter Playfair.
Minard’s flow map of Napoleon’s Russian campaign is celebratedby Tufte and others as one of the greatest information graphics.
It embodies an ideal of highly compressed informative elements,presented with style
Six variables: size, location in 2 dimensions, the direction of thearmy, temperature, date [and group]
However, this is a one-off design that crosses into Infographics, butit can be reproduced in R and other software.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 19 / 52
Minard Examples
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 20 / 52
Minard Examples
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 21 / 52
Fisher and Tukey
In the 20th century, statisticians such as Ronald Fisher and JohnTukey continued to advance graphical methods for the analysis ofdata.
Fisher emphasized plotting the data to understand relationships.
Tukey’s Exploratory Data Analysis emphasized the use of graphicsto understand the data during analysis, rather than the finalpresentation to an outside audience.
Tukey created the box and whiskers plot and the stem and leafplot.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 22 / 52
Tufte
Edward R. Tufte’s series of books, beginning with The Visual Displayof Quantitative Information, have become the most widely know workson data visualization.
There is considerable overlap between the various publications
Tufte’s ideal is highly compressed, elegant, and informative data,as expressed in dense printed graphics
Tufte sometimes emphasizes beauty and design to the detriment ofsimplicity and clarity [e.g., train schedules]
“Graphical elegance is often found in simplicity of design andcomplexity of data.”
“Beautiful graphics do not traffic with the trivial.”
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 23 / 52
Train Schedule from Marey
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 24 / 52
Tufte’s principles
Tufte has developed and popularized numerous principles andterminology:
Graphics reveal data - show the data without distorting it - “above allelse show the data”
Small multiple - understanding one slice makes understanding otherseasier
Lie factor - effect shown/effect in reality
Graphical Integrity - no lies, let data vary, not design
Data density - maximize data/ink ratio
Sparklines - seems they haven’t caught on
chartjunk - self-explanatory
Powerpoint is responsible for most of the world’s sorrows [TheCognitive Style of Powerpoint]
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 25 / 52
Lie Factor
image source: http://www.datavis.ca/gallery/lie-factor.php
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 26 / 52
Cleveland
William Cleveland’s Elements of Graphing Data and VisualizingData pioneered systematic considerations of data legibility
Cleveland is particularly known for promoting the dot plot as aalternative to bars and pies.
The dot plot provides clarity and easy comparison of data.
Cleveland also pioneered Trellis graphics
Trellis graphics emphasizes comparison of multiple panels of data
The lattice package implements Trellis graphics in R
See Cleveland.pdf for a summary of Cleveland’s recommendations
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 27 / 52
Scatterplot matrix
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 28 / 52
The Grammar of Graphics
The Grammar of Graphics, by Leland Wilkinson, was extremelyinfluential in thinking about graphics
Grammar means ”rules for art and science”
The Grammar of Graphics specifies rules both mathematical andaesthetic
Earlier graph producers focused on aesthetics of static content
Dynamic graphics and scientific visualization, by contrast, requiresophisticated designs to enable brushing, drill-down, zooming,linking
The Grammar of Graphics is easily adapted to this approach
ggplot2 was developed by Hadley Wickham as an implementationof the Grammar of Graphics
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 29 / 52
From Barchart to Dot Plot
The Cleveland dot plot
use to compare labeled quantities, ordered lists
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 30 / 52
Figure: Bar chart v. Dot Plot
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 31 / 52
Visualizing Distributions of Data
Box and Whiskers Plot
illustrate quantiles and outliers. There is also a Tufte version.
Violin plot
Blends density information with box and whiskers style (in anartistic manner)
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 31 / 52
Figure: Box Plot v. Violin Plot
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 32 / 52
Visualizing Categorical Data
Beyond the pie chart
The mosaic plot allows multiple categories to be displayed on thesame graph, but can be complicated to interpret.
The spineplot is a variant of the mosaic plot, plotting proportionsin 2 dimensions.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 32 / 52
Figure: Pie Chart v. Mosaic Plot
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 33 / 52
Maps and Glyphs
Maps are obviously an important and widespread way of presentingdata.
We examine a few examples of choropleth maps, in which shadingindicates data levels
See also Interactive Maps in R and 5 kinds of Interactive maps inPlot.ly for further exploration
Glyphs present iconic representations of data elements.
Weather maps often use glyphs.
A more dynamic example is here.
As an R example, consider Chernoff faces and the aplpack
package. Also, Smiley faces [and many more graph variants in thischapter].
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 33 / 52
Figure: Choropleth Map v. Chernoff Faces
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 34 / 52
3-D
3-D scatterplots
cloud (lattice)
contour plots
to plot standardized levels of data
wireframe plots
to present a 3-D surface representation of data
rgl (a separate package containing several 3d plotting functionsand animation)
mosaic3d extends the mosaic paradigm to three dimensions
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 34 / 52
Animation
Animation is an easy way to step through data over time
or to provide comparisons of different views of data
R makes animation easy with the animation package
Just enclose a sequence of graphics in the animation command togenerate interactive HTML (or GIF, SWF, LATEX, Video).
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 35 / 52
Interactive DataViz - Principles
Why aren’t all of our graphs interactive?
Brushing is used to select data points and track them throughvarious analyses.
Drilling down, zooming, and subsetting are also interactivetechniques.
Data displays can be linked so that a selection in one panelmodifies the output displayed in another panel.
Interactivity is especially useful for data exploration, studyingmultidimensional relationships.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 36 / 52
Interactive Data in Practice
There are many R packages that allow for interactive data work in agraphical user interface, including:
playwith - versatile package that works with any graphicsfunction. Graphics can be explored, edited, and exported.
requires separate installation of GTK+ on your computer [method variesby OS]
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 37 / 52
googleVis
In many contexts, visualizing the relationships between data elementsis made easier by viewing related data interactively.
Making this easy are googleVis and other “Vis” packages, e.g.bdvis for biodiversity or rainfreq.
A Library example - comparing selected ARL Statistics for publicCIC universities
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 38 / 52
Interactive Data on the Web - Rcharts
Rcharts is a package that uses javascript to create interactivevisualizations.
Lattice-style commands are used.
The package can output javascript for use in an HTML page.
Some commands depend on supplemental javascript libraries thatmust be installed, such as NVD3
Can embed in documents too, with slidify
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 39 / 52
Interactive Data on the Web - shiny
The shiny package is developed by the Rstudio folks
You can learn shiny in half a day via the online tutorial
More custom control of the design is possible with shiny, incomparison to other do-it-all packages
Graphics use familiar R syntax (including ggplot2), with wrappersto implement web functionality
Every shiny app has the same structure: two R scripts savedtogether in a directory [ui and server files]
You must install the shiny server to deliver pages via the web
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 40 / 52
Interactive Data on the Web - shiny, cont.
There are samples built into the shiny package.
You can build a Census Explorer of your own with theseinstructions from Ari Lamstein.
You can see more in the shiny gallery
Rcharts works with shiny too.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 41 / 52
Interactive Data on the Web - ggvis
The ggvis package is ALSO developed by the Rstudio folks
Think ggplot meets shiny
Similar syntax to ggplot
Some ability to add interactive controls
Can embed in shiny for web access
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 42 / 52
Interactive Data on the Web - radiant
Radiant is another new R interface built with shiny
The following links demonstrate capabilities:
vnijs.shinyapps.io/basevnijs.shinyapps.io/quantvnijs.shinyapps.io/marketing
By automating the mechanics of interacting with data, we canfocus on exploring and understanding.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 43 / 52
Other (non-R) options for Web visualization
D3.js, free at http://d3js.org/
Inkscape, free at https://inkscape.org/
Tableau, free 1-year student license athttp://www.tableau.com/academic/students
Plot.ly environment at http://plot.ly
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 44 / 52
Interactive Power
Population pyramids are one example whereinteractivity + animation = insight .
Populationpyramid.net - for all countries, basic animation
The German Population Pyramid from Destatis is even moreinteractive
Doing it in R is possible with these instructions (Part 1) and (Part2)
The ggvis package is ALSO developed by the Rstudio folks
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 45 / 52
Big Data
Big data presents special issues for data visualization
While many techniques and graphics are the same, explorationand plotting must be optimized for the size of the data set
Representation of the complexity of the data may require specialtechniques
hexbin
bigvis
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 46 / 52
bigvis
bigvis was an experimental package by Hadley Wickham to deal withthe issues of Big Data
There is a Preprint and R Meetup presentation by Hadley Wickham
Complete code is available at https://github.com/hadley/bigvis-infovis
Target: process 100 million observations in under 5 seconds.
Fundamental principle: No need for more data points than there arepixels on the screen.
“ggstat” package has been mentioned as a future project that willincorporate these ideas.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 47 / 52
bigvis steps
Condense (bin, condense)
Smooth (smooth, best_h, peel)
Visualize (autoplot plus standard methods)
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 48 / 52
Trelliscope (Tessera)
Tessera is developed by Purdue, Pacific Northwest NationalLaboratory, and Mozilla. Launched in November 2014, this projectholds a lot of promise.
Running in the R environment, Tessera provides its own commands thatexecute across a cluster, easing the burden of analysis in this environment.
The datadr package “divides and recombines” in a manner similarto MapReduce, providing a simplified interface to Hadoop.
Tessera has its own visualization interface, Trelliscope, that canhandle views across many variables and observations. Described inthis paper.
Tessera’s Bootcamp is a good introduction, or try the quickstart.
Live demo is here.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 49 / 52
Infographics links
Although not covered here, the following links are a sampling ofinfographics sites for your later enjoyment:
Data Storytelling in Video
Art of Data Visualization - in spite of its title, more on theinfographics side
Parisian Subway Traffic and New York Subway Inequality
Tulp Interactive
Mapping London and London Riots + Twitter
YouTube Trends Map
Global Burden of Disease Visualizations
and the Tree of Life
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 50 / 52
Keep Exploring
Data Visualization represents a nearly infinite world of possibilty forexploration:
plunge into programming
deep dives into data
indulge in interactivity
...have fun and keep learning! [e.g., R-bloggers.com]
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 51 / 52
References
There is also an online bibliography of references to accompany thispresentation on my home page.
Ryan Womack ([email protected]) Data Librarian, Rutgers University(A bit about) Data Visualization 52 / 52