statistical visualization - meetupfiles.meetup.com/14454172/statisticalvisualization.pdf ·...
TRANSCRIPT
Statistical VisualizationDavid ZeitlerMay ,2015
Big Data and Hadoop Users Group of West Michigan
Note that in order to get pdf output you will need to have LaTeXinstalled. On your own computer you can download and installLaTeX from the sources below:
• Windows: MikTeX - http://www.miktex.org/• OSX: MacTeX - https://www.tug.org/mactex/
If you don’t have LaTeX, you can change the output to word_documentor html_document and things should mostly render well. For word,you’ll need to make some modifications since it doesn’t handle xtableoutput or the newthought commands.
There are many galleries of R graphics available online where youcan find code along with the image. Just find the one that looks likewhat you want, copy the code and edit it to use your data.
Gallery links: Google search ‘r graphics gallery’
• http://rgraphgallery.blogspot.com/
• http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html
• http://scs.math.yorku.ca/index.php/R_Graphs_Gallery
statistical visualization 2
• http://www.statmethods.net/advgraphs/parameters.html
• Shiny: http://shiny.rstudio.com/gallery/
Preliminaries11 The following packages are used in thiscode and may need to be installed: lattice,latticeExtra, ggplot2, mosaic, mosaicData,vcd, vcdExtra, mapproj, scatterplot3d,mvtnorm, rgl, dplyr, ggvis, lubridate,nycflights, animation.
This presentation is about understanding a bit more about graphicsin R and Statistical Visualization in general. The first thing to under-stand is that the purpose of a graphic is to convey information. Thatmay be information we know, for example when creating graphics forpublication. Often though the information conveyed was unknownbefore seeing the graphic, as during exploratory data analysis (JohnTukey2). 2 http://www.amstat.org/about/
statisticiansinhistory/index.cfm?
fuseaction=biosinfo&BioID=14The difference between art and visualization is that art in-herently depends on the viewer, invoking in each viewer a differentexperience. The artist strives to encourage the viewer to combine thesensory experience with their own experiences to make each percep-tion of the art unique. Visualization is meant to convey or uncoverspecific meaning in all viewers. Human perception is a function ofboth the sensory experience and the viewers model of the object be-ing perceived. For example, the first time I went up in an airplane,nothing outside the windows was recognizable. Not until I had de-veloped a sufficient 3D model of the environment could I recognizeplaces that were totally familiar to me from the ground. R graphicssoftware uses underlying statistical models to give the graphics acommon model that allows the analyst to convey information to theviewer.
The graphics capabilities of R have developed over decades ofopen source development. Although there are a lot of stand-alonegraphics packages and functions available in R, there are 3 maingraphics systems available.
System Comment
base Simplest and oldest but still used frequentlylattice Mature, publication quality, easy to useggplot2 Most sophisticated, hardest to learn, publication quality
We’ll first go rather quickly through a couple of examples focusingon the differences.
One of the datasets we will use for quite a few of these examplesis the mtcars data included with R. Let’s take a quick look at the firstfew rows of that data:
statistical visualization 3
library(xtable)
options(xtable.comment = FALSE)
options(xtable.booktabs = TRUE)
options(xtable.type = knitr::opts_knit$get("rmarkdown.pandoc.to"))
xtable(head(mtcars[, 1:6]), caption = "First rows of mtcars")
mpg cyl disp hp drat wt
Mazda RX4 21.00 6.00 160.00 110.00 3.90 2.62
Mazda RX4 Wag 21.00 6.00 160.00 110.00 3.90 2.88
Datsun 710 22.80 4.00 108.00 93.00 3.85 2.32
Hornet 4 Drive 21.40 6.00 258.00 110.00 3.08 3.21
Hornet Sportabout 18.70 8.00 360.00 175.00 3.15 3.44
Valiant 18.10 6.00 225.00 105.00 2.76 3.46
Table 2: First rows of mtcars
Base or traditional graphics is the simplest and oldest form ofgraphics. Examples are simple box plots (boxplot), scatter plots (plot)or histograms (hist). These are all examples of a single quantitativevariable graphic.
hist(mtcars$mpg)
Histogram of mtcars$mpg
mtcars$mpg
Fre
quen
cy
10 15 20 25 30 35
06
12
Figure 1: Base graphics histogram ofmpg
Lattice
Lattice graphics3 is the system developed by Deepayan Sarkar4. 3 Lattice Tutorial: http://www.isid.ac.
in/~deepayan/R-tutorials/labs/04_
lattice_lab.pdf4 http://www.isid.ac.in/~deepayan/
They focus mainly on trellis plots which are multiple panels allowingus to explore one or more dependent variables with respect to oneor many independent variables. The lattice package is included withall R distributions by default, so we only need to load it into the cur-rent session to use its commands. It can however do single variablegraphics just like the histogram we did with base graphics.
require(lattice)
## Loading required package: lattice
histogram(~mpg, mtcars)
mpg
Per
cent
of T
otal
0
10
20
30
10 15 20 25 30 35
Figure 2: Lattice histogram of mpg
Or we can look at mpg by cylinders.
histogram(~mpg | cyl, mtcars)
mpg
Per
cent
of T
otal
0
20
40
60
80
101520253035
cyl
101520253035
cyl
101520253035
cyl
Figure 3: Lattice histogram of mpg
statistical visualization 4
ggplot2: An Implementation of the Grammar of Graphics
The ggplot system is the system developed by Hadley Wickham5. 5 http://had.co.nz/
It is a highly object oriented graphics system based on The Grammarof Graphics by Leland Wilkinson6. 6 http://www.cs.uic.edu/~wilkinson/
TheGrammarOfGraphics/GOG.html
Shameless plug. . .
For those that might want to go further. . . We’re offering a 3 creditcourse this summer7. STA 380/580 Statistical Visualization: An ex- 7 Summer 6-weeks session 2015. PEW
Campus Eberhard Center room 612,6:00-9:20 Tuesday/Thursday June 22 -August 4
ploration of statistical visualization using R, Tableau, SAS JMP andSAS Enterprise Guide. Emphasis will be on the grammar of graphicsand the graphical communication of quantitative information, in-cluding graphical depictions for most standard statistical proceduresand the use of animations. A semester of statistics and only limitedfamiliarity with programming will be assumed.
Ok. Commercial break over, back to work. . . Let’s look at ourhistograms using ggplot2.
# install.packages(’ggplot2’,dependencies=TRUE)
require(ggplot2)
## Loading required package: ggplot2
ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 3)
0
2
4
6
8
10 20 30 40mpg
coun
t
Figure 4: ggplot2 histogram of mpg
require(ggplot2)
ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 3) +
facet_wrap(~cyl)4 6 8
0
2
4
6
10 20 30 40 10 20 30 40 10 20 30 40mpg
coun
t
Figure 5: ggplot2 histogram of mpg
Donner Party with Logistic Regression
require(mosaic)
require(mosaicData)
require(vcd)
require(vcdExtra)
The Donner data frame in vcdExtra gives details on the survival of90 members of the Donner party, a group of people who attempted tomigrate to California in 1846. They were trapped by an early blizzardon the eastern side of the Sierra Nevada mountains, and before theycould be rescued, nearly half of the party had died. What factorsaffected who lived and who died?
statistical visualization 5
data(Donner)
# separate linear fits on age for M/F
ggplot(Donner, aes(age, survived, color = sex)) +
geom_point(position = position_jitter(height = 0.02,
width = 0)) + stat_smooth(method = "glm",
family = binomial, formula = y ~ x, alpha = 0.2,
size = 2, aes(fill = sex))
0.00
0.25
0.50
0.75
1.00
0 20 40 60age
surv
ived sex
Female
Male
Now let’s just do some graphics for the fun of it!
First: Multipanel base graphics
We can build up graphics using multiple panels with a traditionalgraphic in each panel. For example we can look at various aspects ofthe mtcars data in a 2x2 layout. Arbitrarily complex layouts can becreated, but we really don’t have time for that here.
Note that I didn’t put this on the side margin. These did not scalewell.
par(mfrow = c(2, 2))
hist(mtcars$mpg, main = "", xlab = "mpg")
boxplot(mtcars$mpg, main = "miles/gallon")
plot(mtcars$hp, mtcars$mpg, xlab = "horsepower",
ylab = "mpg")
plot(density(mtcars$mpg), main = "")
statistical visualization 6
mpg
Fre
quen
cy
10 25
0 10
miles/gallon
50 200
10
horsepower
mpg
10 30
0.00
N = 32 Bandwidth = 2.477
Den
sity
A bit more advanced use of base graphics
It might be the old base stuff, but there’s a lot that can be, and hasbeen, done with it.
with(mtcars, {
r <- sqrt(disp/pi)
symbols(wt, mpg, circle = r, inches = 0.3,
main = "Bubble Plot with point size\nproportional to displacement",
fg = "white", bg = "lightblue", xlab = "Miles Per Gallon",
ylab = "Weight of Car (lbs/1000)")
text(wt, mpg, rownames(mtcars), cex = 0.6)
})
statistical visualization 7
1 2 3 4 5 6
1015
2025
3035
Bubble Plot with point sizeproportional to displacement
Miles Per Gallon
Wei
ght o
f Car
(lb
s/10
00)
Mazda RX4Mazda RX4 Wag
Datsun 710
Hornet 4 Drive
Hornet SportaboutValiant
Duster 360
Merc 240D
Merc 230
Merc 280
Merc 280C
Merc 450SEMerc 450SL
Merc 450SLC
Cadillac FleetwoodLincoln Continental
Chrysler Imperial
Fiat 128
Honda Civic
Toyota Corolla
Toyota Corona
Dodge ChallengerAMC Javelin
Camaro Z28
Pontiac Firebird
Fiat X1−9Porsche 914−2
Lotus Europa
Ford Pantera L
Ferrari Dino
Maserati Bora
Volvo 142E
statistical visualization 8
A lattice of cloud plots
We’ve seen several of these lattice graphics today, so we’ll jump rightinto some more interesting capabilities. Remember this graphic, we’llcome back to this data later with a dynamic 3D graphic that does an evenbetter job than this one. This is a single quantitative dependent variableviewed by two quantitative independent variables and a categoricalindependent variable. That’s why we see the formula in the cloudfunction call.8 One of the great things about lattice is this explicit 8 * _Sepal.Length ~ Petal.Length *
Petal.Width | Species_model in the function call making the underlying model clear.
## Lattice of 3d graphics using the Iris data
cloud(Sepal.Length ~ Petal.Length * Petal.Width |
Species, data = iris, screen = list(x = -90,
y = 70), distance = 0.4, zoom = 0.6, zlab = list(rot = 90,
cex = 0.5), xlab = list(rot = 45, cex = 0.5),
ylab = list(cex = 0.5))
Petal.
Leng
th
Petal.Width
Sep
al.L
engt
h
setosa
Petal.
Leng
th
Petal.Width
Sep
al.L
engt
h
versicolor
Petal.
Leng
th
Petal.Width
Sep
al.L
engt
h
virginica
statistical visualization 9
The obligatory map graphic
What would a visualization discussion be without one of those mapgraphics? This one looks at U.S. cancer rates.
data(USCancerRates)
rng <- with(USCancerRates, range(rate.male, rate.female,
finite = TRUE))
nbreaks <- 50
breaks <- exp(do.breaks(log(rng), nbreaks))
suppressWarnings(warning(print(mapplot(rownames(USCancerRates) ~
rate.male + rate.female, data = USCancerRates,
breaks = breaks, map = map("county", plot = FALSE,
fill = TRUE, projection = "tetra"), scales = list(draw = FALSE),
xlab = "", main = "Average yearly deaths due to cancer\nper 100000"))))
Average yearly deaths due to cancerper 100000
rate.male
rate.female
100
200
300
400
500
600
statistical visualization 10
An advanced application of Lattice graphics
This image is going to be complex9, so don’t worry too much 9 Source Author: Alastair Sanderson -
Astrophysics & Space Research (ASR)Group at the University of Birmingham,Source URL: http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html
about the preparations. They’ll go on for a page or so setting up thedata (pulled from the web), defining a special strip function for thetime series ‘rows’ in the graphic, setting up limits, etc.
Plotting time series using lattice graphics
An R chart of daily weather measurements taken at the Universityof Birmingham Wast Hill Observatory, using the excellent R latticegraphics package. The measurements are averaged over the 5 hoursaround midday except for rainfall, which is averaged over a week
# --Load previously saved data from
# Sanderson’s website
path <- "http://www.sr.bham.ac.uk/~ajrs/R/datasets"
a <- load(url(paste(path, "middayweather.RData",
sep = "/")))
close(url(paste(path, "middayweather.RData", sep = "/")))
print(a) # list names of saved objects
## [1] "middayweather"
# --Load extra libraries:
require(lattice)
# --Define plot titles:
lab.wind.speed <- "Wind speed (mph)"
lab.hum <- "Humidity (%)"
lab.rain <- "Rainfall (mm/day, averaged over a week)"
lab.bar <- "Air pressure (mb)"
lab.T.out <- as.expression(expression(paste("Outside temperature (",
degree * C, ")")))
# --Custom strip function: (NB the colour used
# is the default lattice strip background
# colour)
my.strip <- function(which.given, which.panel,
...) {
strip.labels <- c(lab.wind.speed, lab.hum,
lab.rain, lab.bar, lab.T.out)
panel.rect(0, 0, 1, 1, col = "#ffe5cc", border = 1)
panel.text(x = 0.5, y = 0.5, adj = c(0.5,
0.55), cex = 0.95, lab = strip.labels[which.panel[which.given]])
statistical visualization 11
}
# --Define X axis date range:
xlim <- range(middayweather$Date)
# --Define annual quarters for plot grid line
# markers:
d <- seq(from = as.Date("2006-01-01"), to = as.Date("2011-01-01"),
by = 365/4)
# --Define colours for raw & smoothed data:
col.raw <- "#377EB8" #colset[2] } see note above
col.smo <- "#E41A1C" #colset[1] }
col.lm <- "grey20"
Note the equation10 in the xyplot depicting this as a graphic with 10 wind.speed + hum.out + rain + bar +T.out ~ Datefive dependent variables and one independent variable.
# --Create multipanel plot:
xyplot(wind.speed + hum.out + rain + bar + T.out ~
Date, data = middayweather, scales = list(y = "free",
rot = 0), xlim = xlim, strip = my.strip, outer = TRUE,
layout = c(1, 5, 1), ylab = "", panel = function(x,
y, ...) {
# plot default horizontal gridlines
panel.grid(h = -1, v = 0)
# plot default horizontal gridlines
panel.abline(v = d, col = "grey90")
# plot default horizontal gridlines
panel.xyplot(x, y, ..., type = "l", col = col.raw,
lwd = 0.5)
# smoothed data
panel.loess(x, y, ..., col = col.smo,
span = 0.14, lwd = 0.5)
# median value
panel.abline(h = median(y, na.rm = TRUE),
lty = 2, col = col.lm, lwd = 1)
}, key = list(title = "Birmingham Wast Hills Observatory\naverage midday weather",
text = list(c("raw data", "smoothed curve",
"median value")), col = c(col.raw,
col.smo, col.lm), lty = c(1, 1, 2),
columns = 2, cex = 0.8, lines = TRUE),
)
statistical visualization 12
Date
05
1015
2007−01 2007−07 2008−01 2008−07 2009−01 2009−07 2010−01
Wind speed (mph)406080
100Humidity (%)
05
10
Rainfall (mm/day, averaged over a week)960980
100010201040
Air pressure (mb)0
1020
Outside temperature (°C)
Birmingham Wast Hills Observatoryaverage midday weather
raw datasmoothed curve
median value
statistical visualization 13
Other advanced graphics
scatterplot3d
3D Scatterplot with fit linear regression plane and reference points.
require(scatterplot3d)
## Loading required package: scatterplot3d
data(trees)
# Create the scatterplot
s3d <- scatterplot3d(trees, type = "h", color = "blue",
angle = 55, scale.y = 0.7, pch = 16, main = "Trees")
# fit a multiple regression model
my.lm <- lm(trees$Volume ~ trees$Girth + trees$Height)
# add the model plane to the graphic
s3d$plane3d(my.lm)
# mark red reference points at (10,85,60),
# (12,80,50), etc.
s3d$points3d(seq(10, 20, 2), seq(85, 60, -5),
seq(60, 10, -10), col = "red", type = "h",
pch = 8)
Trees
8 10 12 14 16 18 20 221020
3040
5060
7080
60657075808590
Girth
Hei
ght
Vol
ume
statistical visualization 14
Fancy 3D plot of a bivariate normal distribution with equations.
# install.packages(’mvtnorm’)
require(mvtnorm)
## Loading required package: mvtnorm
require(scatterplot3d)
x1 <- x2 <- seq(-10, 10, length = 51)
dens <- matrix(dmvnorm(expand.grid(x1, x2), sigma = rbind(c(3,
2), c(2, 3))), ncol = length(x1))
# Plot the bounding box
s3d <- scatterplot3d(x1, x2, seq(min(dens), max(dens),
length = length(x1)), type = "n", grid = FALSE,
angle = 70, main = "Bivariate normal distribution",
zlab = expression(f(x[1], x[2])), xlab = expression(x[1]),
ylab = expression(x[2]))
# add a little fancy equation labeling
text(s3d$xyz.convert(-1, 10, 0.07), labels = expression(f(x) ==
frac(1, sqrt((2 * pi)^n * phantom(".") * det(Sigma[X]))) *phantom(".") * exp * {
bgroup("(", -scriptstyle(frac(1, 2) *phantom(".")) * (x - mu)^T * Sigma[X]^-1 *(x - mu), ")")
}), cex = 0.5)
text(s3d$xyz.convert(1.5, 10, 0.05), labels = expression("with" *phantom("m") * mu == bgroup("(", atop(0, 0),
")") * phantom(".") * "," * phantom(0) * {
Sigma[X] == bgroup("(", atop(3 * phantom(0) *2, 2 * phantom(0) * 3), ")")
}), cex = 0.5)
# Add the density surface
for (i in length(x1):1) s3d$points3d(rep(x1[i],
length(x2)), x2, dens[i, ], type = "l")
for (i in length(x2):1) s3d$points3d(x1, rep(x2[i],
length(x1)), dens[, i], type = "l")
statistical visualization 15
Bivariate normal distribution
−10 −5 0 5 100.00
0.02
0.04
0.06
0.08
−10
−5
0
5
10
x1
x 2
f(x1,
x 2)
f(x) =1
(2π)n det(ΣX)exp
− 1
2(x − µ)TΣX
−1(x − µ)
with µ =
0
0
, ΣX =
3 2
2 3
Note that we’re using Rmarkdown for the following, but the Routput cannot be compiled using RStudio since there is a lot of userinteraction and animation going on throughout that cannot be ren-dered into a static document. So only the code comes out in thehandout.
rgl: 3D visualization device system (OpenGL)
A common visualization is a 3d point cloud. We saw some staticpoint clouds in the previous work. Here the point cloud is dynamic,allowing us to move it around to see the structure.
require(rgl)
with(iris, plot3d(Sepal.Length, Sepal.Width, Petal.Length,
type = "s", col = as.numeric(Species)))
lattice interactive plotting
With the lattice package we can interact with the plot, highlightinginterestingpoints and extracting those highlighted points for furtheranalysis.
Using environmental data
statistical visualization 16
env <- environmental
env$ozone <- env$ozone^(1/3)
splom(env, pscales = 0, col = "grey")
trellis.focus("panel", 1, 1, highlight = FALSE)
interesting <- panel.link.splom(pch = 16, col = "black")
trellis.unfocus()
env[interesting, ]
ggvis
The new ggvis graphics system is being developed by WinstonChang for interactive exploration.
ggvis works seamlessly with R Markdown v2 and interactive doc-uments, so you can easily add interactive graphics to your R Mark-down documents. ggvis plots are inherently reactive and they renderin the browser, so they can take advantage of the capabilities pro-vided by modern web browsers.
We’ll need to pull in some packages first for these examples.
require(dplyr)
require(ggvis)
require(lubridate)
Note that ggvis graphics will go to a web browser if you’re notknitting the file or using RStudio.
ggvis basics
The goal of ggvis is to make it easy to build interactive graphics forexploratory data analysis. ggvis has a similar underlying theory toggplot2 (the grammar of graphics), but it’s expressed a little differ-ently, and adds new features to make your plots interactive. ggvisalso incorporates shiny’s reactive programming model and dplyr’sgrammar of data transformation.
The graphics produced by ggvis are fundamentally web graphicsand work very differently from traditional R graphics. This allows usto implement exciting new features like interactivity, but it comes at acost. For example, every interactive ggvis plot must be connected to arunning R session (static plots do not need a running R session to beviewed). This is great for exploration, because you can do anythingin your interactive plot you can do in R, but it’s not so great for pub-lication. We will overcome these issues in time, but for now be awarethat we have many existing tools to re-implement before you can doeverything with ggvis that you can do with base graphics.
statistical visualization 17
Every ggvis graphic starts with a call to ggvis(). The first argu-ment is the data set that you want to plot, and the other argumentsdescribe how to map variables to visual properties.
p <- ggvis(mtcars, x = ~wt, y = ~mpg)
This doesn’t actually plot anything because you haven’t told ggvishow to display your data. You do that by layering visual elements,for example with layer_points():
layer_points(p)
(If you’re not using RStudio, you’ll notice that this plot opens inyour web browser. That’s because all ggvis graphics are web graph-ics, and need to be shown in the browser. RStudio includes a built-inbrowser so it can show you the plots directly.)
All ggvis functions take the visualization as the first argumentand return a modified visualization. This seems a little bit awkward.Either you have to create temporary variables and modify them, oryou have to use a lot of parentheses:
layer_points(ggvis(mtcars, x = ~wt, y = ~mpg))
To make life easier ggvis uses the %>% function from the magrittrpackage to chain commands together, much like the ‘+’ used in gg-plot. That allows you to rewrite the previous function call as:
mtcars %>% ggvis(x = ~wt, y = ~mpg) %>% layer_points()
# disp by mpg, converting engine displacment
# to litres
mtcars %>% ggvis(x = ~mpg, y = ~disp) %>% mutate(disp = disp/61.0237) %>%
layer_points()
The format of the visual properties needs a little explanation.We use ~ before the variable name to indicate that we don’t wantto literally use the value of the mpg variable (which doesn’t exist),but instead we want we want to use the mpg variable inside in thedataset. This is a common pattern in ggvis: we’ll always use formulasto refer to variables inside the dataset.
The first two arguments to ggvis() are usually the position, so byconvention you can drop x and y:
mtcars %>% ggvis(~mpg, ~disp) %>% layer_points()
You can add more variables to the plot by mapping them to othervisual properties like fill, stroke, size and shape.
statistical visualization 18
mtcars %>% ggvis(~mpg, ~disp, stroke = ~vs) %>%
layer_points()
mtcars %>% ggvis(~mpg, ~disp, fill = ~vs) %>%
layer_points()
mtcars %>% ggvis(~mpg, ~disp, size = ~vs) %>%
layer_points()
mtcars %>% ggvis(~mpg, ~disp, shape = ~factor(cyl)) %>%
layer_points()
If you want to make the points a fixed colour or size, you needto use := instead of =. The := operator means to use a raw, unscaledvalue. This seems like something that ggvis() should be able to figureout by itself, but making it explicit allows you to create some usefulplots that you couldn’t otherwise. See the properties and scales formore details.
mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(fill, "red"),
‘:=‘(stroke, "black")) %>% layer_points()
mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(size, 300), ‘:=‘(opacity,
0.4)) %>% layer_points()
mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(shape, "cross")) %>%
layer_points()
mtcars %>% ggvis(~mpg, ~disp, stroke = ~wt, fill = ~hp,
size = ~vs, shape = ~factor(cyl)) %>% layer_points() %>%
add_legend("fill", properties = legend_props(legend = list(y = 50))) %>%
add_legend("size", properties = legend_props(legend = list(y = 150))) %>%
add_legend("shape", properties = legend_props(legend = list(y = 250)))
ggvis Interaction
As well as mapping visual properties to variables or setting them tospecific values, you can also connect them to interactive controls.
The following example allows you to control the size and opacityof points with two sliders:
mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(size, input_slider(10,
100)), ‘:=‘(opacity, input_slider(0, 1))) %>%
layer_points()
You can also connect interactive components to other plot parame-ters like the width and centers of histogram bins:
mtcars %>% ggvis(~wt) %>% layer_histograms(width = input_slider(0,
2, step = 0.1, label = "width"), center = input_slider(0,
2, step = 0.05, label = "center"))
statistical visualization 19
Behind the scenes, interactive plots are built with shiny, and youcan currently only have one running at a time in a given R session.To finish with a plot, press the stop button in Rstudio, or close thebrowser window and then press Escape or Ctrl + C in R.
As well as input_slider(), ggvis provides input_checkbox(), in-put_checkboxgroup(), input_numeric(), input_radiobuttons(), in-put_select() and input_text(). See the examples in the documentationfor how you might use each one.
You can also use keyboard controls with left_right() and up_down().Press the left and right arrows to control the size of the points in thenext example.
keys_s <- left_right(10, 1000, step = 50)
mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(size, keys_s),
‘:=‘(opacity, 0.5)) %>% layer_points()
You can also add on more complex types of interaction like tooltips:
mtcars %>% ggvis(~wt, ~mpg) %>% layer_points() %>%
add_tooltip(function(df) df$wt)
As well as input_slider(), which produces a slider (or a double-ended range slider), there are a number of other interactive controls:
• input_checkbox(): a check-box• input_checkboxgroup(): a group of check boxes• input_numeric(): a spin box• input_radiobuttons(): pick one from a set options• input_select(): create a drop-down text box• input_text(): arbitrary text input
Note that all interactive inputs start with input_ so that you canalways use tab completion to remind you of the options.
Here is an example of a plot that uses both a slider and a selectbox:
mtcars %>% ggvis(x = ~wt) %>% layer_densities(adjust = input_slider(0.1,
2, value = 1, step = 0.1, label = "Bandwidth adjustment"),
kernel = input_select(c(Gaussian = "gaussian",
Epanechnikov = "epanechnikov", Rectangular = "rectangular",
Triangular = "triangular", Biweight = "biweight",
Cosine = "cosine", Optcosine = "optcosine"),
label = "Kernel"))
statistical visualization 20
slicing animations
require(ggplot2)
require(dplyr)
# require(animation)
require(nycflights13)
Generate an animationLooking at diamonds price by carats
ggplot(diamonds, aes(carat, price)) + geom_point(aes(color = clarity,
shape = cut)) + ylim(0, max.price) + xlim(0,
max.carat) + stat_smooth()
Instead of trying to put it all on the display at once, slice it intolayers and look at it in an animation. We can do a simple animationhere in the RStudio plot window. A better display is to save the se-ries of graphics in an animated GIF. However the saveGIF functionrequires a copy of the ImageMagick software which is not on the labcomputers. We can run the animation in the plot window and look atthe animation GIF’s I generated on my laptop.
max.carat <- max(diamonds$carat)
max.price <- max(diamonds$price)
# saveGIF( movie.name = ’diamonds.gif’,
for (this.cut in c("Fair", "Good", "Very Good",
"Premium", "Ideal")) for (this.clarity in c("I1",
"SI1", "SI2", "VS1", "VS2", "VVS1", "VVS2",
"IF")) print(diamonds %>% filter(cut == this.cut &
clarity == this.clarity) %>% ggplot(aes(carat,
price)) + geom_point() + ylim(0, max.price) +
xlim(0, max.carat) + stat_smooth() + ggtitle(paste("Cut is",
this.cut, "and clarity is", this.clarity)))
# )
A more traditional slicing is based on time. For this we look at theNYC flights data by month.
ggplot(flights, aes(dep_delay, arr_delay)) + geom_point(aes(color = origin)) +
facet_wrap(~origin) + stat_smooth(method = "lm")
max.dep.delay <- max(flights$dep_delay, na.rm = T)
min.dep.delay <- min(flights$dep_delay, na.rm = T)
max.arr.delay <- max(flights$arr_delay, na.rm = T)
min.arr.delay <- min(flights$arr_delay, na.rm = T)
# saveGIF( movie.name = ’nycflights.gif’,
for (i in 1:12) print(flights %>% filter(month ==
statistical visualization 21
i) %>% ggplot(aes(dep_delay, arr_delay)) +
geom_point() + xlim(min.dep.delay, max.dep.delay) +
ylim(min.arr.delay, max.arr.delay) + stat_smooth(method = "lm") +
facet_wrap(~origin) + ggtitle(paste("Flights for",
month.name[i])))
# )
Let’s now take a quick look at a couple of commercial visualizationpackages.
1) SAS JMP2) Tableau
SAS JMP
The SAS product JMP (pronounced jump) was developed some-what independently from the main SAS statistical products as a newapproach to statistical analysis. From the start it has incorporatedgraphics as an integral part of the statistical analysis process. It hascaught on in industry because of its ease of use and relatively lowcost.
The JMP graph builder is a powerful tool for exploring data.We’ll take a quick look at some of its capabilities using the Diamondsdata set we’ve already been working with.
statistical visualization 22
Tableau
Tableau is a ‘relatively’ cheap software system available for Windowsand Mac that can work with what the company calls ‘medium’ data.Like R you will want to preprocess your data using the databasesystem to select subsets and/or aggregate before bringing it intothe software. It does have facilities however to connect to databasesystems and a Tableau server engine for larger analysis problems.
Extracting the R source code
Running the following chunk will extract the source code from aboveinto StatisticalVisualization.R file.
knitr::knit("StatisticalVisualization.Rmd", tangle = TRUE)