statistical visualization - meetupfiles.meetup.com/14454172/statisticalvisualization.pdf ·...

22
Statistical Visualization David Zeitler May ,2015 Big Data and Hadoop Users Group of West Michigan Note that in order to get pdf output you will need to have LaTeX installed. On your own computer you can download and install LaTeX from the sources below: • Windows: MikTeX - http://www.miktex.org/ • OSX: MacTeX - https://www.tug.org/mactex/ If you don’t have LaTeX, you can change the output to word_document or html_document and things should mostly render well. For word, you’ll need to make some modifications since it doesn’t handle xtable output or the newthought commands. There are many galleries of R graphics available online where you can find code along with the image. Just find the one that looks like what you want, copy the code and edit it to use your data. Gallery links: Google search ‘r graphics gallery’ http://rgraphgallery.blogspot.com/ http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html http://scs.math.yorku.ca/index.php/R _ Graphs _ Gallery

Upload: others

Post on 15-Jun-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

Statistical VisualizationDavid ZeitlerMay ,2015

Big Data and Hadoop Users Group of West Michigan

Note that in order to get pdf output you will need to have LaTeXinstalled. On your own computer you can download and installLaTeX from the sources below:

• Windows: MikTeX - http://www.miktex.org/• OSX: MacTeX - https://www.tug.org/mactex/

If you don’t have LaTeX, you can change the output to word_documentor html_document and things should mostly render well. For word,you’ll need to make some modifications since it doesn’t handle xtableoutput or the newthought commands.

There are many galleries of R graphics available online where youcan find code along with the image. Just find the one that looks likewhat you want, copy the code and edit it to use your data.

Gallery links: Google search ‘r graphics gallery’

• http://rgraphgallery.blogspot.com/

• http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html

• http://scs.math.yorku.ca/index.php/R_Graphs_Gallery

Page 2: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 2

• http://www.statmethods.net/advgraphs/parameters.html

• Shiny: http://shiny.rstudio.com/gallery/

Preliminaries11 The following packages are used in thiscode and may need to be installed: lattice,latticeExtra, ggplot2, mosaic, mosaicData,vcd, vcdExtra, mapproj, scatterplot3d,mvtnorm, rgl, dplyr, ggvis, lubridate,nycflights, animation.

This presentation is about understanding a bit more about graphicsin R and Statistical Visualization in general. The first thing to under-stand is that the purpose of a graphic is to convey information. Thatmay be information we know, for example when creating graphics forpublication. Often though the information conveyed was unknownbefore seeing the graphic, as during exploratory data analysis (JohnTukey2). 2 http://www.amstat.org/about/

statisticiansinhistory/index.cfm?

fuseaction=biosinfo&BioID=14The difference between art and visualization is that art in-herently depends on the viewer, invoking in each viewer a differentexperience. The artist strives to encourage the viewer to combine thesensory experience with their own experiences to make each percep-tion of the art unique. Visualization is meant to convey or uncoverspecific meaning in all viewers. Human perception is a function ofboth the sensory experience and the viewers model of the object be-ing perceived. For example, the first time I went up in an airplane,nothing outside the windows was recognizable. Not until I had de-veloped a sufficient 3D model of the environment could I recognizeplaces that were totally familiar to me from the ground. R graphicssoftware uses underlying statistical models to give the graphics acommon model that allows the analyst to convey information to theviewer.

The graphics capabilities of R have developed over decades ofopen source development. Although there are a lot of stand-alonegraphics packages and functions available in R, there are 3 maingraphics systems available.

System Comment

base Simplest and oldest but still used frequentlylattice Mature, publication quality, easy to useggplot2 Most sophisticated, hardest to learn, publication quality

We’ll first go rather quickly through a couple of examples focusingon the differences.

One of the datasets we will use for quite a few of these examplesis the mtcars data included with R. Let’s take a quick look at the firstfew rows of that data:

Page 3: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 3

library(xtable)

options(xtable.comment = FALSE)

options(xtable.booktabs = TRUE)

options(xtable.type = knitr::opts_knit$get("rmarkdown.pandoc.to"))

xtable(head(mtcars[, 1:6]), caption = "First rows of mtcars")

mpg cyl disp hp drat wt

Mazda RX4 21.00 6.00 160.00 110.00 3.90 2.62

Mazda RX4 Wag 21.00 6.00 160.00 110.00 3.90 2.88

Datsun 710 22.80 4.00 108.00 93.00 3.85 2.32

Hornet 4 Drive 21.40 6.00 258.00 110.00 3.08 3.21

Hornet Sportabout 18.70 8.00 360.00 175.00 3.15 3.44

Valiant 18.10 6.00 225.00 105.00 2.76 3.46

Table 2: First rows of mtcars

Base or traditional graphics is the simplest and oldest form ofgraphics. Examples are simple box plots (boxplot), scatter plots (plot)or histograms (hist). These are all examples of a single quantitativevariable graphic.

hist(mtcars$mpg)

Histogram of mtcars$mpg

mtcars$mpg

Fre

quen

cy

10 15 20 25 30 35

06

12

Figure 1: Base graphics histogram ofmpg

Lattice

Lattice graphics3 is the system developed by Deepayan Sarkar4. 3 Lattice Tutorial: http://www.isid.ac.

in/~deepayan/R-tutorials/labs/04_

lattice_lab.pdf4 http://www.isid.ac.in/~deepayan/

They focus mainly on trellis plots which are multiple panels allowingus to explore one or more dependent variables with respect to oneor many independent variables. The lattice package is included withall R distributions by default, so we only need to load it into the cur-rent session to use its commands. It can however do single variablegraphics just like the histogram we did with base graphics.

require(lattice)

## Loading required package: lattice

histogram(~mpg, mtcars)

mpg

Per

cent

of T

otal

0

10

20

30

10 15 20 25 30 35

Figure 2: Lattice histogram of mpg

Or we can look at mpg by cylinders.

histogram(~mpg | cyl, mtcars)

mpg

Per

cent

of T

otal

0

20

40

60

80

101520253035

cyl

101520253035

cyl

101520253035

cyl

Figure 3: Lattice histogram of mpg

Page 4: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 4

ggplot2: An Implementation of the Grammar of Graphics

The ggplot system is the system developed by Hadley Wickham5. 5 http://had.co.nz/

It is a highly object oriented graphics system based on The Grammarof Graphics by Leland Wilkinson6. 6 http://www.cs.uic.edu/~wilkinson/

TheGrammarOfGraphics/GOG.html

Shameless plug. . .

For those that might want to go further. . . We’re offering a 3 creditcourse this summer7. STA 380/580 Statistical Visualization: An ex- 7 Summer 6-weeks session 2015. PEW

Campus Eberhard Center room 612,6:00-9:20 Tuesday/Thursday June 22 -August 4

ploration of statistical visualization using R, Tableau, SAS JMP andSAS Enterprise Guide. Emphasis will be on the grammar of graphicsand the graphical communication of quantitative information, in-cluding graphical depictions for most standard statistical proceduresand the use of animations. A semester of statistics and only limitedfamiliarity with programming will be assumed.

Ok. Commercial break over, back to work. . . Let’s look at ourhistograms using ggplot2.

# install.packages(’ggplot2’,dependencies=TRUE)

require(ggplot2)

## Loading required package: ggplot2

ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 3)

0

2

4

6

8

10 20 30 40mpg

coun

t

Figure 4: ggplot2 histogram of mpg

require(ggplot2)

ggplot(mtcars, aes(x = mpg)) + geom_histogram(binwidth = 3) +

facet_wrap(~cyl)4 6 8

0

2

4

6

10 20 30 40 10 20 30 40 10 20 30 40mpg

coun

t

Figure 5: ggplot2 histogram of mpg

Donner Party with Logistic Regression

require(mosaic)

require(mosaicData)

require(vcd)

require(vcdExtra)

The Donner data frame in vcdExtra gives details on the survival of90 members of the Donner party, a group of people who attempted tomigrate to California in 1846. They were trapped by an early blizzardon the eastern side of the Sierra Nevada mountains, and before theycould be rescued, nearly half of the party had died. What factorsaffected who lived and who died?

Page 5: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 5

data(Donner)

# separate linear fits on age for M/F

ggplot(Donner, aes(age, survived, color = sex)) +

geom_point(position = position_jitter(height = 0.02,

width = 0)) + stat_smooth(method = "glm",

family = binomial, formula = y ~ x, alpha = 0.2,

size = 2, aes(fill = sex))

0.00

0.25

0.50

0.75

1.00

0 20 40 60age

surv

ived sex

Female

Male

Now let’s just do some graphics for the fun of it!

First: Multipanel base graphics

We can build up graphics using multiple panels with a traditionalgraphic in each panel. For example we can look at various aspects ofthe mtcars data in a 2x2 layout. Arbitrarily complex layouts can becreated, but we really don’t have time for that here.

Note that I didn’t put this on the side margin. These did not scalewell.

par(mfrow = c(2, 2))

hist(mtcars$mpg, main = "", xlab = "mpg")

boxplot(mtcars$mpg, main = "miles/gallon")

plot(mtcars$hp, mtcars$mpg, xlab = "horsepower",

ylab = "mpg")

plot(density(mtcars$mpg), main = "")

Page 6: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 6

mpg

Fre

quen

cy

10 25

0 10

miles/gallon

50 200

10

horsepower

mpg

10 30

0.00

N = 32 Bandwidth = 2.477

Den

sity

A bit more advanced use of base graphics

It might be the old base stuff, but there’s a lot that can be, and hasbeen, done with it.

with(mtcars, {

r <- sqrt(disp/pi)

symbols(wt, mpg, circle = r, inches = 0.3,

main = "Bubble Plot with point size\nproportional to displacement",

fg = "white", bg = "lightblue", xlab = "Miles Per Gallon",

ylab = "Weight of Car (lbs/1000)")

text(wt, mpg, rownames(mtcars), cex = 0.6)

})

Page 7: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 7

1 2 3 4 5 6

1015

2025

3035

Bubble Plot with point sizeproportional to displacement

Miles Per Gallon

Wei

ght o

f Car

(lb

s/10

00)

Mazda RX4Mazda RX4 Wag

Datsun 710

Hornet 4 Drive

Hornet SportaboutValiant

Duster 360

Merc 240D

Merc 230

Merc 280

Merc 280C

Merc 450SEMerc 450SL

Merc 450SLC

Cadillac FleetwoodLincoln Continental

Chrysler Imperial

Fiat 128

Honda Civic

Toyota Corolla

Toyota Corona

Dodge ChallengerAMC Javelin

Camaro Z28

Pontiac Firebird

Fiat X1−9Porsche 914−2

Lotus Europa

Ford Pantera L

Ferrari Dino

Maserati Bora

Volvo 142E

Page 8: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 8

A lattice of cloud plots

We’ve seen several of these lattice graphics today, so we’ll jump rightinto some more interesting capabilities. Remember this graphic, we’llcome back to this data later with a dynamic 3D graphic that does an evenbetter job than this one. This is a single quantitative dependent variableviewed by two quantitative independent variables and a categoricalindependent variable. That’s why we see the formula in the cloudfunction call.8 One of the great things about lattice is this explicit 8 * _Sepal.Length ~ Petal.Length *

Petal.Width | Species_model in the function call making the underlying model clear.

## Lattice of 3d graphics using the Iris data

cloud(Sepal.Length ~ Petal.Length * Petal.Width |

Species, data = iris, screen = list(x = -90,

y = 70), distance = 0.4, zoom = 0.6, zlab = list(rot = 90,

cex = 0.5), xlab = list(rot = 45, cex = 0.5),

ylab = list(cex = 0.5))

Petal.

Leng

th

Petal.Width

Sep

al.L

engt

h

setosa

Petal.

Leng

th

Petal.Width

Sep

al.L

engt

h

versicolor

Petal.

Leng

th

Petal.Width

Sep

al.L

engt

h

virginica

Page 9: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 9

The obligatory map graphic

What would a visualization discussion be without one of those mapgraphics? This one looks at U.S. cancer rates.

data(USCancerRates)

rng <- with(USCancerRates, range(rate.male, rate.female,

finite = TRUE))

nbreaks <- 50

breaks <- exp(do.breaks(log(rng), nbreaks))

suppressWarnings(warning(print(mapplot(rownames(USCancerRates) ~

rate.male + rate.female, data = USCancerRates,

breaks = breaks, map = map("county", plot = FALSE,

fill = TRUE, projection = "tetra"), scales = list(draw = FALSE),

xlab = "", main = "Average yearly deaths due to cancer\nper 100000"))))

Average yearly deaths due to cancerper 100000

rate.male

rate.female

100

200

300

400

500

600

Page 10: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 10

An advanced application of Lattice graphics

This image is going to be complex9, so don’t worry too much 9 Source Author: Alastair Sanderson -

Astrophysics & Space Research (ASR)Group at the University of Birmingham,Source URL: http://www.sr.bham.ac.uk/~ajrs/R/r-gallery.html

about the preparations. They’ll go on for a page or so setting up thedata (pulled from the web), defining a special strip function for thetime series ‘rows’ in the graphic, setting up limits, etc.

Plotting time series using lattice graphics

An R chart of daily weather measurements taken at the Universityof Birmingham Wast Hill Observatory, using the excellent R latticegraphics package. The measurements are averaged over the 5 hoursaround midday except for rainfall, which is averaged over a week

# --Load previously saved data from

# Sanderson’s website

path <- "http://www.sr.bham.ac.uk/~ajrs/R/datasets"

a <- load(url(paste(path, "middayweather.RData",

sep = "/")))

close(url(paste(path, "middayweather.RData", sep = "/")))

print(a) # list names of saved objects

## [1] "middayweather"

# --Load extra libraries:

require(lattice)

# --Define plot titles:

lab.wind.speed <- "Wind speed (mph)"

lab.hum <- "Humidity (%)"

lab.rain <- "Rainfall (mm/day, averaged over a week)"

lab.bar <- "Air pressure (mb)"

lab.T.out <- as.expression(expression(paste("Outside temperature (",

degree * C, ")")))

# --Custom strip function: (NB the colour used

# is the default lattice strip background

# colour)

my.strip <- function(which.given, which.panel,

...) {

strip.labels <- c(lab.wind.speed, lab.hum,

lab.rain, lab.bar, lab.T.out)

panel.rect(0, 0, 1, 1, col = "#ffe5cc", border = 1)

panel.text(x = 0.5, y = 0.5, adj = c(0.5,

0.55), cex = 0.95, lab = strip.labels[which.panel[which.given]])

Page 11: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 11

}

# --Define X axis date range:

xlim <- range(middayweather$Date)

# --Define annual quarters for plot grid line

# markers:

d <- seq(from = as.Date("2006-01-01"), to = as.Date("2011-01-01"),

by = 365/4)

# --Define colours for raw & smoothed data:

col.raw <- "#377EB8" #colset[2] } see note above

col.smo <- "#E41A1C" #colset[1] }

col.lm <- "grey20"

Note the equation10 in the xyplot depicting this as a graphic with 10 wind.speed + hum.out + rain + bar +T.out ~ Datefive dependent variables and one independent variable.

# --Create multipanel plot:

xyplot(wind.speed + hum.out + rain + bar + T.out ~

Date, data = middayweather, scales = list(y = "free",

rot = 0), xlim = xlim, strip = my.strip, outer = TRUE,

layout = c(1, 5, 1), ylab = "", panel = function(x,

y, ...) {

# plot default horizontal gridlines

panel.grid(h = -1, v = 0)

# plot default horizontal gridlines

panel.abline(v = d, col = "grey90")

# plot default horizontal gridlines

panel.xyplot(x, y, ..., type = "l", col = col.raw,

lwd = 0.5)

# smoothed data

panel.loess(x, y, ..., col = col.smo,

span = 0.14, lwd = 0.5)

# median value

panel.abline(h = median(y, na.rm = TRUE),

lty = 2, col = col.lm, lwd = 1)

}, key = list(title = "Birmingham Wast Hills Observatory\naverage midday weather",

text = list(c("raw data", "smoothed curve",

"median value")), col = c(col.raw,

col.smo, col.lm), lty = c(1, 1, 2),

columns = 2, cex = 0.8, lines = TRUE),

)

Page 12: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 12

Date

05

1015

2007−01 2007−07 2008−01 2008−07 2009−01 2009−07 2010−01

Wind speed (mph)406080

100Humidity (%)

05

10

Rainfall (mm/day, averaged over a week)960980

100010201040

Air pressure (mb)0

1020

Outside temperature (°C)

Birmingham Wast Hills Observatoryaverage midday weather

raw datasmoothed curve

median value

Page 13: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 13

Other advanced graphics

scatterplot3d

3D Scatterplot with fit linear regression plane and reference points.

require(scatterplot3d)

## Loading required package: scatterplot3d

data(trees)

# Create the scatterplot

s3d <- scatterplot3d(trees, type = "h", color = "blue",

angle = 55, scale.y = 0.7, pch = 16, main = "Trees")

# fit a multiple regression model

my.lm <- lm(trees$Volume ~ trees$Girth + trees$Height)

# add the model plane to the graphic

s3d$plane3d(my.lm)

# mark red reference points at (10,85,60),

# (12,80,50), etc.

s3d$points3d(seq(10, 20, 2), seq(85, 60, -5),

seq(60, 10, -10), col = "red", type = "h",

pch = 8)

Trees

8 10 12 14 16 18 20 221020

3040

5060

7080

60657075808590

Girth

Hei

ght

Vol

ume

Page 14: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 14

Fancy 3D plot of a bivariate normal distribution with equations.

# install.packages(’mvtnorm’)

require(mvtnorm)

## Loading required package: mvtnorm

require(scatterplot3d)

x1 <- x2 <- seq(-10, 10, length = 51)

dens <- matrix(dmvnorm(expand.grid(x1, x2), sigma = rbind(c(3,

2), c(2, 3))), ncol = length(x1))

# Plot the bounding box

s3d <- scatterplot3d(x1, x2, seq(min(dens), max(dens),

length = length(x1)), type = "n", grid = FALSE,

angle = 70, main = "Bivariate normal distribution",

zlab = expression(f(x[1], x[2])), xlab = expression(x[1]),

ylab = expression(x[2]))

# add a little fancy equation labeling

text(s3d$xyz.convert(-1, 10, 0.07), labels = expression(f(x) ==

frac(1, sqrt((2 * pi)^n * phantom(".") * det(Sigma[X]))) *phantom(".") * exp * {

bgroup("(", -scriptstyle(frac(1, 2) *phantom(".")) * (x - mu)^T * Sigma[X]^-1 *(x - mu), ")")

}), cex = 0.5)

text(s3d$xyz.convert(1.5, 10, 0.05), labels = expression("with" *phantom("m") * mu == bgroup("(", atop(0, 0),

")") * phantom(".") * "," * phantom(0) * {

Sigma[X] == bgroup("(", atop(3 * phantom(0) *2, 2 * phantom(0) * 3), ")")

}), cex = 0.5)

# Add the density surface

for (i in length(x1):1) s3d$points3d(rep(x1[i],

length(x2)), x2, dens[i, ], type = "l")

for (i in length(x2):1) s3d$points3d(x1, rep(x2[i],

length(x1)), dens[, i], type = "l")

Page 15: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 15

Bivariate normal distribution

−10 −5 0 5 100.00

0.02

0.04

0.06

0.08

−10

−5

0

5

10

x1

x 2

f(x1,

x 2)

f(x) =1

(2π)n det(ΣX)exp

− 1

2(x − µ)TΣX

−1(x − µ)

with µ =

0

0

, ΣX =

3 2

2 3

Note that we’re using Rmarkdown for the following, but the Routput cannot be compiled using RStudio since there is a lot of userinteraction and animation going on throughout that cannot be ren-dered into a static document. So only the code comes out in thehandout.

rgl: 3D visualization device system (OpenGL)

A common visualization is a 3d point cloud. We saw some staticpoint clouds in the previous work. Here the point cloud is dynamic,allowing us to move it around to see the structure.

require(rgl)

with(iris, plot3d(Sepal.Length, Sepal.Width, Petal.Length,

type = "s", col = as.numeric(Species)))

lattice interactive plotting

With the lattice package we can interact with the plot, highlightinginterestingpoints and extracting those highlighted points for furtheranalysis.

Using environmental data

Page 16: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 16

env <- environmental

env$ozone <- env$ozone^(1/3)

splom(env, pscales = 0, col = "grey")

trellis.focus("panel", 1, 1, highlight = FALSE)

interesting <- panel.link.splom(pch = 16, col = "black")

trellis.unfocus()

env[interesting, ]

ggvis

The new ggvis graphics system is being developed by WinstonChang for interactive exploration.

ggvis works seamlessly with R Markdown v2 and interactive doc-uments, so you can easily add interactive graphics to your R Mark-down documents. ggvis plots are inherently reactive and they renderin the browser, so they can take advantage of the capabilities pro-vided by modern web browsers.

We’ll need to pull in some packages first for these examples.

require(dplyr)

require(ggvis)

require(lubridate)

Note that ggvis graphics will go to a web browser if you’re notknitting the file or using RStudio.

ggvis basics

The goal of ggvis is to make it easy to build interactive graphics forexploratory data analysis. ggvis has a similar underlying theory toggplot2 (the grammar of graphics), but it’s expressed a little differ-ently, and adds new features to make your plots interactive. ggvisalso incorporates shiny’s reactive programming model and dplyr’sgrammar of data transformation.

The graphics produced by ggvis are fundamentally web graphicsand work very differently from traditional R graphics. This allows usto implement exciting new features like interactivity, but it comes at acost. For example, every interactive ggvis plot must be connected to arunning R session (static plots do not need a running R session to beviewed). This is great for exploration, because you can do anythingin your interactive plot you can do in R, but it’s not so great for pub-lication. We will overcome these issues in time, but for now be awarethat we have many existing tools to re-implement before you can doeverything with ggvis that you can do with base graphics.

Page 17: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 17

Every ggvis graphic starts with a call to ggvis(). The first argu-ment is the data set that you want to plot, and the other argumentsdescribe how to map variables to visual properties.

p <- ggvis(mtcars, x = ~wt, y = ~mpg)

This doesn’t actually plot anything because you haven’t told ggvishow to display your data. You do that by layering visual elements,for example with layer_points():

layer_points(p)

(If you’re not using RStudio, you’ll notice that this plot opens inyour web browser. That’s because all ggvis graphics are web graph-ics, and need to be shown in the browser. RStudio includes a built-inbrowser so it can show you the plots directly.)

All ggvis functions take the visualization as the first argumentand return a modified visualization. This seems a little bit awkward.Either you have to create temporary variables and modify them, oryou have to use a lot of parentheses:

layer_points(ggvis(mtcars, x = ~wt, y = ~mpg))

To make life easier ggvis uses the %>% function from the magrittrpackage to chain commands together, much like the ‘+’ used in gg-plot. That allows you to rewrite the previous function call as:

mtcars %>% ggvis(x = ~wt, y = ~mpg) %>% layer_points()

# disp by mpg, converting engine displacment

# to litres

mtcars %>% ggvis(x = ~mpg, y = ~disp) %>% mutate(disp = disp/61.0237) %>%

layer_points()

The format of the visual properties needs a little explanation.We use ~ before the variable name to indicate that we don’t wantto literally use the value of the mpg variable (which doesn’t exist),but instead we want we want to use the mpg variable inside in thedataset. This is a common pattern in ggvis: we’ll always use formulasto refer to variables inside the dataset.

The first two arguments to ggvis() are usually the position, so byconvention you can drop x and y:

mtcars %>% ggvis(~mpg, ~disp) %>% layer_points()

You can add more variables to the plot by mapping them to othervisual properties like fill, stroke, size and shape.

Page 18: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 18

mtcars %>% ggvis(~mpg, ~disp, stroke = ~vs) %>%

layer_points()

mtcars %>% ggvis(~mpg, ~disp, fill = ~vs) %>%

layer_points()

mtcars %>% ggvis(~mpg, ~disp, size = ~vs) %>%

layer_points()

mtcars %>% ggvis(~mpg, ~disp, shape = ~factor(cyl)) %>%

layer_points()

If you want to make the points a fixed colour or size, you needto use := instead of =. The := operator means to use a raw, unscaledvalue. This seems like something that ggvis() should be able to figureout by itself, but making it explicit allows you to create some usefulplots that you couldn’t otherwise. See the properties and scales formore details.

mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(fill, "red"),

‘:=‘(stroke, "black")) %>% layer_points()

mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(size, 300), ‘:=‘(opacity,

0.4)) %>% layer_points()

mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(shape, "cross")) %>%

layer_points()

mtcars %>% ggvis(~mpg, ~disp, stroke = ~wt, fill = ~hp,

size = ~vs, shape = ~factor(cyl)) %>% layer_points() %>%

add_legend("fill", properties = legend_props(legend = list(y = 50))) %>%

add_legend("size", properties = legend_props(legend = list(y = 150))) %>%

add_legend("shape", properties = legend_props(legend = list(y = 250)))

ggvis Interaction

As well as mapping visual properties to variables or setting them tospecific values, you can also connect them to interactive controls.

The following example allows you to control the size and opacityof points with two sliders:

mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(size, input_slider(10,

100)), ‘:=‘(opacity, input_slider(0, 1))) %>%

layer_points()

You can also connect interactive components to other plot parame-ters like the width and centers of histogram bins:

mtcars %>% ggvis(~wt) %>% layer_histograms(width = input_slider(0,

2, step = 0.1, label = "width"), center = input_slider(0,

2, step = 0.05, label = "center"))

Page 19: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 19

Behind the scenes, interactive plots are built with shiny, and youcan currently only have one running at a time in a given R session.To finish with a plot, press the stop button in Rstudio, or close thebrowser window and then press Escape or Ctrl + C in R.

As well as input_slider(), ggvis provides input_checkbox(), in-put_checkboxgroup(), input_numeric(), input_radiobuttons(), in-put_select() and input_text(). See the examples in the documentationfor how you might use each one.

You can also use keyboard controls with left_right() and up_down().Press the left and right arrows to control the size of the points in thenext example.

keys_s <- left_right(10, 1000, step = 50)

mtcars %>% ggvis(~wt, ~mpg, ‘:=‘(size, keys_s),

‘:=‘(opacity, 0.5)) %>% layer_points()

You can also add on more complex types of interaction like tooltips:

mtcars %>% ggvis(~wt, ~mpg) %>% layer_points() %>%

add_tooltip(function(df) df$wt)

As well as input_slider(), which produces a slider (or a double-ended range slider), there are a number of other interactive controls:

• input_checkbox(): a check-box• input_checkboxgroup(): a group of check boxes• input_numeric(): a spin box• input_radiobuttons(): pick one from a set options• input_select(): create a drop-down text box• input_text(): arbitrary text input

Note that all interactive inputs start with input_ so that you canalways use tab completion to remind you of the options.

Here is an example of a plot that uses both a slider and a selectbox:

mtcars %>% ggvis(x = ~wt) %>% layer_densities(adjust = input_slider(0.1,

2, value = 1, step = 0.1, label = "Bandwidth adjustment"),

kernel = input_select(c(Gaussian = "gaussian",

Epanechnikov = "epanechnikov", Rectangular = "rectangular",

Triangular = "triangular", Biweight = "biweight",

Cosine = "cosine", Optcosine = "optcosine"),

label = "Kernel"))

Page 20: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 20

slicing animations

require(ggplot2)

require(dplyr)

# require(animation)

require(nycflights13)

Generate an animationLooking at diamonds price by carats

ggplot(diamonds, aes(carat, price)) + geom_point(aes(color = clarity,

shape = cut)) + ylim(0, max.price) + xlim(0,

max.carat) + stat_smooth()

Instead of trying to put it all on the display at once, slice it intolayers and look at it in an animation. We can do a simple animationhere in the RStudio plot window. A better display is to save the se-ries of graphics in an animated GIF. However the saveGIF functionrequires a copy of the ImageMagick software which is not on the labcomputers. We can run the animation in the plot window and look atthe animation GIF’s I generated on my laptop.

max.carat <- max(diamonds$carat)

max.price <- max(diamonds$price)

# saveGIF( movie.name = ’diamonds.gif’,

for (this.cut in c("Fair", "Good", "Very Good",

"Premium", "Ideal")) for (this.clarity in c("I1",

"SI1", "SI2", "VS1", "VS2", "VVS1", "VVS2",

"IF")) print(diamonds %>% filter(cut == this.cut &

clarity == this.clarity) %>% ggplot(aes(carat,

price)) + geom_point() + ylim(0, max.price) +

xlim(0, max.carat) + stat_smooth() + ggtitle(paste("Cut is",

this.cut, "and clarity is", this.clarity)))

# )

A more traditional slicing is based on time. For this we look at theNYC flights data by month.

ggplot(flights, aes(dep_delay, arr_delay)) + geom_point(aes(color = origin)) +

facet_wrap(~origin) + stat_smooth(method = "lm")

max.dep.delay <- max(flights$dep_delay, na.rm = T)

min.dep.delay <- min(flights$dep_delay, na.rm = T)

max.arr.delay <- max(flights$arr_delay, na.rm = T)

min.arr.delay <- min(flights$arr_delay, na.rm = T)

# saveGIF( movie.name = ’nycflights.gif’,

for (i in 1:12) print(flights %>% filter(month ==

Page 21: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 21

i) %>% ggplot(aes(dep_delay, arr_delay)) +

geom_point() + xlim(min.dep.delay, max.dep.delay) +

ylim(min.arr.delay, max.arr.delay) + stat_smooth(method = "lm") +

facet_wrap(~origin) + ggtitle(paste("Flights for",

month.name[i])))

# )

Let’s now take a quick look at a couple of commercial visualizationpackages.

1) SAS JMP2) Tableau

SAS JMP

The SAS product JMP (pronounced jump) was developed some-what independently from the main SAS statistical products as a newapproach to statistical analysis. From the start it has incorporatedgraphics as an integral part of the statistical analysis process. It hascaught on in industry because of its ease of use and relatively lowcost.

The JMP graph builder is a powerful tool for exploring data.We’ll take a quick look at some of its capabilities using the Diamondsdata set we’ve already been working with.

Page 22: Statistical Visualization - Meetupfiles.meetup.com/14454172/StatisticalVisualization.pdf · statistical visualization 8 A lattice of cloud plots We’ve seen several of these lattice

statistical visualization 22

Tableau

Tableau is a ‘relatively’ cheap software system available for Windowsand Mac that can work with what the company calls ‘medium’ data.Like R you will want to preprocess your data using the databasesystem to select subsets and/or aggregate before bringing it intothe software. It does have facilities however to connect to databasesystems and a Tableau server engine for larger analysis problems.

Extracting the R source code

Running the following chunk will extract the source code from aboveinto StatisticalVisualization.R file.

knitr::knit("StatisticalVisualization.Rmd", tangle = TRUE)