data visualization and graphic design special topics

48
Data visualization and graphic design Special topics Allan Just and Andrew Rundle EPIC Short Course June 24, 2011 Wickham 2008

Upload: elise

Post on 23-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

Data visualization and graphic design Special topics. Allan Just and Andrew Rundle EPIC Short Course June 24, 2011. Wickham 2008. Agenda. Quick hits Layer order in Deducer Bubble charts ggplot2 quasi- beanplot Being on your own with ggplot2 and R – getting unstuck - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Data visualization and graphic design Special topics

Data visualization and graphic designSpecial topics

Allan Just and Andrew RundleEPIC Short CourseJune 24, 2011

Wickham 2008

Page 2: Data visualization and graphic design Special topics

2

Quick hits• Layer order in Deducer• Bubble charts• ggplot2 quasi-beanplot

Being on your own with ggplot2 and R – getting unstuck

Small datasets revisitedLarge datasetsDisplaying uncertainty

Automated generation of many plots

Extending ggplot2 – direct labels and scatterplot matrices

New geoms

More practice exercises!

Wrap up

Agenda

Page 3: Data visualization and graphic design Special topics

3

A theory about practice…

Page 4: Data visualization and graphic design Special topics

4

Getting unstuck…• Check the str() of your data• Check the console for error messages

• Look at the call for your plot – is that what you wanted?

• Easier to start with something that works but is too simple1. Simplify the plot until it works2. Add back components one-by-one to isolate the

problem

Page 5: Data visualization and graphic design Special topics

5

Reproducible examples and the ggplot2 listserve

http://groups.google.com/group/ggplot2

Compose your question well and you might figure out the answer in the process!

Page 6: Data visualization and graphic design Special topics

6

Data + summaryLoss of information

Page 7: Data visualization and graphic design Special topics

7

Better than bar charts…data(airquality)# open the plot builder and add geom_point# with x = Month and y = Ozone

Data + summary – building this ourselves…

Page 8: Data visualization and graphic design Special topics

8

Pseudo beanplotsg_violin_bean <- ggplot(sleep,

aes(x = extra)) +geom_ribbon(aes(ymax = ..density.., ymin = -..density..), stat = "density", fill = "black") + geom_segment(aes(y = -.05, yend = .05, xend = extra), color = "grey90") + facet_grid(. ~ group, as.table = FALSE, scales = "free_y") +opts(panel.margin = unit(0 , "lines")) + xlab(NULL) + theme_bw(base_size = 20) + coord_flip() + opts(axis.text.x = theme_blank()) + expand_limits(x = c(-5, 9))

g_violin_bean

Page 9: Data visualization and graphic design Special topics

What about large datasets?

Page 10: Data visualization and graphic design Special topics

10

Playing with diamonds…

data(diamonds)str(diamonds)

With your neighbor: how do we show the data on the caret – price relationship…

Page 11: Data visualization and graphic design Special topics

11

Strategies for large datasets

– Use smaller points - use circles

– Use partial transparency

– Jitter (small random noise) if data take discrete values

– Overlay a smoother to show the trend

– Display a random sample from your data

Page 12: Data visualization and graphic design Special topics

12

Partial transparencyAlpha = 0.01

Contours for densityAlpha = 0.1

How do you show 54,000 diamonds?

Hexagonal binswith legend

Page 13: Data visualization and graphic design Special topics

13

Displaying uncertainty

• Confidence intervals (uniformly shaded or bounded)

• Pointwise errorbars• Bayesian simulations• Resampling based estimates

Page 14: Data visualization and graphic design Special topics

14

Model shouldn’t extend beyond the range of your dataxkcd.com/605/

Page 15: Data visualization and graphic design Special topics

15

Page 16: Data visualization and graphic design Special topics

16

Page 17: Data visualization and graphic design Special topics

17

Page 18: Data visualization and graphic design Special topics

18

Graph your uncertaintyInformal Bayesian Simulation

1. Run regression

2. Draw random numbers based on uncertainty of your regression

3. Plot some lines!

4. Uses the sim() function in package “arm”

2~

knXfor

Xkn

Gelman and Hill 2007

Page 19: Data visualization and graphic design Special topics

19

Informal bayesian simulation

Figure 3. Association between DEP concentrations in personal air and the urinary metabolite MEP concentrations (adjusted for specific gravity) stratified by perfume use using linear regression of log transformed values. Lighter lines represent predictive uncertainty in regression parameters from informal Bayesian simulations (20 simulation draws with uniform priors). Boxplots show the distribution of MEP with means (“X”). Just et al 2010

Page 20: Data visualization and graphic design Special topics

20

Resampling - Spline after bootstrap

Cosma Shalizi 2010

Page 21: Data visualization and graphic design Special topics

21

How random is random - the qq-plot

qqreference from package DAAG

Page 22: Data visualization and graphic design Special topics

22

a Q-Q envelope – show range from 19 draws of random normal

Venables and Ripley

Page 23: Data visualization and graphic design Special topics

23

Generating many graphsExample: suppose we wanted to save a separate plot

of mileage for each car manufacturer in "mpg"Start with data formatted so that it is long…

manufacturer cty hwy1 audi 18 292 audi 21 2925 chevrolet 15 2326 chevrolet 16 26100 honda 28 33101 honda 24 32

Use the magic of R and ggplot2…

Page 24: Data visualization and graphic design Special topics

24

Generating many graphsExample: suppose we wanted to save a separate plot

of mileage for each car manufacturer in "mpg"Start with data formatted so that it is long…

manufacturer cty hwy1 audi 18 292 audi 21 2925 chevrolet 15 2326 chevrolet 16 26100 honda 28 33101 honda 24 32

• Use d_ply (from the plyr package – also by Hadley Wickham) to split up the dataframe by our subsetting variable

• Define a function to run on subsets; we name these smaller dataframes "dat"

• Call ggplot() and ggsave() within this function to generate and save our plot

Page 25: Data visualization and graphic design Special topics

25

Generating many graphsExample: suppose we wanted to save a separate plot

of mileage for each car manufacturer in "mpg"

# d_ply takes a dataframe, splits it apart, applies a functiond_ply(mpg, .(manufacturer), function(dat) { # create a ggplot2 object named figure using 'dat'

figure <- ggplot(dat, aes(cty, hwy)) + geom_smooth(method = "lm") + geom_point(alpha = 0.7, size = 2.5,

position = position_jitter(height = 0.1, width = 0.1)) +

annotate("text", x = -Inf, y = Inf, hjust = -.1, vjust = 1.2,label = paste("n =", nrow(dat))) +

opts(title = dat$manufacturer[1]) # unique title can help# create a unique filename for each subset (e.g. "MPG_Audi.png")filename <- paste("MPG_", dat$manufacturer[1], ".png", sep = "")# by default this saves to your working directory; see ?getwdggsave(filename, figure, height = 6.5, width = 10)

})

Page 26: Data visualization and graphic design Special topics

26

Extending ggplot2

Let's get some more packages:install.packages()

directlabels GGally

Page 27: Data visualization and graphic design Special topics

27

Extending ggplot2: directlabels

Page 28: Data visualization and graphic design Special topics

28

# original code adapted from http://learnr.wordpress.com

library(ggplot2)# define the datasetdf <- structure(list(City = structure(c(2L, 3L, 1L), .Label = c("Minneapolis", "Phoenix", "Raleigh"), class = "factor"), January = c(52.1, 40.5, 12.2), February = c(55.1, 42.2, 16.5), March = c(59.7, 49.2, 28.3), April = c(67.7, 59.5, 45.1), May = c(76.3, 67.4, 57.1), June = c(84.6, 74.4, 66.9), July = c(91.2, 77.5, 71.9), August = c(89.1, 76.5, 70.2), September = c(83.8, 70.6, 60), October = c(72.2, 60.2, 50), November = c(59.8, 50, 32.4), December = c(52.5, 41.2, 18.6)), .Names = c("City", "January", "February", "March", "April", "May", "June", "July", "August", "September", "October", "November", "December"), class = "data.frame", row.names = c(NA, -3L))#and season labelsseasons <- data.frame(month = c(1.5, 4.5, 7.5, 10.5),

value = 97, season = c("Winter", "Spring", "Summer", "Autumn"))

# melt the dataset to a long formatdfm <- melt(df, variable_name = "month")levels(dfm$month) <- month.abb

#build the basic plotp <- ggplot(dfm, aes(month, value, group = City, colour = City)) p1 <- p + geom_line(size = 1)dgr_fmt <- function(x, ...) { parse(text = paste(x, "*degree", sep = "")) }none <- theme_blank() p2 <- p1 + theme_bw() + scale_y_continuous(formatter = dgr_fmt, limits = c(0, 100), expand = c(0, 0)) +

xlab(NULL) + ylab(NULL) + opts(title = expression("Average Monthly Temperatures (" * degree * "F)"), panel.grid.major = none, panel.grid.minor = none, legend.position = "none",panel.background = none,panel.border = none,axis.line = theme_segment(colour = "grey50"))

(p3 <- p2 + geom_vline(xintercept = c(2.9, 5.9, 8.9, 11.9), colour = "grey85", alpha = 0.5) + geom_hline(yintercept = 32, colour = "grey80", alpha = 0.5) + annotate("text", x = 1.2, y = 35, label = "Freezing", colour = "grey80", size = 4) + geom_text(data = seasons, aes(label = season, group = NULL), colour = "grey70", size = 4))

(p4 <- p3 + geom_text(data = dfm[dfm$month == "Dec", ], aes(label = City), hjust = 0.7, vjust = 1))

data_table <- ggplot(dfm, aes(x = month, y = factor(City), label = format(value, nsmall = 1), colour = City)) + geom_text(size = 3.5) + theme_bw() + scale_y_discrete(formatter = abbreviate, limits = c("Minneapolis", "Raleigh", "Phoenix")) + xlab(NULL) + ylab(NULL) + opts(panel.grid.major = none, legend.position = "none", panel.border = none, axis.text.x = none, axis.ticks = none,plot.margin = unit(c(-0.5, 1, 0, 0.5), "lines"))

Layout <- grid.layout(nrow = 2, ncol = 1, heights = unit(c(2, 0.25), c("null", "null")))grid.show.layout(Layout)vplayout <- function(...) { grid.newpage()

pushViewport(viewport(layout = Layout)) }subplot <- function(x, y) viewport(layout.pos.row = x, layout.pos.col = y)mmplot <- function(a, b) { vplayout()

print(a, vp = subplot(1, 1)) print(b, vp = subplot(2, 1)) }

mmplot(p4, data_table)

# to save - run the following code - see ?png###### png("temperature_plot.png")# mmplot(p4, data_table)# dev.off()

#note that when we were at the p3 stage we didn't yet have labels for the datap3

library(directlabels) # code to put labels into your ggplot2 objectsp3.labelled <- direct.label(p3, list(last.points, hjust = 0.7, vjust = 1))p3.labelled #############################

A fully polished plot probably took a lot of coding

Page 29: Data visualization and graphic design Special topics

29

Extending ggplot2: GGallyScatterplot matrix: 36 plots showing ~9K measures

bivariate densities and correlations

Page 30: Data visualization and graphic design Special topics

30

Page 31: Data visualization and graphic design Special topics

31

Making a scatterplot matrixlibrary(GGally)data(iris)head(iris[, 3:5]) #iris columns 3 to 5

# example 1 - defaultsggpairs(iris[, 3:5])

# example 2 – more customized by data typeggpairs(iris[,3:5],

upper = list(continuous = "density", combo = "box"), lower = list(continuous = "points", combo = "dot"), diag = list(continuous = "bar", discrete = "bar"))

# example 3 – some new stuff!!!dat <- data.frame(x = rnorm(100),

y = rnorm(100),z = rnorm(100))

plotmatrix <- GGally::ggpairs(dat,lower = list(continuous = "density", aes_string = aes_string(fill = "..level..")),upper = "blank")

plotmatrix#EOF

Page 32: Data visualization and graphic design Special topics

32

Thinking about some new geoms

Page 33: Data visualization and graphic design Special topics

33

Showing density surfaces from stat_density2d

Let's make a plot of x and y from data.frame dat with stat_density2d

What is the default geom?

In the previous plot, which aesthetic was showing those colors?

What geom would we need to make that plot?

Page 34: Data visualization and graphic design Special topics

34

geom_rug to show marginal distribution

Page 35: Data visualization and graphic design Special topics

35

Page 36: Data visualization and graphic design Special topics

36

Page 37: Data visualization and graphic design Special topics

37

Page 38: Data visualization and graphic design Special topics

38

Page 39: Data visualization and graphic design Special topics

39

geom_polygon after computing the convex outer hull, labels at the centroids, moved the legend to the top

Page 40: Data visualization and graphic design Special topics

40

Page 41: Data visualization and graphic design Special topics

41

“Hey, what did you learn in that EPIC class you took?”

Page 42: Data visualization and graphic design Special topics

42

Recap: Why we did thisVisualization is important for communicating

information and promoting your ideas

Effective designs will be noticed

We make many graphs quickly for discovery and choose the best ones to polish for communication

With a theory of visualization we can create sophisticated graphics using basic components

Page 43: Data visualization and graphic design Special topics

Recap: Designing a good scientific figure

1. Answer a question – usually a comparison

2. Use an appropriate design (emphasize comparisons

of position before length, angle, area or color)

3. Make it self-sufficient (annotation & figure legend)

4. Show your data – tell its story

Page 44: Data visualization and graphic design Special topics

44

Recap: ggplot2 and R R is a powerful language for statistics and data analysis

ggplot2 implements a “grammar of graphics”

ggplot2: Builds plots using data,

and layers of geometric objects,

mapping variables to aesthetic features,

which have been transformed by scales,

summarized with statistics,

projected into a coordinate system,

and subset into adjacent plots with facets

Page 45: Data visualization and graphic design Special topics

45

Recap: JGR and Deducer

JGR: a graphic interface system for R programming

Deducer: adds menu driven analysis and plotting

Page 46: Data visualization and graphic design Special topics

46

Send R code to Console

Deducer: Plot BuilderSave or import .ggp file

View call to see R code

ggsave("plot.png", height = 6.5, width = 10)

Page 47: Data visualization and graphic design Special topics

47

GeomData

Stat

Order of drawing layers

Mapped vars

More optionsby component

Switch to map to a var

Right-click to Get info

Right-click to edit, toggle, remove

Adjust position

Set to a constant value

Deducer: Plot Builder

Page 48: Data visualization and graphic design Special topics

Questions?

[email protected]