a visual guide to r graphics and data munging · r graphics and data munging with 65 illustrations!...

59
Christophe Lalanne A Visual Guide to R Graphics and Data Munging With 65 illustrations 7d26af9 2015/04/10

Upload: others

Post on 12-Oct-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Christophe Lalanne

A Visual Guide to

R Graphics andData Munging

With 65 illustrations

7d26af9

20150410

Christophe Lalanne

xyplotdata

list c

scal

es dist

colpanel

type

pch

mea

n

function

aspe

ct boxratiocut

dataframepanelxyplot

rnorm

zoo

qqmath

stripplot

alphahistogram

logts

alternating

border

densityplot

dotplot

factor

groups

jitterdatalwd

over

lap

rot

sample

sd

tck

xlab

xy

ylab

abline

aggregate

amount

barchart

breaks

brew

erp

al

bycex

color

colorkey

cov

cumsum

ecdfplot

groupnumberhorizonplot

includelowestjitterx

num

ber

panelbwplot panelxyareaplotpoints

qnorm

quantile

replace

replicate

seqspan

strip

subs

et

supe

rpos

e

with

The greatest value of a picture is when it forces us to notice what we never expected to see mdash John Tukey

At their best graphics are instruments for reasoning mdash Edward Tufte

Contents

1 Getting started with R graphics 3

11 Why R 3

12 The R graphical model 4

13 Base vs lattice graphics 4

14 The grammar of graphics 4

2 Data management 5

21 Structuring data 5

22 Managing data 5

221 Variable recoding and annotation 6

222 Variable transformation 8

23 Indexing subsetting conditioning 9

24 Summarizing data 9

241 Base R functions 10

242 The Hmisc package 10

243 The plyr package 14

3 Univariate distributions 15

31 Stripchart 15

32 Histograms 17

33 Density plots 19

34 Quantile and related probability plots 21

35 Boxplots 24

36 Time series 25

iii

4 Two-way graphics 31

41 Lineplots 32

42 Scatterplots 32

43 Barcharts 37

44 Dotcharts 37

45 Line fits 39

46 Time series 40

47 Level plot 42

5 Multi-way graphics 45

51 Parallel displays 45

52 Scatterplot matrix 45

53 Three-way tabular data 45

54 N-way data 45

6 Customizing theme and panels 47

7 The Hmisc plotting functions 49

71 Two-way graphics 49

8 The ggplot2 package 51

81 Why ggplot2 51

82 Core components of a ggplot graphic 51

9 Interactive and dynamic displays 53

91 Exploratory data analysis 53

92 Brushing and linking 53

93 The ggobi toolbox 53

CHAPTER1

Getting started with R graphics

11 Why R

In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example

Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows

$ Rscript -e cat(rnorm(500) sep=n) gt rnddat

Then in gnuplot we can run

bw=02

n=500

bin(xwidth)=widthint(xwidth)

tstr(n)=sprintf(Binwidth = 11fn n)

set xrange [-33]

set yrange [01]

set boxwidth bw

plot rnddat using (bin($1bw))(1(bwn)) smooth frequency

with boxes title tstr(bw)

to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)

sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399

3

The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks

The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2

12 The R graphical model

13 Base vs lattice graphics

R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function

14 The grammar of graphics

sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399

Tidying

CHAPTER2

Data management

21 Structuring data

In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice

package which is entirely based on formula even if this is not apparent at first sight

22 Managing data

Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning

data(birthwt package=MASS)

str(birthwt)

sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8

5

Tidying221 Variable recoding and annotation

The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)

Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)

setseed(101)

birthwt$age[5] lt- NA

birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA

yesno lt- c(No Yes)

birthwt lt- within(birthwt

smoke lt- factor(smoke labels = yesno)

low lt- factor(low labels = yesno)

ht lt- factor(ht labels = yesno)

ui lt- factor(ui labels = yesno)

race lt- factor(race levels = 13 labels = c(White Black Other))

lwt lt- lwt22 weight in kg

)

It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well

library(Hmisc)

label(birthwt$age) lt- Mother age

units(birthwt$age) lt- years

label(birthwt$bwt) lt- Baby weight

units(birthwt$bwt) lt- grams

label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study

These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures

The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R

Once we have a working data frame the functions contents and describe provide twoquick summary of the data

gt contents(birthwt)

Data framebirthwt 189 observations and 10 variables Maximum NAs5

Labels Units Levels Class Storage NAs

low 2 integer 0

age Mother age years integer integer 1

Tidyinglwt double 0

race 3 integer 0

smoke 2 integer 0

ptl integer 0

ht 2 integer 0

ui 2 integer 0

ftv integer 5

bwt Baby weight grams integer integer 0

+--------+-----------------+

|Variable|Levels |

+--------+-----------------+

| low |NoYes |

| smoke | |

| ht | |

| ui | |

+--------+-----------------+

| race |WhiteBlackOther|

+--------+-----------------+

As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output

gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)

4 Variables 189 Observations

--------------------------------------------------------------------------------

low

n missing unique

189 0 2

No (130 69) Yes (59 31)

--------------------------------------------------------------------------------

age Mother age [years]

n missing unique Info Mean 05 10 25 50 75

188 1 24 1 233 16 17 19 23 26

90 95

31 32

lowest 14 15 16 17 18 highest 33 34 35 36 45

--------------------------------------------------------------------------------

race

n missing unique

189 0 3

White (96 51) Black (26 14) Other (67 35)

--------------------------------------------------------------------------------

ftv

n missing unique Info Mean

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 2: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

xyplotdata

list c

scal

es dist

colpanel

type

pch

mea

n

function

aspe

ct boxratiocut

dataframepanelxyplot

rnorm

zoo

qqmath

stripplot

alphahistogram

logts

alternating

border

densityplot

dotplot

factor

groups

jitterdatalwd

over

lap

rot

sample

sd

tck

xlab

xy

ylab

abline

aggregate

amount

barchart

breaks

brew

erp

al

bycex

color

colorkey

cov

cumsum

ecdfplot

groupnumberhorizonplot

includelowestjitterx

num

ber

panelbwplot panelxyareaplotpoints

qnorm

quantile

replace

replicate

seqspan

strip

subs

et

supe

rpos

e

with

The greatest value of a picture is when it forces us to notice what we never expected to see mdash John Tukey

At their best graphics are instruments for reasoning mdash Edward Tufte

Contents

1 Getting started with R graphics 3

11 Why R 3

12 The R graphical model 4

13 Base vs lattice graphics 4

14 The grammar of graphics 4

2 Data management 5

21 Structuring data 5

22 Managing data 5

221 Variable recoding and annotation 6

222 Variable transformation 8

23 Indexing subsetting conditioning 9

24 Summarizing data 9

241 Base R functions 10

242 The Hmisc package 10

243 The plyr package 14

3 Univariate distributions 15

31 Stripchart 15

32 Histograms 17

33 Density plots 19

34 Quantile and related probability plots 21

35 Boxplots 24

36 Time series 25

iii

4 Two-way graphics 31

41 Lineplots 32

42 Scatterplots 32

43 Barcharts 37

44 Dotcharts 37

45 Line fits 39

46 Time series 40

47 Level plot 42

5 Multi-way graphics 45

51 Parallel displays 45

52 Scatterplot matrix 45

53 Three-way tabular data 45

54 N-way data 45

6 Customizing theme and panels 47

7 The Hmisc plotting functions 49

71 Two-way graphics 49

8 The ggplot2 package 51

81 Why ggplot2 51

82 Core components of a ggplot graphic 51

9 Interactive and dynamic displays 53

91 Exploratory data analysis 53

92 Brushing and linking 53

93 The ggobi toolbox 53

CHAPTER1

Getting started with R graphics

11 Why R

In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example

Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows

$ Rscript -e cat(rnorm(500) sep=n) gt rnddat

Then in gnuplot we can run

bw=02

n=500

bin(xwidth)=widthint(xwidth)

tstr(n)=sprintf(Binwidth = 11fn n)

set xrange [-33]

set yrange [01]

set boxwidth bw

plot rnddat using (bin($1bw))(1(bwn)) smooth frequency

with boxes title tstr(bw)

to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)

sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399

3

The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks

The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2

12 The R graphical model

13 Base vs lattice graphics

R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function

14 The grammar of graphics

sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399

Tidying

CHAPTER2

Data management

21 Structuring data

In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice

package which is entirely based on formula even if this is not apparent at first sight

22 Managing data

Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning

data(birthwt package=MASS)

str(birthwt)

sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8

5

Tidying221 Variable recoding and annotation

The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)

Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)

setseed(101)

birthwt$age[5] lt- NA

birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA

yesno lt- c(No Yes)

birthwt lt- within(birthwt

smoke lt- factor(smoke labels = yesno)

low lt- factor(low labels = yesno)

ht lt- factor(ht labels = yesno)

ui lt- factor(ui labels = yesno)

race lt- factor(race levels = 13 labels = c(White Black Other))

lwt lt- lwt22 weight in kg

)

It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well

library(Hmisc)

label(birthwt$age) lt- Mother age

units(birthwt$age) lt- years

label(birthwt$bwt) lt- Baby weight

units(birthwt$bwt) lt- grams

label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study

These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures

The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R

Once we have a working data frame the functions contents and describe provide twoquick summary of the data

gt contents(birthwt)

Data framebirthwt 189 observations and 10 variables Maximum NAs5

Labels Units Levels Class Storage NAs

low 2 integer 0

age Mother age years integer integer 1

Tidyinglwt double 0

race 3 integer 0

smoke 2 integer 0

ptl integer 0

ht 2 integer 0

ui 2 integer 0

ftv integer 5

bwt Baby weight grams integer integer 0

+--------+-----------------+

|Variable|Levels |

+--------+-----------------+

| low |NoYes |

| smoke | |

| ht | |

| ui | |

+--------+-----------------+

| race |WhiteBlackOther|

+--------+-----------------+

As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output

gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)

4 Variables 189 Observations

--------------------------------------------------------------------------------

low

n missing unique

189 0 2

No (130 69) Yes (59 31)

--------------------------------------------------------------------------------

age Mother age [years]

n missing unique Info Mean 05 10 25 50 75

188 1 24 1 233 16 17 19 23 26

90 95

31 32

lowest 14 15 16 17 18 highest 33 34 35 36 45

--------------------------------------------------------------------------------

race

n missing unique

189 0 3

White (96 51) Black (26 14) Other (67 35)

--------------------------------------------------------------------------------

ftv

n missing unique Info Mean

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 3: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Contents

1 Getting started with R graphics 3

11 Why R 3

12 The R graphical model 4

13 Base vs lattice graphics 4

14 The grammar of graphics 4

2 Data management 5

21 Structuring data 5

22 Managing data 5

221 Variable recoding and annotation 6

222 Variable transformation 8

23 Indexing subsetting conditioning 9

24 Summarizing data 9

241 Base R functions 10

242 The Hmisc package 10

243 The plyr package 14

3 Univariate distributions 15

31 Stripchart 15

32 Histograms 17

33 Density plots 19

34 Quantile and related probability plots 21

35 Boxplots 24

36 Time series 25

iii

4 Two-way graphics 31

41 Lineplots 32

42 Scatterplots 32

43 Barcharts 37

44 Dotcharts 37

45 Line fits 39

46 Time series 40

47 Level plot 42

5 Multi-way graphics 45

51 Parallel displays 45

52 Scatterplot matrix 45

53 Three-way tabular data 45

54 N-way data 45

6 Customizing theme and panels 47

7 The Hmisc plotting functions 49

71 Two-way graphics 49

8 The ggplot2 package 51

81 Why ggplot2 51

82 Core components of a ggplot graphic 51

9 Interactive and dynamic displays 53

91 Exploratory data analysis 53

92 Brushing and linking 53

93 The ggobi toolbox 53

CHAPTER1

Getting started with R graphics

11 Why R

In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example

Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows

$ Rscript -e cat(rnorm(500) sep=n) gt rnddat

Then in gnuplot we can run

bw=02

n=500

bin(xwidth)=widthint(xwidth)

tstr(n)=sprintf(Binwidth = 11fn n)

set xrange [-33]

set yrange [01]

set boxwidth bw

plot rnddat using (bin($1bw))(1(bwn)) smooth frequency

with boxes title tstr(bw)

to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)

sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399

3

The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks

The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2

12 The R graphical model

13 Base vs lattice graphics

R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function

14 The grammar of graphics

sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399

Tidying

CHAPTER2

Data management

21 Structuring data

In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice

package which is entirely based on formula even if this is not apparent at first sight

22 Managing data

Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning

data(birthwt package=MASS)

str(birthwt)

sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8

5

Tidying221 Variable recoding and annotation

The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)

Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)

setseed(101)

birthwt$age[5] lt- NA

birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA

yesno lt- c(No Yes)

birthwt lt- within(birthwt

smoke lt- factor(smoke labels = yesno)

low lt- factor(low labels = yesno)

ht lt- factor(ht labels = yesno)

ui lt- factor(ui labels = yesno)

race lt- factor(race levels = 13 labels = c(White Black Other))

lwt lt- lwt22 weight in kg

)

It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well

library(Hmisc)

label(birthwt$age) lt- Mother age

units(birthwt$age) lt- years

label(birthwt$bwt) lt- Baby weight

units(birthwt$bwt) lt- grams

label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study

These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures

The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R

Once we have a working data frame the functions contents and describe provide twoquick summary of the data

gt contents(birthwt)

Data framebirthwt 189 observations and 10 variables Maximum NAs5

Labels Units Levels Class Storage NAs

low 2 integer 0

age Mother age years integer integer 1

Tidyinglwt double 0

race 3 integer 0

smoke 2 integer 0

ptl integer 0

ht 2 integer 0

ui 2 integer 0

ftv integer 5

bwt Baby weight grams integer integer 0

+--------+-----------------+

|Variable|Levels |

+--------+-----------------+

| low |NoYes |

| smoke | |

| ht | |

| ui | |

+--------+-----------------+

| race |WhiteBlackOther|

+--------+-----------------+

As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output

gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)

4 Variables 189 Observations

--------------------------------------------------------------------------------

low

n missing unique

189 0 2

No (130 69) Yes (59 31)

--------------------------------------------------------------------------------

age Mother age [years]

n missing unique Info Mean 05 10 25 50 75

188 1 24 1 233 16 17 19 23 26

90 95

31 32

lowest 14 15 16 17 18 highest 33 34 35 36 45

--------------------------------------------------------------------------------

race

n missing unique

189 0 3

White (96 51) Black (26 14) Other (67 35)

--------------------------------------------------------------------------------

ftv

n missing unique Info Mean

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 4: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

4 Two-way graphics 31

41 Lineplots 32

42 Scatterplots 32

43 Barcharts 37

44 Dotcharts 37

45 Line fits 39

46 Time series 40

47 Level plot 42

5 Multi-way graphics 45

51 Parallel displays 45

52 Scatterplot matrix 45

53 Three-way tabular data 45

54 N-way data 45

6 Customizing theme and panels 47

7 The Hmisc plotting functions 49

71 Two-way graphics 49

8 The ggplot2 package 51

81 Why ggplot2 51

82 Core components of a ggplot graphic 51

9 Interactive and dynamic displays 53

91 Exploratory data analysis 53

92 Brushing and linking 53

93 The ggobi toolbox 53

CHAPTER1

Getting started with R graphics

11 Why R

In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example

Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows

$ Rscript -e cat(rnorm(500) sep=n) gt rnddat

Then in gnuplot we can run

bw=02

n=500

bin(xwidth)=widthint(xwidth)

tstr(n)=sprintf(Binwidth = 11fn n)

set xrange [-33]

set yrange [01]

set boxwidth bw

plot rnddat using (bin($1bw))(1(bwn)) smooth frequency

with boxes title tstr(bw)

to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)

sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399

3

The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks

The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2

12 The R graphical model

13 Base vs lattice graphics

R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function

14 The grammar of graphics

sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399

Tidying

CHAPTER2

Data management

21 Structuring data

In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice

package which is entirely based on formula even if this is not apparent at first sight

22 Managing data

Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning

data(birthwt package=MASS)

str(birthwt)

sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8

5

Tidying221 Variable recoding and annotation

The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)

Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)

setseed(101)

birthwt$age[5] lt- NA

birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA

yesno lt- c(No Yes)

birthwt lt- within(birthwt

smoke lt- factor(smoke labels = yesno)

low lt- factor(low labels = yesno)

ht lt- factor(ht labels = yesno)

ui lt- factor(ui labels = yesno)

race lt- factor(race levels = 13 labels = c(White Black Other))

lwt lt- lwt22 weight in kg

)

It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well

library(Hmisc)

label(birthwt$age) lt- Mother age

units(birthwt$age) lt- years

label(birthwt$bwt) lt- Baby weight

units(birthwt$bwt) lt- grams

label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study

These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures

The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R

Once we have a working data frame the functions contents and describe provide twoquick summary of the data

gt contents(birthwt)

Data framebirthwt 189 observations and 10 variables Maximum NAs5

Labels Units Levels Class Storage NAs

low 2 integer 0

age Mother age years integer integer 1

Tidyinglwt double 0

race 3 integer 0

smoke 2 integer 0

ptl integer 0

ht 2 integer 0

ui 2 integer 0

ftv integer 5

bwt Baby weight grams integer integer 0

+--------+-----------------+

|Variable|Levels |

+--------+-----------------+

| low |NoYes |

| smoke | |

| ht | |

| ui | |

+--------+-----------------+

| race |WhiteBlackOther|

+--------+-----------------+

As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output

gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)

4 Variables 189 Observations

--------------------------------------------------------------------------------

low

n missing unique

189 0 2

No (130 69) Yes (59 31)

--------------------------------------------------------------------------------

age Mother age [years]

n missing unique Info Mean 05 10 25 50 75

188 1 24 1 233 16 17 19 23 26

90 95

31 32

lowest 14 15 16 17 18 highest 33 34 35 36 45

--------------------------------------------------------------------------------

race

n missing unique

189 0 3

White (96 51) Black (26 14) Other (67 35)

--------------------------------------------------------------------------------

ftv

n missing unique Info Mean

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 5: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

CHAPTER1

Getting started with R graphics

11 Why R

In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example

Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows

$ Rscript -e cat(rnorm(500) sep=n) gt rnddat

Then in gnuplot we can run

bw=02

n=500

bin(xwidth)=widthint(xwidth)

tstr(n)=sprintf(Binwidth = 11fn n)

set xrange [-33]

set yrange [01]

set boxwidth bw

plot rnddat using (bin($1bw))(1(bwn)) smooth frequency

with boxes title tstr(bw)

to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)

sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399

3

The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks

The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2

12 The R graphical model

13 Base vs lattice graphics

R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function

14 The grammar of graphics

sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399

Tidying

CHAPTER2

Data management

21 Structuring data

In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice

package which is entirely based on formula even if this is not apparent at first sight

22 Managing data

Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning

data(birthwt package=MASS)

str(birthwt)

sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8

5

Tidying221 Variable recoding and annotation

The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)

Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)

setseed(101)

birthwt$age[5] lt- NA

birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA

yesno lt- c(No Yes)

birthwt lt- within(birthwt

smoke lt- factor(smoke labels = yesno)

low lt- factor(low labels = yesno)

ht lt- factor(ht labels = yesno)

ui lt- factor(ui labels = yesno)

race lt- factor(race levels = 13 labels = c(White Black Other))

lwt lt- lwt22 weight in kg

)

It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well

library(Hmisc)

label(birthwt$age) lt- Mother age

units(birthwt$age) lt- years

label(birthwt$bwt) lt- Baby weight

units(birthwt$bwt) lt- grams

label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study

These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures

The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R

Once we have a working data frame the functions contents and describe provide twoquick summary of the data

gt contents(birthwt)

Data framebirthwt 189 observations and 10 variables Maximum NAs5

Labels Units Levels Class Storage NAs

low 2 integer 0

age Mother age years integer integer 1

Tidyinglwt double 0

race 3 integer 0

smoke 2 integer 0

ptl integer 0

ht 2 integer 0

ui 2 integer 0

ftv integer 5

bwt Baby weight grams integer integer 0

+--------+-----------------+

|Variable|Levels |

+--------+-----------------+

| low |NoYes |

| smoke | |

| ht | |

| ui | |

+--------+-----------------+

| race |WhiteBlackOther|

+--------+-----------------+

As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output

gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)

4 Variables 189 Observations

--------------------------------------------------------------------------------

low

n missing unique

189 0 2

No (130 69) Yes (59 31)

--------------------------------------------------------------------------------

age Mother age [years]

n missing unique Info Mean 05 10 25 50 75

188 1 24 1 233 16 17 19 23 26

90 95

31 32

lowest 14 15 16 17 18 highest 33 34 35 36 45

--------------------------------------------------------------------------------

race

n missing unique

189 0 3

White (96 51) Black (26 14) Other (67 35)

--------------------------------------------------------------------------------

ftv

n missing unique Info Mean

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 6: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks

The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2

12 The R graphical model

13 Base vs lattice graphics

R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function

14 The grammar of graphics

sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399

Tidying

CHAPTER2

Data management

21 Structuring data

In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice

package which is entirely based on formula even if this is not apparent at first sight

22 Managing data

Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning

data(birthwt package=MASS)

str(birthwt)

sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8

5

Tidying221 Variable recoding and annotation

The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)

Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)

setseed(101)

birthwt$age[5] lt- NA

birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA

yesno lt- c(No Yes)

birthwt lt- within(birthwt

smoke lt- factor(smoke labels = yesno)

low lt- factor(low labels = yesno)

ht lt- factor(ht labels = yesno)

ui lt- factor(ui labels = yesno)

race lt- factor(race levels = 13 labels = c(White Black Other))

lwt lt- lwt22 weight in kg

)

It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well

library(Hmisc)

label(birthwt$age) lt- Mother age

units(birthwt$age) lt- years

label(birthwt$bwt) lt- Baby weight

units(birthwt$bwt) lt- grams

label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study

These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures

The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R

Once we have a working data frame the functions contents and describe provide twoquick summary of the data

gt contents(birthwt)

Data framebirthwt 189 observations and 10 variables Maximum NAs5

Labels Units Levels Class Storage NAs

low 2 integer 0

age Mother age years integer integer 1

Tidyinglwt double 0

race 3 integer 0

smoke 2 integer 0

ptl integer 0

ht 2 integer 0

ui 2 integer 0

ftv integer 5

bwt Baby weight grams integer integer 0

+--------+-----------------+

|Variable|Levels |

+--------+-----------------+

| low |NoYes |

| smoke | |

| ht | |

| ui | |

+--------+-----------------+

| race |WhiteBlackOther|

+--------+-----------------+

As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output

gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)

4 Variables 189 Observations

--------------------------------------------------------------------------------

low

n missing unique

189 0 2

No (130 69) Yes (59 31)

--------------------------------------------------------------------------------

age Mother age [years]

n missing unique Info Mean 05 10 25 50 75

188 1 24 1 233 16 17 19 23 26

90 95

31 32

lowest 14 15 16 17 18 highest 33 34 35 36 45

--------------------------------------------------------------------------------

race

n missing unique

189 0 3

White (96 51) Black (26 14) Other (67 35)

--------------------------------------------------------------------------------

ftv

n missing unique Info Mean

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 7: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Tidying

CHAPTER2

Data management

21 Structuring data

In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice

package which is entirely based on formula even if this is not apparent at first sight

22 Managing data

Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning

data(birthwt package=MASS)

str(birthwt)

sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8

5

Tidying221 Variable recoding and annotation

The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)

Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)

setseed(101)

birthwt$age[5] lt- NA

birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA

yesno lt- c(No Yes)

birthwt lt- within(birthwt

smoke lt- factor(smoke labels = yesno)

low lt- factor(low labels = yesno)

ht lt- factor(ht labels = yesno)

ui lt- factor(ui labels = yesno)

race lt- factor(race levels = 13 labels = c(White Black Other))

lwt lt- lwt22 weight in kg

)

It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well

library(Hmisc)

label(birthwt$age) lt- Mother age

units(birthwt$age) lt- years

label(birthwt$bwt) lt- Baby weight

units(birthwt$bwt) lt- grams

label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study

These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures

The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R

Once we have a working data frame the functions contents and describe provide twoquick summary of the data

gt contents(birthwt)

Data framebirthwt 189 observations and 10 variables Maximum NAs5

Labels Units Levels Class Storage NAs

low 2 integer 0

age Mother age years integer integer 1

Tidyinglwt double 0

race 3 integer 0

smoke 2 integer 0

ptl integer 0

ht 2 integer 0

ui 2 integer 0

ftv integer 5

bwt Baby weight grams integer integer 0

+--------+-----------------+

|Variable|Levels |

+--------+-----------------+

| low |NoYes |

| smoke | |

| ht | |

| ui | |

+--------+-----------------+

| race |WhiteBlackOther|

+--------+-----------------+

As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output

gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)

4 Variables 189 Observations

--------------------------------------------------------------------------------

low

n missing unique

189 0 2

No (130 69) Yes (59 31)

--------------------------------------------------------------------------------

age Mother age [years]

n missing unique Info Mean 05 10 25 50 75

188 1 24 1 233 16 17 19 23 26

90 95

31 32

lowest 14 15 16 17 18 highest 33 34 35 36 45

--------------------------------------------------------------------------------

race

n missing unique

189 0 3

White (96 51) Black (26 14) Other (67 35)

--------------------------------------------------------------------------------

ftv

n missing unique Info Mean

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 8: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Tidying221 Variable recoding and annotation

The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)

Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)

setseed(101)

birthwt$age[5] lt- NA

birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA

yesno lt- c(No Yes)

birthwt lt- within(birthwt

smoke lt- factor(smoke labels = yesno)

low lt- factor(low labels = yesno)

ht lt- factor(ht labels = yesno)

ui lt- factor(ui labels = yesno)

race lt- factor(race levels = 13 labels = c(White Black Other))

lwt lt- lwt22 weight in kg

)

It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well

library(Hmisc)

label(birthwt$age) lt- Mother age

units(birthwt$age) lt- years

label(birthwt$bwt) lt- Baby weight

units(birthwt$bwt) lt- grams

label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study

These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures

The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R

Once we have a working data frame the functions contents and describe provide twoquick summary of the data

gt contents(birthwt)

Data framebirthwt 189 observations and 10 variables Maximum NAs5

Labels Units Levels Class Storage NAs

low 2 integer 0

age Mother age years integer integer 1

Tidyinglwt double 0

race 3 integer 0

smoke 2 integer 0

ptl integer 0

ht 2 integer 0

ui 2 integer 0

ftv integer 5

bwt Baby weight grams integer integer 0

+--------+-----------------+

|Variable|Levels |

+--------+-----------------+

| low |NoYes |

| smoke | |

| ht | |

| ui | |

+--------+-----------------+

| race |WhiteBlackOther|

+--------+-----------------+

As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output

gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)

4 Variables 189 Observations

--------------------------------------------------------------------------------

low

n missing unique

189 0 2

No (130 69) Yes (59 31)

--------------------------------------------------------------------------------

age Mother age [years]

n missing unique Info Mean 05 10 25 50 75

188 1 24 1 233 16 17 19 23 26

90 95

31 32

lowest 14 15 16 17 18 highest 33 34 35 36 45

--------------------------------------------------------------------------------

race

n missing unique

189 0 3

White (96 51) Black (26 14) Other (67 35)

--------------------------------------------------------------------------------

ftv

n missing unique Info Mean

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 9: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Tidyinglwt double 0

race 3 integer 0

smoke 2 integer 0

ptl integer 0

ht 2 integer 0

ui 2 integer 0

ftv integer 5

bwt Baby weight grams integer integer 0

+--------+-----------------+

|Variable|Levels |

+--------+-----------------+

| low |NoYes |

| smoke | |

| ht | |

| ui | |

+--------+-----------------+

| race |WhiteBlackOther|

+--------+-----------------+

As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output

gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)

4 Variables 189 Observations

--------------------------------------------------------------------------------

low

n missing unique

189 0 2

No (130 69) Yes (59 31)

--------------------------------------------------------------------------------

age Mother age [years]

n missing unique Info Mean 05 10 25 50 75

188 1 24 1 233 16 17 19 23 26

90 95

31 32

lowest 14 15 16 17 18 highest 33 34 35 36 45

--------------------------------------------------------------------------------

race

n missing unique

189 0 3

White (96 51) Black (26 14) Other (67 35)

--------------------------------------------------------------------------------

ftv

n missing unique Info Mean

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 10: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Tidying184 5 6 083 0783

0 1 2 3 4 6

Frequency 98 45 30 7 3 1

53 24 16 4 2 1

--------------------------------------------------------------------------------

The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies

222 Variable transformation

In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example

birthwt$low lt- relevel(birthwt$low Yes)

levels(birthwt$race)[23] lt- Black+Other

Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe

birthwt lt- transform(birthwt agec=scale(age scale=FALSE))

Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)

birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))

ftvc=factor(ifelse(ftv lt 2 1 2+)))

Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2

function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals

table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))

If there is some reason to treat ftv as an ordered factor a command like

asordered(cut2(birthwt$ftv c(0 1 2 6)))

might do the job

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 11: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Tidying23 Indexing subsetting conditioning

A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection

The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))

The following instruction will print the age of hypertensive mothers whose baby was un-derweight

subset(birthwt low == Yes amp ht == Yes age)

It should be noted that subset returns a data frame

Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs

library(sqldf)

sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)

Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055

There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable

lwt lt- birthwt$lwt

lwt[sample(length(lwt) 10)] lt- NA

lwti lt- impute(lwt)

summary(lwti)

Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option

24 Summarizing data

Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization

sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 12: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Tidying241 Base R functions

Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)

There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD

f lt- function(x) c(mean = mean(x) sd = sd(x))

aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)

242 The Hmisc package

There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)

The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame

Note that enhanced results can be printed on the console using the prn command

Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable

f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x

narm = naopts))

out lt- with(birthwt summarize(bwt race f))

Here is the output produced by R

Average baby weight by ethnicity out

race bwt sd

3 White 3103 7279

1 Black 2720 6387

2 Other 2805 7222

⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 13: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame

Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command

with(birthwt summarize(bwt llist(race smoke) f))

The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed

With the following instruction

with(birthwt bystats(cbind(bwt lwt) smoke race))

we would get

Mean of cbind(bwt lwt) by smoke

N bwt lwt

No White 44 3429 6311

Yes White 52 2827 5741

No Black 16 2854 6793

Yes Black 10 2504 6482

No Other 55 2816 5416

Yes Other 12 2757 5636

ALL 189 2945 5901

whereas

with(birthwt bystats2(lwt smoke race))

gives

Mean of lwt by smoke

+----+

|N |

|Mean|

+----+

+---+-----+-----+-----+-----+

|No | 44 | 16 | 55 |115 |

| |6311|6793|5416|5950|

+---+-----+-----+-----+-----+

|Yes| 52 | 10 | 12 | 74 |

| |5741|6482|5636|5824|

+---+-----+-----+-----+-----+

|ALL| 96 | 26 | 67 |189 |

| |6002|6673|5455|5901|

+---+-----+-----+-----+-----+

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 14: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups

Here are some examples of use with their outputs

gt summary(bwt ~ race + ht + lwt data = birthwt)

Baby weight N=189

+-------+------------+---+----+

| | |N |bwt |

+-------+------------+---+----+

|race |White | 96|3103|

| |Black | 26|2720|

| |Other | 67|2805|

+-------+------------+---+----+

|ht |No |177|2972|

| |Yes | 12|2537|

+-------+------------+---+----+

|lwt |[364 509)| 53|2656|

| |[509 555)| 43|3059|

| |[555 641)| 46|3075|

| |[6411136]| 47|3038|

+-------+------------+---+----+

|Overall| |189|2945|

+-------+------------+---+----+

gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)

mean by race bwt

+-------+

|N |

|Missing|

|lwt |

|age |

+-------+

+-----+-----------+-----------+-----------+-----------+-----+

| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |

+-----+-----------+-----------+-----------+-----------+-----+

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 15: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Tidying|White| 19 | 23 | 20 | 33 | 95 |

| | 0 | 1 | 0 | 0 |1 |

| | 5555 | 5767 | 6223 | 6325 |6014|

| | 2274 | 2478 | 2450 | 2491 |2436|

+-----+-----------+-----------+-----------+-----------+-----+

|Black| 9 | 9 | 6 | 2 | 26 |

| | 0 | 0 | 0 | 0 |0 |

| | 6510 | 5970 | 7083 | 9341 |6673|

| | 2344 | 2089 | 2000 | 2050 |2154|

+-----+-----------+-----------+-----------+-----------+-----+

|Other| 20 | 16 | 19 | 12 | 67 |

| | 0 | 0 | 0 | 0 |0 |

| | 5123 | 5295 | 5890 | 5534 |5455|

| | 2220 | 2269 | 2226 | 2250 |2239|

+-----+-----------+-----------+-----------+-----------+-----+

|ALL | 48 | 48 | 45 | 47 |188 |

| | 0 | 1 | 0 | 0 |1 |

| | 5554 | 5648 | 6197 | 6251 |5906|

| | 2265 | 2335 | 2296 | 2411 |2327|

+-----+-----------+-----------+-----------+-----------+-----+

gt summary(low ~ race + ht data = birthwt fun = table)

low N=189

+-------+-----+---+---+---+

| | |N |No |Yes|

+-------+-----+---+---+---+

|race |White| 96| 73|23 |

| |Black| 26| 15|11 |

| |Other| 67| 42|25 |

+-------+-----+---+---+---+

|ht |No |177|125|52 |

| |Yes | 12| 5| 7 |

+-------+-----+---+---+---+

|Overall| |189|130|59 |

+-------+-----+---+---+---+

Finally here is amore complexTable produced by summaryformulawith method = reverse

out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE

test = TRUE)

print(out prmsd = TRUE digits = 2)

Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function

latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)

Here is the final output

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 16: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

TidyingN No Yes Combined Test Statistic

119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576

race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)

Other 32 (42) 42 (25) 35 (67)

Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568

Yes 11 ( 14) 24 ( 14) 15 ( 28)

119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test

243 The plyr package

The plyr package⁵ offers a general solution to those kind of tasks

⁵Ibid

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 17: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

CHAPTER3

Univariate distributions

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

31 Stripchart

Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections

x lt- sample (130 25 replace=TRUE)

stripplot(sim x jitterdata=TRUE factor=8 aspect=5)

With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation

x

5 10 15 20 25 30

15

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 18: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)

x

5 10 15 20 25 30

A better way of flattening the display is to use anldquoxyrdquo aspect

stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12

scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)

x

5

10

15

20

25

30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication

stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)

x

5 10 15 20 25 30

In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 19: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

x lt- sample(seq (1 60 by=2) 75 replace=TRUE)

stripplot (1 sim x panel=panelsunflowerplot col=black

segcol=black seglwd =1 size=08)

With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)

x

0 10 20 30 40 50 60

32 Histograms

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

histogram(sim waiting data=faithful)

faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA

A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 20: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum

Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)

type=count

waiting

Count

0

20

40

60

40 50 60 70 80 90 100

col=steelblue

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

border=NA nint=12

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

40 50 60 70 80 90 100

nint=nrow(dataset)

waiting

Perc

ent

of

Tota

l

0

2

4

6

40 50 60 70 80 90 100

p lt- histogram(sim waiting data=faithful lwd=5 type=percent

border=white)

xbreaks lt- p$panelargscommon$breaks

yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))

update(p panel=function( )

panelhistogram()

paneltext(xbreaks+diff(xbreaks )[1]2

yvaluesnrow(faithful)100

ascharacter(yvalues ) cex=8 adj=c(5 0))

)

waiting

Perc

ent

of

Tota

l

0

5

10

15

20

25

40 50 60 70 80 90 100

13

3133

22

14

67

63

24

5

13

The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 21: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

x lt- rnorm (80 mean =12 sd=2)

histogram(sim x type=density border=NA

panel=function(x )

panelhistogram(x )

panelmathdensity(dmath=dnorm col=BF3030

args=list(mean=mean(x)sd=sd(x)))

)

An example where we superimposed a normaldensity with parameters estimated from the sam-ple)

x

Densi

ty

000

005

010

015

6 8 10 12 14 16

33 Density plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

sup1

sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 22: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

densityplot(sim waiting data=faithful)

waiting

Densi

ty

000

001

002

003

40 60 80 100

The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian

with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ

densityplot(sim waiting data=faithful plotpoints=rug)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points

densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)

waiting

Densi

ty

000

001

002

003

40 60 80 100

Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics

An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature

sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 23: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor

bwplot(sim waiting data=faithful panel=panelviolin)

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

bwplot(sim waiting data=faithful

panel=function( boxratio )

panelviolin( col=transparent varwidth=FALSE

boxratio=boxratio)

panelbwplot( fill=NULL boxratio=15 ))

A simple ldquoviolinrdquo panel

waiting

50 60 70 80 90

34 Quantile and related probability plots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 24: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

qqmath(sim rnorm (60))

qnorm

rnorm

(60

)

-3

-2

-1

0

1

2

-2 -1 0 1 2

A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis

x lt- rt(500 df=20)

qqmath(sim x pch=+ abline=c(01) dist=qnorm)

qnorm

x

-4

-2

0

2

-2 0 2

+

+ +

+++

+++++++

+++++++++++

++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

+++++++++++++++++++++++++++++++++++++

++++++++++++++++++++++

++++++++++++++++++++++++

+++++++++++++++++++

+++++++++

+++

+

+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 25: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

qqmath(sim x pch=+ abline=c(01) dist=qnorm

fvalue=ppoints (100))

Same as above but subsampling data points

qnormx

-2

0

2

-2 -1 0 1 2

+

++

+++++

++++++

++++++

++++++++

+++++++++++

+++++++++

+++++++++++++

++++++++

+++++

++++++++

++++++

+++++

++

++ +

+

+

qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)

cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight

Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath

to use the Uniform distribution as a reference

qunif

Hw

t

6

8

10

12

00 02 04 06 08 10

ecdfplot(sim Hwt data=cats subset=Sex == F)

An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package

Hwt

Em

pir

ical C

DF

00

02

04

06

08

10

6 8 10 12

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 26: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

35 Boxplots

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

bwplot(sim height data=women)

height

60 65 70

women This data set gives theaverage heights and weights forAmerican women aged 30-39

Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)

bwplot(sim height data=women boxratio=5)

height

60 65 70

This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 27: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

bwplot(sim height data=women boxratio=5

scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))

A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument

height

50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80

bwplot(sim height data=women aspect=5

panel=function(x )

panelbwplot(x pch=| )

panelpoints(mean(x) 1 pch=19 cex =1)

)

It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE

when calling mean(x)height

60 65 70

36 Time series

Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 28: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot(sunspotyear)

Time

05

01

00

15

0

1700 1750 1800 1850 1900 1950 2000

sunspotyear Yearly numbers ofsunspots from 1700 to 1988

A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements

xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))

Time

0

50

100

150

1700 1750 1800 1850 1900 1950 2000

A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component

xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))

Time

0

50

100

150

1700 1750 1800 1850

1850 1900 1950

0

50

100

150

The same time series cut into two pieces with 5of overlap

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 29: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot(sunspotyear strip=FALSE stripleft=TRUE

cut=list(number =2 overlap=05))

Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period

Time

0

50

100

150

1700 1750 1800 1850

tim

e

1850 1900 1950

0

50

100

150

tim

e

xyplot(zoo(discoveries ))

discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959

The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

xyplot(zoo(discoveries ) panel=panelxyarea)

It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age

Time

02

46

81

01

2

1860 1880 1900 1920 1940 1960

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 30: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xt lt- ts(cumsum(rnorm (200 12)))

p lt- xyplot(xt)

Time

-20

02

04

06

08

0

0 500 1000 1500 2000 2500

The lattice package works with ts objetcs too

xt lt- zoo(accdeaths)

xyplot(xt type=c(lg) ylab=Total accidental deaths)

Time

Tota

l acc

identa

l d

eath

s

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

accdeaths A regular time seriesgiving the monthly totals of accidental

deaths in the USA

Another regular time series is

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 31: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot(xt

panel=function(x y )

panelxyplot(x y )

panellines(rollmean(zoo(y x) 3) lwd=2 col =1)

)

The same dataset with a rolling mean (of width3)

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

xyplot(xt) +

layer(paneltskernel(x y c=3 col=1 lwd =2))

Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6

Time

70

00

80

00

90

00

10

00

01

10

00

1973 1974 1975 1976 1977 1978 1979

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 32: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))

superpose=TRUE panel=panelsuperpose

panelgroups=function( groupnumber )

if ( groupnumber == 1) panelxyarea()

else panelxyplot()

border=NA cut=list(n=3 overlap =0) aspect=xy

parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))

Time

050

100150

1700 1720 1740 1760 1780

time

1800 1820 1840 1860 1880

050100150

time

050

100150

1900 1920 1940 1960 1980

time

sunspotyearlag10sunspotyear Yearly numbers of

sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)

flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))

xyplot(flow

panel=function(x y )

panelxblocks(x y gt mean(y) col=lightgray)

panelxyplot(x y )

)

Time

01

02

03

04

05

0

0 50 100 150 200

blabla blabla

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 33: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

CHAPTER4

Two-way graphics

This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot

31

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 34: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

41 Lineplots

xyplot(uptake sim conc data=CO2 groups=Treatment type=a)

conc

up

take

10

20

30

40

200 400 600 800 1000

Automatic averaging

42 Scatterplots

The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty

xyplot(dist sim speed data=cars)

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

cars The data give the speed of carsand the distances taken to stop Note

that the data were recorded in the1920s

This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 35: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)

Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars

col=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars pch=19

groups=with(cars cut(speed breaks=quantile(speed)

includelowest=TRUE )))

This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 36: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xy lt- asdataframe(replicate (2 rnorm (1000)))

xyplot(V1 sim V2 data=xy pch=19 alpha=5)

V2

V1

-2

0

2

-2 0 2

When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)

dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))

cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))

xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)

X2

X1

-2

0

2

-3 -2 -1 0 1 2

Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer

package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors

xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE

amount=2)

PetalLength

Sep

alLe

ng

th

5

6

7

8

1 2 3 4 5 6 7

iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the

measurements in centimeters of thevariables sepal length and width andpetal length and width respectively

for 50 flowers from each of 3 species ofiris The species are Iris setosa

versicolor and virginica

As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 37: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot(dist sim speed data=cars

panel=function(x y )

panelxyplot(x y )

panelrug(x y )

)

It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(log(Volume ) sim log(Girth) data=trees)

trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground

Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis

log(Girth)

log

(Volu

me)

25

30

35

40

22 24 26 28 30

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 38: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot ((1200)20 sim (1200)20 type=c(p g)

scales=list(x=list(log =10) y=list(log =10))

xscalecomponents=xscalecomponentslog103

yscalecomponents=yscalecomponentslog103)

(1200)20

(12

00

)2

0

01

03

1

3

10

01 03 1 3 10

Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing

xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

To get ride of ticks on opposite axes we canchange default values for tck= in the scales=

component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately

xyplot(dist sim speed data=cars scales=list(alternating =3))

speed

dis

t

5 10 15 20 25

0

20

40

60

80

100

120

5 10 15 20 25

0

20

40

60

80

100

120

To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2

would reverse the annotation of axis (righttopinstead of leftbottom)

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 39: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

43 Barcharts

spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)

barchart(count sim spray data=spraydf)

InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides

Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider

count

5

10

15

A B C D E F

barchart(spray sim count data=spraydf)

To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula

count

A

B

C

D

E

F

5 10 15

44 Dotcharts

low ink-ratio

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 40: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

dotplot(spray sim count data=InsectSprays)

count

A

B

C

D

E

F

0 5 10 15 20 25

Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations

dotplot(spray sim count data=InsectSprays lty =2)

count

A

B

C

D

E

F

0 5 10 15 20 25

The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn

dotplot(spray sim count data=InsectSprays type=c(pa))

count

A

B

C

D

E

F

0 5 10 15 20 25

It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 41: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

45 Line fits

In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot

xyplot(dist sim speed data=cars type=c(psmooth))

An adaptive loess smoother superimposed on topof a standard scatterplot

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

xyplot(dist sim speed data=cars type=c(psmooth) span=13)

Window span can be controlled using the span=

argument where lower valuemeansmore sentiv-ity to local variations

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 42: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot(dist sim speed data=cars type=c(pr))

speed

dis

t

0

20

40

60

80

100

120

5 10 15 20 25

Regression fit can be shown using the same ideathrough the type= argument

46 Time series

xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))

xyplot(xt)

Time

-30

-20

-10

01

0

Series 1

-50

-40

-30

-20

-10

01

0

0 200 400 600 800 1000 1200

Series 2

blabla blabla

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 43: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot(xt scales=list(y=same) type=c(lg))

blabla blabla

Time

-40

-20

0

Series 1

0 200 400 600 800 1000 1200

-40

-20

0

Series 2

xyplot(xt layout=c(2 1))

blabla blabla

Time

-30

-20

-10

01

0

0 200 400 600 800 1200

Series 1

0 200 400 600 800 1200

-50

-40

-30

-20

-10

01

0

Series 2

xyplot(EuStockMarkets scales=list(y=same))

EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted

blabla blabla

Time

2000400060008000

DAX

2000400060008000

SMI

2000400060008000

CAC

1992 1994 1996 1998

2000400060008000

FTSE

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 44: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))

Time

20

00

40

00

60

00

80

00

1992 1994 1996 1998

DAXSMI

CACFTSE blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE)

Time

DA

XSM

IC

AC

1992 1994 1996 1998

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers

Time

15191

16288

DA

X

22446

16781

SM

I

8719

17728

CA

C

1992 1994 1996 1998

12451

24436

FTSE

minus 3Δ i

minus 2Δ i

minus 1Δ i

origin

+ 1Δ i

+ 2Δ i

+ 3Δ i blabla blabla

47 Level plot

Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 45: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

data(Harman23cor)

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab=)

Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976

Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

cols lt- colorRampPalette(brewerpal (8 RdBu))

levelplot(Harman23cor$cov scales=list(x=list(rot =45))

xlab= ylab= colregions=cols)

The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)

height

armspan

forearm

lowerleg

weight

bitrodiameter

chestgirth

chestwidth

heig

ht

arm

spa

n

fore

arm

lower

leg

wei

ght

bitro

diam

eter

ches

tgirt

h

ches

twid

th

02

03

04

05

06

07

08

09

10

Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like

It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 46: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 47: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

CHAPTER5

Multi-way graphics

This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor

51 Parallel displays

52 Scatterplot matrix

53 Three-way tabular data

54 N-way data

45

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 48: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 49: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

CHAPTER6

Customizing theme and panels

47

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 50: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Lattice

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 51: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

HmiscCHAPTER7

The Hmisc plotting functions

71 Two-way graphics

Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels

package

Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval

se lt- function(x) sd(x)sqrt(length(x))

f lt- function(x) c(mean = mean(x)

lwr = mean(x) - 196 se(x)

upr = mean(x) + 196 se(x))

49

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 52: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Hmisc

d lt- with(birthwt summarize(bwt race f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)

data = d type = b keys = lines

ylim = range(apply(d[ 34] 2 range )) + c(-11) 100

scales = list(x = list(at = 13 labels = levels(d$race ))))

Ethnicity

Baby w

eig

ht

2600

2800

3000

3200

White Black Other

The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter

Here is another example using a grouping factor

d lt- with(birthwt summarize(bwt llist(race smoke) f))

xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke

data = d type = l keys = lines method = alt bars

ylim = c(2200 3600)

scales = list(x = list(at = 13 labels = levels(d$race ))))

race

Baby w

eig

ht

2400

2600

2800

3000

3200

3400

White Black Other

NoYes

bla bla

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 53: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Ggplot

CHAPTER8

The ggplot2 package

81 Why ggplot2

How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1

82 Core components of a ggplot graphic

sup1L Wilkinson The Grammar of Graphics Springer 2005

51

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 54: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Ggplot

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 55: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Ggplot

CHAPTER9

Interactive and dynamic displays

91 Exploratory data analysis

92 Brushing and linking

93 The ggobi toolbox

53

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 56: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Ggplot

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55

Page 57: A Visual Guide to R Graphics and Data Munging · R Graphics and Data Munging With 65 illustrations! 7d26af9 2015/04/10. xyplot data listc scales dist col panel type pch mean function

Ggplot

Index

abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38

barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21

c 9 14 18 19 21 24ndash26 3234ndash37 45 46

cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8

data 13 14 16 17 19ndash2128ndash36 39 45 46

dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35

36dmath 15dotplot 34draw 12

ecdfplot 19else 26

fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25

26 31

groupnumber 26groups 28 29 45 46

histogram 13ndash15horizonplot 38horizontal 12

if 26includelowest 8 29

jitterdata 11 12jitterx 30jittery 30

labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32

37ndash39 45 46log 31 32lty 34lwd 14 25 26

matrix 36mean 15 21 26 33 34median 34method 26 45 46

narm 21ncol 36nint 14nrow 14 29number 22 23

overlap 22 23 26

panel 12ndash15 17 21 23 2526 31

panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19

qnorm 18qqmath 18 19

quantile 29qunif 19

range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18

sample 11 13 30scales 12 13 21 22 31 32

37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39

table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash

37 45 46

update 14

varwidth 17

with 29 45 46within 8

xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38

ylab 24 39yscalecomponents 32yscalecomponentslog103

32

zoo 5 23ndash25

55