a visual guide to r graphics and data munging · r graphics and data munging with 65 illustrations!...
TRANSCRIPT
Christophe Lalanne
A Visual Guide to
R Graphics andData Munging
With 65 illustrations
7d26af9
20150410
xyplotdata
list c
scal
es dist
colpanel
type
pch
mea
n
function
aspe
ct boxratiocut
dataframepanelxyplot
rnorm
zoo
qqmath
stripplot
alphahistogram
logts
alternating
border
densityplot
dotplot
factor
groups
jitterdatalwd
over
lap
rot
sample
sd
tck
xlab
xy
ylab
abline
aggregate
amount
barchart
breaks
brew
erp
al
bycex
color
colorkey
cov
cumsum
ecdfplot
groupnumberhorizonplot
includelowestjitterx
num
ber
panelbwplot panelxyareaplotpoints
qnorm
quantile
replace
replicate
seqspan
strip
subs
et
supe
rpos
e
with
The greatest value of a picture is when it forces us to notice what we never expected to see mdash John Tukey
At their best graphics are instruments for reasoning mdash Edward Tufte
Contents
1 Getting started with R graphics 3
11 Why R 3
12 The R graphical model 4
13 Base vs lattice graphics 4
14 The grammar of graphics 4
2 Data management 5
21 Structuring data 5
22 Managing data 5
221 Variable recoding and annotation 6
222 Variable transformation 8
23 Indexing subsetting conditioning 9
24 Summarizing data 9
241 Base R functions 10
242 The Hmisc package 10
243 The plyr package 14
3 Univariate distributions 15
31 Stripchart 15
32 Histograms 17
33 Density plots 19
34 Quantile and related probability plots 21
35 Boxplots 24
36 Time series 25
iii
4 Two-way graphics 31
41 Lineplots 32
42 Scatterplots 32
43 Barcharts 37
44 Dotcharts 37
45 Line fits 39
46 Time series 40
47 Level plot 42
5 Multi-way graphics 45
51 Parallel displays 45
52 Scatterplot matrix 45
53 Three-way tabular data 45
54 N-way data 45
6 Customizing theme and panels 47
7 The Hmisc plotting functions 49
71 Two-way graphics 49
8 The ggplot2 package 51
81 Why ggplot2 51
82 Core components of a ggplot graphic 51
9 Interactive and dynamic displays 53
91 Exploratory data analysis 53
92 Brushing and linking 53
93 The ggobi toolbox 53
CHAPTER1
Getting started with R graphics
11 Why R
In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example
Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows
$ Rscript -e cat(rnorm(500) sep=n) gt rnddat
Then in gnuplot we can run
bw=02
n=500
bin(xwidth)=widthint(xwidth)
tstr(n)=sprintf(Binwidth = 11fn n)
set xrange [-33]
set yrange [01]
set boxwidth bw
plot rnddat using (bin($1bw))(1(bwn)) smooth frequency
with boxes title tstr(bw)
to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)
sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399
3
The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks
The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2
12 The R graphical model
13 Base vs lattice graphics
R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function
14 The grammar of graphics
sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399
Tidying
CHAPTER2
Data management
21 Structuring data
In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice
package which is entirely based on formula even if this is not apparent at first sight
22 Managing data
Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning
data(birthwt package=MASS)
str(birthwt)
sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8
5
Tidying221 Variable recoding and annotation
The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)
Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)
setseed(101)
birthwt$age[5] lt- NA
birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA
yesno lt- c(No Yes)
birthwt lt- within(birthwt
smoke lt- factor(smoke labels = yesno)
low lt- factor(low labels = yesno)
ht lt- factor(ht labels = yesno)
ui lt- factor(ui labels = yesno)
race lt- factor(race levels = 13 labels = c(White Black Other))
lwt lt- lwt22 weight in kg
)
It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well
library(Hmisc)
label(birthwt$age) lt- Mother age
units(birthwt$age) lt- years
label(birthwt$bwt) lt- Baby weight
units(birthwt$bwt) lt- grams
label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study
These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures
The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R
Once we have a working data frame the functions contents and describe provide twoquick summary of the data
gt contents(birthwt)
Data framebirthwt 189 observations and 10 variables Maximum NAs5
Labels Units Levels Class Storage NAs
low 2 integer 0
age Mother age years integer integer 1
Tidyinglwt double 0
race 3 integer 0
smoke 2 integer 0
ptl integer 0
ht 2 integer 0
ui 2 integer 0
ftv integer 5
bwt Baby weight grams integer integer 0
+--------+-----------------+
|Variable|Levels |
+--------+-----------------+
| low |NoYes |
| smoke | |
| ht | |
| ui | |
+--------+-----------------+
| race |WhiteBlackOther|
+--------+-----------------+
As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output
gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)
4 Variables 189 Observations
--------------------------------------------------------------------------------
low
n missing unique
189 0 2
No (130 69) Yes (59 31)
--------------------------------------------------------------------------------
age Mother age [years]
n missing unique Info Mean 05 10 25 50 75
188 1 24 1 233 16 17 19 23 26
90 95
31 32
lowest 14 15 16 17 18 highest 33 34 35 36 45
--------------------------------------------------------------------------------
race
n missing unique
189 0 3
White (96 51) Black (26 14) Other (67 35)
--------------------------------------------------------------------------------
ftv
n missing unique Info Mean
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
xyplotdata
list c
scal
es dist
colpanel
type
pch
mea
n
function
aspe
ct boxratiocut
dataframepanelxyplot
rnorm
zoo
qqmath
stripplot
alphahistogram
logts
alternating
border
densityplot
dotplot
factor
groups
jitterdatalwd
over
lap
rot
sample
sd
tck
xlab
xy
ylab
abline
aggregate
amount
barchart
breaks
brew
erp
al
bycex
color
colorkey
cov
cumsum
ecdfplot
groupnumberhorizonplot
includelowestjitterx
num
ber
panelbwplot panelxyareaplotpoints
qnorm
quantile
replace
replicate
seqspan
strip
subs
et
supe
rpos
e
with
The greatest value of a picture is when it forces us to notice what we never expected to see mdash John Tukey
At their best graphics are instruments for reasoning mdash Edward Tufte
Contents
1 Getting started with R graphics 3
11 Why R 3
12 The R graphical model 4
13 Base vs lattice graphics 4
14 The grammar of graphics 4
2 Data management 5
21 Structuring data 5
22 Managing data 5
221 Variable recoding and annotation 6
222 Variable transformation 8
23 Indexing subsetting conditioning 9
24 Summarizing data 9
241 Base R functions 10
242 The Hmisc package 10
243 The plyr package 14
3 Univariate distributions 15
31 Stripchart 15
32 Histograms 17
33 Density plots 19
34 Quantile and related probability plots 21
35 Boxplots 24
36 Time series 25
iii
4 Two-way graphics 31
41 Lineplots 32
42 Scatterplots 32
43 Barcharts 37
44 Dotcharts 37
45 Line fits 39
46 Time series 40
47 Level plot 42
5 Multi-way graphics 45
51 Parallel displays 45
52 Scatterplot matrix 45
53 Three-way tabular data 45
54 N-way data 45
6 Customizing theme and panels 47
7 The Hmisc plotting functions 49
71 Two-way graphics 49
8 The ggplot2 package 51
81 Why ggplot2 51
82 Core components of a ggplot graphic 51
9 Interactive and dynamic displays 53
91 Exploratory data analysis 53
92 Brushing and linking 53
93 The ggobi toolbox 53
CHAPTER1
Getting started with R graphics
11 Why R
In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example
Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows
$ Rscript -e cat(rnorm(500) sep=n) gt rnddat
Then in gnuplot we can run
bw=02
n=500
bin(xwidth)=widthint(xwidth)
tstr(n)=sprintf(Binwidth = 11fn n)
set xrange [-33]
set yrange [01]
set boxwidth bw
plot rnddat using (bin($1bw))(1(bwn)) smooth frequency
with boxes title tstr(bw)
to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)
sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399
3
The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks
The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2
12 The R graphical model
13 Base vs lattice graphics
R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function
14 The grammar of graphics
sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399
Tidying
CHAPTER2
Data management
21 Structuring data
In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice
package which is entirely based on formula even if this is not apparent at first sight
22 Managing data
Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning
data(birthwt package=MASS)
str(birthwt)
sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8
5
Tidying221 Variable recoding and annotation
The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)
Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)
setseed(101)
birthwt$age[5] lt- NA
birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA
yesno lt- c(No Yes)
birthwt lt- within(birthwt
smoke lt- factor(smoke labels = yesno)
low lt- factor(low labels = yesno)
ht lt- factor(ht labels = yesno)
ui lt- factor(ui labels = yesno)
race lt- factor(race levels = 13 labels = c(White Black Other))
lwt lt- lwt22 weight in kg
)
It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well
library(Hmisc)
label(birthwt$age) lt- Mother age
units(birthwt$age) lt- years
label(birthwt$bwt) lt- Baby weight
units(birthwt$bwt) lt- grams
label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study
These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures
The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R
Once we have a working data frame the functions contents and describe provide twoquick summary of the data
gt contents(birthwt)
Data framebirthwt 189 observations and 10 variables Maximum NAs5
Labels Units Levels Class Storage NAs
low 2 integer 0
age Mother age years integer integer 1
Tidyinglwt double 0
race 3 integer 0
smoke 2 integer 0
ptl integer 0
ht 2 integer 0
ui 2 integer 0
ftv integer 5
bwt Baby weight grams integer integer 0
+--------+-----------------+
|Variable|Levels |
+--------+-----------------+
| low |NoYes |
| smoke | |
| ht | |
| ui | |
+--------+-----------------+
| race |WhiteBlackOther|
+--------+-----------------+
As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output
gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)
4 Variables 189 Observations
--------------------------------------------------------------------------------
low
n missing unique
189 0 2
No (130 69) Yes (59 31)
--------------------------------------------------------------------------------
age Mother age [years]
n missing unique Info Mean 05 10 25 50 75
188 1 24 1 233 16 17 19 23 26
90 95
31 32
lowest 14 15 16 17 18 highest 33 34 35 36 45
--------------------------------------------------------------------------------
race
n missing unique
189 0 3
White (96 51) Black (26 14) Other (67 35)
--------------------------------------------------------------------------------
ftv
n missing unique Info Mean
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Contents
1 Getting started with R graphics 3
11 Why R 3
12 The R graphical model 4
13 Base vs lattice graphics 4
14 The grammar of graphics 4
2 Data management 5
21 Structuring data 5
22 Managing data 5
221 Variable recoding and annotation 6
222 Variable transformation 8
23 Indexing subsetting conditioning 9
24 Summarizing data 9
241 Base R functions 10
242 The Hmisc package 10
243 The plyr package 14
3 Univariate distributions 15
31 Stripchart 15
32 Histograms 17
33 Density plots 19
34 Quantile and related probability plots 21
35 Boxplots 24
36 Time series 25
iii
4 Two-way graphics 31
41 Lineplots 32
42 Scatterplots 32
43 Barcharts 37
44 Dotcharts 37
45 Line fits 39
46 Time series 40
47 Level plot 42
5 Multi-way graphics 45
51 Parallel displays 45
52 Scatterplot matrix 45
53 Three-way tabular data 45
54 N-way data 45
6 Customizing theme and panels 47
7 The Hmisc plotting functions 49
71 Two-way graphics 49
8 The ggplot2 package 51
81 Why ggplot2 51
82 Core components of a ggplot graphic 51
9 Interactive and dynamic displays 53
91 Exploratory data analysis 53
92 Brushing and linking 53
93 The ggobi toolbox 53
CHAPTER1
Getting started with R graphics
11 Why R
In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example
Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows
$ Rscript -e cat(rnorm(500) sep=n) gt rnddat
Then in gnuplot we can run
bw=02
n=500
bin(xwidth)=widthint(xwidth)
tstr(n)=sprintf(Binwidth = 11fn n)
set xrange [-33]
set yrange [01]
set boxwidth bw
plot rnddat using (bin($1bw))(1(bwn)) smooth frequency
with boxes title tstr(bw)
to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)
sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399
3
The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks
The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2
12 The R graphical model
13 Base vs lattice graphics
R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function
14 The grammar of graphics
sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399
Tidying
CHAPTER2
Data management
21 Structuring data
In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice
package which is entirely based on formula even if this is not apparent at first sight
22 Managing data
Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning
data(birthwt package=MASS)
str(birthwt)
sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8
5
Tidying221 Variable recoding and annotation
The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)
Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)
setseed(101)
birthwt$age[5] lt- NA
birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA
yesno lt- c(No Yes)
birthwt lt- within(birthwt
smoke lt- factor(smoke labels = yesno)
low lt- factor(low labels = yesno)
ht lt- factor(ht labels = yesno)
ui lt- factor(ui labels = yesno)
race lt- factor(race levels = 13 labels = c(White Black Other))
lwt lt- lwt22 weight in kg
)
It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well
library(Hmisc)
label(birthwt$age) lt- Mother age
units(birthwt$age) lt- years
label(birthwt$bwt) lt- Baby weight
units(birthwt$bwt) lt- grams
label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study
These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures
The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R
Once we have a working data frame the functions contents and describe provide twoquick summary of the data
gt contents(birthwt)
Data framebirthwt 189 observations and 10 variables Maximum NAs5
Labels Units Levels Class Storage NAs
low 2 integer 0
age Mother age years integer integer 1
Tidyinglwt double 0
race 3 integer 0
smoke 2 integer 0
ptl integer 0
ht 2 integer 0
ui 2 integer 0
ftv integer 5
bwt Baby weight grams integer integer 0
+--------+-----------------+
|Variable|Levels |
+--------+-----------------+
| low |NoYes |
| smoke | |
| ht | |
| ui | |
+--------+-----------------+
| race |WhiteBlackOther|
+--------+-----------------+
As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output
gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)
4 Variables 189 Observations
--------------------------------------------------------------------------------
low
n missing unique
189 0 2
No (130 69) Yes (59 31)
--------------------------------------------------------------------------------
age Mother age [years]
n missing unique Info Mean 05 10 25 50 75
188 1 24 1 233 16 17 19 23 26
90 95
31 32
lowest 14 15 16 17 18 highest 33 34 35 36 45
--------------------------------------------------------------------------------
race
n missing unique
189 0 3
White (96 51) Black (26 14) Other (67 35)
--------------------------------------------------------------------------------
ftv
n missing unique Info Mean
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
4 Two-way graphics 31
41 Lineplots 32
42 Scatterplots 32
43 Barcharts 37
44 Dotcharts 37
45 Line fits 39
46 Time series 40
47 Level plot 42
5 Multi-way graphics 45
51 Parallel displays 45
52 Scatterplot matrix 45
53 Three-way tabular data 45
54 N-way data 45
6 Customizing theme and panels 47
7 The Hmisc plotting functions 49
71 Two-way graphics 49
8 The ggplot2 package 51
81 Why ggplot2 51
82 Core components of a ggplot graphic 51
9 Interactive and dynamic displays 53
91 Exploratory data analysis 53
92 Brushing and linking 53
93 The ggobi toolbox 53
CHAPTER1
Getting started with R graphics
11 Why R
In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example
Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows
$ Rscript -e cat(rnorm(500) sep=n) gt rnddat
Then in gnuplot we can run
bw=02
n=500
bin(xwidth)=widthint(xwidth)
tstr(n)=sprintf(Binwidth = 11fn n)
set xrange [-33]
set yrange [01]
set boxwidth bw
plot rnddat using (bin($1bw))(1(bwn)) smooth frequency
with boxes title tstr(bw)
to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)
sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399
3
The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks
The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2
12 The R graphical model
13 Base vs lattice graphics
R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function
14 The grammar of graphics
sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399
Tidying
CHAPTER2
Data management
21 Structuring data
In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice
package which is entirely based on formula even if this is not apparent at first sight
22 Managing data
Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning
data(birthwt package=MASS)
str(birthwt)
sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8
5
Tidying221 Variable recoding and annotation
The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)
Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)
setseed(101)
birthwt$age[5] lt- NA
birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA
yesno lt- c(No Yes)
birthwt lt- within(birthwt
smoke lt- factor(smoke labels = yesno)
low lt- factor(low labels = yesno)
ht lt- factor(ht labels = yesno)
ui lt- factor(ui labels = yesno)
race lt- factor(race levels = 13 labels = c(White Black Other))
lwt lt- lwt22 weight in kg
)
It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well
library(Hmisc)
label(birthwt$age) lt- Mother age
units(birthwt$age) lt- years
label(birthwt$bwt) lt- Baby weight
units(birthwt$bwt) lt- grams
label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study
These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures
The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R
Once we have a working data frame the functions contents and describe provide twoquick summary of the data
gt contents(birthwt)
Data framebirthwt 189 observations and 10 variables Maximum NAs5
Labels Units Levels Class Storage NAs
low 2 integer 0
age Mother age years integer integer 1
Tidyinglwt double 0
race 3 integer 0
smoke 2 integer 0
ptl integer 0
ht 2 integer 0
ui 2 integer 0
ftv integer 5
bwt Baby weight grams integer integer 0
+--------+-----------------+
|Variable|Levels |
+--------+-----------------+
| low |NoYes |
| smoke | |
| ht | |
| ui | |
+--------+-----------------+
| race |WhiteBlackOther|
+--------+-----------------+
As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output
gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)
4 Variables 189 Observations
--------------------------------------------------------------------------------
low
n missing unique
189 0 2
No (130 69) Yes (59 31)
--------------------------------------------------------------------------------
age Mother age [years]
n missing unique Info Mean 05 10 25 50 75
188 1 24 1 233 16 17 19 23 26
90 95
31 32
lowest 14 15 16 17 18 highest 33 34 35 36 45
--------------------------------------------------------------------------------
race
n missing unique
189 0 3
White (96 51) Black (26 14) Other (67 35)
--------------------------------------------------------------------------------
ftv
n missing unique Info Mean
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
CHAPTER1
Getting started with R graphics
11 Why R
In the ldquoGNUworldrdquomost of the plotting programexpect data from text file (tab-delimitedor csv) arranged by columns with extra grouping levels denoted by string or integer codesThis is the case with gnuplotsup1 (httpwwwgnuplotinfo) or plotutils (httpwwwgnuorgsoftwareplotutils) for example
Here is howwe could create an histogram in gnuplot froma series of 500 gaussian variatesgenerate using R as follows
$ Rscript -e cat(rnorm(500) sep=n) gt rnddat
Then in gnuplot we can run
bw=02
n=500
bin(xwidth)=widthint(xwidth)
tstr(n)=sprintf(Binwidth = 11fn n)
set xrange [-33]
set yrange [01]
set boxwidth bw
plot rnddat using (bin($1bw))(1(bwn)) smooth frequency
with boxes title tstr(bw)
to get the picture shown below (Note that we didnrsquot try to customize anything except thetitle)
sup1PK Janert Gnuplot in Action Understanding Data with Graphs Manning Publications 2009 isbn 978-1933988399
3
The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks
The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2
12 The R graphical model
13 Base vs lattice graphics
R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function
14 The grammar of graphics
sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399
Tidying
CHAPTER2
Data management
21 Structuring data
In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice
package which is entirely based on formula even if this is not apparent at first sight
22 Managing data
Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning
data(birthwt package=MASS)
str(birthwt)
sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8
5
Tidying221 Variable recoding and annotation
The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)
Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)
setseed(101)
birthwt$age[5] lt- NA
birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA
yesno lt- c(No Yes)
birthwt lt- within(birthwt
smoke lt- factor(smoke labels = yesno)
low lt- factor(low labels = yesno)
ht lt- factor(ht labels = yesno)
ui lt- factor(ui labels = yesno)
race lt- factor(race levels = 13 labels = c(White Black Other))
lwt lt- lwt22 weight in kg
)
It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well
library(Hmisc)
label(birthwt$age) lt- Mother age
units(birthwt$age) lt- years
label(birthwt$bwt) lt- Baby weight
units(birthwt$bwt) lt- grams
label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study
These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures
The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R
Once we have a working data frame the functions contents and describe provide twoquick summary of the data
gt contents(birthwt)
Data framebirthwt 189 observations and 10 variables Maximum NAs5
Labels Units Levels Class Storage NAs
low 2 integer 0
age Mother age years integer integer 1
Tidyinglwt double 0
race 3 integer 0
smoke 2 integer 0
ptl integer 0
ht 2 integer 0
ui 2 integer 0
ftv integer 5
bwt Baby weight grams integer integer 0
+--------+-----------------+
|Variable|Levels |
+--------+-----------------+
| low |NoYes |
| smoke | |
| ht | |
| ui | |
+--------+-----------------+
| race |WhiteBlackOther|
+--------+-----------------+
As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output
gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)
4 Variables 189 Observations
--------------------------------------------------------------------------------
low
n missing unique
189 0 2
No (130 69) Yes (59 31)
--------------------------------------------------------------------------------
age Mother age [years]
n missing unique Info Mean 05 10 25 50 75
188 1 24 1 233 16 17 19 23 26
90 95
31 32
lowest 14 15 16 17 18 highest 33 34 35 36 45
--------------------------------------------------------------------------------
race
n missing unique
189 0 3
White (96 51) Black (26 14) Other (67 35)
--------------------------------------------------------------------------------
ftv
n missing unique Info Mean
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
The above example shows one important aspect of using a dedicated statistical packageGnuplot has no function to draw an histogram and we have to write some code to per-form additional tasks like binning in this case The same applies for plotutils We couldmake use of external programs to do that like the GSL library (see example 2211 fromthe manual httpwwwgnuorgsoftwaregslmanual) but this two-stage approachis rather likely to be cumbersome for repetitive tasks
The author found that Stata is one of the only great alternative to R but it has its costIn fact this textbook is largely inspired from one of Stata Press book on Stata graphicalcapabilitiessup2
12 The R graphical model
13 Base vs lattice graphics
R comes with several graphical system including so-called ldquoBaserdquo graphics which relyon grDevices and While these plotting functions are quite useful they lack severalinteresting aspects that are available through grid-based packages including lattice andggplot2 and in particular the update and print function
14 The grammar of graphics
sup2MN Mitchell A Visual Guide to Stata Graphics Stata Press 2008 isbn 978-1597180399
Tidying
CHAPTER2
Data management
21 Structuring data
In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice
package which is entirely based on formula even if this is not apparent at first sight
22 Managing data
Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning
data(birthwt package=MASS)
str(birthwt)
sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8
5
Tidying221 Variable recoding and annotation
The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)
Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)
setseed(101)
birthwt$age[5] lt- NA
birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA
yesno lt- c(No Yes)
birthwt lt- within(birthwt
smoke lt- factor(smoke labels = yesno)
low lt- factor(low labels = yesno)
ht lt- factor(ht labels = yesno)
ui lt- factor(ui labels = yesno)
race lt- factor(race levels = 13 labels = c(White Black Other))
lwt lt- lwt22 weight in kg
)
It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well
library(Hmisc)
label(birthwt$age) lt- Mother age
units(birthwt$age) lt- years
label(birthwt$bwt) lt- Baby weight
units(birthwt$bwt) lt- grams
label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study
These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures
The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R
Once we have a working data frame the functions contents and describe provide twoquick summary of the data
gt contents(birthwt)
Data framebirthwt 189 observations and 10 variables Maximum NAs5
Labels Units Levels Class Storage NAs
low 2 integer 0
age Mother age years integer integer 1
Tidyinglwt double 0
race 3 integer 0
smoke 2 integer 0
ptl integer 0
ht 2 integer 0
ui 2 integer 0
ftv integer 5
bwt Baby weight grams integer integer 0
+--------+-----------------+
|Variable|Levels |
+--------+-----------------+
| low |NoYes |
| smoke | |
| ht | |
| ui | |
+--------+-----------------+
| race |WhiteBlackOther|
+--------+-----------------+
As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output
gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)
4 Variables 189 Observations
--------------------------------------------------------------------------------
low
n missing unique
189 0 2
No (130 69) Yes (59 31)
--------------------------------------------------------------------------------
age Mother age [years]
n missing unique Info Mean 05 10 25 50 75
188 1 24 1 233 16 17 19 23 26
90 95
31 32
lowest 14 15 16 17 18 highest 33 34 35 36 45
--------------------------------------------------------------------------------
race
n missing unique
189 0 3
White (96 51) Black (26 14) Other (67 35)
--------------------------------------------------------------------------------
ftv
n missing unique Info Mean
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Tidying
CHAPTER2
Data management
21 Structuring data
In R the most common structure used to store a data set with mixed-type variables is adataframe Such an R object presents several characteristics that makes it most appro-priate for managing statistical data structure with few exceptions (eg when one onlyhas to work with aggregated data or two-way tables) It should be noted that other datastructures might be more appropriate for example when one is interested in time seriesanalysis but see the zoo packagesup1Many R functions accept dataframe as input and further allow to subset or index it forcomputation or visualization purpose In addition to receiving a dataframe some Rcommands allow to use a formula notation where the right and left-hand side are sepa-rated by the sim (tilde) operator The use of together with a dataframe simplify the acces-sion of variable in a given environment This is especially true when using the lattice
package which is entirely based on formula even if this is not apparent at first sight
22 Managing data
Consider for example the low birth study which is discussed at length in Hosmer andLemeshowrsquos textbook on logistic regressionsup2 A quick look at the variables shouldmake itclear that they wonrsquot be treated the way we like them to be considered motherrsquos ethnicitystatus (race) takes three integer values without any explicit meaning
data(birthwt package=MASS)
str(birthwt)
sup1A Zeilis and G Grothendieck ldquozoo S3 Infrastructure for Regular and Irregular Time Seriesrdquo In Journal ofStatistical Software 146 (2005) url httpwwwjstatsoftorgv14i06sup2D Hosmer and S Lemeshow Applied Logistic Regression New York Wiley 1989 isbn 0-471-35632-8
5
Tidying221 Variable recoding and annotation
The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)
Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)
setseed(101)
birthwt$age[5] lt- NA
birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA
yesno lt- c(No Yes)
birthwt lt- within(birthwt
smoke lt- factor(smoke labels = yesno)
low lt- factor(low labels = yesno)
ht lt- factor(ht labels = yesno)
ui lt- factor(ui labels = yesno)
race lt- factor(race levels = 13 labels = c(White Black Other))
lwt lt- lwt22 weight in kg
)
It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well
library(Hmisc)
label(birthwt$age) lt- Mother age
units(birthwt$age) lt- years
label(birthwt$bwt) lt- Baby weight
units(birthwt$bwt) lt- grams
label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study
These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures
The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R
Once we have a working data frame the functions contents and describe provide twoquick summary of the data
gt contents(birthwt)
Data framebirthwt 189 observations and 10 variables Maximum NAs5
Labels Units Levels Class Storage NAs
low 2 integer 0
age Mother age years integer integer 1
Tidyinglwt double 0
race 3 integer 0
smoke 2 integer 0
ptl integer 0
ht 2 integer 0
ui 2 integer 0
ftv integer 5
bwt Baby weight grams integer integer 0
+--------+-----------------+
|Variable|Levels |
+--------+-----------------+
| low |NoYes |
| smoke | |
| ht | |
| ui | |
+--------+-----------------+
| race |WhiteBlackOther|
+--------+-----------------+
As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output
gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)
4 Variables 189 Observations
--------------------------------------------------------------------------------
low
n missing unique
189 0 2
No (130 69) Yes (59 31)
--------------------------------------------------------------------------------
age Mother age [years]
n missing unique Info Mean 05 10 25 50 75
188 1 24 1 233 16 17 19 23 26
90 95
31 32
lowest 14 15 16 17 18 highest 33 34 35 36 45
--------------------------------------------------------------------------------
race
n missing unique
189 0 3
White (96 51) Black (26 14) Other (67 35)
--------------------------------------------------------------------------------
ftv
n missing unique Info Mean
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Tidying221 Variable recoding and annotation
The Hmisc package includes numerous R functions that will facilitate the task of datachecking (describe provides ldquocodebookrdquo facilities) summarizing (summaryformula) oraggregating data (summaryformula)
Here are some examples of use with the birthwt data For illustration purpose we setsome observations as missing on two variables (age and ftv)
setseed(101)
birthwt$age[5] lt- NA
birthwt$ftv[sample(1nrow(birthwt) 5)] lt- NA
yesno lt- c(No Yes)
birthwt lt- within(birthwt
smoke lt- factor(smoke labels = yesno)
low lt- factor(low labels = yesno)
ht lt- factor(ht labels = yesno)
ui lt- factor(ui labels = yesno)
race lt- factor(race levels = 13 labels = c(White Black Other))
lwt lt- lwt22 weight in kg
)
It often helps to keep variable names short and informative and to have separated labelsandor units in case of continuous measurements This is available via the label andunits functions Note that label allows to annotate the data frame as well
library(Hmisc)
label(birthwt$age) lt- Mother age
units(birthwt$age) lt- years
label(birthwt$bwt) lt- Baby weight
units(birthwt$bwt) lt- grams
label(birthwt self = TRUE) lt- Hosmer amp Lemeshows low birth weight study
These labels can thenbe used in almost every Hmisc functions evenwhenusing listtreein place of str However we will see that there mostly useful when generating Tables orFigures
The contents command offers a quick summary of data format and missing values andit provides a list of labels associated to variables treated as factor by R
Once we have a working data frame the functions contents and describe provide twoquick summary of the data
gt contents(birthwt)
Data framebirthwt 189 observations and 10 variables Maximum NAs5
Labels Units Levels Class Storage NAs
low 2 integer 0
age Mother age years integer integer 1
Tidyinglwt double 0
race 3 integer 0
smoke 2 integer 0
ptl integer 0
ht 2 integer 0
ui 2 integer 0
ftv integer 5
bwt Baby weight grams integer integer 0
+--------+-----------------+
|Variable|Levels |
+--------+-----------------+
| low |NoYes |
| smoke | |
| ht | |
| ui | |
+--------+-----------------+
| race |WhiteBlackOther|
+--------+-----------------+
As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output
gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)
4 Variables 189 Observations
--------------------------------------------------------------------------------
low
n missing unique
189 0 2
No (130 69) Yes (59 31)
--------------------------------------------------------------------------------
age Mother age [years]
n missing unique Info Mean 05 10 25 50 75
188 1 24 1 233 16 17 19 23 26
90 95
31 32
lowest 14 15 16 17 18 highest 33 34 35 36 45
--------------------------------------------------------------------------------
race
n missing unique
189 0 3
White (96 51) Black (26 14) Other (67 35)
--------------------------------------------------------------------------------
ftv
n missing unique Info Mean
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Tidyinglwt double 0
race 3 integer 0
smoke 2 integer 0
ptl integer 0
ht 2 integer 0
ui 2 integer 0
ftv integer 5
bwt Baby weight grams integer integer 0
+--------+-----------------+
|Variable|Levels |
+--------+-----------------+
| low |NoYes |
| smoke | |
| ht | |
| ui | |
+--------+-----------------+
| race |WhiteBlackOther|
+--------+-----------------+
As can be seen contents displays storage mode and class of R variables that are presentin the data frame The number of missing values is also reported for each variable Thelevels of each factor-type variable is also printed at the end of the output
gt describe(subset(birthwt select = c(low age race ftv)) digits = 3)
4 Variables 189 Observations
--------------------------------------------------------------------------------
low
n missing unique
189 0 2
No (130 69) Yes (59 31)
--------------------------------------------------------------------------------
age Mother age [years]
n missing unique Info Mean 05 10 25 50 75
188 1 24 1 233 16 17 19 23 26
90 95
31 32
lowest 14 15 16 17 18 highest 33 34 35 36 45
--------------------------------------------------------------------------------
race
n missing unique
189 0 3
White (96 51) Black (26 14) Other (67 35)
--------------------------------------------------------------------------------
ftv
n missing unique Info Mean
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Tidying184 5 6 083 0783
0 1 2 3 4 6
Frequency 98 45 30 7 3 1
53 24 16 4 2 1
--------------------------------------------------------------------------------
The describe function gives more details and in particular provides useful summarystatistics for each variable Here we only considered four variables a binary variable(low) a continuous measure (age) a three-level factor (race) and a count variable (ftv)The output will be different depending on the type of variable and for all but continuousmeasures (defined as variable taking at least 10 distinct values) describe will display atable of counts and frequencies
222 Variable transformation
In case we would like to consider one of the above factors as a numerical variable wecan now use asnumeric and R will take care of attributing the lowest integer score to thebaseline category Of course there might be occasion where we would like to change thatreference level or we might want to collapse two discrete categories Again there aresimple commands to do that for example
birthwt$low lt- relevel(birthwt$low Yes)
levels(birthwt$race)[23] lt- Black+Other
Another common task consists in transforming some predictors either for visualizationpurpose or when building an explantory or predictive model As a simple example wecan imagine centering some of the predictors of interest like age in the above exampleThe within or transform command can be used to append the centered variable to thelist of variables present in the dataframe
birthwt lt- transform(birthwt agec=scale(age scale=FALSE))
Likewise we may want to recode previous premature labours (ptl) as yesno and num-ber of physician visits during the first trimester (ftv) as onemore than one like shownbelow (we show two different syntax that basically perform the same task by relying on Rindexing)
birthwt lt- transform(birthwt ptlyn=factor(ptl gt 0 labels=c(NoYes))
ftvc=factor(ifelse(ftv lt 2 1 2+)))
Hmisc provides a replacement for Rrsquos cut function with better default options (especiallythe infamous includelowest = FALSE) to discretize a continuous variable The cut2
function has many useful options including g= and levelsmean= to return 119892 classes andreport center of each class instead of class intervals
table(cut2(birthwt$age g = 3 levelsmean = TRUE digits = 3))
If there is some reason to treat ftv as an ordered factor a command like
asordered(cut2(birthwt$ftv c(0 1 2 6)))
might do the job
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Tidying23 Indexing subsetting conditioning
A lot of statistical operations that practictioners use to apply on a given dataset are mostlyvariations around the idea of indexing or subsetting By comparison SQL-like operationswould be selection and projection
The subset command offers a simple and elegant way to combine both operations for agiven data frame the subset = option is used to filter rows according to logical expressionor simple row indexes while the select = option is used to return only selected variablesbased on their index (eg column 1 and 3) or an unquoted name (eg c(lowlwt))
The following instruction will print the age of hypertensive mothers whose baby was un-derweight
subset(birthwt low == Yes amp ht == Yes age)
It should be noted that subset returns a data frame
Some people might prefer to use their favorite SQL-like language so something like thiswould perfectly fit their needs
library(sqldf)
sqldf(SELECT age FROM birthwt WHERE low = 0 AND ht = 1 LIMIT 3 rownames = TRUE)
Unfortunately this doesnrsquot work with ldquolabelledrdquo objects as typically returned by Hmiscalthough a solution is readily available at httpstackoverflowcomq2394902420055
There are also a bunch of command dedicated to variables clustering analysis of missingpatterns or simple (impute) or multiple (aregImpute transcan) imputation methodsHere is how we would impute missing values with the median in the case of a continuousvariable
lwt lt- birthwt$lwt
lwt[sample(length(lwt) 10)] lt- NA
lwti lt- impute(lwt)
summary(lwti)
Missing observations will be marked with an asterisk when we print the whole object inR To use the mean instead of the median we just have to add the fun = mean option
24 Summarizing data
Statisticians generally spend a great part of their time in data cleansing data transforma-tion or re-expressionsup3 and data visualization
sup3DC Hoaglin F Mosteller and JW Tukey Understanding Robust and Exploratory Data Analysis Wiley-Interscience 1983 isbn 0-471-09777-2
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Tidying241 Base R functions
Both the by and tapply function allow to apply a builtin or custom function to help sum-marizing a numeric variable by a categorical variable Most of the time these two func-tions are less handy than aggregate since the latter offers a formula interface and returnsa data frame
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt mean)
There is however one caveat when using aggregate even if you can pass a custom func-tion that returns multiple values that can be printed on screen the resulting data framewill still have only one column for its output For exemple the dimensions of the dataframe created by the following call to aggregate is 4 by 3 even if it looks like there twoseparate columns for bwt mean and SD
f lt- function(x) c(mean = mean(x) sd = sd(x))
aggregate(bwt ~ ui + I(ftv gt 1) data = birthwt f)
242 The Hmisc package
There are three useful commands that provide summary statistics for a list of variablesThey implement the ldquosplit-apply-combinerdquo strategy⁴ in the spirit of Rrsquos built-in functions(unlike plyr)
The first one summarize can be seen as an equivalent to Rrsquos aggregate command Givena response variable and one or more classification factors it applies a specific function toall data chunk where each chunk is defined based on factor levels The results are storedin a matrix which can easily be coerced to a data frame using eg asdataframe orHmiscmatrix2dataFrame
Note that enhanced results can be printed on the console using the prn command
Here is a first example using a dedicated function to print the mean and standard devia-tion (SD) of a numerical variable
f lt- function(x naopts = TRUE) c(mean = mean(x narm = naopts) sd = sd(x
narm = naopts))
out lt- with(birthwt summarize(bwt race f))
Here is the output produced by R
Average baby weight by ethnicity out
race bwt sd
3 White 3103 7279
1 Black 2720 6387
2 Other 2805 7222
⁴H Wickham ldquoThe Split-Apply-Combine Strategy for Data Analysisrdquo In Journal of Statistical Software 401(2011) url httpwwwjstatsoftorgv40i01
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
TidyingContrary to aggregate this command provides multiway data structure in case we ask tocompute more than one quantity In this case the dimension of out is a 3 by 3 data frame
Summarizing multivariate responses or predictors is also possible with either summarizeor mApply Of course any built-in functions such as colMeans could be used in place ofour custom summary command
with(birthwt summarize(bwt llist(race smoke) f))
The second command bystats (or bystats2 for two-way tabular output) allows to de-scribe with any custom or built-in function one or multiple outcome by two explanatoryvariables or even more Sample size and the number of missing values are also printed
With the following instruction
with(birthwt bystats(cbind(bwt lwt) smoke race))
we would get
Mean of cbind(bwt lwt) by smoke
N bwt lwt
No White 44 3429 6311
Yes White 52 2827 5741
No Black 16 2854 6793
Yes Black 10 2504 6482
No Other 55 2816 5416
Yes Other 12 2757 5636
ALL 189 2945 5901
whereas
with(birthwt bystats2(lwt smoke race))
gives
Mean of lwt by smoke
+----+
|N |
|Mean|
+----+
+---+-----+-----+-----+-----+
|No | 44 | 16 | 55 |115 |
| |6311|6793|5416|5950|
+---+-----+-----+-----+-----+
|Yes| 52 | 10 | 12 | 74 |
| |5741|6482|5636|5824|
+---+-----+-----+-----+-----+
|ALL| 96 | 26 | 67 |189 |
| |6002|6673|5455|5901|
+---+-----+-----+-----+-----+
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
TidyingThe third and last command is summaryformula which can be abbreviated as summary aslong as formula is used to describe variables relationships There are three main configu-rations (method=) rdquoresponserdquo where a numerical variable is summarized for each level ofone or more variables (numerical variables will be discretized in 4 classes) as summarizedoes rdquocrossrdquo to compute conditional and marginal means of several response variablesdescribed by at most 3 explanatory variables (again continuous predictors are repre-sented as quartiles) rdquoreverserdquo to summarize univariate distribution of a set of variables foreach level of a classification variable (which appears on the left-hand side of the formula)Variables are viewed as continuous as long as they have more than 10 distinct values butthis can be changed by setting eg continuous = 5 When method = reverse it ispossible to add overall = TRUE test = TRUE to add overall statistics and correspondingstatistical tests of null effect between the groups
Here are some examples of use with their outputs
gt summary(bwt ~ race + ht + lwt data = birthwt)
Baby weight N=189
+-------+------------+---+----+
| | |N |bwt |
+-------+------------+---+----+
|race |White | 96|3103|
| |Black | 26|2720|
| |Other | 67|2805|
+-------+------------+---+----+
|ht |No |177|2972|
| |Yes | 12|2537|
+-------+------------+---+----+
|lwt |[364 509)| 53|2656|
| |[509 555)| 43|3059|
| |[555 641)| 46|3075|
| |[6411136]| 47|3038|
+-------+------------+---+----+
|Overall| |189|2945|
+-------+------------+---+----+
gt summary(cbind(lwt age) ~ race + bwt data = birthwt method = cross)
mean by race bwt
+-------+
|N |
|Missing|
|lwt |
|age |
+-------+
+-----+-----------+-----------+-----------+-----------+-----+
| race|[ 7092424)|[24243005)|[30053544)|[35444990]| ALL |
+-----+-----------+-----------+-----------+-----------+-----+
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Tidying|White| 19 | 23 | 20 | 33 | 95 |
| | 0 | 1 | 0 | 0 |1 |
| | 5555 | 5767 | 6223 | 6325 |6014|
| | 2274 | 2478 | 2450 | 2491 |2436|
+-----+-----------+-----------+-----------+-----------+-----+
|Black| 9 | 9 | 6 | 2 | 26 |
| | 0 | 0 | 0 | 0 |0 |
| | 6510 | 5970 | 7083 | 9341 |6673|
| | 2344 | 2089 | 2000 | 2050 |2154|
+-----+-----------+-----------+-----------+-----------+-----+
|Other| 20 | 16 | 19 | 12 | 67 |
| | 0 | 0 | 0 | 0 |0 |
| | 5123 | 5295 | 5890 | 5534 |5455|
| | 2220 | 2269 | 2226 | 2250 |2239|
+-----+-----------+-----------+-----------+-----------+-----+
|ALL | 48 | 48 | 45 | 47 |188 |
| | 0 | 1 | 0 | 0 |1 |
| | 5554 | 5648 | 6197 | 6251 |5906|
| | 2265 | 2335 | 2296 | 2411 |2327|
+-----+-----------+-----------+-----------+-----------+-----+
gt summary(low ~ race + ht data = birthwt fun = table)
low N=189
+-------+-----+---+---+---+
| | |N |No |Yes|
+-------+-----+---+---+---+
|race |White| 96| 73|23 |
| |Black| 26| 15|11 |
| |Other| 67| 42|25 |
+-------+-----+---+---+---+
|ht |No |177|125|52 |
| |Yes | 12| 5| 7 |
+-------+-----+---+---+---+
|Overall| |189|130|59 |
+-------+-----+---+---+---+
Finally here is amore complexTable produced by summaryformulawith method = reverse
out lt- summary(low ~ race + age + ui data = birthwt method = reverse overall = TRUE
test = TRUE)
print(out prmsd = TRUE digits = 2)
Although we display the output on the console directly it is also possible to build a 119871A119879E119883Table using the latex function
latex(out file = tab-summarytex ctable = TRUE digits = 1 exclude1 = FALSE caption = NULL)
Here is the final output
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
TidyingN No Yes Combined Test Statistic
119873 = 111356811135701113567 119873 = 11135721113576 119873 = 111356811135751113576
race White 189 56 (73) 39 (23) 51 (96) 12059411135691113569 = 5 119875 = 0081113568Black 12 (15) 19 (11) 14 (26)
Other 32 (42) 42 (25) 35 (67)
Mother age years 188 19 23 28 20 22 25 19 23 26 1198651113568111356811135751113573 = 1 119875 = 021113569ui No 189 89 (116) 76 ( 45) 85 (161) 12059411135691113568 = 5 119875 = 0021113568
Yes 11 ( 14) 24 ( 14) 15 ( 28)
119886 119887 119888 represent the lower quartile 119886 the median 119887 and the upper quartile 119888 for continuous variables119873 is the number of nonndashmissing valuesNumbers after percents are frequenciesTests used1Pearson test 2Wilcoxon test
243 The plyr package
The plyr package⁵ offers a general solution to those kind of tasks
⁵Ibid
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
CHAPTER3
Univariate distributions
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
31 Stripchart
Stripchart aims at showing the distribution of a series of continuousmeasurements (muchlike scatterplot for 2D data discussed in sect 42) They are useful for small to moderatedataset With large119873 it is proably better to switch to alternative displays see next sections
x lt- sample (130 25 replace=TRUE)
stripplot(sim x jitterdata=TRUE factor=8 aspect=5)
With this synthetic dataset where several obser-vations can take the same value jittering pointlocations on the horizontal and vertical axes en-sures a better representation
x
5 10 15 20 25 30
15
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
stripplot(sim x jitterdata=TRUE factor=8 aspect=xy)
x
5 10 15 20 25 30
A better way of flattening the display is to use anldquoxyrdquo aspect
stripplot(x sim 1 horizontal=FALSE jitterdata=TRUE aspect =12
scales=list(x=list(draw=F)) xlab= pch=15 alpha=5)
x
5
10
15
20
25
30 This is basically the same picture but the 119909 and119910 axis have been transposed We used a differ-ent symbol and transparency to highlight wherereplication occurs in the data Obviously thatwonrsquot work so nicely with larger sample size ora higher density of replication
stripplot(sim x panel=HH paneldotplottb cex=12 factor=2)
x
5 10 15 20 25 30
In contrast to the base stripchart functionthere is no way of imposing a stacked displayin lattice However there is some convenientpanel function in the HH package
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
x lt- sample(seq (1 60 by=2) 75 replace=TRUE)
stripplot (1 sim x panel=panelsunflowerplot col=black
segcol=black seglwd =1 size=08)
With possible replicates it is also interesting touse ldquosunflowersrdquo where multiple leaves are usedfor each duplicate The custom panel functionmimics the base sunflowerplot function For analternative way of embedding ldquosunflowersrdquo intoa lattice display see the following thread onR-help httpbitlyIg4RTq Note that weremove 119910-axis annotation using commands pre-sented before (ie using scales=)
x
0 10 20 30 40 50 60
32 Histograms
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
histogram(sim waiting data=faithful)
faithful Waiting time betweeneruptions and the duration of theeruption for the Old Faithful geyser inYellowstone National Park WyomingUSA
A simple histogram of waiting time expressed asdensity Note that forgetting the sim operator willraise an error message
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
Box 31 shows some custom settings with the faithful dataset Lorem ipsum dolor sitamet consectetuer adipiscing elit Ut purus elit vestibulum ut placerat ac adipiscingvitae felis Curabitur dictum gravida mauris Nam arcu libero nonummy eget con-
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
sectetuer id vulputate a magna Donec vehicula augue eu neque Pellentesque habitantmorbi tristique senectus et netus etmalesuada fames ac turpis egestas Mauris ut leo Crasviverrametus rhoncus sem Nulla et lectus vestibulumurna fringilla ultrices Phasellus eutellus sit amet tortor gravida placerat Integer sapien est iaculis in pretium quis viverraac nunc Praesent eget sem vel leo ultrices bibendum Aenean faucibus Morbi dolornulla malesuada eu pulvinar at mollis ac nulla Curabitur auctor semper nulla Donecvarius orci eget risus Duis nibh mi congue eu accumsan eleifend sagittis quis diamDuis eget orci sit amet orci dignissim rutrum
Box 31Common options for histogram include displaying counts instead of percents or varying bar color It is also possible tochange default bin size When nint=nrow(dataset) we have a so-called high-density vertical lines much like when usingplot( type=h)
type=count
waiting
Count
0
20
40
60
40 50 60 70 80 90 100
col=steelblue
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
border=NA nint=12
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
40 50 60 70 80 90 100
nint=nrow(dataset)
waiting
Perc
ent
of
Tota
l
0
2
4
6
40 50 60 70 80 90 100
p lt- histogram(sim waiting data=faithful lwd=5 type=percent
border=white)
xbreaks lt- p$panelargscommon$breaks
yvalues lt- table(cut(faithful$waiting breaks=xbreaks ))
update(p panel=function( )
panelhistogram()
paneltext(xbreaks+diff(xbreaks )[1]2
yvaluesnrow(faithful)100
ascharacter(yvalues ) cex=8 adj=c(5 0))
)
waiting
Perc
ent
of
Tota
l
0
5
10
15
20
25
40 50 60 70 80 90 100
13
3133
22
14
67
63
24
5
13
The following example demonstrates how a de-fault histogram displaying percent data can beannotated with counts data This is intended toshow howwe can steal away default setting to dis-play the distribution of discrete values)
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
x lt- rnorm (80 mean =12 sd=2)
histogram(sim x type=density border=NA
panel=function(x )
panelhistogram(x )
panelmathdensity(dmath=dnorm col=BF3030
args=list(mean=mean(x)sd=sd(x)))
)
An example where we superimposed a normaldensity with parameters estimated from the sam-ple)
x
Densi
ty
000
005
010
015
6 8 10 12 14 16
33 Density plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
sup1
sup1BW Silverman Density Estimation London Chapman and Hall 1986 isbn 978-0412246203
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
densityplot(sim waiting data=faithful)
waiting
Densi
ty
000
001
002
003
40 60 80 100
The default panel relies on Rrsquos density func-tion As such the default kernel is gaussian
with 119899 = 512 equally spaced at which thedensity is estimated The choice of the band-with follows Silvermanrsquos rule of thumb namelymin(09SD IQR134119899) An alternative bandwithcan be selected using bw=SJ
densityplot(sim waiting data=faithful plotpoints=rug)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Instead of a mini stripchart displayed at 119910 = 0a ldquorugplotrdquo can be preferred It might help spot-ting possible local concentration of data pointscompared to simple jittered points
densityplot(sim waiting data=faithful plotpoints=FALSE ref=TRUE)
waiting
Densi
ty
000
001
002
003
40 60 80 100
Sometimes adding a reference line crossing at119910 = 0 may be informative especially for multi-modal distributions It is advised to avoid plot-ting individual observations like was done in thepreceding graphics
An alternative way of presenting density plots are so-called ldquoviolin plotsrdquosup2 which feature
sup2JL Hintze and RD Nelson ldquoViolin Plots A Box Plot-Density Trace Synergismrdquo In The American Statistician522 (1998) pp 181ndash184
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
the main components of a boxplot (sect 35) and a kernel density estimation and ldquobeanplotsrdquosup3 where density trace are mirrored to form a polygon shape The latter presents theadvantage of allowing asymmetrical plotting depending on a grouping factor
bwplot(sim waiting data=faithful panel=panelviolin)
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
bwplot(sim waiting data=faithful
panel=function( boxratio )
panelviolin( col=transparent varwidth=FALSE
boxratio=boxratio)
panelbwplot( fill=NULL boxratio=15 ))
A simple ldquoviolinrdquo panel
waiting
50 60 70 80 90
34 Quantile and related probability plots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnasup3P Kampstra ldquoBeanplot A Boxplot Alternative for Visual Comparison of Distributionsrdquo In Journal of Statis-tical Software 28 (2008) url httpwwwjstatsoftorgv28c01
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
fringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
qqmath(sim rnorm (60))
qnorm
rnorm
(60
)
-3
-2
-1
0
1
2
-2 -1 0 1 2
A simple quantile plot of 60 gaussian variatesThe theoretical quantiles of an119977 (0 1) are shownon the 119909-axis
x lt- rt(500 df=20)
qqmath(sim x pch=+ abline=c(01) dist=qnorm)
qnorm
x
-4
-2
0
2
-2 0 2
+
+ +
+++
+++++++
+++++++++++
++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+++++++++++++++++++++++++++++++++++++
++++++++++++++++++++++
++++++++++++++++++++++++
+++++++++++++++++++
+++++++++
+++
+
+ +This is basically asking to draw the sameQQ-plotbut with a higher number of data points com-ing from a different distribution (Student 119905(20))A reference line is added to facilitate compar-ison This basically shows how closely the 119905-distribution can be approximated by an 119977 (0 1)when 119899 gets very large
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
qqmath(sim x pch=+ abline=c(01) dist=qnorm
fvalue=ppoints (100))
Same as above but subsampling data points
qnormx
-2
0
2
-2 -1 0 1 2
+
++
+++++
++++++
++++++
++++++++
+++++++++++
+++++++++
+++++++++++++
++++++++
+++++
++++++++
++++++
+++++
++
++ +
+
+
qqmath(sim Hwt data=cats subset=Sex == F dist=qunif)
cats The heart and body weights ofsamples of male and female cats usedfor digitalis experiments The cats wereall adult over 2 kg body weight
Here is one way to show the cumulative distri-bution function (CDF) of some sample datasetNote that we need to explicitly need to ask qqmath
to use the Uniform distribution as a reference
qunif
Hw
t
6
8
10
12
00 02 04 06 08 10
ecdfplot(sim Hwt data=cats subset=Sex == F)
An alternative way of plotting the empirical CDFis to rely on the ecdfplot function from thelatticeExtra package
Hwt
Em
pir
ical C
DF
00
02
04
06
08
10
6 8 10 12
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
35 Boxplots
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
bwplot(sim height data=women)
height
60 65 70
women This data set gives theaverage heights and weights forAmerican women aged 30-39
Boxplot are shown in horizontal mode bydefault Internally the boxplotstats func-tion is used so that hinges corresponds to thefirst and third quartile while notches extendsto plusmn158IQRradic119899 It provides a visual summaryanalogous to Tukeyrsquos five number (see fivenum)
bwplot(sim height data=women boxratio=5)
height
60 65 70
This is the same picture but with a thinner boxWhen there are several boxplots to draw side byside this might be useful
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
bwplot(sim height data=women boxratio=5
scales=list(x=list(limits=c(50 80) at=seq (50 80 by =2))))
A more detailed 119909-scale has been created (with-out prejudice to its usefulness) by simply updat-ing the scales= argument
height
50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80
bwplot(sim height data=women aspect=5
panel=function(x )
panelbwplot(x pch=| )
panelpoints(mean(x) 1 pch=19 cex =1)
)
It is possible to alter the way boxplot are drawnbut also the statistical summaries that are dis-played For instance in the above code we com-puted themean (drawn as a vertical bar inside thebox) in addition to the median Of note if thereare missing values we should add narm=TRUE
when calling mean(x)height
60 65 70
36 Time series
Lorem ipsum dolor sit amet consectetuer adipiscing elit Ut purus elit vestibulum utplacerat ac adipiscing vitae felis Curabitur dictum gravida mauris Nam arcu liberononummy eget consectetuer id vulputate a magna Donec vehicula augue eu nequePellentesque habitantmorbi tristique senectus et netus et malesuada fames ac turpis eges-tas Mauris ut leo Cras viverra metus rhoncus sem Nulla et lectus vestibulum urnafringilla ultrices Phasellus eu tellus sit amet tortor gravida placerat Integer sapien estiaculis in pretiumquis viverra ac nunc Praesent eget semvel leo ultrices bibendum Ae-nean faucibus Morbi dolor nulla malesuada eu pulvinar at mollis ac nulla Curabiturauctor semper nulla Donec varius orci eget risus Duis nibh mi congue eu accumsaneleifend sagittis quis diam Duis eget orci sit amet orci dignissim rutrum
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot(sunspotyear)
Time
05
01
00
15
0
1700 1750 1800 1850 1900 1950 2000
sunspotyear Yearly numbers ofsunspots from 1700 to 1988
A simple time-series is displayed as a lineplotbut taking care of arranging the 119909-axis for timemeasurements
xyplot(sunspotyear aspect=3 scales=list(y=list(rot =0)))
Time
0
50
100
150
1700 1750 1800 1850 1900 1950 2000
A more comprehensive picture after aspect ra-tio has been lowered so as to better highlight thecyclic component
xyplot(sunspotyear strip=FALSE cut=list(number =2 overlap=05))
Time
0
50
100
150
1700 1750 1800 1850
1850 1900 1950
0
50
100
150
The same time series cut into two pieces with 5of overlap
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot(sunspotyear strip=FALSE stripleft=TRUE
cut=list(number =2 overlap=05))
Now we highlight explicitely that the two seriesof measurements are related by adding a coloredribbon denoting time period
Time
0
50
100
150
1700 1750 1800 1850
tim
e
1850 1900 1950
0
50
100
150
tim
e
xyplot(zoo(discoveries ))
discoveries The numbers of ldquogreatrdquoinventions and scientific discoveries ineach year from 1860 to 1959
The zoo has to be loaded before using theabove command Briefly it takes care of han-dling time-series data correctly and it is inter-faced to latticersquos xyplot as an S3 method (seexyplotzoo)
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
xyplot(zoo(discoveries ) panel=panelxyarea)
It is possible to add a shaded area by using a spe-cific panel function from the latticeExtra pack-age
Time
02
46
81
01
2
1860 1880 1900 1920 1940 1960
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xt lt- ts(cumsum(rnorm (200 12)))
p lt- xyplot(xt)
Time
-20
02
04
06
08
0
0 500 1000 1500 2000 2500
The lattice package works with ts objetcs too
xt lt- zoo(accdeaths)
xyplot(xt type=c(lg) ylab=Total accidental deaths)
Time
Tota
l acc
identa
l d
eath
s
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
accdeaths A regular time seriesgiving the monthly totals of accidental
deaths in the USA
Another regular time series is
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot(xt
panel=function(x y )
panelxyplot(x y )
panellines(rollmean(zoo(y x) 3) lwd=2 col =1)
)
The same dataset with a rolling mean (of width3)
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
xyplot(xt) +
layer(paneltskernel(x y c=3 col=1 lwd =2))
Discrete symmetric smoothing kernels availablein latticeExtra can be used instead of a rollingmean Here an approximate gaussian filter wasused to highlight the seasonal component Ofnote the above R commandmakes use of a layersee also Chapter 6
Time
70
00
80
00
90
00
10
00
01
10
00
1973 1974 1975 1976 1977 1978 1979
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot(tsunion(sunspotyear lag10=lag(sunspotyear 10))
superpose=TRUE panel=panelsuperpose
panelgroups=function( groupnumber )
if ( groupnumber == 1) panelxyarea()
else panelxyplot()
border=NA cut=list(n=3 overlap =0) aspect=xy
parsettings=simpleTheme(col=c(greyblack) lwd=c(5 2)))
Time
050
100150
1700 1720 1740 1760 1780
time
1800 1820 1840 1860 1880
050100150
time
050
100150
1900 1920 1940 1960 1980
time
sunspotyearlag10sunspotyear Yearly numbers of
sunspots between 1700 and 1988More complex arrangement can be done againwith the latticeExtra panel function Hereyearly numbers of sunspots are shown togetherwith a lagged version (10 years)
flow lt- ts(filter(rlnorm (200 mean = 1) 08 method = r))
xyplot(flow
panel=function(x y )
panelxblocks(x y gt mean(y) col=lightgray)
panelxyplot(x y )
)
Time
01
02
03
04
05
0
0 50 100 150 200
blabla blabla
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
CHAPTER4
Two-way graphics
This chapter covers graphical displays for two-way relationships possibly by consideringadditional variables (numerical or categorical) to highlight ternary relationships Two-way graphics are not limited to numerical variables as we may be interested in showingthe relationships between two ordered categorical variables two unordered or ldquonominalrdquovariables Moreover as stated above categorical or discretized variables can be used toprovide additional information on top of a line- or scatter-plot by simply varying pointsize point or line colors and so on Of course we could extend this idea to the pointof displaying sixth dimensions in a single graph (eg using symbol with varying lengthand width color and shading pattern) But such a complex graph would likely be poorlyreadable and uninformative in the end So this chapter basically provides necessary Rcommand to create line-plot scatter-plot bar-plot dot-plot
31
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
41 Lineplots
xyplot(uptake sim conc data=CO2 groups=Treatment type=a)
conc
up
take
10
20
30
40
200 400 600 800 1000
Automatic averaging
42 Scatterplots
The basic R command for displaying a two-way scatterplot is xyplot A command likexyplot(y sim x) will produce a 2D plot almost identical to what would be obtained usingbase graphics plot(x y) However the default layout is generally better and it looksmore pretty
xyplot(dist sim speed data=cars)
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
cars The data give the speed of carsand the distances taken to stop Note
that the data were recorded in the1920s
This basic scatterplot show default options whencalling the xyplot command The formula inter-face is used to plot dist (119910-axis) as a function ofspeed (119909-axis) with automatic determination ofaxis units
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot(dist sim speed data=cars pch=rbinom(nrow(cars) 1 5)+1)
Wepick a random symbol ( = 1 = 2) for eachobservation using the pch= argument In factthis argument is transferred to the panelxyplotfunction that acts as the default panel functionNote that the vector of symbols should have thesame length as the 119909 and 119910 components oth-erwise recycling occurs It is however not agood idea to manipulate the pch= (or col= seebelow) argument and it is better to rely on theparsettings= parameter since it allows to usecustom themes which will facilitate the display oflegend speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars
col=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
Color (col=) of each observation depends of thequartile they belong to Note that passing colorsthis way will override default theming options
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars pch=19
groups=with(cars cut(speed breaks=quantile(speed)
includelowest=TRUE )))
This is basically the same code as previouslyshown except that we replaced the col= argumentby groups= This has the advantage of observingthe current theme and this will further facilitatethe insertion of an automatic legend
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xy lt- asdataframe(replicate (2 rnorm (1000)))
xyplot(V1 sim V2 data=xy pch=19 alpha=5)
V2
V1
-2
0
2
-2 0 2
When there are a high proportion of pointsthat overlap using transparent color may beuseful We replaced the default symbol withits filled counterpart An equivalent way ofspecifying transparent color would be to usergb(22 49 72 alpha=5)
dat lt- dataframe(replicate (2 rnorm (500)) z=sample (040 500 T))
cols lt- colorRampPalette(brewerpal (11 RdBu))( diff(range(dat$z)))
xyplot(X1 sim X2 data=dat col=cols[dat$z] pch=19 alpha=5)
X2
X1
-2
0
2
-3 -2 -1 0 1 2
Alpha-blending and color palette might be com-bined as well Here we used a pre-defined colorscheme (Red to Blue) from the RColorBrewer
package As the selected palette has only 11 dif-ferent colors whereas the grouping factor z has40 levels we use linear interpolation to increasethe number of available colors
xyplot(SepalLength sim PetalLength data=iris jitterx=TRUE
amount=2)
PetalLength
Sep
alLe
ng
th
5
6
7
8
1 2 3 4 5 6 7
iris This famous (Fisherrsquos orAndersonrsquos) iris data set gives the
measurements in centimeters of thevariables sepal length and width andpetal length and width respectively
for 50 flowers from each of 3 species ofiris The species are Iris setosa
versicolor and virginica
As an alternative to transparent colors one mayresort on ldquojitteringrdquo This is also useful when notso many points are available but show few varia-tions on one dimension The panelxyplot func-tion uses jitterx= and jittery= to vary 119909 and119910 coordinates by adding a random shift drawnfrom a uniform distribution 119984(minus119886 119886) where 119886stands for the amount= parameter
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot(dist sim speed data=cars
panel=function(x y )
panelxyplot(x y )
panelrug(x y )
)
It is possible to superimpose the univariate distri-bution of both series of measurement using ldquorugrdquoplots Usually they remain quite discreet (readnon-invasive) but provide additional informationto spot possible asymmetry We need to ask ex-plicitly for a custom panel though
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(log(Volume ) sim log(Girth) data=trees)
trees This data set providesmeasurements of the girth height andvolume of timber in 31 felled blackcherry trees Note that girth is thediameter of the tree (in inches)measured at 4 ft 6 in above the ground
Asimple log-log plot Note thatwewould have tomanually update the scales= component to pro-vide more suitable annotations for the 119909 and 119910-axis
log(Girth)
log
(Volu
me)
25
30
35
40
22 24 26 28 30
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot ((1200)20 sim (1200)20 type=c(p g)
scales=list(x=list(log =10) y=list(log =10))
xscalecomponents=xscalecomponentslog103
yscalecomponents=yscalecomponentslog103)
(1200)20
(12
00
)2
0
01
03
1
3
10
01 03 1 3 10
Instead of transforming variables in the formulait is easier and safer to do this through thescales= parameter The [x|y]scalecomponentsare convenient functions that help to annotateaxes with correct units and tick marks spacing
xyplot(dist sim speed data=cars scales=list(tck=c(1 0)))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
To get ride of ticks on opposite axes we canchange default values for tck= in the scales=
component The tck= parameter controls thelength of tick marks however with a vector oflength 2 it can be used to deal with leftbottomand righttop axis separately
xyplot(dist sim speed data=cars scales=list(alternating =3))
speed
dis
t
5 10 15 20 25
0
20
40
60
80
100
120
5 10 15 20 25
0
20
40
60
80
100
120
To annotate both axes we can alter thealternating= parameter In most case howeveradding grid lines in the background should pro-vide enough information Using alternating=2
would reverse the annotation of axis (righttopinstead of leftbottom)
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
43 Barcharts
spraydf lt- aggregate(count sim spray data=InsectSprays FUN=mean)
barchart(count sim spray data=spraydf)
InsectSprays The counts of insects inagricultural experimental units treatedwith different insecticides
Before using barchart with one continuous andone categorical variable we need to consider howto aggregate data in other words what summarymeasure to consider
count
5
10
15
A B C D E F
barchart(spray sim count data=spraydf)
To reverse 119909 and 119910 axis we just need to use theexchange the right-hand and left-hand side of thepreceding formula
count
A
B
C
D
E
F
5 10 15
44 Dotcharts
low ink-ratio
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
dotplot(spray sim count data=InsectSprays)
count
A
B
C
D
E
F
0 5 10 15 20 25
Instead of barcharts it is usually more easy to useClevelandrsquos dotcharts as they allow to show indi-vidual (within level) variations
dotplot(spray sim count data=InsectSprays lty =2)
count
A
B
C
D
E
F
0 5 10 15 20 25
The main panel can be customized easily For ex-ample we can change the way horizontal lines aredrawn
dotplot(spray sim count data=InsectSprays type=c(pa))
count
A
B
C
D
E
F
0 5 10 15 20 25
It is also possible to show aggregated data like av-erage count per level using panelaverage Thedefault fun= argument is mean but we could usemedian instead
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
45 Line fits
In this section we discuss the addition of model fit to existing two-way graphics Forexample it may be interesting to show a regression or lowesssup1 line when using xyplot
xyplot(dist sim speed data=cars type=c(psmooth))
An adaptive loess smoother superimposed on topof a standard scatterplot
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
xyplot(dist sim speed data=cars type=c(psmooth) span=13)
Window span can be controlled using the span=
argument where lower valuemeansmore sentiv-ity to local variations
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
sup1WS Cleveland ldquoRobust locally weighted regression and smoothing scatterplotsrdquo In Journal of the AmericanStatistical Association 74 (1979) pp 829ndash836
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot(dist sim speed data=cars type=c(pr))
speed
dis
t
0
20
40
60
80
100
120
5 10 15 20 25
Regression fit can be shown using the same ideathrough the type= argument
46 Time series
xt lt- ts(matrix(cumsum(rnorm (200 12)) ncol =2))
xyplot(xt)
Time
-30
-20
-10
01
0
Series 1
-50
-40
-30
-20
-10
01
0
0 200 400 600 800 1000 1200
Series 2
blabla blabla
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot(xt scales=list(y=same) type=c(lg))
blabla blabla
Time
-40
-20
0
Series 1
0 200 400 600 800 1000 1200
-40
-20
0
Series 2
xyplot(xt layout=c(2 1))
blabla blabla
Time
-30
-20
-10
01
0
0 200 400 600 800 1200
Series 1
0 200 400 600 800 1200
-50
-40
-30
-20
-10
01
0
Series 2
xyplot(EuStockMarkets scales=list(y=same))
EuStockMarkets Contains the dailyclosing prices of major European stockindices Germany DAX (Ibis)Switzerland SMI France CAC and UKFTSE The data are sampled in businesstime ie weekends and holidays areomitted
blabla blabla
Time
2000400060008000
DAX
2000400060008000
SMI
2000400060008000
CAC
1992 1994 1996 1998
2000400060008000
FTSE
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
xyplot(EuStockMarkets superpose=TRUE autokey=list(columns =2))
Time
20
00
40
00
60
00
80
00
1992 1994 1996 1998
DAXSMI
CACFTSE blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE)
Time
DA
XSM
IC
AC
1992 1994 1996 1998
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
horizonplot(EuStockMarkets colorkey=TRUE ) + infolayers
Time
15191
16288
DA
X
22446
16781
SM
I
8719
17728
CA
C
1992 1994 1996 1998
12451
24436
FTSE
minus 3Δ i
minus 2Δ i
minus 1Δ i
origin
+ 1Δ i
+ 2Δ i
+ 3Δ i blabla blabla
47 Level plot
Level plot can be used to display data summaries as ldquoheatmaprdquo Here are two examples
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
data(Harman23cor)
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab=)
Harman23cor A correlation matrix ofeight physical measurements on 305girls between ages seven andseventeen HH Harman Modern FactorAnalysis Third ed Table 23 Universityof Chicago Press 1976
Level plots can be used to display correlationma-trix in a concise way (not unlike symnum)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
cols lt- colorRampPalette(brewerpal (8 RdBu))
levelplot(Harman23cor$cov scales=list(x=list(rot =45))
xlab= ylab= colregions=cols)
The default color scheme can be changed easilyHere we are using a Red-Blue color palette fromthe RColorBrewer package (Note that it intro-duces little changes compared to the precedingplot because a custom theme is currently in usefor the whole textbook)
height
armspan
forearm
lowerleg
weight
bitrodiameter
chestgirth
chestwidth
heig
ht
arm
spa
n
fore
arm
lower
leg
wei
ght
bitro
diam
eter
ches
tgirt
h
ches
twid
th
02
03
04
05
06
07
08
09
10
Sometimes it is helpful to reorder the entries in a correlation matrix or a two-way cross-classification This can be done using order or custom packages like
It is also possible to combine a dendrogram with a distance matrix and add this cluster-ing information to an heatmap Here is an illustration using the mtcars dataset and thedendrogramGrob function from latticeExtra
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
CHAPTER5
Multi-way graphics
This chapter focus on multi-variable displays where usually two-way graphics are con-ditioned on values taken by one or more variables or a combination thereof These so-called ldquo displaysrdquo are very good at conveying information about trend or variation be-tween two numerical variables across the levels of a third factor
51 Parallel displays
52 Scatterplot matrix
53 Three-way tabular data
54 N-way data
45
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
CHAPTER6
Customizing theme and panels
47
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Lattice
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
HmiscCHAPTER7
The Hmisc plotting functions
71 Two-way graphics
Hmisc provides automatic labelling of curves or levels of grouping factor which are usedas in standard lattice graphics (groups=) without the need to rely on the directlabels
package
Let us assume that we defined the following helper functions to compute standard errorof a mean and the corresponding 95 confidence interval
se lt- function(x) sd(x)sqrt(length(x))
f lt- function(x) c(mean = mean(x)
lwr = mean(x) - 196 se(x)
upr = mean(x) + 196 se(x))
49
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Hmisc
d lt- with(birthwt summarize(bwt race f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race label = Ethnicity)
data = d type = b keys = lines
ylim = range(apply(d[ 34] 2 range )) + c(-11) 100
scales = list(x = list(at = 13 labels = levels(d$race ))))
Ethnicity
Baby w
eig
ht
2600
2800
3000
3200
White Black Other
The race variable needs to be converted to a nu-merical variable for proper scaling on the 119909-axisNote that labeling of the 119909-axis can be done at thesame time The keys= argument controls the typeof drawing while type= behaves similarly to thelattice plotting parameter
Here is another example using a grouping factor
d lt- with(birthwt summarize(bwt llist(race smoke) f))
xYplot(Cbind(bwt lwr upr) sim numericScale(race) groups = smoke
data = d type = l keys = lines method = alt bars
ylim = c(2200 3600)
scales = list(x = list(at = 13 labels = levels(d$race ))))
race
Baby w
eig
ht
2400
2600
2800
3000
3200
3400
White Black Other
NoYes
bla bla
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Ggplot
CHAPTER8
The ggplot2 package
81 Why ggplot2
How does ggplot2 differ from lattice Both packages rely on the grid system How-ever unlike lattice ggplot2 makes heavy use of the so-called ldquoGrammar of Graphicsrdquoapproachsup1
82 Core components of a ggplot graphic
sup1L Wilkinson The Grammar of Graphics Springer 2005
51
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Ggplot
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Ggplot
CHAPTER9
Interactive and dynamic displays
91 Exploratory data analysis
92 Brushing and linking
93 The ggobi toolbox
53
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Ggplot
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55
Ggplot
Index
abline 18 19adj 14aggregate 9 33alpha 12 30alternating 32amount 30args 15ascharacter 14asnumeric 8aspect 11 12 21 22 26at 21 45 46autokey 38
barchart 33border 14 15 26boxratio 17 20 21boxplotstats 20breaks 14 29brewerpal 30 39bw 16by 9 13 21
c 9 14 18 19 21 24ndash26 3234ndash37 45 46
cex 12 14 21col 13ndash15 17 25 26 29 30colregions 39colorkey 38colorRampPalette 30 39columns 38contents 6 7cumsum 24 36cut 8 14 22 23 26 29cut2 8
data 13 14 16 17 19ndash2128ndash36 39 45 46
dataframe 5 8 30density 16densityplot 16describe 6 8df 18diff 14 30dist 18 19 28 29 31 32 35
36dmath 15dotplot 34draw 12
ecdfplot 19else 26
fvalue 19factor 7 11 12fill 17filter 26fivenum 20formula 5FUN 33fun 34function 14 15 17 21 25
26 31
groupnumber 26groups 28 29 45 46
histogram 13ndash15horizonplot 38horizontal 12
if 26includelowest 8 29
jitterdata 11 12jitterx 30jittery 30
labels 45 46lag 26layer 25layout 37levels 45 46limits 21list 12 15 21ndash23 26 32
37ndash39 45 46log 31 32lty 34lwd 14 25 26
matrix 36mean 15 21 26 33 34median 34method 26 45 46
narm 21ncol 36nint 14nrow 14 29number 22 23
overlap 22 23 26
panel 12ndash15 17 21 23 2526 31
panelaverage 34panelbwplot 17 21panelgroups 26panelhistogram 14 15panellines 25panelmathdensity 15panelpoints 21panelrug 31paneltext 14paneltskernel 25panelxblocks 26panelxyarea 23 26panelxyplot 25 26 29ndash31parsettings 26 29pch 12 18 19 21 29 30plot 14 28plotpoints 16ppoints 19
qnorm 18qqmath 18 19
quantile 29qunif 19
range 30rbinom 29ref 16replace 11 13replicate 30rgb 30rlnorm 26rnorm 15 18 24 30 36rollmean 25rot 22 39rt 18
sample 11 13 30scales 12 13 21 22 31 32
37 39 45 46sd 15segcol 13seglwd 13seq 13 21simpleTheme 26span 35strip 22 23stripleft 23stripchart 12stripplot 11ndash13subset 9 19summaryformula 6sunflowerplot 13superpose 26 38symnum 39
table 14tapply 9tck 32transform 8treillis 41ts 24 26 36tsunion 26type 14 15 24 28 32 34ndash
37 45 46
update 14
varwidth 17
with 29 45 46within 8
xlab 12 39xscalecomponents 32xy 12 30xyplot 22ndash26 28ndash32 35ndash38
ylab 24 39yscalecomponents 32yscalecomponentslog103
32
zoo 5 23ndash25
55