sta 326 2.0 programming and data analysis with r
TRANSCRIPT
STA 326 2.0 Programming and Data Analysis with R ∗
Exploring iris dataset with qplot
Contents
Stage 1: Planning your analysis 2
Step 1: Dataset overview and description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Step 2: One-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Step 3: Two-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Step 4: Three-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Stage 2: Getting started with qplot() in the ggplot2 package. 3
Recap: some important arguments in qplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
One-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Load packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1. Summarizing qualitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2. Summarizing quantitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Two-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1. Visualizing two qualitative variables at a time . . . . . . . . . . . . . . . . . . . . . . 6
2. Visualizing qualitative and quantitative variables . . . . . . . . . . . . . . . . . . . . 8
Three-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
patchwork package in R 13
Arrange multiple graphs row-wise use “/” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
Arrange multiple graphs column-wise: use “|” . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Arrange multiple graphs both row-wise and column-wise . . . . . . . . . . . . . . . . . . . . . 15
Stage 3: Final analysis 16
1. Composition of the sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2. Distribution of the features of sepal and petal by species . . . . . . . . . . . . . . . . . . . 17
3. Relationship between features of sepal and petal by species . . . . . . . . . . . . . . . . . . 18
∗All rights reserved by Thiyanga S. Talagala
1
Stage 1: Planning your analysis
Step 1: Dataset overview and description
Before we get started let’s look at the data and plan a analysis.
Load iris dataset
data(iris)
Here is a glimpse of the dataset.
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa
We have four quantitative variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and onequalitative variable: Species
Step 2: One-way analysis
Let’s look at the graphs we could use to explore variables one by one.
Plots that could be used to to summarize qualitative variables
• Bar chart
• Pie chart
Plots that could be used to to summarize quantitative variables
• Box and whisker plot
• Histograms
• Dot plots
• Density plot
• Stem and leaf displays
Note: Stem and leaf displays are best-suited for small to moderate datasets, whereas others such ashistograms and Box and whisker plots are best-suited for large datasets. Box and whisker plots andhistograms are also good at depicting differences between distributions and identifying outliers.
2
Step 3: Two-way analysis
Next, we will look at two variables at a time.
• Quantitative vs Quantitative: Scatter plots
• Quantitative vs Qualitative: Box plots/ Histograms/ Dot plots/ Density plots with groups allowus to compare across different levels of the qualitative variable. Faceting can be used to generatethe same plot for different levels of the qualitative variable.
Step 4: Three-way analysis
Now, let’s look at three variables at a time.
• Two quantitative variables and one qualitative variable: Scatter plot with different markers (eg:size, shapes, colours) for different levels of the qualitative variable.
Stage 2: Getting started with qplot() in the ggplot2 package.
Now we are going to use the qplot function to make some quick plots. This section demonstrates howdifferent graphs can be plotted for various purposes using the qplot.
Recap: some important arguments in qplot
qplot(x, # X variabley, # Y variabledata, # name of the dataframefacets = NULL,margins = FALSE,geom = "auto",xlim = c(NA, NA), # numeric vector of length 2 giving the x coordinatesylim = c(NA, NA), # numeric vector of length 2 giving the 7 coordinateslog = "",main = NULL, # Figure titlexlab = NULL, # X-axis titleylab = NULL, # Y-axis titleasp = NA,stat = NULL,position = NULL,
)
3
One-way analysis
Load packages
library(ggplot2)
1. Summarizing qualitative variables
qplot(x = Species, data = iris, geom = "bar", ylab = "Count",main = "Composition of Species")
0
10
20
30
40
50
setosa versicolor virginicaSpecies
Cou
nt
Composition of Species
Figure 1: Composition of the sample
4
2. Summarizing quantitative variables
Here, I have drawn plots only for Sepal.Width. Please take suitable graphs for other quantitativevariables in the iris dataset.
qplot(x = Sepal.Width, data = iris, geom = "histogram")
0
10
20
2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width
Figure 2: Histogram of sepal width
Histogram
qplot(x = Sepal.Width, data = iris, geom = "density")
Density plot
5
0.0
0.3
0.6
0.9
2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width
Figure 3: Density plot of sepal width
qplot(y = Sepal.Width, data = iris, geom = "boxplot")
2.0
2.5
3.0
3.5
4.0
4.5
−0.4 −0.2 0.0 0.2 0.4
Sep
al.W
idth
Figure 4: Boxplot of sepal width
Box and whisker plot
Two-way analysis
1. Visualizing two qualitative variables at a time
6
qplot(x = Sepal.Length, y = Sepal.Width, data = iris, geom = "point")
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8Sepal.Length
Sep
al.W
idth
Figure 5: Scatterplot of sepal length and sepal width
7
2. Visualizing qualitative and quantitative variables
qplot(x = Species, y = Sepal.Width, data = iris, geom = "boxplot")
2.0
2.5
3.0
3.5
4.0
4.5
setosa versicolor virginicaSpecies
Sep
al.W
idth
Figure 6: Boxplot of sepal width by species
qplot(x = Species, y = Sepal.Width, data = iris, geom = "boxplot", fill = Species)
2.0
2.5
3.0
3.5
4.0
4.5
setosa versicolor virginicaSpecies
Sep
al.W
idth Species
setosa
versicolor
virginica
Figure 7: Boxplot of sepal width by species
Different ways to modify your graph
8
qplot(x = Species, y = Sepal.Width, data = iris, geom = c("point","jitter","boxplot"),alpha = 0.5, colour = Species)
2.0
2.5
3.0
3.5
4.0
4.5
setosa versicolor virginicaSpecies
Sep
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
Figure 8: Boxplot of sepal width by species
qplot(x = Sepal.Width, data = iris, geom = c("histogram"),colour = Species)
0
10
20
2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width
Species
setosa
versicolor
virginica
Figure 9: Histogram of sepal width
9
qplot(x = Sepal.Width, data = iris, geom = c("histogram"),fill = Species)
0
10
20
2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width
Species
setosa
versicolor
virginica
Figure 10: Histogram of sepal width
qplot(x = Sepal.Width, data = iris, geom = c("density"),colour = Species)
0.0
0.5
1.0
2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width
Species
setosa
versicolor
virginica
Figure 11: Density plot of sepal width
10
Three-way analysis
Everything on a single panel
qplot(x = Sepal.Length, y = Sepal.Width, data = iris,geom = "point", color = Species)
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8Sepal.Length
Sep
al.W
idth Species
setosa
versicolor
virginica
Figure 12: Scatterplot of sepal length and sepal width by species
Separate panels for each species: column-wise
qplot(x = Sepal.Length, y = Sepal.Width,facets = .~Species, data = iris, geom = "point")
setosa versicolor virginica
5 6 7 8 5 6 7 8 5 6 7 82.0
2.5
3.0
3.5
4.0
4.5
Sepal.Length
Sep
al.W
idth
Figure 13: Scatterplot of sepal length and sepal width by species
11
Separate panels for each species: row-wise
qplot(x = Sepal.Length, y = Sepal.Width,facets = Species~., data = iris, geom = "point")
setosaversicolor
virginica
5 6 7 8
2.0
2.5
3.0
3.5
4.0
4.5
2.0
2.5
3.0
3.5
4.0
4.5
2.0
2.5
3.0
3.5
4.0
4.5
Sepal.Length
Sep
al.W
idth
Figure 14: Scatterplot of sepal length and sepal width by species
12
patchwork package in R
library(patchwork)
First you need to assign a name for each graph. Here, I use q1 and q2.
q1 <- qplot(x = Species, y = Sepal.Width, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "Distribution of Sepal.Width") + geom_boxplot(outlier.size = NA)
q2 <- qplot(x = Species, y = Sepal.Length, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "Distribution of Sepal.Length") + geom_boxplot(outlier.size = NA)
Arrange multiple graphs row-wise use “/”
q1/q2
2.0
2.5
3.0
3.5
4.0
4.5
setosa versicolor virginicaSpecies
Sep
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
Distribution of Sepal.Width
5
6
7
8
setosa versicolor virginicaSpecies
Sep
al.L
engt
h
alpha
0.5
Species
setosa
versicolor
virginica
Distribution of Sepal.Length
Figure 15: Arrange multiple graphs row-wise
13
Arrange multiple graphs column-wise: use “|”
q1|q2
2.0
2.5
3.0
3.5
4.0
4.5
setosa versicolorvirginicaSpecies
Sep
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
Distribution of Sepal.Width
5
6
7
8
setosa versicolorvirginicaSpecies
Sep
al.L
engt
h
alpha
0.5
Species
setosa
versicolor
virginica
Distribution of Sepal.Length
Figure 16: Arrange multiple graphs column-wise
14
Arrange multiple graphs both row-wise and column-wise
(q1|q2)/(q1|q2)
2.0
2.5
3.0
3.5
4.0
4.5
setosa versicolorvirginicaSpecies
Sep
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
Distribution of Sepal.Width
5
6
7
8
setosa versicolorvirginicaSpecies
Sep
al.L
engt
h
alpha
0.5
Species
setosa
versicolor
virginica
Distribution of Sepal.Length
2.0
2.5
3.0
3.5
4.0
4.5
setosa versicolorvirginicaSpecies
Sep
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
Distribution of Sepal.Width
5
6
7
8
setosa versicolorvirginicaSpecies
Sep
al.L
engt
h
alpha
0.5
Species
setosa
versicolor
virginica
Distribution of Sepal.Length
Figure 17: Arrange multiple graphs both row-wise and column-wise
15
Stage 3: Final analysis
You do not need to include all the graphs to your final analysis. Please select only the useful graphswhich help you to tell the story in your dataset. Here is mine.
1. Composition of the sample
qplot(x = Species, data = iris, geom = "bar", ylab = "Count",colour = Species, fill = Species,main = "Composition of Species")
0
10
20
30
40
50
setosa versicolor virginicaSpecies
Cou
nt
Species
setosa
versicolor
virginica
Composition of Species
Figure 18: Composition of the sample
16
2. Distribution of the features of sepal and petal by species
q1 <- qplot(x = Species, y = Sepal.Width, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "(a) Distribution of Sepal Width") + geom_boxplot(outlier.size = NA)
q2 <- qplot(x = Species, y = Sepal.Length, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "(b) Distribution of Sepal Length") + geom_boxplot(outlier.size = NA)
q3 <- qplot(x = Species, y = Petal.Width, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "(c) Distribution of Petal Width") + geom_boxplot(outlier.size = NA)
q4 <- qplot(x = Species, y = Petal.Length, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "(d) Distribution of Petal Length") + geom_boxplot(outlier.size = NA)
(q1|q2)/(q3|q4)
2.0
2.5
3.0
3.5
4.0
4.5
setosa versicolor virginicaSpecies
Sep
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
(a) Distribution of Sepal Width
5
6
7
8
setosa versicolor virginicaSpecies
Sep
al.L
engt
h
alpha
0.5
Species
setosa
versicolor
virginica
(b) Distribution of Sepal Length
0.0
0.5
1.0
1.5
2.0
2.5
setosa versicolor virginicaSpecies
Pet
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
(c) Distribution of Petal Width
2
4
6
setosa versicolor virginicaSpecies
Pet
al.L
engt
h
alpha
0.5
Species
setosa
versicolor
virginica
(d) Distribution of Petal Length
Figure 19: Distribution of features related to sepal and petal by species
17
3. Relationship between features of sepal and petal by species
p1 <- qplot(x = Sepal.Length, y = Sepal.Width, data = iris, geom = c("point","jitter"),alpha = 0.5, colour = Species,main="(a) Sepal Length and Sepal Width")
p2 <- qplot(x = Petal.Length, y = Petal.Width, data = iris, geom = c("point","jitter"),alpha = 0.5, colour = Species,main = "(b) Petal Length and Petal Width")
p3 <- qplot(x = Sepal.Length, y = Petal.Length, data = iris, geom = c("point","jitter"),alpha = 0.5, colour = Species,main = "(c) Sepal Length and Petal Length")
p4 <- qplot(x = Sepal.Length, y = Petal.Width, data = iris, geom = c("point","jitter"),alpha = 0.5, colour = Species,main = "(d) Sepal length and Petal Width")
(p1|p2)/(p3|p4)
2.0
2.5
3.0
3.5
4.0
4.5
5 6 7 8Sepal.Length
Sep
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
(a) Sepal Length and Sepal Width
0.0
0.5
1.0
1.5
2.0
2.5
2 4 6Petal.Length
Pet
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
(b) Petal Length and Petal Width
2
4
6
5 6 7 8Sepal.Length
Pet
al.L
engt
h
alpha
0.5
Species
setosa
versicolor
virginica
(c) Sepal Length and Petal Length
0.0
0.5
1.0
1.5
2.0
2.5
5 6 7 8Sepal.Length
Pet
al.W
idth
alpha
0.5
Species
setosa
versicolor
virginica
(d) Sepal length and Petal Width
Figure 20: Relationship between features of sepal and petal by species
Note: Interpret all figures in your final analysis.
18