sta 326 2.0 programming and data analysis with r

18
STA 326 2.0 Programming and Data Analysis with R Exploring iris dataset with qplot Contents Stage 1: Planning your analysis 2 Step 1: Dataset overview and description ............................. 2 Step 2: One-way analysis ...................................... 2 Step 3: Two-way analysis ...................................... 3 Step 4: Three-way analysis ..................................... 3 Stage 2: Getting started with qplot() in the ggplot2 package. 3 Recap: some important arguments in qplot ............................ 3 One-way analysis ........................................... 4 Load packages ......................................... 4 1. Summarizing qualitative variables ............................. 4 2. Summarizing quantitative variables ............................ 5 Two-way analysis ........................................... 6 1. Visualizing two qualitative variables at a time ...................... 6 2. Visualizing qualitative and quantitative variables .................... 8 Three-way analysis .......................................... 11 patchwork package in R 13 Arrange multiple graphs row-wise use “/” ............................. 13 Arrange multiple graphs column-wise: use “|” ........................... 14 Arrange multiple graphs both row-wise and column-wise ..................... 15 Stage 3: Final analysis 16 1. Composition of the sample .................................... 16 2. Distribution of the features of sepal and petal by species ................... 17 3. Relationship between features of sepal and petal by species .................. 18 All rights reserved by Thiyanga S. Talagala 1

Upload: others

Post on 20-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: STA 326 2.0 Programming and Data Analysis with R

STA 326 2.0 Programming and Data Analysis with R ∗

Exploring iris dataset with qplot

Contents

Stage 1: Planning your analysis 2

Step 1: Dataset overview and description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Step 2: One-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

Step 3: Two-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Step 4: Three-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Stage 2: Getting started with qplot() in the ggplot2 package. 3

Recap: some important arguments in qplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

One-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Load packages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

1. Summarizing qualitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2. Summarizing quantitative variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Two-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1. Visualizing two qualitative variables at a time . . . . . . . . . . . . . . . . . . . . . . 6

2. Visualizing qualitative and quantitative variables . . . . . . . . . . . . . . . . . . . . 8

Three-way analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

patchwork package in R 13

Arrange multiple graphs row-wise use “/” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Arrange multiple graphs column-wise: use “|” . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Arrange multiple graphs both row-wise and column-wise . . . . . . . . . . . . . . . . . . . . . 15

Stage 3: Final analysis 16

1. Composition of the sample . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2. Distribution of the features of sepal and petal by species . . . . . . . . . . . . . . . . . . . 17

3. Relationship between features of sepal and petal by species . . . . . . . . . . . . . . . . . . 18

∗All rights reserved by Thiyanga S. Talagala

1

Page 2: STA 326 2.0 Programming and Data Analysis with R

Stage 1: Planning your analysis

Step 1: Dataset overview and description

Before we get started let’s look at the data and plan a analysis.

Load iris dataset

data(iris)

Here is a glimpse of the dataset.

head(iris)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species1 5.1 3.5 1.4 0.2 setosa2 4.9 3.0 1.4 0.2 setosa3 4.7 3.2 1.3 0.2 setosa4 4.6 3.1 1.5 0.2 setosa5 5.0 3.6 1.4 0.2 setosa6 5.4 3.9 1.7 0.4 setosa

We have four quantitative variables: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width and onequalitative variable: Species

Step 2: One-way analysis

Let’s look at the graphs we could use to explore variables one by one.

Plots that could be used to to summarize qualitative variables

• Bar chart

• Pie chart

Plots that could be used to to summarize quantitative variables

• Box and whisker plot

• Histograms

• Dot plots

• Density plot

• Stem and leaf displays

Note: Stem and leaf displays are best-suited for small to moderate datasets, whereas others such ashistograms and Box and whisker plots are best-suited for large datasets. Box and whisker plots andhistograms are also good at depicting differences between distributions and identifying outliers.

2

Page 3: STA 326 2.0 Programming and Data Analysis with R

Step 3: Two-way analysis

Next, we will look at two variables at a time.

• Quantitative vs Quantitative: Scatter plots

• Quantitative vs Qualitative: Box plots/ Histograms/ Dot plots/ Density plots with groups allowus to compare across different levels of the qualitative variable. Faceting can be used to generatethe same plot for different levels of the qualitative variable.

Step 4: Three-way analysis

Now, let’s look at three variables at a time.

• Two quantitative variables and one qualitative variable: Scatter plot with different markers (eg:size, shapes, colours) for different levels of the qualitative variable.

Stage 2: Getting started with qplot() in the ggplot2 package.

Now we are going to use the qplot function to make some quick plots. This section demonstrates howdifferent graphs can be plotted for various purposes using the qplot.

Recap: some important arguments in qplot

qplot(x, # X variabley, # Y variabledata, # name of the dataframefacets = NULL,margins = FALSE,geom = "auto",xlim = c(NA, NA), # numeric vector of length 2 giving the x coordinatesylim = c(NA, NA), # numeric vector of length 2 giving the 7 coordinateslog = "",main = NULL, # Figure titlexlab = NULL, # X-axis titleylab = NULL, # Y-axis titleasp = NA,stat = NULL,position = NULL,

)

3

Page 4: STA 326 2.0 Programming and Data Analysis with R

One-way analysis

Load packages

library(ggplot2)

1. Summarizing qualitative variables

qplot(x = Species, data = iris, geom = "bar", ylab = "Count",main = "Composition of Species")

0

10

20

30

40

50

setosa versicolor virginicaSpecies

Cou

nt

Composition of Species

Figure 1: Composition of the sample

4

Page 5: STA 326 2.0 Programming and Data Analysis with R

2. Summarizing quantitative variables

Here, I have drawn plots only for Sepal.Width. Please take suitable graphs for other quantitativevariables in the iris dataset.

qplot(x = Sepal.Width, data = iris, geom = "histogram")

0

10

20

2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width

Figure 2: Histogram of sepal width

Histogram

qplot(x = Sepal.Width, data = iris, geom = "density")

Density plot

5

Page 6: STA 326 2.0 Programming and Data Analysis with R

0.0

0.3

0.6

0.9

2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width

Figure 3: Density plot of sepal width

qplot(y = Sepal.Width, data = iris, geom = "boxplot")

2.0

2.5

3.0

3.5

4.0

4.5

−0.4 −0.2 0.0 0.2 0.4

Sep

al.W

idth

Figure 4: Boxplot of sepal width

Box and whisker plot

Two-way analysis

1. Visualizing two qualitative variables at a time

6

Page 7: STA 326 2.0 Programming and Data Analysis with R

qplot(x = Sepal.Length, y = Sepal.Width, data = iris, geom = "point")

2.0

2.5

3.0

3.5

4.0

4.5

5 6 7 8Sepal.Length

Sep

al.W

idth

Figure 5: Scatterplot of sepal length and sepal width

7

Page 8: STA 326 2.0 Programming and Data Analysis with R

2. Visualizing qualitative and quantitative variables

qplot(x = Species, y = Sepal.Width, data = iris, geom = "boxplot")

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolor virginicaSpecies

Sep

al.W

idth

Figure 6: Boxplot of sepal width by species

qplot(x = Species, y = Sepal.Width, data = iris, geom = "boxplot", fill = Species)

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolor virginicaSpecies

Sep

al.W

idth Species

setosa

versicolor

virginica

Figure 7: Boxplot of sepal width by species

Different ways to modify your graph

8

Page 9: STA 326 2.0 Programming and Data Analysis with R

qplot(x = Species, y = Sepal.Width, data = iris, geom = c("point","jitter","boxplot"),alpha = 0.5, colour = Species)

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolor virginicaSpecies

Sep

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

Figure 8: Boxplot of sepal width by species

qplot(x = Sepal.Width, data = iris, geom = c("histogram"),colour = Species)

0

10

20

2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width

Species

setosa

versicolor

virginica

Figure 9: Histogram of sepal width

9

Page 10: STA 326 2.0 Programming and Data Analysis with R

qplot(x = Sepal.Width, data = iris, geom = c("histogram"),fill = Species)

0

10

20

2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width

Species

setosa

versicolor

virginica

Figure 10: Histogram of sepal width

qplot(x = Sepal.Width, data = iris, geom = c("density"),colour = Species)

0.0

0.5

1.0

2.0 2.5 3.0 3.5 4.0 4.5Sepal.Width

Species

setosa

versicolor

virginica

Figure 11: Density plot of sepal width

10

Page 11: STA 326 2.0 Programming and Data Analysis with R

Three-way analysis

Everything on a single panel

qplot(x = Sepal.Length, y = Sepal.Width, data = iris,geom = "point", color = Species)

2.0

2.5

3.0

3.5

4.0

4.5

5 6 7 8Sepal.Length

Sep

al.W

idth Species

setosa

versicolor

virginica

Figure 12: Scatterplot of sepal length and sepal width by species

Separate panels for each species: column-wise

qplot(x = Sepal.Length, y = Sepal.Width,facets = .~Species, data = iris, geom = "point")

setosa versicolor virginica

5 6 7 8 5 6 7 8 5 6 7 82.0

2.5

3.0

3.5

4.0

4.5

Sepal.Length

Sep

al.W

idth

Figure 13: Scatterplot of sepal length and sepal width by species

11

Page 12: STA 326 2.0 Programming and Data Analysis with R

Separate panels for each species: row-wise

qplot(x = Sepal.Length, y = Sepal.Width,facets = Species~., data = iris, geom = "point")

setosaversicolor

virginica

5 6 7 8

2.0

2.5

3.0

3.5

4.0

4.5

2.0

2.5

3.0

3.5

4.0

4.5

2.0

2.5

3.0

3.5

4.0

4.5

Sepal.Length

Sep

al.W

idth

Figure 14: Scatterplot of sepal length and sepal width by species

12

Page 13: STA 326 2.0 Programming and Data Analysis with R

patchwork package in R

library(patchwork)

First you need to assign a name for each graph. Here, I use q1 and q2.

q1 <- qplot(x = Species, y = Sepal.Width, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "Distribution of Sepal.Width") + geom_boxplot(outlier.size = NA)

q2 <- qplot(x = Species, y = Sepal.Length, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "Distribution of Sepal.Length") + geom_boxplot(outlier.size = NA)

Arrange multiple graphs row-wise use “/”

q1/q2

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolor virginicaSpecies

Sep

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

Distribution of Sepal.Width

5

6

7

8

setosa versicolor virginicaSpecies

Sep

al.L

engt

h

alpha

0.5

Species

setosa

versicolor

virginica

Distribution of Sepal.Length

Figure 15: Arrange multiple graphs row-wise

13

Page 14: STA 326 2.0 Programming and Data Analysis with R

Arrange multiple graphs column-wise: use “|”

q1|q2

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolorvirginicaSpecies

Sep

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

Distribution of Sepal.Width

5

6

7

8

setosa versicolorvirginicaSpecies

Sep

al.L

engt

h

alpha

0.5

Species

setosa

versicolor

virginica

Distribution of Sepal.Length

Figure 16: Arrange multiple graphs column-wise

14

Page 15: STA 326 2.0 Programming and Data Analysis with R

Arrange multiple graphs both row-wise and column-wise

(q1|q2)/(q1|q2)

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolorvirginicaSpecies

Sep

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

Distribution of Sepal.Width

5

6

7

8

setosa versicolorvirginicaSpecies

Sep

al.L

engt

h

alpha

0.5

Species

setosa

versicolor

virginica

Distribution of Sepal.Length

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolorvirginicaSpecies

Sep

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

Distribution of Sepal.Width

5

6

7

8

setosa versicolorvirginicaSpecies

Sep

al.L

engt

h

alpha

0.5

Species

setosa

versicolor

virginica

Distribution of Sepal.Length

Figure 17: Arrange multiple graphs both row-wise and column-wise

15

Page 16: STA 326 2.0 Programming and Data Analysis with R

Stage 3: Final analysis

You do not need to include all the graphs to your final analysis. Please select only the useful graphswhich help you to tell the story in your dataset. Here is mine.

1. Composition of the sample

qplot(x = Species, data = iris, geom = "bar", ylab = "Count",colour = Species, fill = Species,main = "Composition of Species")

0

10

20

30

40

50

setosa versicolor virginicaSpecies

Cou

nt

Species

setosa

versicolor

virginica

Composition of Species

Figure 18: Composition of the sample

16

Page 17: STA 326 2.0 Programming and Data Analysis with R

2. Distribution of the features of sepal and petal by species

q1 <- qplot(x = Species, y = Sepal.Width, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "(a) Distribution of Sepal Width") + geom_boxplot(outlier.size = NA)

q2 <- qplot(x = Species, y = Sepal.Length, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "(b) Distribution of Sepal Length") + geom_boxplot(outlier.size = NA)

q3 <- qplot(x = Species, y = Petal.Width, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "(c) Distribution of Petal Width") + geom_boxplot(outlier.size = NA)

q4 <- qplot(x = Species, y = Petal.Length, data = iris, geom = c("jitter","boxplot"),alpha = 0.5, colour = Species, main = "(d) Distribution of Petal Length") + geom_boxplot(outlier.size = NA)

(q1|q2)/(q3|q4)

2.0

2.5

3.0

3.5

4.0

4.5

setosa versicolor virginicaSpecies

Sep

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

(a) Distribution of Sepal Width

5

6

7

8

setosa versicolor virginicaSpecies

Sep

al.L

engt

h

alpha

0.5

Species

setosa

versicolor

virginica

(b) Distribution of Sepal Length

0.0

0.5

1.0

1.5

2.0

2.5

setosa versicolor virginicaSpecies

Pet

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

(c) Distribution of Petal Width

2

4

6

setosa versicolor virginicaSpecies

Pet

al.L

engt

h

alpha

0.5

Species

setosa

versicolor

virginica

(d) Distribution of Petal Length

Figure 19: Distribution of features related to sepal and petal by species

17

Page 18: STA 326 2.0 Programming and Data Analysis with R

3. Relationship between features of sepal and petal by species

p1 <- qplot(x = Sepal.Length, y = Sepal.Width, data = iris, geom = c("point","jitter"),alpha = 0.5, colour = Species,main="(a) Sepal Length and Sepal Width")

p2 <- qplot(x = Petal.Length, y = Petal.Width, data = iris, geom = c("point","jitter"),alpha = 0.5, colour = Species,main = "(b) Petal Length and Petal Width")

p3 <- qplot(x = Sepal.Length, y = Petal.Length, data = iris, geom = c("point","jitter"),alpha = 0.5, colour = Species,main = "(c) Sepal Length and Petal Length")

p4 <- qplot(x = Sepal.Length, y = Petal.Width, data = iris, geom = c("point","jitter"),alpha = 0.5, colour = Species,main = "(d) Sepal length and Petal Width")

(p1|p2)/(p3|p4)

2.0

2.5

3.0

3.5

4.0

4.5

5 6 7 8Sepal.Length

Sep

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

(a) Sepal Length and Sepal Width

0.0

0.5

1.0

1.5

2.0

2.5

2 4 6Petal.Length

Pet

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

(b) Petal Length and Petal Width

2

4

6

5 6 7 8Sepal.Length

Pet

al.L

engt

h

alpha

0.5

Species

setosa

versicolor

virginica

(c) Sepal Length and Petal Length

0.0

0.5

1.0

1.5

2.0

2.5

5 6 7 8Sepal.Length

Pet

al.W

idth

alpha

0.5

Species

setosa

versicolor

virginica

(d) Sepal length and Petal Width

Figure 20: Relationship between features of sepal and petal by species

Note: Interpret all figures in your final analysis.

18