data visualization in r - dccines/aulas/1516/dm1/07_visualization.pdf · data visualization in r l....

77
Data Visualization in R L. Torgo [email protected] Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade do Porto Oct, 2014

Upload: others

Post on 28-May-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Data Visualization in R

L. Torgo

[email protected]

Faculdade de Ciências / LIAAD-INESC TEC, LAUniversidade do Porto

Oct, 2014

Page 2: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction

Motivation for Data Visualization

Humans are outstanding at detecting patterns and structures withtheir eyesData visualization methods try to explore these capabilitiesIn spite of all advantages visualization methods also have severalproblems, particularly with very large data sets

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 2 / 58

Page 3: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction

Outline of what we will learn

Tools for univariate dataTools for bivariate dataTools for multivariate data

Multidimensional scaling methods

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 3 / 58

Page 4: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction

R Graphics

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 4 / 58

Page 5: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

Standard Graphics (the graphics package)

R standard graphics, available through package graphics,includes several functions that provide standard statistical plots,like:

ScatterplotsBoxplotsPiechartsBarplotsetc.

These graphs can be obtained tyipically by a single function callExample of a scatterplot

plot(1:10,sin(1:10))

These graphs can be easily augmented by adding severalelements to these graphs (lines, text, etc.)

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 5 / 58

Page 6: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

Standard Graphics (the graphics package)

R standard graphics, available through package graphics,includes several functions that provide standard statistical plots,like:

ScatterplotsBoxplotsPiechartsBarplotsetc.

These graphs can be obtained tyipically by a single function callExample of a scatterplot

plot(1:10,sin(1:10))

These graphs can be easily augmented by adding severalelements to these graphs (lines, text, etc.)

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 5 / 58

Page 7: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

Standard Graphics (the graphics package)

R standard graphics, available through package graphics,includes several functions that provide standard statistical plots,like:

ScatterplotsBoxplotsPiechartsBarplotsetc.

These graphs can be obtained tyipically by a single function callExample of a scatterplot

plot(1:10,sin(1:10))

These graphs can be easily augmented by adding severalelements to these graphs (lines, text, etc.)

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 5 / 58

Page 8: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the activegraphics deviceThe default and more frequently used device is the screenThere are many more graphical devices in R, like the pdf device,the jpeg device, etc.The user just needs to open (and in the end close) the graphicsoutput device she/he wants. R takes care of producing the type ofoutput required by the deviceThis means that to produce a certain plot on the screen or as aGIF graphics file the R code is exactly the same. You only need toopen the target output device before!Several devices may be open at the same time, but only one is theactive device

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58

Page 9: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the activegraphics deviceThe default and more frequently used device is the screenThere are many more graphical devices in R, like the pdf device,the jpeg device, etc.The user just needs to open (and in the end close) the graphicsoutput device she/he wants. R takes care of producing the type ofoutput required by the deviceThis means that to produce a certain plot on the screen or as aGIF graphics file the R code is exactly the same. You only need toopen the target output device before!Several devices may be open at the same time, but only one is theactive device

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58

Page 10: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the activegraphics deviceThe default and more frequently used device is the screenThere are many more graphical devices in R, like the pdf device,the jpeg device, etc.The user just needs to open (and in the end close) the graphicsoutput device she/he wants. R takes care of producing the type ofoutput required by the deviceThis means that to produce a certain plot on the screen or as aGIF graphics file the R code is exactly the same. You only need toopen the target output device before!Several devices may be open at the same time, but only one is theactive device

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58

Page 11: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the activegraphics deviceThe default and more frequently used device is the screenThere are many more graphical devices in R, like the pdf device,the jpeg device, etc.The user just needs to open (and in the end close) the graphicsoutput device she/he wants. R takes care of producing the type ofoutput required by the deviceThis means that to produce a certain plot on the screen or as aGIF graphics file the R code is exactly the same. You only need toopen the target output device before!Several devices may be open at the same time, but only one is theactive device

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58

Page 12: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the activegraphics deviceThe default and more frequently used device is the screenThere are many more graphical devices in R, like the pdf device,the jpeg device, etc.The user just needs to open (and in the end close) the graphicsoutput device she/he wants. R takes care of producing the type ofoutput required by the deviceThis means that to produce a certain plot on the screen or as aGIF graphics file the R code is exactly the same. You only need toopen the target output device before!Several devices may be open at the same time, but only one is theactive device

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58

Page 13: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

Graphics Devices

R graphics functions produce output that depends on the activegraphics deviceThe default and more frequently used device is the screenThere are many more graphical devices in R, like the pdf device,the jpeg device, etc.The user just needs to open (and in the end close) the graphicsoutput device she/he wants. R takes care of producing the type ofoutput required by the deviceThis means that to produce a certain plot on the screen or as aGIF graphics file the R code is exactly the same. You only need toopen the target output device before!Several devices may be open at the same time, but only one is theactive device

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 6 / 58

Page 14: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to Standard Graphics

A few examples

A scatterplot

plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1)))

The same but stored on a jpeg file

jpeg('exp.jpg')plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1)))dev.off()

And now as a pdf file

pdf('exp.pdf',width=6,height=6)plot(seq(0,4*pi,0.1),sin(seq(0,4*pi,0.1)))dev.off()

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 7 / 58

Page 15: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to ggplot

Package ggplot2

Package ggplot2 implements the ideas created by Wilkinson(2005) on a grammar of graphicsThis grammar is the result of a theoretical study on what is astatistical graphicggplot2 builds upon this theory by implementing the concept ofa layered grammar of graphics (Wickham, 2009)The grammar defines a statistical graphic as:

a mapping from data into aesthetic attributes (color, shape, size,etc.) of geometric objects (points, lines, bars, etc.)

L. Wilkinson (2005). The Grammar of Graphics. Springer.

H. Wickham (2009). A layered grammar of graphics. Journal of Computational and Graphical Statistics.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 8 / 58

Page 16: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to ggplot

The Basics of the Grammar of Graphics

Key elements of a statistical graphic:dataaesthetic mappingsgeometric objectsstatistical transformationsscalescoordinate systemfaceting

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 9 / 58

Page 17: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to ggplot

Aesthetic Mappings

Controls the relation between data variables and graphic variablesmap the Temperature variable of a data set into the x variable in ascatter plotmap the Species of a plant into the colour of dots in a graphicetc.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 10 / 58

Page 18: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to ggplot

Geometric Objects

Controls what is shown in the graphicsshow each observation by a point using the aesthetic mappings thatmap two variables in the data set into the x, y variables of the plotetc.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 11 / 58

Page 19: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to ggplot

Statistical Transformations

Allows us to calculate and do statistical analysis over the data inthe plot

Use the data and approximate it by a regression line on the x, ycoordinatesCount occurrences of certain valuesetc.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 12 / 58

Page 20: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to ggplot

Scales

Maps the data values into values in the coordinate system of thegraphics device

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 13 / 58

Page 21: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to ggplot

Coordinate System

The coordinate system used to plot the dataCartesianPolaretc.

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 14 / 58

Page 22: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to ggplot

Faceting

Split the data into sub-groups and draw sub-graphs for each group

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 15 / 58

Page 23: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Introduction to ggplot

A Simple Example

data(iris)library(ggplot2)ggplot(iris,

aes(x=Petal.Length,y=Sepal.Length,colour=Species)

) + geom_point()

●●

●●

●●

●●

● ●

●●

● ●

●●

5

6

7

8

2 4 6Petal.Length

Sep

al.L

engt

h Species

setosa

versicolor

virginica

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 16 / 58

Page 24: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of Values of Nominal Variables

The Distribution of Values of Nominal VariablesThe Barplot

The Barplot is a graph whose main purpose is to display a set ofvalues as heights of barsIt can be used to display the frequency of occurrence of differentvalues of a nominal variable as follows:

First obtain the number of occurrences of each valueThen use the Barplot to display these counts

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 17 / 58

Page 25: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of Values of Nominal Variables

The Distribution of Values of Nominal VariablesThe Barplot

The Barplot is a graph whose main purpose is to display a set ofvalues as heights of barsIt can be used to display the frequency of occurrence of differentvalues of a nominal variable as follows:

First obtain the number of occurrences of each valueThen use the Barplot to display these counts

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 17 / 58

Page 26: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of Values of Nominal Variables

Barplots in base graphics

barplot(table(algae$season),main='Distribution of the Water Samples across Seasons')

autumn spring summer winter

Distribution of the Water Samples across Seasons

010

2030

4050

60

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 18 / 58

Page 27: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of Values of Nominal Variables

Barplots in ggplot2

ggplot(algae,aes(x=season)) + geom_bar() +ggtitle('Distribution of the Water Samples across Seasons')

0

20

40

60

autumn spring summer winterseason

coun

tDistribution of the Water Samples across Seasons

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 19 / 58

Page 28: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of Values of Nominal Variables

Pie Charts

Pie charts serve the same purpose as bar plots but present theinformation in the form of a pie.pie(table(algae$season),

main='Distribution of the Water Samples across Seasons')

autumn

spring

summer

winter

Distribution of the Water Samples across Seasons

Note: Avoid pie charts!

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 20 / 58

Page 29: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of Values of Nominal Variables

Pie Charts in ggplot

ggplot(algae,aes(x=factor(1),fill=season)) + geom_bar(width=1) +ggtitle('Distribution of the Water Samples across Seasons') +

coord_polar(theta="y")

50

100

150

0/200

1

count

fact

or(1

)

season

autumn

spring

summer

winter

Distribution of the Water Samples across Seasons

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 21 / 58

Page 30: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of the Values of a Continuous Variable

The Distribution of Values of a Continuous VariableThe Histogram

The Histogram is a graph whose main purpose is to display howthe values of a continuous variable are distributedIt is obtained as follows:

First the range of the variable is divided into a set of bins (intervalsof values)Then the number of occurrences of values on each bin is countedThen this number is displayed as a bar

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 22 / 58

Page 31: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of the Values of a Continuous Variable

The Distribution of Values of a Continuous VariableThe Histogram

The Histogram is a graph whose main purpose is to display howthe values of a continuous variable are distributedIt is obtained as follows:

First the range of the variable is divided into a set of bins (intervalsof values)Then the number of occurrences of values on each bin is countedThen this number is displayed as a bar

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 22 / 58

Page 32: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of the Values of a Continuous Variable

Two examples of Histograms in R

hist(algae$a1,main='Distribution of Algae a1',xlab='a1',ylab='Nr. of Occurrencies')

hist(algae$a1,main='Distribution of Algae a1',xlab='a1',ylab='Percentage',prob=TRUE)

Distribution of Algae a1

a1

Nr.

of O

ccur

renc

ies

0 20 40 60 80

020

4060

8010

0

Distribution of Algae a1

a1

Per

cent

age

0 20 40 60 80

0.00

0.01

0.02

0.03

0.04

0.05

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 23 / 58

Page 33: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of the Values of a Continuous Variable

Two examples of Histograms in ggplot2

ggplot(algae,aes(x=a1)) + geom_histogram(binwidth=10) +ggtitle("Distribution of Algae a1") + ylab("Nr. of Occurrencies")

ggplot(algae,aes(x=a1)) + geom_histogram(binwidth=10,aes(y=..density..)) +ggtitle("Distribution of Algae a1") + ylab("Percentage")

0

30

60

90

0 25 50 75 100a1

Nr.

of O

ccur

renc

ies

Distribution of Algae a1

0.00

0.02

0.04

0 25 50 75 100a1

Per

cent

age

Distribution of Algae a1

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 24 / 58

Page 34: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Distribution of the Values of a Continuous Variable

Problems with Histograms

Histograms may be misleading in small data setsAnother key issued is how the limits of the bins are chosen

There are several algorithms for thisCheck the “Details” section of the help page of function hist() ifyou want to know more about this and to obtain references onalternativesWithin ggplot2 you may control this through the binwidthparameter

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 25 / 58

Page 35: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesKernel Density Estimates

Some of the problems of histograms can be tackled by smoothingthe estimates of the distribution of the values. That is the purposeof kernel density estimatesKernel estimates calculate the estimate of the distribution at acertain point by smoothly averaging over the neighboring pointsNamely, the density is estimated by

f̂ (x) =1n

n∑i=1

K(

x − xi

h

)where K () is a kernel function and h a bandwidth parameter

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 26 / 58

Page 36: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesKernel Density Estimates

Some of the problems of histograms can be tackled by smoothingthe estimates of the distribution of the values. That is the purposeof kernel density estimatesKernel estimates calculate the estimate of the distribution at acertain point by smoothly averaging over the neighboring pointsNamely, the density is estimated by

f̂ (x) =1n

n∑i=1

K(

x − xi

h

)where K () is a kernel function and h a bandwidth parameter

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 26 / 58

Page 37: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesKernel Density Estimates

Some of the problems of histograms can be tackled by smoothingthe estimates of the distribution of the values. That is the purposeof kernel density estimatesKernel estimates calculate the estimate of the distribution at acertain point by smoothly averaging over the neighboring pointsNamely, the density is estimated by

f̂ (x) =1n

n∑i=1

K(

x − xi

h

)where K () is a kernel function and h a bandwidth parameter

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 26 / 58

Page 38: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

An Example and how to obtain it in R basic graphics

hist(algae$mxPH,main='The Histogram of mxPH (maximum pH)',xlab='mxPH Values',prob=T)

lines(density(algae$mxPH,na.rm=T))rug(algae$mxPH)

The Histogram of mxPH (maximum pH)

mxPH Values

Den

sity

6 7 8 9 10

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 27 / 58

Page 39: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

An Example and how to obtain it in ggplot2

ggplot(algae,aes(x=mxPH)) + geom_histogram(binwidth=.5, aes(y=..density..)) +geom_density(color="red") + geom_rug() + ggtitle("The Histogram of mxPH (maximum pH)")

0.0

0.2

0.4

0.6

5 6 7 8 9 10mxPH

dens

ity

The Histogram of mxPH (maximum pH)

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 28 / 58

Page 40: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesGraphing Quantiles

The x quantile of a continuous variable is the value below whichan x% proportion of the data isExamples of this concept are the 1st (25%) and 3rd (75%)quartiles and the median (50%)We can calculate these quantiles at different values of x and thenplot them to provide an idea of how the values in a sample of dataare distributed

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 29 / 58

Page 41: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesGraphing Quantiles

The x quantile of a continuous variable is the value below whichan x% proportion of the data isExamples of this concept are the 1st (25%) and 3rd (75%)quartiles and the median (50%)We can calculate these quantiles at different values of x and thenplot them to provide an idea of how the values in a sample of dataare distributed

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 29 / 58

Page 42: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesGraphing Quantiles

The x quantile of a continuous variable is the value below whichan x% proportion of the data isExamples of this concept are the 1st (25%) and 3rd (75%)quartiles and the median (50%)We can calculate these quantiles at different values of x and thenplot them to provide an idea of how the values in a sample of dataare distributed

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 29 / 58

Page 43: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

The Cumulative Distribution Function in Base Graphics

plot(ecdf(algae$NO3),main='Cumulative Distribution Function of NO3',xlab='NO3 values')

0 10 20 30 40 50

0.0

0.2

0.4

0.6

0.8

1.0

Cumulative Distribution Function of NO3

NO3 values

Fn(

x)

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●● ●

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 30 / 58

Page 44: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

The Cumulative Distribution Function in ggplot

ggplot(algae,aes(x=NO3)) + stat_ecdf() + xlab('NO3 values') +ggtitle('Cumulative Distribution Function of NO3')

0.00

0.25

0.50

0.75

1.00

0 10 20 30 40NO3 values

y

Cumulative Distribution Function of NO3

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 31 / 58

Page 45: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

QQ Plots

Graphs that can be used to compare the observed distributionagainst the Normal distributionCan be used to visually check the hypothesis that the variableunder study follows a normal distributionObviously, more formal tests also exist

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 32 / 58

Page 46: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

QQ Plots

Graphs that can be used to compare the observed distributionagainst the Normal distributionCan be used to visually check the hypothesis that the variableunder study follows a normal distributionObviously, more formal tests also exist

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 32 / 58

Page 47: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

QQ Plots

Graphs that can be used to compare the observed distributionagainst the Normal distributionCan be used to visually check the hypothesis that the variableunder study follows a normal distributionObviously, more formal tests also exist

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 32 / 58

Page 48: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

QQ Plots from package car using base graphics

library(car)qqPlot(algae$mxPH,main='QQ Plot of Maximum PH Values')

−3 −2 −1 0 1 2 3

67

89

QQ Plot of Maximum PH Values

norm quantiles

alga

e$m

xPH

●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●

●●●●

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 33 / 58

Page 49: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

QQ Plots using ggplot2

ggplot(algae,aes(sample=mxPH)) + stat_qq() +ggtitle('QQ Plot of Maximum PH Values')

●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●

●●●● ●

6

7

8

9

−3 −2 −1 0 1 2 3theoretical

sam

ple

QQ Plot of Maximum PH Values

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 34 / 58

Page 50: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

QQ Plots using ggplot2 (2)

q.x <- quantile(algae$mxPH,c(0.25,0.75),na.rm=TRUE)q.z <- qnorm(c(0.25,0.75))b <- (q.x[2] - q.x[1])/(q.z[2] - q.z[1])a <- q.x[1] - b * q.z[1]ggplot(algae,aes(sample=mxPH)) + stat_qq() +

ggtitle('QQ Plot of Maximum PH Values') +geom_abline(intercept=a,slope=b,color="red")

●●

●●

●●●●

●●●●●●

●●●●●●●●●●●●●●

●●●●●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●

●●●● ●

6

7

8

9

−3 −2 −1 0 1 2 3theoretical

sam

ple

QQ Plot of Maximum PH Values

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 35 / 58

Page 51: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesBox Plots

Box plots provide interesting summaries of a variable distributionFor instance, they inform us of the interquartile range and of theoutliers (if any)

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 36 / 58

Page 52: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Showing the Distribution of ValuesBox Plots

Box plots provide interesting summaries of a variable distributionFor instance, they inform us of the interquartile range and of theoutliers (if any)

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 36 / 58

Page 53: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

An Example with base graphics

boxplot(algae$mnO2,ylab='Minimum values of O2')

●●2

46

810

12

Min

imum

val

ues

of O

2

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 37 / 58

Page 54: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

An Example with base ggplot2

ggplot(algae,aes(x=factor(0),y=mnO2)) + geom_boxplot() +ylab("Minimum values of O2") + xlab("") + scale_x_discrete(breaks=NULL)

5

10

Min

imum

val

ues

of O

2

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 38 / 58

Page 55: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Hands on Data Visualization - the Algae data set

Using the Algae data set from package DMwR answer to the followingquestions:

1 Create a graph that you find adequate to show the distribution ofthe values of algae a6 solution

2 Show the distribution of the values of size solution

3 Check visually if it is plausible to consider that oPO4 follows anormal distribution solution

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 39 / 58

Page 56: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Solution of Exercise 1

Create a graph that you find adequate to show the distribution ofthe values of algae a6

hist(algae$a6,main="Distribution of Algae a6",xlab="Frequency of a6",ylab="Nr. of Occurrences")

Distribution of Algae a6

Frequency of a6

Nr.

of O

ccur

renc

es

0 20 40 60 80

050

100

150

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 40 / 58

Page 57: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Solution of Exercise 1 with ggplot2

Create a graph that you find adequate to show the distribution ofthe values of algae a6

ggplot(algae,aes(x=a6)) + geom_histogram(binwidth=10) +ggtitle("Distribution of Algae a6")

0

50

100

150

0 25 50 75a6

coun

tDistribution of Algae a6

Go Back

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 41 / 58

Page 58: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Solution to Exercise 2

Show the distribution of the values of size

barplot(table(algae$size), main = "Distribution of River Size")

large medium small

Distribution of River Size0

2040

6080

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 42 / 58

Page 59: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Solution to Exercise 2 with ggplot2

Show the distribution of the values of size

ggplot(algae,aes(x=size)) + geom_bar() +ggtitle("Distribution of River Size")

0

20

40

60

80

large medium smallsize

coun

tDistribution of River Size

Go Back

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 43 / 58

Page 60: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Univariate Plots Kernel Density Estimation

Solutions to Exercise 3

Check visually if it is plausible to consider that oPO4 follows anormal distribution

library(car)qqPlot(algae$oPO4,main = "QQ Plot of oPO4")

−3 −2 −1 0 1 2 3

010

020

030

040

050

0

QQ Plot of oPO4

norm quantiles

alga

e$oP

O4

● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●

●●●●●●●●

●●●●●●●●●●●

●●●●

●●

●●

Go Back

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 44 / 58

Page 61: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs

Conditioned Graphs

Data sets frequently have nominal variables that can be used tocreate sub-groups of the data according to these variables values

e.g. the sub-group of male clients of a company

Some of the visual summaries described before can be obtainedon each of these sub-groupsConditioned plots allow the simultaneous presentation of thesesub-group graphs to better allow finding eventual differencesbetween the sub-groupsBase graphics do not have conditioning but ggplot2 has itthrough the concept of facets

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 45 / 58

Page 62: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs

Conditioned Graphs

Data sets frequently have nominal variables that can be used tocreate sub-groups of the data according to these variables values

e.g. the sub-group of male clients of a company

Some of the visual summaries described before can be obtainedon each of these sub-groupsConditioned plots allow the simultaneous presentation of thesesub-group graphs to better allow finding eventual differencesbetween the sub-groupsBase graphics do not have conditioning but ggplot2 has itthrough the concept of facets

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 45 / 58

Page 63: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs

Conditioned Graphs

Data sets frequently have nominal variables that can be used tocreate sub-groups of the data according to these variables values

e.g. the sub-group of male clients of a company

Some of the visual summaries described before can be obtainedon each of these sub-groupsConditioned plots allow the simultaneous presentation of thesesub-group graphs to better allow finding eventual differencesbetween the sub-groupsBase graphics do not have conditioning but ggplot2 has itthrough the concept of facets

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 45 / 58

Page 64: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs

Conditioned Graphs

Data sets frequently have nominal variables that can be used tocreate sub-groups of the data according to these variables values

e.g. the sub-group of male clients of a company

Some of the visual summaries described before can be obtainedon each of these sub-groupsConditioned plots allow the simultaneous presentation of thesesub-group graphs to better allow finding eventual differencesbetween the sub-groupsBase graphics do not have conditioning but ggplot2 has itthrough the concept of facets

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 45 / 58

Page 65: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs Conditioned histograms

Conditioned Histograms

Goal: Constrast the distribution of data sub-groups

algae$speed <- factor(algae$speed,levels=c("low","medium","high"))ggplot(algae,aes(x=a3)) + geom_histogram(binwidth=5) + facet_wrap(~ speed) +

ggtitle("Distribution of A3 by River Speed")

low medium high

0

20

40

60

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50a3

coun

t

Distribution of A3 by River Speed

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 46 / 58

Page 66: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs Conditioned histograms

Conditioned Histograms (2)

algae$season <- factor(algae$season,levels=c("spring","summer","autumn","winter"))ggplot(algae,aes(x=a3)) + geom_histogram(binwidth=5) + facet_grid(speed~season) +

ggtitle('Distribution of a3 as a function of River Speed and Season')

spring summer autumn winter

0

5

10

15

0

5

10

15

0

5

10

15

lowm

ediumhigh

0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50a3

coun

t

Distribution of a3 as a function of River Speed and Season

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 47 / 58

Page 67: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs Conditioned Box Plots

Conditioned Box Plots on base graphics

boxplot(a6 ~ season,algae,main='Distribution of a6 as a function of Season',xlab='Season',ylab='Distribution of a6')

●●

●●●

●●

●●

●●

spring summer autumn winter

020

4060

80Distribution of a6 as a function of Season

Season

Dis

trib

utio

n of

a6

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 48 / 58

Page 68: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs Conditioned Box Plots

Conditioned Box Plots on ggplot2

ggplot(algae,aes(x=season,y=a6))+geom_boxplot() +ggtitle('Distribution of a6 as a function of Season and River Size')

●●

●●●

●●

●●

●●

●●

0

20

40

60

80

spring summer autumn winterseason

a6

Distribution of a6 as a function of Season and River Size

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 49 / 58

Page 69: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs Conditioned Box Plots

Hands on Data Visualization - Algae data set

Using the Algae data set from package DMwR answer to the followingquestions:

1 Produce a graph that allows you to understand how the values ofNO3 are distributed across the sizes of river solution

2 Try to understand (using a graph) if the distribution of algae a1varies with the speed of the river solution

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 50 / 58

Page 70: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs Conditioned Box Plots

Solutions to Exercise 1

Produce a graph that allows you to understand how the values ofNO3 are distributed across the sizes of river.

ggplot(algae,aes(x=NO3)) + geom_histogram(binwidth=5) + facet_wrap(~size) +ggtitle("Distribution of NO3 for different river sizes")

large medium small

0

20

40

60

0 20 40 0 20 40 0 20 40NO3

coun

tDistribution of NO3 for different river sizes

Go Back

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 51 / 58

Page 71: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Conditioned Graphs Conditioned Box Plots

Solutions to Exercise 2

Try to understand (using a graph) if the distribution of algae a1varies with the speed of the river

ggplot(algae,aes(x=speed,y=a1)) + geom_boxplot() +ggtitle("Distribution of a1 for different River Speeds")

●●

●●

0

25

50

75

high low mediumspeed

a1Distribution of a1 for different River Speeds

Go Back

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 52 / 58

Page 72: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Other Plots Scatterplots

Scatterplots in base graphics

The Scatterplot is the natural graph for showing the relationshipbetween two numerical variables

plot(algae$a1,algae$a6,main='Relationship between the frequency of a1 and a6',xlab='a1',ylab='a6')

●●●

●●

●●● ● ●● ● ●●●

●● ●

●●

●●

● ●

●● ● ● ●●●●

●● ●

●●

●●● ●● ●●● ●

● ●●

●● ●●● ●

●●●

●●●

●●

● ●

●● ●

●● ●

●●●

● ● ●●●

●●●

●●

●●

●●

● ●

●●

●●●●●

● ●●●●● ●●●

●●

● ●●●

●● ●●●

●●●●

●●

●●

●●●

0 20 40 60 80

020

4060

80

Relationship between the frequency of a1 and a6

a1

a6

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 53 / 58

Page 73: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Other Plots Scatterplots

Scatterplots in ggplot2

ggplot(algae,aes(x=a1,y=a6)) + geom_point() +ggtitle('Relationship between the frequency of a1 and a6')

● ●●

●●

●●● ● ●● ● ●●●

●● ●

●●

●●

● ●

●● ● ● ●●●●

●● ●

●●

●●● ●● ●●●

● ●

●● ●● ● ●

●●●

●●●

● ●

●● ●

●● ●

●●

● ● ●●●

●●●

●●

●●

● ●

●●

●●●●

● ●●●●● ●●●

●●

● ●●●

● ● ●●●

●●

●●

●● ●0

20

40

60

80

0 25 50 75a1

a6Relationship between the frequency of a1 and a6

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 54 / 58

Page 74: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Other Plots Scatterplots

Conditioned Scatterplots using the ggplot2 package

ggplot(algae,aes(x=a4,y=Chla)) + geom_point() + facet_wrap(~season)

●●

●●●●

●●● ●

●●●●

●●

●●● ●

●●●●●

●●

●●●● ●●

●●●

● ●

●●

● ●●

●●

●●

●●

●●● ●●●●

●●

●●

● ●●●●

●●●

●●

● ●

●●●●

●●

●●●●●●

●●

● ●

●●

●●

●●

autumn spring

summer winter

0

30

60

90

0

30

60

90

0 10 20 30 40 0 10 20 30 40a4

Chl

a

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 55 / 58

Page 75: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Other Plots Scatterplot matrices

Scatterplot matrices through package GGally

These graphs try to present all pairwise scatterplots in a data set.They are unfeasible for data sets with many variables.

library(GGally)ggpairs(algae,columns=12:16,

diag=list(continuous="density", discrete="bar"),axisLabels="show")

a1a2

a3a4

a5

a1 a2 a3 a4 a5

0

25

50

75Corr:

−0.294

Corr:

−0.147

Corr:

−0.038

Corr:

−0.292

●●

● ●●●● ●●● ●

●●● ●●● ●● ●●

● ●●● ●

●●● ●

●●

●●●

● ●● ●● ●●

● ● ●●● ●●

●●● ●●●●● ●● ●●

● ●●●●●●

●●●●●●●●

●●●●●

●●●●●●●

●●●● ●●

●●

●●●●

●● ●

●●●

●●●

●●●●●●

● ●●

● ●●●●

●●

●●●

●●●

●●●●

●●

●●●●

●●

●●●●●

●● ●

●●●●

●●●●●●●●●●

●●

●●●●

0204060 Corr:

0.0321

Corr:

−0.172

Corr:

−0.16

●●●

●●●

●●● ●●●

●●●

●● ●● ●●● ●●

●●

● ●● ● ●●●●

● ●● ●● ●●● ● ●●● ●

●●●● ●●●

●●

●●● ●●●

●●

●●●

●●●●●●

●●●●●●●●●

● ●

●●●●

●●●● ● ●● ●

●●

●●

●●

●●

●●●●

●● ●●●

●●●

● ●●●

●●●

● ●●●●●●●

●●●●●●

●●●●●●

●●

●●●

●●●

●●●●●●

●●

●● ●●

●●●●

010203040

●●

●●●●

●●●●●●

●●●

●●●●● ●●●●

●●

●●●●●●●

●●●●●●●●●●●●●●

●●●●●●●●

●●●●●●

●●

●●●

●●●●●●

●●●●

●●●

●●

●●

● ●●●

●●●● ●●●●

●●

●●

●●

●●

●●●●

●●●● ●

●●●

● ● ●●

●●●

● ●● ●● ●●●

●●●●● ●

●● ●●

●●

●●

● ●●

●●●

●●●●●●

●●

●●● ●

● ●●●

Corr:

0.0123

Corr:

−0.108

●●●●●●●

●●●● ●●● ● ●●

● ●

●●●

●●●

●●● ●●●● ●● ● ●●●●●

●●● ●

●●● ●

●● ● ●●● ●

● ●●●●● ●● ●

●●

●●●●● ●●

●●●●●●●●●●●●●●●●●●●

●●●● ●●

●●● ●

● ●●●● ● ●● ●●●●●●●●●●●●●●●●

●●●●●

●●●

●●

●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●● ●●●●●

010203040

●● ●●● ●●●●●●●●●●●●●●

●●●● ●●●●●● ●●●● ●●●● ●●●●

●●●●●●●●●●●●●●

●●●●●●●●●

●●

●●●●●●●

●●●●●●●●●● ●●●●●●●

●●

●●●

●●●●● ●●

●●●●● ●●●● ●● ●●●●●● ●●●●●●●

● ● ●●●● ●

●●●

● ●● ●●●● ●●●●

●● ●●●●●●●● ●●● ●●●●● ●● ●●●●●● ●●●●●●●●● ● ●● ●● ●●

● ●● ●●●●

●●●●●●●●●●

● ●

●●●●●●●● ●●● ●●●●●●●●● ●●

●●●●

●●●●●●●●●●

●●●● ●●● ●●

●●

●●●●

●● ●

●●● ●●

●●●●● ●●● ●●●●●●

●● ●●●●●●●●

● ●●●●●●●● ●●● ●●● ●●● ●●● ●●●

●●●●●●●

●●●

●●● ●●●● ●●●●●●●●●●●●●●●●●●

● ● ●●● ●●●●●● ●●●●●●● ●●● ●●●●●●●

Corr:

−0.109

●●●●

● ●●●● ●●● ●

●●

●●

●●

●●

●●● ●

●● ● ●●●●●

●●

● ●

●●

●●● ●● ●●● ●●●●● ●● ●●● ●

●●●●●●●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●● ●●●●

●● ●●

●●

●●●

●●● ●

● ● ●●●

●●●●

●●●●

●●●●●●●●●●●●●

●●●

●●●

● ●● ●●●●●●●●●●●●

●●●

●●●

●●●010203040

0 255075

●●●

●●●●●●●●●

●●

●●●●●

● ●

●●●●

● ●●●● ●●●

●●

●●

●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●

●●

●●

●●

●●

● ●

● ●●

● ●●●●●

●●● ●

●●

●● ●

● ●●●

●●●●

● ●● ●

●●● ●

●● ●● ●●●●●●●

● ●●● ●

●●●

●● ●●●●●● ●●●●●●●

●●

●●

●●

●●●

0 20 40 60

●● ●

●●●●●●●●●●●

●●

●●●

●●

●● ●●

●●●●●●● ●

●●

●●

●●●●●●●●●●●●● ●●● ●●●●●

●● ●●●

● ●●●●●●

●●

●●●

●●

●●

●●

●●

●●

●●●

●● ●●●●

●●● ●

●●

● ●●

●●●●

● ●●●

●●● ●

●●● ●

●●●●●●●●●●●●●

●●●

●●

● ●●●●●● ●●●●●●● ●

●●

●●●

●●● ●

0 10203040

●●●●

●●●●●●●●●●●

●●

●●

●●

●●●●

●●●●●●●●

●●

●●

●●

●●●● ● ●●●●●●●●●●● ●● ●

●●●●●●● ●●●●●●

●●●

●●

●●

●●

●●

●●

●●●

●●●●●●

●●●●

●●

●●●

●●●●

● ●●●

●●●●

●●●●

●●●●●●●●●●●●●●●●

●●●

●●●●●●●●●●●●●●●

●●●●●●●● ●

0 10203040 0 10203040

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 56 / 58

Page 76: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Other Plots Scatterplot matrices

Scatterplot matrices involving nominal variables

ggpairs(algae,columns=c("season","speed","a1","a2"),axisLabels="show")

seas

onsp

eed

a1a2

season speed a1 a2

0

20

40

60

autumn

springsum

mer

winter

●● ●

● ●● ●●●

●● ●●

● ●

●●●●

●●

●●●●●

autumn

springsum

mer

winter

01020

01020

01020

● ●●

●●● ●● ●●●●

●●● ●● ●● ●● ●●

●●●●●● ●

highlow

medium

0

25

50

75Corr:

−0.294

0

20

40

60

0102030010203001020300102030020400204002040

●●

● ●●

●● ● ●●●

●●● ●●● ● ● ●●

● ●●● ●

●●

● ●

● ●

●●●

● ●● ●● ●●

● ● ●●● ●●

●●● ●●●●● ●● ●●

● ●●●●●●

●●●●●●●●

●●●●●

●●●●●●

●● ●● ●●

●●

●●●●

●● ●

●●

●●●

●●●●●●

● ●●

● ●●

●●

●●●

●● ●

●● ●●

●●●●

●●

●●●●●

● ●●●●●

●●●●●● ●●●

●●

●●●

0 25 50 75 0 20 40 60

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 57 / 58

Page 77: Data Visualization in R - DCCines/aulas/1516/DM1/07_visualization.pdf · Data Visualization in R L. Torgo ltorgo@fc.up.pt Faculdade de Ciências / LIAAD-INESC TEC, LA Universidade

Other Plots Parallel plots

Parallel Plots

Parallel plots are also interesting for visualizing a data set

ggparcoord(algae,columns=12:16,groupColumn="speed")

0.0

2.5

5.0

7.5

10.0

a1 a2 a3 a4 a5variable

valu

e

speed

high

low

medium

© L.Torgo (FCUP - LIAAD / UP) Data Visualization Oct, 2014 58 / 58