canadian bioinformatics workshops

53
Canadian Bioinformatics Workshops www.bioinformatics.ca

Upload: wallace-hawkins

Post on 31-Dec-2015

34 views

Category:

Documents


1 download

DESCRIPTION

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Module 2 Exploratory Data Analysis (EDA). Exploratory Data Analysis of Biological Data using R Boris Steipe Toronto, May 23. and 24. 2013. †. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops

www.bioinformatics.ca

Page 2: Canadian Bioinformatics Workshops

2Module #: Title of Module

Page 3: Canadian Bioinformatics Workshops

Module 2Exploratory Data Analysis (EDA)

DEPARTMENT OF

BIOCHEMISTRY

DEPARTMENT OF

MOLECULAR GENETICS

DEPARTMENT OF

BIOCHEMISTRY

DEPARTMENT OF

MOLECULAR GENETICS

††

Odysseus listening to the song of the sirens. Late Archaic (500–480 BCE)

Exploratory Data Analysis of Biological Data using R

Boris SteipeToronto, May 23. and 24. 2013

This workshop includes material originally developed by Raphael Gottardo, FHCRC and by Sohrab Shah, UBC

††

Page 4: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Exploratory Data Analysis (EDA)

EDA is an approach to data analysis without a particular statistical model or hypothesis.

It is therefore often the first step of data analysis.

EDA is an approach to data analysis without a particular statistical model or hypothesis.

It is therefore often the first step of data analysis.

Page 5: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Exploratory Data Analysis (EDA)

The objectives of EDA include:

• uncovering underlying structure and identifying trends and patterns;

• extracting important variables;

• detecting outliers and anomalies;

• testing underlying assumptions;

• developing statistical models.

The objectives of EDA include:

• uncovering underlying structure and identifying trends and patterns;

• extracting important variables;

• detecting outliers and anomalies;

• testing underlying assumptions;

• developing statistical models.

Page 6: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Exploratory Data Analysis (EDA)

The practice of EDA emphasizes looking at data in different ways through:

• computing and tabulating basic descriptors of data properties such as ranges, means, and variances;

• generating graphics, such as boxplots, histograms, scatter plots;

• applying transformations, such as log or rank;

• comparing observations to statistical models, such as the QQ-plot, or regression;

• identifying underlying structure through clustering;

• simplifying data through dimension reduction ...

... all with the final goal of defining a statistical model and using the model for hypothesis testing and prediction.

The practice of EDA emphasizes looking at data in different ways through:

• computing and tabulating basic descriptors of data properties such as ranges, means, and variances;

• generating graphics, such as boxplots, histograms, scatter plots;

• applying transformations, such as log or rank;

• comparing observations to statistical models, such as the QQ-plot, or regression;

• identifying underlying structure through clustering;

• simplifying data through dimension reduction ...

... all with the final goal of defining a statistical model and using the model for hypothesis testing and prediction.

Page 7: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Graphics

Good graphics are immensely valuable. Poor graphics are worse than none.

If you want to learn more about good graphics and information design, find a copy of Edward Tufte's The Visual Display of Quantitative Information. You can also visit his Web site to get a sense of the field (www.edwardtufte.com).

Fundamentally, there is one simple rule.

Good graphics are immensely valuable. Poor graphics are worse than none.

If you want to learn more about good graphics and information design, find a copy of Edward Tufte's The Visual Display of Quantitative Information. You can also visit his Web site to get a sense of the field (www.edwardtufte.com).

Fundamentally, there is one simple rule.

The rule has many corollaries.The rule has many corollaries.

Use less ink.

Page 8: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Use less ink

Make sure that all elements on your graphics are necessary.

Make sure that all elements on your graphics are informative.

Make sure that all information in your data is displayed.

Make sure that all elements on your graphics are necessary.

Make sure that all elements on your graphics are informative.

Make sure that all information in your data is displayed.

Not all of R's defaults use as little ink as possible...

Page 9: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

refined graphics

... for a popular alternative to R's defaults, check out the ggplot2 package (http://www.ggplot2.org).

... for a popular alternative to R's defaults, check out the ggplot2 package (http://www.ggplot2.org).

> install.packages("ggplot2")...> library(ggplot2)...

Example: Bubble chart.Example: Bubble chart.http://ygc.name/2010/12/01/bubble-chart-by-using-ggplot2/http://ygc.name/2010/12/01/bubble-chart-by-using-ggplot2/

crime <- read.csv("crimeRatesByState2008.csv", header=TRUE, sep="\t")p <- ggplot(crime, aes(murder,burglary,size=population, label=state))p <- p+geom_point(colour="red") +scale_area(to=c(1,20))+geom_text(size=3)p + xlab("Murders per 100,000 population") + ylab("Burglaries per 100,000")

crime <- read.csv("crimeRatesByState2008.csv", header=TRUE, sep="\t")p <- ggplot(crime, aes(murder,burglary,size=population, label=state))p <- p+geom_point(colour="red") +scale_area(to=c(1,20))+geom_text(size=3)p + xlab("Murders per 100,000 population") + ylab("Burglaries per 100,000")

Page 10: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Graphics

R Graph Gallery: http://gallery.r-enthusiasts.com/ R Graph Gallery: http://gallery.r-enthusiasts.com/

Page 11: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

exploring data

• When teaching (or learning new procedures) I prefer to work with synthetic data.

• Synthetic data has the advantage that I know what the outcome of the analysis should be.

• Typically one would create values acording to a function and then add noise.

• R has several functions to create sequences of values – or you can write your own ...

• When teaching (or learning new procedures) I prefer to work with synthetic data.

• Synthetic data has the advantage that I know what the outcome of the analysis should be.

• Typically one would create values acording to a function and then add noise.

• R has several functions to create sequences of values – or you can write your own ...

0:10 seq(0, pi, 5*pi/180) rep(1:3, each=3, times=2)for (i in 1:10) { print(i*i) }

Page 12: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

synthetic data

Function ...Function ...

Explore functions and noise.

Noise ...Noise ...

Noisy Function ...Noisy Function ...

Page 13: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

discrete data

TGGGACGATCAACTCGAAGCGGATGCCCCTATTGATCCTCTCCGAACACAATGAGAATGCTCTCAAAGGCTGTAAGCTCAGTGAACGGGACTAAATCAGACGTTTGTCGTCGATACGGGTGCCCTACCTCTACCTAATCTGATCGAGGCCAAGTATCTGGAAGGACTGCTAATTTTGCCGTAAACCCCGTTCTTGTTCAGTATAATCTAAGTAGATGTGTTAGCAGTCATATGACGATAGTTTGGGGGTCCTCCTATAGGAAAAAGAAAGAGTCGCCACACCAGAGCCTGTAGCGCTTTCTAGATAGTGCTGCATATTTATATGTCGGCCCCGAACCAGAGCGCTCCAATGGTAGCCCCTTTAATCTTCGTATCTTACCTTTTATGAGTGCACAGGTTTCCATCGAGGGAGAAAAACCTCAGCAACGTGGGTGGGTAGAAGAGCCCTAGTTTGAAAGCCGCACCATAACCCGCACATCGTCAGATCAGTAACCCAAGATCGGTGGGCTGTAACTAGGCTCCGTGACACAGCGTGGTATTCCGAGTTCCCGAAATCGTTTCACCTATAGAACGCCACCCCGGACGGGGTTGTTAGTTTTTCTACCTTTTAAGAAGAAAAGCAAAGTGTGTGGACACGAGAACTAGTGTGAGTACGGTTTTGTATGTGGCCCTACTGTGGAAACTCAGTAGTACGAAGGGGATAGCGAGACTTAGCTTTGCCCCAACTGCCGTCACGCACCCGCTTGTGCCGGTACGCAGAGCTCCGCCGGGTGTCCAAGTGCCGTTCTACGATAAGAACCTGTGTATCTAGCGCGCCCGATATGAATAAAGCCTACTCTTATCCAGATTTTGCGGACTGGTAAGCGTGACAATTATTGCGCAGCTTCGACTTAGTTCTCCTTGCCTTGCTTTAGGGGAGTTCTCCACTCAAAAGTCGTTGACGTACAATCGCAGATTTTGTAATCCCTTAAACCTCTGATTAGTCTCAGCCGTATTCACTA

Exercise: Write a function to output random nucleotides.

Page 14: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

random nucleotides: solution 1

Generating uniformly distributed random nucleotides using sample(). Generating uniformly distributed random nucleotides using sample().

rNucUnif <- function(n) {nuc <- c("A", "C", "G", "T")rSeq <- sample(nuc, n, replace=TRUE)return(rSeq)

}

Task: Explore a function that samples according to arbitrary probabilities.

Page 15: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

sample()

• Random samples from a vector with/without replacement.• Sampling a vector over its length without replacement is a permutation

or shuffle of the vector.

• Random samples from a vector with/without replacement.• Sampling a vector over its length without replacement is a permutation

or shuffle of the vector.

sums <- rep(0, 10)digits <- 0:9rsd <- vector()for (i in 1:10000) {

perm <- sample(digits, 10)sums <- sums + permif (i %% 1000 == 0) {

rsd <- c(rsd, sd(sums)/mean(sums))}

}

Task: Test whether the permutation is biased.

Page 16: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

probability distributions

• Can be either discrete or continuous (uniform, Bernoulli, normal, etc)

• Defined by a density function, p(x) or f(x)

• Bernoulli distribution Be(p): flip a coin (T=0, H=1). Assume probability of H is 0.1 ...

• Can be either discrete or continuous (uniform, Bernoulli, normal, etc)

• Defined by a density function, p(x) or f(x)

• Bernoulli distribution Be(p): flip a coin (T=0, H=1). Assume probability of H is 0.1 ...

x <- 0:1 f <- dbinom(x, size=1, prob=0.1) plot(x, f, xlab="x", ylab="density", type="h", lwd=5)

Page 17: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

probability distributions

Random sampling: Generate 100 observations from a Be(0.1)

Random sampling: Generate 100 observations from a Be(0.1)

set.seed(100)x <- rbinom(100, size=1, prob=0.1)hist(x)

Page 18: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

probability distributions

Normal distribution N(μ,σ2)μ is the mean and σ2 is the variance.

Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed.

Normal distribution N(μ,σ2)μ is the mean and σ2 is the variance.

Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed.

x <- seq(-4, 4, 0.1)f <- dnorm(x, mean=0, sd=1)plot(x, f, xlab="x", ylab="density", lwd=5, type="l")

The area under the curve is the probability of observing a value between 0 and 2.

Page 19: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

probability distributions

Normal distribution N(μ,σ2)μ is the mean and σ2 is the variance.

Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed.

Normal distribution N(μ,σ2)μ is the mean and σ2 is the variance.

Extremely important because of the Central Limit Theorem: if a random variable is the sum of a large number of small random variables, it will be normally distributed.

x <- seq(-4, 4, 0.1)f <- dnorm(x, mean=0, sd=1)plot(x, f, xlab="x", ylab="density", lwd=5, type="l")

The area under the curve is the probability of observing a value between 0 and 2.

Task:

Explore line parameters

Page 20: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

probability distributions

Random sampling: Generate 100 observations from a N(0,1)

Random sampling: Generate 100 observations from a N(0,1)

set.seed(100)x <- rnorm(100, mean=0, sd=1)hist(x)lines(seq(-3,3,0.1),50*dnorm(seq(-3,3,0.1)), col="red")

Histograms can be used to estimate densities!Histograms can be used to estimate densities!

Page 21: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

quantiles

(Theoretical) Quantiles:

The p-quantile has the property that there is a probability p of getting a value less than or equal to it.

(Theoretical) Quantiles:

The p-quantile has the property that there is a probability p of getting a value less than or equal to it.

The 50% quantile is called the median.The 50% quantile is called the median.

90% of the probability (area under the curve) is to the left of the red vertical line.

q90 <- qnorm(0.90, mean = 0, sd = 1)x <- seq(-4, 4, 0.1)f <- dnorm(x, mean=0, sd=1)plot(x, f, xlab="x", ylab="density", type="l", lwd=5)abline(v=q90, col=2, lwd=5)

Page 22: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

descriptive statistics

Empirical Quantiles:

The p-quantile has the property that p% of the observations are less than or equal to it.

Empirical quantiles can be easily obtained in R.

Empirical Quantiles:

The p-quantile has the property that p% of the observations are less than or equal to it.

Empirical quantiles can be easily obtained in R.

> set.seed(100)> x <- rnorm(100, mean=0, sd=1)> quantile(x) 0% 25% 50% 75% 100% -2.2719255 -0.6088466 -0.0594199 0.6558911 2.5819589 > quantile(x, probs=c(0.1, 0.2, 0.9)) 10% 20% 90% -1.1744996 -0.8267067 1.3834892

Page 23: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

descriptive statistics

We often need to quickly 'quantify' a data set, and this can be done using a set of summary statistics (mean, median, variance, standard deviation).

We often need to quickly 'quantify' a data set, and this can be done using a set of summary statistics (mean, median, variance, standard deviation).

> mean(x)[1] 0.002912563> median(x)[1] -0.0594199> IQR(x)[1] 1.264738> var(x)[1] 1.04185> summary(x) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.272000 -0.608800 -0.059420 0.002913 0.655900 2.582000

Exercise: what are the units of variance and standard deviation?

Page 24: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Box-plot

Descriptive statistics can be intuitively summarized in a Box-plot.

Descriptive statistics can be intuitively summarized in a Box-plot.

> boxplot(x)

IQRIQR

1.5 x IQR1.5 x IQR

1.5 x IQR1.5 x IQR

Everything above and below 1.5 x IQR is considered an "outlier".

75% quantile

Median

25% quantile

IQR = Inter Quantile Range = 75% quantile – 25% quantileIQR = Inter Quantile Range = 75% quantile – 25% quantile

Page 25: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Violinplot

Internal structure of a data-vector can be made visible in a violin plot. The principle is the same as for a boxplot, but a width is calculated from a smoothed histogram.

Internal structure of a data-vector can be made visible in a violin plot. The principle is the same as for a boxplot, but a width is calculated from a smoothed histogram.

p <- ggplot(X, aes(1,x))p + geom_violin()

Page 26: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

plotting data in R

Task: Explore types of plots.

Page 27: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

QQ–plot

One of the first things we may ask about data is whether it deviates from an expectation e.g. to be normally distributed.

The quantile-quantile plot provides a way to visually verify this.

The QQ-plot shows the theoretical quantiles versus the empirical quantiles. If the distribution assumed (theoretical one) is indeed the correct one, we should observe a straight line.

R provides qqnorm() and qqplot().

One of the first things we may ask about data is whether it deviates from an expectation e.g. to be normally distributed.

The quantile-quantile plot provides a way to visually verify this.

The QQ-plot shows the theoretical quantiles versus the empirical quantiles. If the distribution assumed (theoretical one) is indeed the correct one, we should observe a straight line.

R provides qqnorm() and qqplot().

Page 28: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

QQ–plot: sample vs. Normal

Only valid for the normal distribution!Only valid for the normal distribution!

qqnorm(x)qqline(x, col=2)

Page 29: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Plotting: lines and legendsExample: compare the t-distribution with the Normal distribution.

Example: compare the t-distribution with the Normal distribution.

x <- seq(-4, 4, 0.1)f1 <- dnorm(x, mean=0, sd=1)f2 <- dt(x, df=2)plot(x, f1, xlab="x", ylab="density", lwd=5, type="l")lines(x, f2, lwd=5, col=2)legend(-4, .4, c("Normal", "t2"), col=1:2, lwd=5)

Page 30: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

QQ–plot: sample vs. Normal

Clearly the t distribution with two degrees of freedom is not Normal.Clearly the t distribution with two degrees of freedom is not Normal.

set.seed(100)t <- rt(100, df=2)qqnorm(t)qqline(t, col=2)

Page 31: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

QQ–plot

set.seed(101)generateVariates <- function(n) { Nvar <- 10000 Vout <- c() for (i in 1:n) { x <- runif(Nvar, -0.01, 0.01) Vout <- c(Vout, sum(x) ) } return(Vout)}

x <- generateVariates(1000)y <- rnorm(1000, mean=0, sd=1)qqnorm(x)qqline(x, y, col=2)

Verify the CLT.Verify the CLT.

Page 32: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

QQ–plot: sample vs. sample

Comparing two samples: are their distributions the same?

... or ...

compare a sample vs. a synthetic dataset.

Comparing two samples: are their distributions the same?

... or ...

compare a sample vs. a synthetic dataset.

set.seed(100)x <- rt(100, df=2)y <- rnorm(100, mean=0, sd=1)qqplot(x, y)

Exercise: try different values of df for rt() and compare the vectors.

Page 33: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

scatter plots

Biological data sets often contain several variables, they are multivariate.

Scatter plots allow us to look at two variables at a time.

Biological data sets often contain several variables, they are multivariate.

Scatter plots allow us to look at two variables at a time.

This can be used to assess independence and identify subgroups.This can be used to assess independence and identify subgroups.

# GvHD flow cytometry datagvhd <- read.table("GvHD.txt", header=TRUE)# Only extract the CD3 positive cellsgvhdCD3p <- as.data.frame(gvhd[gvhd[, 5]>280, 3:6])plot(gvhdCD3p[, 1:2])

Page 34: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

scatter plots

Task: Explore scatter plots.

Page 35: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

trellis graphics

Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets.Trellis Graphics is a family of techniques for viewing complex, multi-variable data sets.

plot(gvhdCD3p, pch=".")

Tip:

Many more possibilities are available in the "lattice" package. See ?Lattice

Page 36: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Boxplots

The boxplot function can be used to display several variables at a time.

The boxplot function can be used to display several variables at a time.

boxplot(gvhdCD3p)

Exercise: Interpret this plot.

Page 37: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Histograms

We discover a mix of distinct cell subpopulations!We discover a mix of distinct cell subpopulations!

par(mfrow=c(2, 2))hist(gvhdCD3p[, 1], 50,

main=names(gvhdCD3p)[1], xlab="fluorescent intensity", ylim=c(0, 120))

hist(gvhdCD3p[, 2], 50, main=names(gvhdCD3p)[2], xlab="fluorescent intensity", ylim=c(0, 120))

hist(gvhdCD3p[, 3], 50, main=names(gvhdCD3p)[3], xlab="fluorescent intensity", ylim=c(0, 120))

hist(gvhdCD3p[, 4], 50, main=names(gvhdCD3p)[4], xlab="fluorescent intensity", ylim=c(0, 120))

Page 38: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA: HIV microarray data

HIV Data:

The expression levels of 7680 genes were measured in CD4-T-cell lines at time t = 24 hours after infection with HIV type 1 virus.

• 12 positive controls (HIV genes).

• 4 replicates (2 with a dye swap)

HIV Data:

The expression levels of 7680 genes were measured in CD4-T-cell lines at time t = 24 hours after infection with HIV type 1 virus.

• 12 positive controls (HIV genes).

• 4 replicates (2 with a dye swap)

Page 39: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA: HIV microarray data

We get the data after the image analysis is done.

For each spot we have an estimate of the intensity in both channels.

The data matrix therefore is of size 7680 x 8.

We get the data after the image analysis is done.

For each spot we have an estimate of the intensity in both channels.

The data matrix therefore is of size 7680 x 8.

One of the array chips.One of the array chips.

Page 40: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA: HIV microarray data

data <- read.table(file="hiv.raw.data.24h.txt", sep="\t", header=TRUE)

summary(data)boxplot(data)

Page 41: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA: HIV microarray data

opar <- par(mfrow=c(2, 2))hist(data[, 1], 50, main=names(data)[1], xlab="fluorescent intensity")hist(data[, 2], 50, main=names(data)[2], xlab="fluorescent intensity")hist(data[, 5], 50, main=names(data)[5], xlab="fluorescent intensity")hist(data[, 6], 50, main=names(data)[6], xlab="fluorescent intensity")par(oPar)

Page 42: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA: HIV microarray data

# 'apply' will apply the function to all rows of the data matrixmean <- apply(data[, 1:4], 1, "mean")sd <- apply(data[, 1:4], 1, "sd")plot(mean, sd)trend <- lowess(mean, sd)lines(trend, col=2, lwd=5)

lowess fit

LOcally WEighted Scatter plot Smoother; used to estimate the trend in a scatter plot. Non parametric!

lowess fit

LOcally WEighted Scatter plot Smoother; used to estimate the trend in a scatter plot. Non parametric!

Page 43: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA: Transformations

Observations:

The data are highly skewed. The standard deviation is not constant, as it increases with the mean.

Solution:

Look for a transformation that will make the data more symmetric and the variance more constant.

With positive data the log transformation is often appropriate.

Observations:

The data are highly skewed. The standard deviation is not constant, as it increases with the mean.

Solution:

Look for a transformation that will make the data more symmetric and the variance more constant.

With positive data the log transformation is often appropriate.

Page 44: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA: Transformations

data <- log(read.table(file="hiv.raw.data.24h.txt", sep="\t", header=TRUE))summary(data)boxplot(data)

Page 45: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA: HIV microarray data

opar <- par(mfrow=c(2, 2))hist(data[, 1], 50, main=names(data)[1], xlab="fluorescent intensity")hist(data[, 2], 50, main=names(data)[2], xlab="fluorescent intensity")hist(data[, 5], 50, main=names(data)[5], xlab="fluorescent intensity")hist(data[, 6], 50, main=names(data)[6], xlab="fluorescent intensity")par(oPar)

Page 46: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA: Transformations

The standard deviation has become almost independent of the mean.

The standard deviation has become almost independent of the mean.

# 'apply' will apply the function to all rows of the data matrixmean <- apply(data[, 1:4], 1, "mean")sd <- apply(data[, 1:4], 1, "sd")plot(mean, sd)trend <- lowess(mean, sd)lines(trend, col=2, lwd=5)

Page 47: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA for microarray data: always log

• Makes the data more symmetric, large observations are not as influential

• The variance is (more) constant

• Turns multiplication into addition (log(ab)=log(a)+log(b))

• In practice use log base 2, log2(x)=log(x)/log(2)

• Makes the data more symmetric, large observations are not as influential

• The variance is (more) constant

• Turns multiplication into addition (log(ab)=log(a)+log(b))

• In practice use log base 2, log2(x)=log(x)/log(2)

Page 48: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA for gene expression

Is this a good way to look at the data?Is this a good way to look at the data?

# scatter plotplot(data[, 1], data[, 5], xlab=names(data)[1], ylab=names(data)[5])

Page 49: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA for gene expression: MA plots

M (minus) is the log ratio.

A (average) is overall intensity.

M (minus) is the log ratio.

A (average) is overall intensity.

# MA plotA <- (data[, 1]+data[, 5])/2M <- (data[, 1]-data[, 5])plot(A, M, xlab="A", ylab="M")

Page 50: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

EDA for gene expression: MA plots

The data suggests that a small group of genes are interesting. How do we define/find such differentially expressed genes? More on this in Module 6: Hypothesis Testing.The data suggests that a small group of genes are interesting. How do we define/find such differentially expressed genes? More on this in Module 6: Hypothesis Testing.

# MA plots per replicatepar(mfrow=c(2, 2))A1 <- (data[, 1]+data[, 5])/2M1 <- (data[, 1]-data[, 5])plot(A1, M1, xlab="A", ylab="M", main="rep 1")trend <- lowess(A1, M1)lines(trend, col=2, lwd=5)A2 <- (data[, 2]+data[, 6])/2M2 <- (data[, 2]-data[, 6])plot(A2, M2, xlab="A", ylab="M", main="rep 2")trend <- lowess(A2, M2)lines(trend, col=2, lwd=5)A3 <- (data[, 3]+data[, 7])/2M3 <- (data[, 3]-data[, 7])plot(A3, M3, xlab="A", ylab="M", main="rep 3")trend <- lowess(A3, M3)lines(trend, col=2, lwd=5)A4 <- (data[, 4]+data[, 8])/2M4 <- (data[, 4]-data[, 8])plot(A4, M4, xlab="A", ylab="M", main="rep 4")trend <- lowess(A4, M4)lines(trend, col=2, lwd=5)par(OldPar)

Page 51: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

More plots

Task: Explore 3D plots.

Page 52: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

Summary

EDA should be the first step in any statistical analysis!

Good modeling starts and ends with EDA.

R provides a great framework for EDA.

EDA should be the first step in any statistical analysis!

Good modeling starts and ends with EDA.

R provides a great framework for EDA.

Page 53: Canadian Bioinformatics Workshops

Module 2: Exploratory Data Analysis (EDA) bioinformatics.ca

[email protected]@utoronto.ca