statistics for microarray data analysis with r

Session 3: More features of R and the Central Limit Theorem

Class web site: http://ludwig-sun2.unil.ch/~darlene/

Statistics for Microarray Data Analysis with R

> tumor.info<-data.frame(localization,tumorsize,progress)> rownames(tumor.info)<-c("XX348","XX234","XX987")> tumor.info$tumorsize[1] 6.3 8.0 10.0

Today’s Outline

• Further features of the R language

• Preliminary data analysis exercise

• Central Limit Theorem (CLT)

• CLT exercise

• some material included here was adapted from materials available at

http://www.bioconductor.org/

and is used by permission




R: factors

• Categorical variables in R should be specified as factors

• Factors can take on a limited number of values, called levels

• Levels of a factor may have a natural order

• Functions in R for creating factors: factor(), ordered()

R: data frames (review)• data frame: the type of R object normally

used to store a data set• A data frame is a rectangular table with rows

and columns – data within each column has the same

type (e.g. number, character, logical)– different columns may have different types

• Example:> tumor.info localisation tumorsize progressXX348 proximal 6.3 FALSEXX234 distal 8.0 TRUEXX987 proximal 10.0 FALSE

R: making data frames• Data frames can be created in R by importing a

data set• A data frame can also be created from pre-

existing variables• Example:

> localisation<-c("proximal","distal","proximal")> tumorsize<- c(6.3,8,10)> progress<-c(FALSE,TRUE,FALSE)> tumor.info<-data.frame(localization,tumorsize,progress)

> rownames(tumor.info)<-c("XX348","XX234","XX987")

> tumor.info$tumorsize[1] 6.3 8.0 10.0

> tumor.info[c(1,3),] localisation tumorsize progressXX348 proximal 6.3 FALSEXX987 proximal 10.0 FALSE

> tumor.info[c(TRUE,FALSE,TRUE),] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0

> tumor.info$localisation[1] "proximal" "distal" "proximal"

> tumor.info$localisation=="proximal"[1] TRUE FALSE TRUE

> tumor.info[ tumor.info$localisation=="proximal", ] localisation tumorsize progressXX348 proximal 6.3 0XX987 proximal 10.0 0

subset rows by a vector of indices

subset rows by a logical vector

subset a column

comparison resulting in logical vector

subset the selected rows

R: more on subsetting

R: loops• When the same or similar tasks need to be

performed multiple times in an iterative fashion

• A data frame can also be created from pre-existing variables

• Examples:> for(i in 1:10) { > i = 1

print(i*i) while(i<=10) { } print(i*i) i=i+sqrt(i) }

• Explicit loops such as these should be avoided where possible

R: lapply, sapply

• When the same or similar tasks need to be performed multiple times for all elements of a list or for all columns of an array

• These implicit loops are generally faster than explicit ‘for’ loops

• lapply(the.list,the.function)– the.function is applied to each element of the.list

– result is a list whose elements are the individual results for the.function

• sapply(the.list,the.function)– Like lapply, but tries to simplify the result, by

converting it into a vector or array of appropriate size

R: apply

• apply(array, margin,the.function)– applies the.function along the dimension of array specified by margin

– result is a vector or matrix of the appropriate size

• Example: > x [,1] [,2] [,3] [1,] 5 7 0 [2,] 7 9 8 [3,] 4 6 7 [4,] 6 3 5

> apply(x, 1, sum) [1] 12 24 17 14

> apply(x, 2, sum) [1] 22 25 20

R: sweep and scale

• sweep(...) removes a statistic from dimensions of an array

• Example: Subtract column medians

> col.med<-apply(my.data,2,median)

> sweep(my.data,2,col.med)

• scale(...) centers and/or rescales columns of a matrix

R: importing and exporting data (review)

• Many ways to get data into and out of R

• One straightforward way is to use tab-delimited text files (e.g. save an Excel sheet as tab-delimited text, for easy import into R)

• Useful R functions: read.delim(), read.table(), read.csv(), write.table()

• Example:> x = read.delim(“filename.txt”)> write.table(x, file=“x.txt”, sep=“\t”)

R: introduction to object orientation

• Primitive (or atomic) data types in R are:– numeric (integer, double, complex)– character– logical– function

• From these, vectors, arrays, lists can be built• An object is an abstract term for anything

that can be assigned to a variable• Components of objects are called slots• Example: a microarray experiment

– probe intensities– patient data (tissue location, diagnosis, follow-up)– gene data (sequence, IDs, annotation)

R: classes and generic functions

• Object-oriented programming aims to create coherent data systems and methods that work on them

• In general, there is a class of data objects and a (print, plot, etc.) method for that class

• Generic functions, such as print, act differently depending on the function argument

• This means that we don’t need to worry about a lot of the programming details

• In R, an object has a (character vector) class attribute which determines the mode of action for the generic function

Exercises: Bittner et al. dataset

• You should have downloaded the dataset gene_list-Cutaneous_Melanoma.xls from the web

• Use the handout as a guide to get this dataset into R and do some preliminary analyses

• If you do not have this dataset, you can use your own data

Sample surveys

• Surveys are carried out with the aim of learning about characteristics (or parameters) of a target population, the group of interest

• The survey may select all population members (census) or only a part of the population (sample)

• Typically studies sample individuals (rather than obtain a census) because of time, cost, and other practical constraints

Sampling variability• Say we sample from a population in order to

estimate the population mean of some (numerical) variable of interest (e.g. weight, height, number of children, etc.)

• We would use the sample mean as our guess for the unknown value of the population mean

• Our sample mean is very unlikely to be exactly equal to the (unknown) population mean just due to chance variation in sampling

• Thus, it is useful to quantify the likely size of this chance variation (also called ‘chance error’ or ‘sampling error’, as distinct from ‘nonsampling errors’ such as bias)

Sampling variability of the sample mean

• Say the SD in the population for the variable is known to be some number

• If a sample of n individuals has been chosen ‘at random’ from the population, then the likely size of chance error of the sample mean (called the ‘standard error’) is

SE(mean) = /n

• If is not known, you can substitute an estimate

Sampling variability of the sample proportion

• Similarly, we could use the sample proportion as a guess for the unknown population proportion p with some characteristic (e.g. proportion of females)

• If a sample of n individuals has been chosen ‘at random’ from the population, then the likely size of chance error of the sample proportion is

SE(proportion) = p(1-p)/n

• Of course, we don’t know p (or we would not need to estimate it), so we substitute our estimate

Central Limit Theorem (CLT)

• The CLT says that if we – repeat the sampling process many times– compute the sample mean (or proportion) each

time– make a histogram of all the means (or proportions)

• then that histogram of sample means (or proportions) should look like the normal distribution

• Of course, in practice we only get one sample from the population

• The CLT provides the basis for making confidence intervals and hypothesis tests for means or proportions

What the CLT does not say

• The CLT does not say that the histogram of variable values will look like the normal distribution

• The distribution of the individual variable values will look like the population distribution of variable values for a big enough sample

• This population distribution does not have to be normal, and in practice is typically not normal

CLT: technical details

• A few technical conditions must be met for the CLT to hold

• The most important ones in practice are that

– the sampling should be random (in a carefully defined sense)

– the sample size should be ‘big enough ’

• How big is ‘big enough’? There is no single answer because it depends on the variable’s distribution in the population: the less symmetric the distribution, the more samples you need

Exercises: CLT simulations

• Here, you will simulate flipping coins

• The coins will have differing probabilities of ‘heads’

• The object is to see how the number of coin flips required for the distribution of the proportion of heads in the simulated flips to become approximately normal

• See the handout for details

statistics for microarray data analysis with r

Documents

columns data

data seta data frame

progressxx348 proximal

0xx987 proximal

truexx987 proximal

falsexx987 proximal

features of r

microarray data analysis