introduction to microarrays kellie j. archer, ph.d. assistant professor department of biostatistics...

28
Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics [email protected]

Upload: wilfred-moody

Post on 17-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Introduction to Microarrays

Kellie J. Archer, Ph.D.Assistant Professor

Department of [email protected]

Page 2: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Microarrays

A snapshot that captures the activity pattern of thousands of genes at once.  

Custom spotted arrays Affymetrix GeneChip

Kellie Archer
Microarrays assess gene expression for thousands of genes simultaneously in a single sample. Currently, two different technologies dominant the field of microarray research, custom spotted and oligonucleotide arrays. Custom spotted or cDNA arrays consist of hundreds to thousands of spots where each spot contains a PCR product. Typically, two samples are dyed with different dyes and are hybridized to an array. Affymetrix GeneChips are commonly used oligonucleotide microarrays. Rather than one spot representing a gene, several short oligos interrogate for the same gene. Only one sample is hybridized to an Affymetrix GeneChip.
Page 3: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Spotted Microarray Process

CTRL

TEST

Page 4: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Affymetrix GeneChip® Probe Arrays

24µm

Each probe cell or feature containsmillions of copies of a specificoligonucleotide probe

Image of Hybridized Probe Array

Over 250,000 different probes complementary to geneticinformation of interest

Single stranded, fluorescentlylabeled DNA target

Oligonucleotide probe

* **

**

1.28cm

GeneChip Probe Array Hybridized Probe Cell

BGT108_DukeUniv

Page 5: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Applications of microarrays

• Cancer research: Molecular characterization of tumors on a genomic scale; more reliable diagnosis and effective treatment of cancer

• Immunology: Study of host genomic responses to bacterial infections

• Model organisms: Multifactorial experiments monitoring expression response to different treatments and doses, over time or in different cell types

• etc.

Page 6: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Applications of Microarrays

• Compare mRNA transcript levels in different type of cells, i.e., vary– Tissue (liver vs. brain);– Treatment (Drugs A, B, and C);– State (tumor vs. normal);– Organism (yeast, different strains);– Timepoint;– etc.

Page 7: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu
Page 8: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Affymetrix Design

PM

MM

GCGCCGGCTGCAGGAGCAGGAGGAG

GCGCCGGCTGCACGAGCAGGAGGAG

11 – 20 Probe Pairs interrogate each gene

Kellie Archer
This is an example of one gene on an Affymetrix GeneChip. Each gene is interrogated by several PM, where each PM is the complimentary base pairing of the gene region of interest. Each PM has a corresponding MM, which is the same as the PM except the middle base has been reversed. The purpose of the MM probe is to identify non-specific binding.
Page 9: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Image Analysis: Pixel Level Data

6 x 6 matrix of pixels for each PM and MM probeHG-U133A GeneChip

Kellie Archer
On the left is a portion of an Affymetrix GeneChip image. I have zoomed in to reveal the PM and MM cells. Also, a grid has been overlain for cell addressing, that is, to identify the location of the PM and MM cells. On the right is a PM cell from the left hand figure which I have zoomed in to reveal the pixel level data. Each PM and MM cell is composed of a 6x6 matrix of pixel intensities. These pixels must be summarized in some way to report a probe level intensity measure.
Page 10: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Expression Quantification

PM

MM

GCGCCGGCTGCAGGAGCAGGAGGAG

GCGCCGGCTGCACGAGCAGGAGGAG

PM and MM intensities are combined to form an expression measure for the probe set (gene)

Kellie Archer
Once an intensity for each PM and MM probe has been obtained, some method of summarized the probe pairs within a probe set (where probe set represents one gene) must be performed. This summary is a measure of gene expression.
Page 11: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Expression Quantification• Initially, Affymetrix signal was calculated as

where j indexes the probe pairs for each probe set A. This is known as the “Average Difference” method.

• Problems: – Large variability in PM-MM – MM probes may be measuring signal for another

gene/EST– PM-MM calculations are sometimes negative

Page 12: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Expression Quantification

• The mean of a random variable X is a measure of central location of the density of X.

• The variance of a random variable is a measure of spread or dispersion of the density of X.

• Var(X)=E[(X-)2] =E(X2) - 2

• Standard deviation = =

Var(X)

Page 13: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Expression QuantificationIllustration:

Average Difference.xls

Page 14: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu
Page 15: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Sources of Obscuring Variation in Microarray

Measurements • Sample handling (degree of physical manipulation, time from extripation to freezing)

• Microarray manufacture • Sample processing (extraction procedure, RNA

integrity & purity, RNA labeling) • Processing differences (hybridization chambers,

washing modules, scanners)• Personnel differences • Random differences in signal intensity in a data

set which co vary with the biological process

Page 16: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Normalization• The purpose of normalization is to remove

experimental artifacts of no direct interest, that is, the removal of systematic effects other than differential expression. Normalization procedures often include – background subtraction, – detection of outliers, – and removal of variation due to

• differences in sample preparation, • array differences, • differences in dye labeling efficiencies,

• and scanning differences.

Page 17: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

16 Replicate HG-133A GeneChips, Before normalization

Page 18: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

16 Replicate HG-133A GeneChips, After normalization

Page 19: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu
Page 20: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Taxonomy of Microarray Data Analysis Methods

• Unsupervised Learning: The statistical analysis seeks to find structure in the data without knowledge of class labels.

• Supervised Learning: Class or group labels are known a priori and the goal of the statistical analysis pertains to identifying differentially expressed genes (AKA feature selection) or identifying combinations of genes that are predictive of class or group membership.

Page 21: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Unsupervised Learning• Unsupervised learning or clustering involves the

aggregation of samples into groups based on similarity of their respective expression patterns without knowledge of class labels.

• Examples of Unsupervised Learning methods include– Hierarchical clustering– k-means– k-medoids– Self Organizing Maps– Principal Components– Multidimensional Scaling

Page 22: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Supervised Learning• Example methods for Class comparison/ Feature

selection include– T-test / Wilcoxon rank sum test– F-test / Kruskal Wallis test– etc.

• Example methods for Class Prediction include– Weighted voting– K nearest neighbors– Compound Covariate Predictors– Classification trees– Support vector machines– etc.

Page 23: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Supervised Learning: Class Prediction

• Risk of over-fitting the data: may have a perfect discriminator for the data set at hand but the same model may perform poorly on independent data sets.

• Most prediction methods are intended for large ‘n’ (samples) small ‘p’ (covariates) datasets.

• Process is to– Fit model– Check model adequacy– Make an inference

Page 24: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Class Prediction: Checking model Adequacy

• Regardless of algorithm used, it is essential that once the prediction rule has been defined, an unbiased estimate of the true error rate must be calculated.

Page 25: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Class Prediction: Checking Model Adequacy

• In a data rich situation,– randomly divide the dataset into two

parts, representing a training and test dataset.

– Build the prediction algorithm using the training dataset

– Once a final model has been developed, the prediction rule is applied to the test dataset to estimate the misclassification error

Page 26: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Class Prediction: Checking Model Adequacy

• For small sample sizes, withholding a large portion of the data for validation purposes may limit the ability of developing a prediction rule. Therefore, use cross-validation techniques to assess the error.

Page 27: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Class Prediction: Checking Model Adequacy

• K-fold cross-validation requires one to randomly split the dataset into K equally sized groups.

• Thereafter, the model is fit to K-1 parts of the data and the generalization error is calculated using the Kth remaining part of the data.

• This procedure is repeated so that the generalization error is estimated for each of the K parts of the data, providing an overall estimate of the generalization error and its associated standard error.

Page 28: Introduction to Microarrays Kellie J. Archer, Ph.D. Assistant Professor Department of Biostatistics kjarcher@vcu.edu

Class Prediction: Checking Model Adequacy

1 2 3 4 5 6 7 8 9 10

• Leave out data in group 3• Fit the model to the data in groups 1 – 2, 4 – 10 (learning dataset)• Calculate the error using observations in group 3 as the test dataset• Do this for each of the 10 partitions