introduction to microarray gene expression
DESCRIPTION
Introduction to Microarray Gene Expression. Shyamal D. Peddada Biostatistics Branch National Inst. Environmental Health Sciences (NIH) Research Triangle Park, NC. Outline of the four talks. A general overview of microarray data Some important terminology and background Various platforms - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/1.jpg)
Introduction to Microarray Introduction to Microarray
Gene ExpressionGene Expression
Shyamal D. PeddadaBiostatistics Branch
National Inst. Environmental Health Sciences (NIH)
Research Triangle Park, NC
![Page 2: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/2.jpg)
Outline of the four talksOutline of the four talks
A general overview of microarray data
– Some important terminology and background– Various platforms– Sources of variation– Normalization of data
Analysis of gene expression data - Nominal explanatory variables
– Two types of explanatory variables– Scientific questions of interest– A brief discussion on false discovery rate (FDR)
analysis– Some existing methods of analysis.
![Page 3: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/3.jpg)
Outline of the four talksOutline of the four talks
Analysis of ordered gene expression data
– Common experimental designs– Some existing statistical methods– An example– Demonstration of ORIOGEN– Some open research problems
Analysis of data from cell-cycle experiments
– Some background on cell-cycle experiments– Modeling the data– Data from multiple experiments– Some open research problem
![Page 4: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/4.jpg)
Talk 1: An overview Talk 1: An overview of microarray dataof microarray data
![Page 5: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/5.jpg)
To perform statistical analysis To perform statistical analysis of any given dataof any given data
It is important to understand all sources of (i) bias, (ii) variability.
– Some basic understanding of the underlying technology!
– Understand the sampling/experimental design
![Page 6: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/6.jpg)
Some Important Terminology Some Important Terminology and Backgroundand Background
![Page 7: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/7.jpg)
Central Dogma of Molecular Biology
![Page 8: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/8.jpg)
Some background terminology:Some background terminology:DNA and RNADNA and RNA
DNA (Deoxyribonucleic acid) - Contains genetic code or instructions for the development and function living organisms. It is double stranded.
Four Nucleotides (building blocks of DNA)
– Adenine (A), Guanine (G), – Thymine (T), Cytosine (C)
Base pairs: (A, T) (G, C)
E.g. 5’ ---AAATGCAT---3’ 3’ ---TTTACGTA---5’
![Page 9: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/9.jpg)
Some background terminology:Some background terminology:DNA and RNADNA and RNA
RNA (Ribonucleic acid) - transcribed (or copied) from DNA. It is single stranded. (Complimentary copy of one of the strands of DNA)
RNA polymerase - An enzyme that helps in the transcription of DNA to form RNA.
Four Nucleotides (building blocks of DNA)
– Adenine (A), Guanine (G), – Uracil (U), Cytosine (C)
Base pairs: (A, U) (G, C)
![Page 10: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/10.jpg)
Some background terminology:Some background terminology:Types of RNATypes of RNA
Types of RNA - (transfer) tRNA, (ribosomal) rRNA, etc.
mRNA - messenger RNA. Carries information from DNA to ribosomes where protein synthesis takes place (less stable than DNA).
![Page 11: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/11.jpg)
Some background terminology: Some background terminology: OligosOligos
Oligonucleotide - a short segment of DNA consisting of a few base pairs. In short it is commonly called “Oligo”.
“mer” - unit of measurement for an Oligo. It is the number of base pairs. So 30 base pair Oligo would be 30-mer long.
![Page 12: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/12.jpg)
Some background terminology: Some background terminology: ProbesProbes
cDNA - complimentary DNA. DNA sequence that is complimentary to the given mRNA.
– Obtained using an enzyme called reverse transcriptase.
Probes - a short segment of DNA (about 100-mer or longer) used to detect DNA or RNA that compliments the sequence present in the probe.
![Page 13: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/13.jpg)
Some background terminology:Some background terminology:“Blots” - Origins of Microarrays“Blots” - Origins of Microarrays
Southern blot (Edwin Southern, 1975 J. Molec. Biol.)
– A method used to identify the presence of a DNA sequence in a sample of DNA.
Western blot (immunoblot)
– to identify a specific protein from a tissue extract.
![Page 14: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/14.jpg)
Some background terminologySome background terminology
Southwestern blot
– to identify and characterize DNA-binding proteins.
Northern blot
– A method used to study the gene expression from a sample of mRNA.
![Page 15: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/15.jpg)
Microarrays …Microarrays …
![Page 16: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/16.jpg)
Northern blot Vs MicroarrayNorthern blot Vs Microarray
Microarray Northern blot
Rate of expression analysis
Thousands of genes at a time(High throughput)
Few genes at a time
Automation Automation possible
Manual
Scope Allows to explore relationships among several 100’s of genes at the same time
Limited
![Page 17: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/17.jpg)
What is a Microarray?
Sequences from thousands of different genes are immobilized, or attached, at fixed locations.
Spotted, or actually synthesized directly onto the support.
![Page 18: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/18.jpg)
Microarray Technology
Two color dye array (Spotted array)
– Spotted cDNA microarrays– Spotted oligo microarrays
Single dye array
– In situ oligo microarrays
![Page 19: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/19.jpg)
Microarray Technology
![Page 20: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/20.jpg)
Spotted MicroarraysSpotted Microarrays
![Page 21: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/21.jpg)
Spotted DNA Microarray
Spotted DNA array is typically “home made” so you need to think about:
– cDNA or Oligo– Location of the Oligo in a given gene– Oligo length - number of bp?
![Page 22: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/22.jpg)
Spotted DNA Microarray
Gene expression:
– Y < 0; gene is over expressed in green labeled sample compared to red-labeled sample
– Y = 0; gene is equally expressed in both samples
– Y > 0; gene is over expressed in red-labeled sample compared to green labeled sample
Y log2Red
Green
![Page 23: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/23.jpg)
Single Dye MicroarraysSingle Dye Microarrays
![Page 24: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/24.jpg)
Major Commercial Platforms
More than 50 companies are currently offering various DNA microarray platforms, reagents and software
Affymetrix dominated the marker for many years
Manufacturer Code Protocol Platform # of Probes
Applied Biosystems ABI One-color microarray Human Genome Survey Microarray v2.0 32878
Affymetrix AFX One-color microarray HG-U133 Plus 2.0 GeneChip 54675
Agilent* AG1 One-color microarray Whole Human Genome Oligo Microarray, G4112A 43931
Eppendorf EPP One-color microarray DualChip Microarray 294
GE Healthcare GEH One-color microarray CodeLink Human Whole Genome, 300026 54359
Illumina ILM One-color microarray Human-6 BeadChip, 48K v1.0 47293
*Agilent has one and two-color microarray platform
![Page 25: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/25.jpg)
Affymetrix GeneChip
Each gene is represented by 11 to 20 oligos of 25-mers
Probe: An oligo of 25-mer
Probe Pair: a PM and MM pair
Perfect match (PM): A 25-mer complementary to a reference sequence of interest (part of the gene)
Mismatch (MM): same as PM with a single base change for the middle (13th) base (G <-> C, A <-> T)
Probe set: a collection of probe-pairs (11 to 20) related to a fraction of gene
![Page 26: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/26.jpg)
Affymetrix call for the presence of a signal
Affymetrix detection algorithm uses probe pair intensities to obtain detection p-value
Using this p-value they decide whether the signal
is– “ present”, “marginal” or “absent”
![Page 27: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/27.jpg)
Affy call
Detection of p-value
– Calculate Kendall’s tau T for each probe pair
T = (PM-MM) / (PM+MM)
– Determine the statistical significance of the gene by computing the p-value.
![Page 28: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/28.jpg)
Affy call
Ref: Affymetrix Technical Manual
![Page 29: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/29.jpg)
Affymetrix Vs Illumina
Ref: Pan Du & Simon Lin
![Page 30: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/30.jpg)
Microarray Data AnalysisMicroarray Data Analysis
![Page 31: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/31.jpg)
Why Normalize Data?
To “calibrate”/adjust data so as to reduce or eliminate the effects arising from variation in technology and other sources rather than due to true biological differences between test groups.
![Page 32: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/32.jpg)
Sources of bias/variationSources of bias/variation
Tissue or cell lines
mRNA
– It can degrade over time - so there is a potential batch effect if portions of experiment are performed at different times
– Purity and quantity
Dye color effect (spotted arrays)
Variation due to technology - is substantially reduced with improved technology
Etc.
![Page 33: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/33.jpg)
A useful graphical representation of data
Data matrix:
Let
S :mm sample covariance matrix.
Xmxn , Rank(X) r min(m,n) n.
m :# genes,n # samples.
![Page 34: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/34.jpg)
A useful graphical representation of data
Let its spectral decomposition be given by
where
S '
:mr matrix of eigenvectors
: rr diagonal matrix of non- zero eigenvalues
1 2 ... r 0.
![Page 35: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/35.jpg)
A useful graphical representation of data
Then
Plot
Z ' X : rn matrix of " eigengenes"
Z i i ' X : i th eigengene.
Z1 vs Z2
![Page 36: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/36.jpg)
Common Normalization Methods
Internal Control Normalization
Global Normalization
Linear Normalization (Spotted arrays)
Non-linear Normalization Method (Spotted arrays) - LOWESS curve.
ANOVA
COMBAT (for batch effect)
![Page 37: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/37.jpg)
Internal control normalization(Housekeeping gene(s))
Expression of each gene is measured relative to the average of house keeping genes.
– Basic assumption: Expression of housekeeping genes does not change.
Disadvantage: – House keeping genes may be highly expressed
sometimes. Unexpected regulation of house keeping gene(s) leads to misinterpretation
![Page 38: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/38.jpg)
Global Normalization
Basic assumption
– Mean/Median expression ratio of all monitored mRNAs is constant across a chip.
Regression of
In simple terms the log ratios are corrected by a common “mean” or “median”
This method can also be applied to single Dye data
logR
G
on a constant
![Page 39: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/39.jpg)
Linear Normalization(for spotted arrays)
Basic assumption
– Mean/Median expression ratio of all monitored mRNAs depends upon the average intensity
Regression of
logR
G
on (1/2) log(RG)
![Page 40: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/40.jpg)
Non-Linear Normalization(for spotted arrays)
Basic assumption
– Mean/Median expression ratio of all monitored mRNAs depends upon the average intensity
Regression of
Where is estimated by the robust scatter plot
smoother LOWESS (Locally WEighted Scatterplot Smoothing)
logR
G
on C(log(RG))
C(log(RG))
![Page 41: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/41.jpg)
Analysis of Variance (ANOVA)
Standard Analysis of Variance model
– Response variable - Gene expression– Explanatory variables:
– Dye color
– Batch
– Other potential effects?
Advantage: Statistically significant genes can be identified while controlling for
the various experimental conditions/factors.
![Page 42: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/42.jpg)
Some important experimental designs
Pooled Samples versus Separate samples
– Sometimes there may not be sufficient biological sample/specimen from a given animal. In such cases biological samples are pooled from several identical animals to form a sample.
![Page 43: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/43.jpg)
An example of a pooling designAn example of a pooling design(for each treatment group)(for each treatment group)
Subjects Pool Observations
(Microarray chips)
![Page 44: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/44.jpg)
The pooling designThe pooling design
Subjects Pool Observations
(Microarray chips)
9 3 6(3 per pool)
More generally:n p m
(r=n/p per pool)
![Page 45: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/45.jpg)
The standard designThe standard design
Subjects # Pool Observations
(Microarray chips)
9 9 9(r=1)
More generally:n p=n m=n
(r=1)
![Page 46: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/46.jpg)
Some issuesSome issues
• What are the underlying parameters?• Effect of pooling on power.• The basic assumption. Validity of the
assumption.
![Page 47: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/47.jpg)
ParametersParameters
• Total variation in the expression of a gene can be decomposed in to:
– Biological variation– Technical variation
• Biological samples (n)• Number of pools (p)• Biological samples per pool (r=n/p)• Observed number of samples (e.g. microarrays) (m)
![Page 48: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/48.jpg)
Some comments about poolingSome comments about pooling
Variance of the estimated mean expression of a gene depends on:
– number of pools (p) – number of bio samples per pool (r)– number of arrays (m)– biological variation– Technical variation.
Pooling works well when the biological variation in the gene expression is substantially larger than the technical variation.
![Page 49: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/49.jpg)
Power comparisonsPower comparisons# Bio #Micro Pool size Power
5/group 5/group 1 (Standard design) 0.816/group 6/group 1 (Standard design) 0.95
6/group 3/group 2 (i.e 3 pools/group) 0.308/group 4/group 2 (i.e. 4 pools/group) 0.8010/group 5/group 2 (i.e. 5 pools/group) 0.98
- Zhang and Gant (2005)
![Page 50: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/50.jpg)
Power comparisonsPower comparisons
Conditions of the simulation study:
Biological variation is 4 times the technical variation.
False positive rate is 0.001.
Detect 2-fold expression.
Data are normally distributed.
![Page 51: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/51.jpg)
A fundamental assumptionA fundamental assumption
Biological averaging:
Suppose an experiment consists of pooling “r” samples. Then the expression of a gene in the pooled sample is assumed to be the average of the gene’s expression in the “r” samples.
This assumption need not be true especially if the expression values are transformed non-linearly.
![Page 52: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/52.jpg)
Some important experimental designs
Reference designs (Spotted array)
– Each treatment sample is hybridized against a common reference control.
Loop designs (Spotted array)
– Suppose we have a control and three experimental groups A, B and C. Then hybridize Control and A, A with B, B with C and C with A.
![Page 53: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/53.jpg)
Data Analysis - Preliminaries
Normalization
Transformation of data (usual methods)
– Perhaps first fit ANOVA and plot the residuals
Log transformation Square root More generally, Box-Cox family of transformations
Identify potential outliers in the data (again, perhaps use the residuals)
![Page 54: Introduction to Microarray Gene Expression](https://reader030.vdocuments.mx/reader030/viewer/2022033021/56814bee550346895db8d115/html5/thumbnails/54.jpg)
Data Analysis
Method of Analysis depends upon the scientific question of interest.
In the next three lectures we describe several general methods and illustrate some using real data!