lecturew2_intro_array

65
Friday (1/15) computer lab session: Location: 3073 (3rd floor), Department of Computational Biology, BST3, 3501 Fifth Avenue. Time: 9:30-10:45AM Play with R (tutorial) at home before the lab session.

Upload: many87

Post on 11-May-2015

766 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: LectureW2_Intro_Array

Friday (1/15) computer lab session:Location: 3073 (3rd floor), Department of

Computational Biology, BST3, 3501 Fifth Avenue.

Time: 9:30-10:45AMPlay with R (tutorial) at home before the lab session.

Page 2: LectureW2_Intro_Array

Agenda• Introduction to microarray

– Motivation & previous techniques• Concept of biological pathway• Northern blot, RT-PCR and real time RT-PCR

– Affymetrix microarray experiment– cDNA microarray experiment– Comparison of the two– Codelink, Illumina & Agilent– MAQC (Microarray Quality Control) Project

• Introduction to next generation sequencing (RNA-seq, ChIP-seq etc)

Page 3: LectureW2_Intro_Array

Review

Page 4: LectureW2_Intro_Array

The central dogma of molecular biology:

DNA

mRNA(messenger)

rRNA(ribosomal)

tRNA(transfer)

Protein

Ribosome

transcription

ProteinRNADNA ntranslatioiontranscript

transcription transcription

translationMicroarray is a technology to globaly (simultaneously detecting thousands of genes) detect mRNA expression level.

Page 5: LectureW2_Intro_Array

Why detect expression level of protein or

mRNA?

Page 6: LectureW2_Intro_Array

Cell cycle

Cancer cells are malignant cells who don’t die but reproduce rapidly instead.

Important to repair problematic mutations during cell division.

Page 7: LectureW2_Intro_Array

Example 1: p53 Pathway(an important tumor suppressor)

Cancer cells are malignant cells who don’t die but reproduce rapidly instead.

(DNA damaged)

http://breast-cancer-research.com/content/pdf/bcr426.pdf

Page 8: LectureW2_Intro_Array

Example 2: KRas Pathway(an oncogene)

(     upregulation;     downregulation)

normalcell

-- P53 properly suppress cell replication.-- Ras genes properly activate cell replication.

cancerouscell

-- P53 doesn’t suppress cell replication.-- Ras genes overly expressed. Cells are overly replicated.

From http://www.icnet.uk/axp/mphh/biomed/lemoine.html

Page 9: LectureW2_Intro_Array

Prediction of a disease:

If mechanism known, detecting expression level can help identifying cancer patients (e.g. unusual p53 or Kras expression activity).

Exploratory:

In general, microarray can help identify candidate genes that contribute to tumor progression and propose hypothesis of the underlying genetic network.

Why detect expression level of protein or mRNA?

Page 10: LectureW2_Intro_Array

http://www.escience.ws/b572/L13/north.html

Northern Blot (an old technique for measuring mRNA expression)

mRNA extracted and purified.

mRNA loaded for electrophoresis.

Lane 1: size standards.Lane 2: RNA to be tested.

The gel is charged and RNA “swim” through gel according to weight.

-

mRNA are transferred from the gel to a membrane.

A labelled probe specific for the RNA fragment is incubated with the blot. So the RNA of interest can be detected.

See next page for the details of this step.

+

Page 11: LectureW2_Intro_Array

http://www.escience.ws/b572/L13/northupclose.html

Norther Blot closeup(color staining)

In this simplified cartoon, two mRNAs are bound on the membrane.

The complement DNAs of A are prepared with label and are hybridized to all the mRNA on the membrane.

The labeled complement DNA will bind to A but not B.

After washing and detecting, abundance of the target mRNA can be seen.

Page 12: LectureW2_Intro_Array

See animation of RT-PCR:http://www.bio.davidson.edu/courses/Immunology/Flash/RT_PCR.html

RT-PCR (reverse transcription-polymerase chain reaction)

http://www.ambion.com/techlib/basics/rtpcr/

real-time RT-PCR

1. RNA is reverse transcribed to DNA.2. PCR procedures can be used amplify DNA at exponential

rate.3. Gel quantification for the amplified product.

---- an semi-quantitative method. Smaller amount of sample needed.

1. The PCR amplification can be monitored by fluorescence in “real time”.

2. The fluorescence values recorded in each cycle represent the amount of amplified product.

---- a quantitative method. The current most advanced and accurate analysis for mRNA abundance. Usually used to validate microarray result.

Often used to validate microarray

Page 13: LectureW2_Intro_Array

Limitation of the old techniques

1. Labor intensive

2. Can only detect up to dozens of genes. (gene-by-gene analysis)

3. Need to know the target sequences. For RT-PCR, at least need to know the primer to start the PCR.

Page 14: LectureW2_Intro_Array

Various microarrays

A new view on genomic level

Page 15: LectureW2_Intro_Array

Affymetrix GeneChip

Page 16: LectureW2_Intro_Array

from Affymetrix Inc.

Overview of the Affymetrix GeneChip technology

Page 17: LectureW2_Intro_Array

From experiments to analysis

Page 18: LectureW2_Intro_Array

Details of labeling and hybridizationRNA

polymeraseDNA DNARNA

tase transcripreverse

TACGTATTGCAAAA TTTTGCAATACGTA

TACGTATTGCAAAA

(at C and T)

Page 19: LectureW2_Intro_Array

Notes

• Only Pyrimidines (C and T) have biotin labeled. This is where the color intensities come from.

• The fragmentation makes the biotin-labeled cRNA shorter and helps efficiency of hybridization.

• Sequence info of the target mRNA should be known so the complementary sequence can be prepared on the array.

Page 20: LectureW2_Intro_Array

25-mer unique oligo

mismatch in the middle nuclieotide

multiple probes (11~16) for each gene

from Affymetrix Inc.

Array Design

Page 21: LectureW2_Intro_Array

                                                         

   

from Affymetrix Inc.Needs at most 425=100 masking and coupling.

Technology adapted from semiconductor industry.(photolithography and combinatorial chemistry)

Array Manufacturing

Page 22: LectureW2_Intro_Array

HG-U95 HG-U133 Set HG-U133 Plus 2.0 Array

sequence source

Build 95

UniGene database

(Oct, 2, 1999??)

Build 133

UniGene database

(April, 20, 2001)

Build 133

UniGene database

(April, 20, 2001)

Probe uniqueness

21/25 bases Two 8-mers including at least one 12-mer

Two 8-mers including at least

one 12-mer

# of probes ~16 11 11

# of arrays 5 2 1

# of transcripts

~54000 genesHG-U95Av2: ~12000

HG-U95B-E: ~44000 EST

~33,000 genes ~38500 genes

Feature size 20 µm 18 µm 11 µm

Chip Advances

Page 23: LectureW2_Intro_Array

Few years ago, U95 set had 5 arrays. Normally only U95Av2 is used.

Improved probe selection algorithm to avoid non-specific binding. Decreased # of probes in each probe set (20 => 11)

Smaller probe size20 µm => 11 µm

More genes on each array and less cost(Only one array for HG-U133 Plus )

Chip Advances

Page 24: LectureW2_Intro_Array

Background adjustment Normalization Summarization

Give an expression measure for each probe set on each array

The result will greatly affect subsequent analysis (e.g. clustering and classification). If not modeled properly,

=> “Garbage in, garbage out”

Array Probe Level Analysis

NormalizationBackground adjustment Summarization

Details will be discussed in the next lecture.

Page 25: LectureW2_Intro_Array

Spotted cDNA microarray

Page 26: LectureW2_Intro_Array

From experiments to analysis

Page 27: LectureW2_Intro_Array

1. 48 grids in a 12x4 pattern.

2. Each grid has 12x16 features (spots).

3. Total 9216 features (spots).

4. Each pin prints 3 grids.

Probe (array) printing

Page 28: LectureW2_Intro_Array

Probe design and printing

Page 29: LectureW2_Intro_Array

From Y. Chen et al. 1997

The experiment

Page 30: LectureW2_Intro_Array

From: http://www.techfak.uni-bielefeld.de/ags/ai/projects/microarray/

An image example

Image analysis is more difficult than Affy array. The probes are spotted by robot instead of synthesized and the exact physical location is not known.

Page 31: LectureW2_Intro_Array

cDNA GeneChip

Probe preparation

Probes are cDNA fragments, usually amplified by PCR and spotted by robot.

Probes are short oligos synthesized using a photolithographic approach.

colors Two-color

(measures relative intensity)

One-color

(measures absolute intensity)

Gene representation

One probe per gene 11-16 probe pairs per gene

Probe length Long, varying lengths

(hundreds to 1K bp)

25-mers

Density Maximum of ~15000 probes. 38500 genes * 11 probes = 423500 probes

Comparison of cDNA array and GeneChip

Page 32: LectureW2_Intro_Array

Affymetrix GeneChipOne color design

cDNA microarrayTwo color design

Why the difference?

Page 33: LectureW2_Intro_Array

Affymetrix GeneChipPhotolithography

(The amount of oligos on a probe is well controlled)

cDNA microarrayRobotic spotting

(The amount of cDNA spotted on a probe may vary greatly)

Page 34: LectureW2_Intro_Array

Advantage and disadvantage of cDNA array and GeneChip

cDNA microarray Affymetrix GeneChip

The data can be noisy and with variable quality

Specific and sensitive. Result very reproducible.

Cross(non-specific) hybridization can often happen.

Hybridization more specific.

May need a RNA amplification procedure.

Can use small amount of RNA.

More difficulty in image analysis. Image analysis and intensity extraction is easier.

Need to search the database for gene annotation.

More widely used. Better quality of gene annotation.

Cheap. (both initial cost and per slide cost)

Expensive (~$400 per array+labeling and hybridization)

Can be custom made for special species.

Only several popular species are available

Do not need to know the exact DNA sequence.

Need the DNA sequence for probe selection.

Page 35: LectureW2_Intro_Array

Other platforms of microarray

• GE Codelink (out of market now)

• Illumina

• Agilent

Page 36: LectureW2_Intro_Array

Codelink

Page 37: LectureW2_Intro_Array

Fig. End-point attachment orients the DNA while the polymeric coating holds it away from the surface of the slide, making the DNA readily available for hybridization.

Codelink’s

Gel-matrix

Page 38: LectureW2_Intro_Array

cDNA GeneChip Codelink Agilent

Probe preparation

Probes are cDNA fragments, usually amplified by PCR and spotted by robot.

Probes are short oligos synthesized using a photolithographic approach.

3-D aqueous gel matrix

Probes are printed by Inkjet technology from HP

colors Two-color

(measures relative intensity)

One-color

(measures absolute intensity)

One-color One- or two-color

Gene representation

One probe per gene 11-16 probe pairs per gene

One probe per gene

One probe per gene

Probe length

Long, varying lengths

(hundreds to 1K bp)

25-mers 30-mers 60-mers

Density Maximum of ~15000 probes.

38500 genes * 11 probes = 423500

~57000 ~22000 probes

Manufacturer

Stanford and many labs.

Affymetrix company

GE company Agilent company

Comparisons

Page 39: LectureW2_Intro_Array

Mechanisms in microarrayImportant mechanisms that make microarray work:

1. Reverse transcription: mRNA => cDNA. This is usually also the step to label dyes.

(Protein can not be reverse translated to mRNA or to another form. So difficult to label dyes.)

2. Double strand binding of complimentary DNA sequences.

(Protein does not enjoy such a good property; there are 20 amino acids without complementary binding)

Page 40: LectureW2_Intro_Array

Microarray Quality Control (MAQC) Project

a series of papers published in Nature Biotechnology (Sep 2006)

Page 41: LectureW2_Intro_Array

Previous paper in NAR 2003

• Evaluation of gene expression measurements from commercial microarray platforms. Tan et al. Nucleic Acids Research. 2003. 31:5676-5684.

• Poor consistency made it a concern for precise science and routine clinical use.

• Three commercial platforms were compared.• Inconsistent result found across platforms

Page 42: LectureW2_Intro_Array

Experiment Design

• 7 microarray platforms; each platform implemented in 3 test sites; 4 pools of RNA each with 5 replicates were performed. (3*4*5=60 arrays for each platform)

• The 4 pools of RNA are: A. 100%UHRR; B. 100%HBRR; C. 75%UHRR + 25%HBRR; D. 25%UHRR + 75%HBRR.UHRR: Universal Human Reference RNA from StratageneHBRR: Human Brain Reference RNA from Ambion

• 3 RT-PCR based alternative gene expression platforms are also tested: TaqMan, StaRT-PCR and QuantiGene Assays.

Page 43: LectureW2_Intro_Array

Experiment Design

• NCI has only 2 test site. AGL has only 2 samples. Some problematic arrays are removed.

• AGL is not included in this paper. A total of 386 arrays are analyzed.

Page 44: LectureW2_Intro_Array

Difficulties in comparing multiple platforms

• Each platform has different probe design• Sensitivity and specificity of the probes. (some

variability of cross-platform may be due to this annotation problem)

• Database (NCBI RefSeq) often change, making it difficult to match.

• Probes may bind to multiple alternative spliced transcripts, which may have different functions and expression patterns.

Page 45: LectureW2_Intro_Array

Kuo(2006): probe matching within one exon for Gas1

Gene matching across different platforms is not easy.Essentially each platform detects different targets.

Page 46: LectureW2_Intro_Array

Match genes across platforms• All probes mapped to RefSeq and AceView database.• Each platform assayed 15,429-16,990 Entrez genes.• 23,971 in 24,157 RefSeq NM accessions assayed in

at least on platform. Among them, 15,615 accessions (which correspond to 12,091 Entrez genes) were assayed in all platforms.

• When multiple probes match to one RefSeq, only the probe closest to the 3’ end is used.

• Finally each platform has 12,091 probes matching to a common set of 12,091 RefSeq from 12,091 different genes.

Page 47: LectureW2_Intro_Array

Number of detected genes called by manufactures’

softwareCV of 5 technical replicates

Page 48: LectureW2_Intro_Array

Blue: CV of 5 technical replicatesRed: CV of all 15 replicates (5 technical replicates X 3 test sites)

Page 49: LectureW2_Intro_Array

Blue dot: percentage of genes concordantly called detected in each test site.Blue bar: percentage of genes concordantly called detected in all three test site.

Page 50: LectureW2_Intro_Array

Conclusions• Microarray provides an opportunity to measure

thousands of genes simultaneously and make the global monitoring of cellular activities possible.

• The method produces more noisy data and the choice of an adequate design and analysis is the key.

• RT-PCR for validation of small number of genes.• Data obtained from different platforms and

centers are consistent. Ready for routine clinical use.

Page 51: LectureW2_Intro_Array

Limitation• The method measures mRNA instead of

proteins. The actual protein abundance and post-translation modification can not be detected.

• The method usually does not measure spatial or temporal dynamics of the cellular activity.

• The method is suitable for global monitoring and should be used to generate further hypothesis or should combine with other carefully designed experiments.

Page 52: LectureW2_Intro_Array

Introduction to next generation sequencing

Page 53: LectureW2_Intro_Array

Introduction

• What is next generation sequencing?– Short reads (35~70 bps)– Higher throughput– Faster– Cheaper

Page 54: LectureW2_Intro_Array

Introduction

• Comparing to traditional sequencing– Traditional Sequencing

• No reference sequence available (ab initio)• Longer reads and additional linkage information

required to assemble the entire sequence

– Next Generation Sequencing• Reference sequence available (Sequenced by

traditional sequencing)• No need of assembly, just map the short reads

back to the reference sequence.

Page 55: LectureW2_Intro_Array

Technology

Page 56: LectureW2_Intro_Array

Technology

Page 57: LectureW2_Intro_Array

Technology

Page 58: LectureW2_Intro_Array

Technology

Page 59: LectureW2_Intro_Array

Technology

Page 60: LectureW2_Intro_Array

Technology

Page 61: LectureW2_Intro_Array
Page 62: LectureW2_Intro_Array
Page 63: LectureW2_Intro_Array

Major Applications

• ChIP-Seq (Chromosome Immunoprecipitation)– A substitute for ChIP-chip– To find the binding sequence of proteins (TFBS)

• RNA-Seq– A substitute for Microarray– To measure the amount of RNA expressed

Page 64: LectureW2_Intro_Array

RNA-Seq

• Comparing to microarray– Microarray

• Closed technology: Prior knowledge required• Affected by pseudo-genes (homologous of real genes)• Cheap and mature

– RNA-Seq• Open technology: No prior knowledge required• Not affected by pseudo-genes because exact sequence

is measured• Other information could be yielded (SNP, Alternative

splicing)• Still more expensive than microarray

Page 65: LectureW2_Intro_Array

See also the following introduction slides:

http://biocluster.ucr.edu/~tgirke/HTML_Presentations/Manuals/HT-Seq/HT-Seq.pdf