sage- serial analysis of gene expression
TRANSCRIPT
Serial Analysis of Gene Expression (SAGE) Technology
By: Dr. Ashish C PatelAssistant ProfessorVet College, AAU, Anand
Serial Analysis of Gene Expression
It is believed that the majority of biological phenomena found in a variety of organisms can be explained by the quantity of gene products.
To understand the cellular functions under the certain conditions at a certain time By measuring the mRNAs of different genes and respective numbers of mRNAs at a point of time.
Each cell contains more than 10000 mRNAs of different genes, copies of mRNAs of each gene ranging from one to more than 10000, and, as a total, up to half a million mRNA transcript copies. It is therefore practically impossible to determine them.
Large-scale Random cDNA sequencing by EST project was very useful for the identification of unknown genes expressed in given cells or tissues. (Adams et al., 1991)mRNA Species 1 ……………. mRNA Species n
Plasmid Insertion
cDNA clones
RE
Assemble EST1…n
Hence, sequencing = n x n times
cDNA
Assemble EST1…n
Assemble EST1…n of all seq. projects
All steps
• However, this approach was not designed to quantify expressed genes.
• The body mapping project (Okubo et al., 1992) attempted to construct gene expression profiles of a number of cells and tissues by random sequencing of a 3’-directed cDNA library.
• About 300 bp fragments of these 3’-region were called gene signature and each represented a particular mRNA species.
• By sequencing 1000 or more cDNA clones, they could make a rough pattern of gene expression and identify mRNAs of highly abundant class.
• However, an expected weakness of both EST and body mapping projects, in which one sequencing process yields only one cDNA sequence.
• Mainly because of this low throughput, the profiles obtained by the body mapping project unavoidably became a long way from what is expected and demanded.
• Although the more recent methods of hybridization-based analyses (DNA microarray) using immobilized cDNAs or oligonucleotides can potentially examine the expression patterns of a relatively large number of genes but these method can only examine expressed sequences that have already been identified.
• In contrast, the SAGE method allows for a quantitative and simultaneous analysis of a large number transcripts in any particular cells or tissues, without prior knowledge of the genes.
• As the body mapping procedure, this method takes advantage of the 3’-portion of mRNA as the gene tag, but of much shorter form (9–10 bp).These tags can be serially connected before cloning into a plasmid vector.
• Since the resulting plasmid clones contain multiple tags, sequences of several dozens of mRNAs can be obtained by a single sequencing reaction.
• Rapid and cost-saving sequencing by this original device allows quantification and identification of a large number of cellular transcripts.
• SAGE is based mainly on two principles, representation of mRNAs (cDNAs) by short sequence tags and concatenation of these tags for cloning to allow the efficient sequencing analysis.
• The hypothetical eukaryotic cell that contains seven mRNA molecules composed of four species is depicted.
• To explain the gene expression profile of this cell, they would have to conduct several cDNA sequencing reactions.
• However, if each mRNA species can be represented by a short unique sequence stretch (such as 9 bp tag), the purpose would be attained by sequencing them, because a sequence stretch as short as 9 bp can distinguish 49 (262 144) transcripts, provided a random nucleotide distribution throughout the genome.
• If we could connect these tags into a long stretch of DNA molecule, sequencing reaction would be needed only once.
Principle of SAGE
The Principle of SAGE. The hypothetical eukaryotic cell that contain seven mRNA molecules composed of four species is shown as a model. Boxed are tags that are proper to mRNA species
SAGE Scheme
SAGE method allows for a quantitative and simultaneous analysis of a large number of transcripts in any particular cells or tissues
mRNA species 1mRNA species 2mRNA species 3
9–10 bp tag
AAAAAAAAAAAAAAA
clone
Extract tags ,concatenate in plasmid
SAGE Scheme
Isolate insertion seq from plasmid
sequencing
TAGCGG.. ATGCGGC.. TATTTTAGC…
mRNA tag of species 1 mRNA tag of species 2 mRNA tag of species 3
Use BLAST serviceHuman genome
ATCGCC TAGCGG
TACGCCG ATGCGGC
ATAAAATCGTATTTTAGC
Annotated Gene 1 Annotated Gene 12 Annotated Gene 34
Result: gene 1, 12, 34 are expressed during certain time say mitosis
SAGE procedure
AAAAAmRNA
mRNa-cDNA hybrid
TTTTT
Oligo(dT)-primerAAAAA
Remove RNA by RNase H
TTTTT
ds cDNA synthesis TTTTT
AAAAA
Double-stranded cDNA is synthesized from mRNA by biotinylated oligo(dT) primer. b/c high efficiency for 3 ́ poly(A) region present in most eukaryotic mRNA
SAGE procedure
AAAAATTTTT
TTTTTAAAAA
5’ GTAC
Bind to streptavidin beads
TTTTT5’ GTAC
Divide in half
TTTTT5’ GTAC
AAAAA
AAAAATTTTT
AAAAA5’ GTAC
The cDNA is then cleaved with a restriction enzyme (called anchoring enzyme, NlaIII
The cDNA with a cohesive end at its 5’terminus is immobilize by binding to streptavidin-coated beads.
SAGE procedure
GTACAAAAATTTTT
CATGGGGA CCCT
GTACCATGGGGA
CCCTAAAAATTTTTLinkers A
Linkers B
Cleave Tagging Enzyme (TE) e.g. BsmFI.
Linkers have RE site for BsmFI or FokITE RE site
TE RE site
GTACCATGGGGA
CCCTNNNNN NNNNNNNNNNNNN Overlapping
end
CATGGGGA CCCT
NNNNN NNNNNNNNNNNNN GTAC
T4 DNA polymerase
GTACCATGGGGA
CCCTNNNNNNNNNNNNN NNNNNNNNNNNNN
CATGGGGA CCCT
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC
Blunt end
Two independent linkers are ligated using NlaIII cohesive termini to each
SAGE procedure
GTACCATGGGGA
CCCTNNNNNNNNNNNNN NNNNNNNNNNNNN
CATGGGGA CCCT
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC5’ 5’
Ligate tail-to-tail orientation GTAC
CATGGGGA CCCT
NNNNNNNNNNNNN NNNNNNNNNNNNN
CATG CCCT GGGA
NNNNNNNNNNNNN NNNNNNNNNNNNN
Amplify by primers A and B
GTACCATGGGGA
CCCTNNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN
primer A
primer B
GTAC
CATG CCCT GGGA GTAC
Two portions are mixed again and ligated. The 5’ends of the linkers are blocked by amino group, only the mRNA-derived termini are able to be ligated in a tail-to-tail orientation
SAGE procedure
After 1 round of amplification
GTACCATGGGGA
CCCTNNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN
GTACCATGGGGA
CCCTNNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN
AE RE site
AE RE site
NNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC
CATG
CATGGGGA CCCT
CATG CCCT GGGA
CATG CCCT GGGA
GTAC
GTAC
GTAC
CCCT GGGA GTAC
NNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC
CATG
Isolate ditags
Amplified product cleaved by NlaIII, an anchoring enzyme
Ditag fragments flanked both ends with NlaIII cohesive terminus are isolated and ligated to obtain concatemers
SAGE procedure
NNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC
CATG
NNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC
concatenate
NNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC
CATG NNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC
Insert into plasmid & clone
CATG
CATG
You can concatenate n number of species
1 mRNA species gives 2 ds cDNA joined by Palindromic Sequences
SAGE procedure
NNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC
CATG NNNNNNNNNNNNN NNNNNNNNNNNNN
NNNNNNNNNNNNN NNNNNNNNNNNNN GTAC
CATG
1 mRNA species
mRNA species no. 1
mRNA species no. 2
mRNA species no. 3
mRNA species no. n
plasmid
• SAGE is a tool for the study of gene expression, a variety of biological phenomena has been analyzed. Total tags analyzed by this method are close to five million up to year 2000.
• Table 1 showing highly diverse types of cells and tissues under a variety of physiological and pathological conditions can be noticed. Numbers of total collected tags in each study were variable.
Cancer studies (Lal et al., 1999)
• By comparing the gene expression profiles derived from cancer and normal tissue of interest, a large number of genes were identified as tumor specific.
• Usually Northern blot hybridization analysis was performed for the confirmation of differential expression of these genes against a number of independently isolated tissue samples of similar nature.
• About half of the overrepresented genes identified by SAGE were reproducibly present in these samples, while the behavior of the other half was quite different. This may reflect the heterogeneity among tumors from different individuals.
Immunological studies• A few SAGE analysis has been directly applied for the study of
immunological phenomena.• Chen et al. (1998) have reported that the changes in gene
expression in the rat mast cells before and after they were stimulated through high affinity receptors for immunoglobulin E.
• It had not been previously associated with mast cells were macrophage migration inhibitory factor, receptors for growth hormone-releasing factor and melatonin.
• Many other genes that were differentially expressed were those related to cell structure and cell motility, and numerous unknown genes that showed no database-matching.
Yeast• Yeast is widely used to clarify the biochemical and physiologic
parameters underlying eukaryotic cellular functions. • The entire genome sequence has been determined (Goffeau,
1997) and the number of genes has been estimated to be about 6300.
• Total mRNA molecules were also been estimated to be15 000 per cell (Hereford and Rosbach, 1977).
• So, yeast was chosen as a model organism to evaluate the power of the SAGE technology.
Drawbacks, problems and technical modifications• As technical problems, a disadvantage of the need of relatively high
amount of mRNA, relative difficulty to construct tag libraries and others.• MicroSAGE (Datson et al., 1999) requires 500–5000-fold less starting
input RNA, and is simplified by the incorporation of a ‘one-tube’ procedure for all steps from RNA isolation to tag release.
• SAGE-lite, is another similarly-devised protocol also allows the global analysis of transcription from less than 100 ng of total starting RNA (Peters et al., 1999).
Technical difficulty of the procedure; • In the original SAGE protocol, major products of PCR are often linker-
dimers. To minimize contaminating linker molecules, biotinylated PCR primers were introduce, which generates biotinylated ditag products, thus allowing removal of the unwanted linkers by binding to streptavidin beads used at a later stage.
• A simple introduction of heating step at final ligation step yields cloned concatemers with an average of 67 tags as compared to 22 tags obtained by the original protocol.
• A major problem of the SAGE approach is how to further analyze the unknown tags.
• The utilization of a conventional oligonucleotide-based plaque lift method was employed successfully for the isolation and cloning of a number of genes.
• However, it is almost impossible to discriminate one-base mismatched sequence within oligonucleotides of only 13–14 bp in length rather than temperature-regulated DNA–DNA hybridization technology, thus resulting in numerous false positives.
• An RT-PCR-based method was developed to analyze the corresponding genes and this approach utilizes identified tag sequences and oligo-dT as PCR primers.
• Matsumura et al. (1999) reported a procedure to recover a longer cDNA fragment by PCR using the SAGE tag sequence as a primer, thereby facilitating the analysis of unknown genes identified by tag sequence in SAGE.
• Sequencing Error: Sequencing error rate affect a SAGE experiment which can improve by using phred scores and discarding ambiguous sequences.
• Short SAGE comprised 14bp and long SAGE comprised 21bp.• About 12% of C. elegans tags are not unambiguously
identified using 14bp tags (Mc Kay et al., 2003). Results of empirical data suggests that Long SAGE gives far greater resolution, but at an increased cost.
SAGE Data Analysis Strategies
• The sequence files generated by the automated sequencer are analyzed using the SAGE2000 software (www.sagenet.org).
• The three steps involved in obtaining a differential gene expression list are as follows:
(1) Interpret the SAGE tags from the sequence data files by using the SAGE2000 software for extracting ditags and checking for duplicate ditags;
(2) Download a reference sequence database from the NCBI Web site (SAGEmap, www.ncbi.nlm.nih.gov); and
(3) Associating the tags to the expressed gene database.The relative transcript abundance can then be calculated by dividing
the unique tag count by the total tags sequenced, and the fold change can be determined by the ratio of tags between libraries.
• The initial analysis is usually limited to a predefined tag ratio of greater than 5-fold and a value of P≤0.05.
• The rates of false-positives associated with different probability values have been computed by Monte-Carlo test to validate confidence intervals.
• Depending on the preliminary results, the SAGE data can be reanalyzed by varying the P values and the fold-change thresholds.
SAGEmap
http://www.sagenet.org/
Sage resources
Sage data
SAGE APPLICATION • SAGE is useful in comparative expression studies to identify
differences in gene expression between two or more cellular sources of RNA.
• Gene Discovery• Determining changes on gene expression as consequence of an
experimental treatment (e.g. carcinogen, hormone) • Provides quantitative data on both known and unknown genes • Analyzes all transcripts (Transcriptome) without prior selection
of known genes • Analysis of Cardiovascular gene expression• Gene expression in carcinogenesis• Substance abuse studies• Cell, tissue and developmental stage profiling• Profiling of human diseases
SAGE – Advantages & Disadvantages
Advantages• No hybridizing, so no cross-hybridizing can occur.• Can help identify new genes by using tag as a PCR primer
Disadvantages
• Cost and time required to perform so many PCR and sequencing reactions.
• Type IIS restriction enzyme can yield fragments of the wrong length depending on temperature.
• Multiple genes could have the same tag• As with microarrays, mRNA levels may not represent protein
levels in a cell
Microarray Vs. SAGE