accurate estimation of gene expression levels from digital gene expression sequencing data
DESCRIPTION
Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data. Marius Nicolae and Ion M ă ndoiu (University of Connecticut, USA). Outline. DGE/SAGE- Seq protocol EM algorithm Experimental results Conclusions. RNA- Seq Protocol. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/1.jpg)
Accurate Estimation of Gene Expression Levelsfrom Digital Gene Expression Sequencing Data
Marius Nicolae and Ion Măndoiu (University of Connecticut, USA)
![Page 2: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/2.jpg)
Outline
• DGE/SAGE-Seq protocol• EM algorithm• Experimental results• Conclusions
![Page 3: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/3.jpg)
RNA-Seq Protocol
Make cDNA & shatter into fragments
Sequence fragment ends
A B C D E
Map reads
Gene Expression (GE)Isoform Expression (IE)
A B C
A C
D E
Isoform Discovery (ID)
![Page 4: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/4.jpg)
DGE ProtocolAAAAA
Gene Expression (GE)
Cleave with tagging enzymeCATG
Map tags
A B C D E
Cleave with anchoring enzyme (AE)AAAAACATG
AE
TCCRAC AAAAACATG
AETE
Attach primer for tagging enzyme (TE)
![Page 5: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/5.jpg)
Our Approach
Previous methods• Discard ambiguous tags [Asmann et al. 09, Zaretzki et al. 10]• Heuristics to rescue some ambiguous tags [Wu et al. 10]
New DGE-EM algorithm• Uses all tags, including all ambiguous ones• Uses quality scores• Takes into account partial digest and gene isoforms
![Page 6: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/6.jpg)
Tag Formation Probability
12k …3’5’
AE siteMRNA
Tag formation probability
pp(1 -p)p(1 -p) k-1
![Page 7: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/7.jpg)
Tag-Isoform Compatibility
1,, )1( j
ajit ppQw
![Page 8: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/8.jpg)
assign random values to all f(i)while not converged
DGE-EM Algorithm
E-step
twjiiwfs
),,()(
siwfjin )(),(
init all n(i,j) to 0for each tag t
for (i,j,w) in t
M-step )()(
1 ,
)1(1/)( isites
isites
j ji
pNif
nN
for each isoform i
![Page 9: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/9.jpg)
MAQC Data (UHRR, HBRR)
DGE• 9 Illumina libraries, 238M 20bp tags [Asmann et al. 09]• Anchoring enzyme DpnII (GATC)
RNA-Seq • 6 libraries, 47-92M 35bp reads each [Bullard et al. 10]
qPCR • Quadruplicate measurements for 832 Ensembl genes
[MAQC Consortium 06]
![Page 10: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/10.jpg)
Compared Algorithms
DGE• Uniq [Asmann et al. 09, Zaretzki et al. 10]• DGE-EM
RNA-Seq• IsoEM [Nicolae et al. 10]• Cufflinks [Trapnell et al. 10]
![Page 11: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/11.jpg)
DGE-EM vs. Uniq on HBRR Library 4
0 10000000 20000000 30000000 40000000 50000000 6000000065
70
75
80
85
Uniq 0 mismatches Uniq 1 mismatch Uniq 2 mismatches
DGE-EM 0 mismatches DGE-EM 1 mismatch DGE-EM 2 mismatches
Med
ian
Perc
ent E
rror
![Page 12: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/12.jpg)
DGE vs. RNA-Seq
60
65
70
75
80
85
90
95
100RNA HBRR 1X, IsoEMRNA HBRR 1A, IsoEMRNA UHRR 1X, IsoEMRNA UHRR 1A, IsoEMRNA UHRR 2, IsoEMRNA UHRR 3, IsoEMRNA UHRR 4, IsoEMRNA UHRR 5, IsoEMDGE HBRR 1, DGE-EMDGE HBRR 2, DGE-EMDGE HBRR 3, DGE-EMDGE HBRR 4, DGE-EMDGE HBRR 5, DGE-EMDGE HBRR 6, DGE-EMDGE HBRR 7, DGE-EMDGE HBRR 8, DGE-EMDGE UHRR 1, DGE-EMMillion Mapped Bases
Med
ian
Perc
ent E
rror
![Page 13: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/13.jpg)
DGE vs. RNA-Seq
60
65
70
75
80
85
90
95
100RNA HBRR 1X, IsoEMRNA HBRR 1X, CufflinksRNA HBRR 1A, IsoEMRNA HBRR 1A, CufflinksRNA UHRR 1X, IsoEMRNA UHRR 1X, CufflinksRNA UHRR 1A, IsoEMRNA UHRR 1A, CufflinksRNA UHRR 2, IsoEMRNA UHRR 2, CufflinksRNA UHRR 3, IsoEMRNA UHRR 3, CufflinksRNA UHRR 4, IsoEMRNA UHRR 4, CufflinksRNA UHRR 5, IsoEMRNA UHRR 5, CufflinksDGE HBRR 1, DGE-EMDGE HBRR 1, UniqDGE HBRR 2, DGE-EMDGE HBRR 2, UniqDGE HBRR 3, DGE-EMDGE HBRR 3, UniqDGE HBRR 4, DGE-EMDGE HBRR 4, UniqDGE HBRR 5, DGE-EMDGE HBRR 5, UniqDGE HBRR 6, DGE-EMDGE HBRR 6, UniqDGE HBRR 7, DGE-EMDGE HBRR 7, UniqDGE HBRR 8, DGE-EMDGE HBRR 8, UniqDGE UHRR 1, DGE-EMDGE UHRR 1, UniqMillion Mapped Bases
Med
ian
Perc
ent E
rror
![Page 14: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/14.jpg)
DGE vs. RNA-Seq
0.35
0.45
0.55
0.65
0.75
0.85RNA HBRR 1X, IsoEMRNA HBRR 1X, CufflinksRNA HBRR 1A, IsoEMRNA HBRR 1A, CufflinksRNA UHRR 1X, IsoEMRNA UHRR 1X, CufflinksRNA UHRR 1A, IsoEMRNA UHRR 1A, CufflinksRNA UHRR 2, IsoEMRNA UHRR 2, CufflinksRNA UHRR 3, IsoEMRNA UHRR 3, CufflinksRNA UHRR 4, IsoEMRNA UHRR 4, CufflinksRNA UHRR 5, IsoEMRNA UHRR 5, CufflinksDGE HBRR 1, DGE-EMDGE HBRR 1, UniqDGE HBRR 2, DGE-EMDGE HBRR 2, UniqDGE HBRR 3, DGE-EMDGE HBRR 3, UniqDGE HBRR 4, DGE-EMDGE HBRR 4, UniqDGE HBRR 5, DGE-EMDGE HBRR 5, UniqDGE HBRR 6, DGE-EMDGE HBRR 6, UniqDGE HBRR 7, DGE-EMDGE HBRR 7, UniqDGE HBRR 8, DGE-EMDGE HBRR 8, UniqDGE UHRR 1, DGE-EMDGE UHRR 1, UniqMillion Mapped Bases
R2
![Page 15: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/15.jpg)
Synthetic Data
• 1-30M tags, lengths 14-26bp• UCSC hg19 genome and known isoforms• Simulated expression levels
– Gene expression for 5 tissues from the GNFAtlas2– Geometric expression for the isoforms of each gene
• Anchoring enzymes from REBASE– DpnII (GATC) [Asmann et al. 09]– NlaIII (CATG) [Wu et al. 10]– CviJI (RGCY, R=G or A, Y=C or T)
![Page 16: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/16.jpg)
MPE for 30M 21bp tags
RNA-Seq: 8.3 MPE
GATC GGCC CATG TGCA AGCT YATR ASST RGCY0
5
10
15
20
25
30
Uniq p=1.0 Uniq p=0.5 DGE-EM p=1.0 DGE-EM p=.5
Med
ian
Perc
ent E
rror
![Page 17: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/17.jpg)
ConclusionsIntroduced new DGE-EM algorithm
• Improves accuracy over previous methods by using ambiguous tags and considering isoforms and partial digestion
• Source code freely availabe at http://www.dna.engr.uconn.edu/software/DGE-EM
First direct comparison of RNA-Seq and DGE protocols• Best inference algorithms yield comparable cost-normalized
accuracy on MAQC dataSimulations suggest possible DGE protocol improvements
• Enzymes with degenerate recognition sites (e.g. CviJI)• Optimizing cutting probability
![Page 18: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/18.jpg)
Questions?
ACKNOWLEDGEMENTSWork supported in part by NSF awards IIS-0546457 and IIS-0916948
![Page 19: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/19.jpg)
Anchoring Enzyme Statistics
GATC GGCC CATG TGCA AGCT YATR ASST RGCY75
80
85
90
95
100
% Genes Cut % Unique Tags (p=1.0) % Unique Tags (p=0.5)
![Page 20: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/20.jpg)
RNA-Seq
10000005000000
1000000015000000
30000000
0
5
10
15
20
25
14
18
21
26
36
50
75
100
14 18 21 26 36 50 75 100
#Reads
MPE
Read Length
![Page 21: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/21.jpg)
DGE enzyme GATC p=1.0
1418
2124
26
0
2
4
6
8
10
12
14
16
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
![Page 22: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/22.jpg)
DGE enzyme CATG p=1.0
1418
2124
26
0
2
4
6
8
10
12
14
16
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
![Page 23: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/23.jpg)
DGE enzyme RGCY p=1.0
1418
2124
26
0
2
4
6
8
10
12
14
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
![Page 24: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/24.jpg)
DGE enzyme GATC p=.5
1418
2124
26
0
2
4
6
8
10
12
14
16
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
![Page 25: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/25.jpg)
DGE enzyme CATG p=.5
1418
2124
26
0
2
4
6
8
10
12
14
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags
![Page 26: Accurate Estimation of Gene Expression Levels from Digital Gene Expression Sequencing Data](https://reader035.vdocuments.mx/reader035/viewer/2022062816/56814bef550346895db8d4c6/html5/thumbnails/26.jpg)
DGE enzyme RGCY p=.5
1418
2124
26
0
2
4
6
8
10
12
14
1000000
5000000
10000000
15000000
30000000
1000000 5000000 10000000 15000000 30000000
Tag Length
MPE
#Tags