1 q1-q3 results. 2 rf lengths 3 filtered rf length distribution
TRANSCRIPT
1
Q1-Q3 results
2
RF lengths
3
Filtered RF length distribution
Min Max AverageQ1 28 1462 78.2Q2 20 372 42.3Q3 28 826 77.2
4
Q1 filtered RF length distribution
5
Q2 filtered RF length distribution
6
Q3 filtered RF length distribution
7
RF position when compared to genes and exons
8
Q1-Q2-Q3: Projected filtered RF distribution(internal=overlap target gene ; projection done by pool)
39% internal
46% exonic
54% intronic
61% external
71% genic
79% exonic22% overlap most 5' ex. of tr.
21% intronic29% intergenic
86% internal
88% exonic
12% intronic
14% external
78% genic
88% exonic47% overlap most 5' ex. of tr.
12% intronic22% intergenic
21% internal
47% exonic
53% intronic
79% external
78% genic
69% exonic23% overlap most 5' ex. of tr.
31% intronic22% intergenic
Q1 Q3Q2
® chimeric transcripts?
9
Why are Q3 RF mostly external (79%) ?
Existence of a systematic swap between certain pairs of pools?
For each RF we have computed the overlap with all genes of Q3 and then compared:
RF pool withRF overlapping gene pool
10
RF overlapping gene pool
Pool 1
1
8
2
12
6
5
Pool 2
2
13
15
1
5
11
Pool 3
3
16
11
12
2
8
Pool 8
8
11
1
13
16
4
6
7
Pool 9
9
6
11
1
10
13
11
Q3 RF compared to Q3 genes
Pool1 73 31 104 70.192 136 31 167 81.443 30 26 56 53.574 25 0 25 1005 26 12 38 68.426 41 19 60 68.337 49 29 78 62.828 10 17 27 37.049 16 42 58 27.5910 42 35 77 54.5511 47 16 63 74.612 58 7 65 89.2313 32 38 70 45.7114 40 36 76 52.6315 26 25 51 50.9816 56 20 76 73.68
Nb RF with at least one overlapping gene
in its pool
Nb RF with no overlapping gene
in its pool
Nb RF with at least one overlapping
gene
Proportion of RF with at least one overlapping gene
in its pool
® Q3 RF are more overlapping genes of their pool than genes of other pools (no clear pool swap)
12
6 genes of Q3 are in two different pools
® generates pool unspecific RF
55 CHODL,NM_024944, 6,10, Same gene
47 NM_021090,MTMR3, 15,16, Same gene
27 12,3, Same gene
16 14,15, Same gene
12 ZNF74,NM_003426, 10,13, Same gene
2 NM_005446,P2RXL1, 8,9, Same gene
1 14,16, Not the same gene
Nb RF with two Q3
overlapping genes
Q3 genes overlapping the RF
Pools of these genes
Are these genes the same one?
NM_015367,BCL2L13,
RP5-1042K10.2,NM_0157
05,
NM_001013676,MTMR3,
Problematic pools are:● 6-10-13● 8-9● 12-3● 14-15-16
13
Q3 RF overlapping Q3 genes
Pool Nb assigned RF1 261 104 39.85 73 27.972 260 167 64.23 136 52.313 164 56 34.15 30 18.294 172 25 14.53 25 14.535 354 38 10.73 26 7.346 229 60 26.2 41 17.97 222 78 35.14 49 22.078 183 27 14.75 10 5.469 212 58 27.36 16 7.5510 184 77 41.85 42 22.8311 256 63 24.61 47 18.3612 173 65 37.57 58 33.5313 169 70 41.42 32 18.9314 209 76 36.36 40 19.1415 104 51 49.04 26 2516 259 76 29.34 56 21.62
Total 3411 1091 32.95 707 20.8
RF with at least one overlapping gene
RF overlapping its assigned gene
14
Position of Q3 filtered projected RF when filtering RF shorter than a threshold
All >30 bp > 60 bp > 90 bp
Internal 21 21.04 29.1 34.86
Exonic 47 52.08 55.29 54.87
Intronic 53 47.92 44.71 42.75
External 79 78.96 70.9 65.14
Genic 78 79.09 83.12 85.3
Exonic 69 70.44 73.03 73.82
Intronic 31 29.56 26.97 26.18
Intergenic 22 20.91 16.88 14.7
15
Q2 vs Encode 2005
86% internal
88% exonic
12% intronic
14% external
78% genic
88% exonic47% overlap most 5' ex. of tr.
12% intronic22% intergenic
68% internal
49% exonic
51% intronic
32% external
80% genic
70% exonic23% overlap most 5' ex. of tr.
30% intronic20% intergenic
Q2 Encode 2005
433 out of 1577 (27.5%)are novel projected RF
2859 out of 4951 (57.8%)are novel projected RF
16
Distance of RF to closest gene within pool (target gene)
17
Q1, Q3: proportion of RF > 3Mb away from target gene
Q1:
983/10387= 9.4% filtered RF > 3Mb away from target gene
Q3:
1789/3411 = 52.4% RF > 3Mb away from target gene
839/1249 = 67.2% external non exonic RF > 3Mb away from target gene
18
Min Max Average0 10.7Mb 556kb
19
Min Max Average0 32Mb 5.1Mb
20
Min Max Average0 32Mb 6.8Mb
21
Proportion of Q3 filtered RF >3 Mb away from target gene
All
Projected RF 3411> 3Mb away 1789Proportion 52.4
>30bp
Projected RF 3302> 3Mb away 1726Proportion 52.3
>60bp
Projected RF 1536> 3Mb away 689Proportion 44.9
>90bp
Projected RF 835> 3Mb away 328Proportion 39.3
22
23
24
25
Do external exonic projected RF overlap most 5' exons of transcripts more than
other exons of transcripts ?
26
Proportion of external exonic projected RF overlapping most 5' exons of transcripts
Real: 22.3% (63)
Same strand: 68.3% (43)
Opposite strand: 31.7% (20)
Random: 19.8% (56)
Same strand: 41.1% (23)
Opposite strand: 58.9% (33)
Real: 23.0% (335)
Same strand: 62.1% (208)
Opposite strand: 37.9% (127)
Random: 15.8% (230)
Same strand: 49.1% (113)
Opposite strand: 50.9% (117)
Real: 46.5% (206)
Same strand: 45.6% (94)
Opposite strand: 54.4% (112)
Random: 30.7% (136)
Same strand: 54.4% (74)
Opposite strand: 45.6% (62)
Q1 Q3Q2
27
Does the most 5' RF of a particular gene and a particular tissue overlap most 5'
exons of transcripts more than other RF?
All filtered assignable RF
Total RF External RF External exonic RF
Q1 10387 821 504 61.39 282 55.95 63 22.34 43 68.25
Q2 17477 4561 645 14.14 443 68.68 206 46.5 94 45.63
Q3 3796 3411 2704 79.27 1455 53.81 335 23.02 208 62.09
Encode 2005 24361 4951 1605 32.42 901 56.14 207 22.97 101 48.79
Most 5' filtered assignable RF for each gene and each tissue
Total RF External RF External exonic RF
Q1 575 134 92 68.66 57 61.96 16 28.07 7 43.75
Q2 416 174 111 63.79 102 91.89 49 48.04 47 95.92
Q3 87 78 74 94.87 31 41.89 7 22.58 3 42.86
Encode 2005
Projected RF
Overlapping a most 5' exon of a transcript
Overlapping a most 5' exon of a
transcript on the same strand
Projected RF
Overlapping a most 5' exon of a transcript
Overlapping a most 5' exon of a
transcript on the same strand
28
Correlation of most 5' RF with CAGE tags
29
Correlation of most 5' racefrags with cage tags
5 prime most rf Random1 Random2 Random3 Random4 Random5 Random6 Random7 Random8 Random9 Random1050 7 1 1 2.5 0 0 3.5 0 1 1 1
250 22 3.5 7 7 7 3.5 7 1 7 1 8500 29 5.5 11.5 10.5 10.5 4.5 10.5 4.5 9 8 12.5
1000 45 10.5 18.5 17 14 9 15 10.5 12.5 12.5 18.5
Most 5'RF
5'
30
Pool unspecific RF
31
Pool unspecific unique RF(USPP-filtered)
Most pool unspecific unique RF are:
Q1: internal exonic (72%)
Q2: internal exonic (87%)
Q3: external (91%) (of which 63% are exonic)
Q1 Q2 Q3
1736 94.40% 3032 62.70% 3217 93.00%
102 5.60% 1804 37.30% 243 7.00%
Unique RF in 1 pool
Unique RF in more than 2
pools
20 unique RF are in more than 4 pools
32
Pool unspecific unique Q3 RF(filtered)
- Hits found by blat.- Need to be done again using our highlighted probe simulator.
Unique RF Total
20 14
In 1 pool 20 5
With multiple >80% Id hits in chr21 and 22
In more than 4 pools
33
1 2 3 4 5 6 80
500
1000
1500
2000
2500
3000
3500
4000 Q31 2
0
200
400
600
800
1000
1200
1400
1600
1800
2000 Q1
1 2 3 4 5 6 7 8 9 100
500
1000
1500
2000
2500
3000
3500 Q2
Q1-Q3: Number of pools a unique RF appears in (unfiltered/filtered)
34
Pool-unspecific RFs in Q3Possibly due to...
1 - cross-hybridization?is there a correlation between number of pools a RF is found in and the number of non-unique probes it overlaps?
noby the way 135,380 / 2,191,331 (6%) of probes from chr21/22 chip have multiple perfect matches in genome
35
Pool-unspecific RFs in Q3Possibly due to...
2 - high GC content?
Number of pools the RACEfrag appears in Number of RACEfrags Average GC content
1 3413 52%2 191 55%3 41 55%4 12 57%5 4 56%6 3 57%7 0 N/A8 1 42%
-> Answer: NO!
36
Pool-unspecific RFs in Q3Possibly due to...
3 - mis-priming on unknown transcripts of chr21 or chr22 (missed by the simulator)?
4 - genuine chimeric transcripts?
5 - Pooling errorsthe same gene is present in >1 pool because it has 2 different identifiers (UCSC known genes / RefSeq nomenclature discrepancy
we found a few cases like this, not sure yet how widespread it is (systematic survey to come)
37
Genes present in several pools
5 genes present in 2 pools:
RP5-1042K10.2,NM_015705 (pools 14,15)
CHODL,NM_024944 (pools 6,10)
NM_005446,P2RXL1 (pools 8,9)
ZNF74,NM_003426 (pools 10,13)
NM_015367,BCL2L13 (pools 12,3)
1 gene present in 3 pools:
NM_021090,NM_001013676,MTMR3 (pools 15,14,16)
Eliminate RF present in these pairs/triplets of pools (problematic pool RF)
38
Effect of filtering problematic pool RF on Q3 pool unspecificity
2 3 4 5 6 80
20
40
60
80
100
120
140
160
180
200
Initial RFFiltering problematic pool RF
# pools
# R
F
-48
-1Genes present in several pools do not explain all pool unspecific RF of Q3
39
Distribution of pool specific and pool unspecific unique Q3 RF
Total Q3 genic Q3 exonic Genic Exonic
3217 977 30.37 504 15.67 2668 82.93 1664 51.73
243 52 21.4 37 15.23 206 84.77 154 63.37
194 34 17.53 21 10.82 158 81.44 116 59.79
Pool specific unique Q3 RF
Pool unspecific
unique Q3 RFSame without problematic pool Q3 RF
Pool unspecific Q3 RF are more:● external to Q3 genes,● exonic,
compared to pool specific Q3 RF
40
Pool unspecific RF
Unassigned
Overlap target gene
Overlap a Q3 gene
Overlap a gene
Do not overlap a gene
Pool specific and unspecific RFregarding gene overlap
Total Unassigned Overlap target gene Overlap a Q3 gene Overlap a gene
Pool specific RF 3217 324 10.07 661 20.55 234 7.27 1498 46.57 500 15.54
480 55 11.46 29 6.04 39 8.13 269 56.04 88 18.33
Do not overlap a gene
Pool unspecific RF
Pool specific RF
Unassigned
Overlap target gene
Overlap a Q3 gene
Overlap a gene
Do not overlap a gene
Pool specific RF overlap their target gene more than pool unspecific RF
41
Two other criteria for comparing Q3 pool specific and unspecific RF
Total
Pool specific RF 3217 1732 53.84 956 55.2
480 308 64.17 172 55.84
Overlap a gene other than target
Overlap a gene other than target
but in same orientation
Pool unspecific RF
Overlap with gene in same orientation as target gene
Min Max Avg
0 32Mb 4.9Mb
191 28Mb 7Mb
Pool specific RF
Pool unspecific RF
Distance to target gene
Pool unspecific RF are more distant to their target gene
Pool unspecific RF behave similarly as pool specific RF
42
6 genes of Q3 are in two different pools
® generates pool unspecific RF
55 CHODL,NM_024944, 6,10, Same gene
47 NM_021090,MTMR3, 15,16, Same gene
27 12,3, Same gene
16 14,15, Same gene
12 ZNF74,NM_003426, 10,13, Same gene
2 NM_005446,P2RXL1, 8,9, Same gene
1 14,16, Not the same gene
Nb RF with two Q3
overlapping genes
Q3 genes overlapping the RF
Pools of these genes
Are these genes the same one?
NM_015367,BCL2L13,
RP5-1042K10.2,NM_0157
05,
NM_001013676,MTMR3,
Problematic pools are:● 6-10-13● 8-9● 12-3● 14-15-16
43
Impact of index exon position on RF coverage
44
45
USPP filter results
46
Initial set of RF
Exonic Intronic Intergenic Total
Pool1 3567 1192 1043 5802
Pool1 61.48 20.54 17.98 100.00%
Pool2 4100 1196 336 5632
Pool2 72.8 21.24 5.97 100.00%
Total 7667 2388 1379 11434Total 67.05 20.89 12.06 100.00%
Filtered RF
Exonic Intronic Intergenic Total
Pool1 3023 1038 747 4808
Pool1 62.87 21.59 15.54 100
Pool2 4100 1143 336 5579
Pool2 73.49 20.49 6.02 100
Total 7123 2181 1083 10387
Total 68.58 21 10.43 100Removed 7.1 8.67 21.46 9.16
The USPP filter removes more intergenicthan genic RF
Q1: proportion of exonic, intronic and intergenic RF before and after USPP-based filtering
47
Initial set of RF
Genic <50kb <100kb <200kb <500kb <1Mb >1Mb >5Mb Non genic TotalPool1 3618 762 777 1126 1148 1176 1008 295 2184 5802Pool1 34.89 35.58 51.56 52.56 53.85 46.15 13.51 100Pool2 4026 764 773 787 900 935 671 196 1606 5632Pool2 47.57 48.13 49 56.04 58.22 41.78 12.2 100Total 7644 1526 1550 1913 2048 2111 1679 491 3790 11434Total 40.26 40.9 50.47 54.04 55.7 44.3 12.96 100
Filtered RF
Genic <50kb <100kb <200kb <500kb <1Mb >1Mb >5Mb Non genic TotalPool1 3300 752 767 820 842 870 638 272 1508 4808Pool1 49.87 50.86 54.38 55.84 57.69 42.31 18.04 100Pool2 3973 764 773 787 900 935 671 196 1606 5579Pool2 47.57 48.13 49 56.04 58.22 41.78 12.2 100Total 7273 1516 1540 1607 1742 1805 1309 468 3114 10387Total 48.68 49.45 51.61 55.94 57.96 42.04 15.03 100Removed 4.85 0.66 0.65 16 14.94 14.5 22.04 4.68 17.84 9.16
The USPP filter removes more RF located:- from 100 to 200 kb - from 1 to 5 Mb to closest gene within pool
Q1: Distance of RF to closest gene within poolbefore and after the USPP-based filter
48
0 1 2
21629 21917 22316
Number of genes generating USPPs
Number of Experimental RFs overlapping USPPs 259 263 283
Number of genes with RFs overlapping USPPs 8 8 8
class 0 RFs removed by USPP filter 0: 0% 0: 0% 0: 0%
class 1 RFs removed by USPP filter 2: 0% 2: 0% 2: 0%
class 3 RFs removed by USPP filter 185: 18.4% 185: 18.4% 188: 18.7%
class 5 RFs removed by USPP filter 72: 0.5% 76: 0.5% 93: 0.6%
Number of RACE product vs. probe mismatches allowed for the probe to be considered positive in simulation
Number of Simulated Positive Probes (SPPs)Number of positive probes outside of target locus (Unspecific Simulated Positive Probes, USPPs)
6969 / 21629 7063 / 21917 7279 / 22316
19 / 41 23 / 41 34 / 41
Q2: Class 0, 1, 3, 5 RF removed by USPP-based filter (using 0, 1 and 2 Race/probe mismatches)
The USPP filter:- removes 37 times more 3' RF than 5' RF- is ~ independent of the number of RACE/probe mismatches
49
Proportion of RF and projected RF eliminated by the USPP-based filter
(projections made by pool)
Initial number of RF
Q1 11434 10387 9.16 785 734 6.5
Q2 17740 17477 1.48 1620 1577 2.65
Q3 4012 3796 5.38 3102 2933 5.45
ENCODE 2005 25605 24361 4.86 5104 4951 3
Number of RF after filter
Percentage of RF filtered
Initial number of projected RF
Number of projected RF after filter
Percentage of projected RF filtered
50
Proportion of RF and projected RF eliminated by the USPP-based filter
(projections made by pool)
Initial number of RF
ENCODE 2005 initial 35129 30688 12.64 9178 8569 6.64
38001 33426 12.04 9178 8671 5.52
25605 24361 4.86 5104 4951 3
Number of RF after filter
Percentage of RF filtered
Initial number of projected RF
Number of projected RF after filter
Percentage of projected RF filtered
ENCODE 2005 fragmented initial
ENCODE 2005 fragmented filtered by
France
51
Tissue specificity results
521 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47
0
100
200
300
400
500
600
700
800
900
1000
Q1: number of tissues a unique RF appears in (unfiltered/filtered)
531 2 3 4 5 6 7 8 9 10 11 12
0
500
1000
1500
2000
2500
3000
Q2: Number of tissues a unique RF appears in (unfiltered/filtered)
54
Generating RF from probes
55
Generating RF from probes
Initial Filtered Genic ExonicExp Q3 4012 3796 94.62 3322 82.8 2174 65.44
Sim Q3 (2 probes) 7448 7178 96.37 5786 77.69 3562 61.56Sim Q3 (3 probes) 4024 3806 94.58 3333 82.83 2191 65.74
56
Comparison between Encode 2005 and Q2
57
Intersection between Encode 2005 and Q2 RF sets
Total RF All Globally common All All All
25605 3020 1407 46.59 1096 615 56.11 1096 339 30.93 280 46 16.43
#RF in Q2 17740 17740 3579 20.17 807 561 69.52 807 460 57 64 33 51.56
RF corresponding to genes common to Encode 2005 and
Q2
RF corresponding to experiments common to
Encode 2005 and Q2
RF corresponding to 14 experiments common to
Encode 2005 and Q2
Novel RF corresponding to experiments common to
Encode 2005 and Q2
Common in corresponding
experiment
Common in corresponding
experiment and tissue
Common in corresponding
experiment
#RF in Encode 2005
58
Comparison between Q1 and Q3
59
Overlap between Q1 and Q3 RF assigned to genes common to Q1-Q3
TCP10L 1213 397 32.73 25 18 72
TPTE 588 302 51.36 53 30 56.6
C21orf13 998 632 63.33 47 30 63.83
C21orf59 1316 665 50.53 53 18 33.96
YWHAH 995 392 39.4 73 13 17.81
APP 1446 981 67.84 43 18 41.86
RFPL3S 1008 2 0.2 22 2 9.09
TBX1 232 121 52.16 15 8 53.33
DSCR2 948 357 37.66 45 18 40
TSPO 645 205 31.78 30 8 26.67
All 9389 4054 43.18 406 163 40.15
Q1 RF of this gene
Overlapped by a Q3 RF assigned to this
geneQ3 RF of this gene
Overlapped by a Q1 RF assigned to this
gene
40% overlap between Q1 and Q3 RF assigned to genes common to both experiments ® problem in gene assignment?