evolution of alternative splicing mikhail gelfand institute for information transmission problems,...
DESCRIPTION
Alternative splicing of human (and mouse) genesTRANSCRIPT
Evolution of alternative splicing
Mikhail GelfandInstitute for Information Transmission Problems,
Russian Academy of Sciences
Workshop “Gene Annotation Analysis and Alternative Splicing”Berlin, December 2004
Overview
• Exon-intron structure of orthologous genes– human–mouse – Drosophila–Anopheles
• Sequence divergence in alternative and constitutive regions
• Evolution of splicing and regulatory sites • Alternative splicing and protein structure
Alternative splicing of human(and mouse) genes
5% Sharp, 1994 (Nobel lecture)35% Mironov-Fickett-Gelfand, 199938% Brett-…-Bork, 2000 (ESTs/mRNA)22% Croft et al., 2000 (ISIS database)55% Kan et al., 2001 (11% AS patterns conserved in mouse ESTs)
42% Modrek et al., 2001 (HASDB)~33% CELERA, 2001
59% Human Genome Consortium, 200128% Clark and Thanaraj, 2002all? Kan et al., 2002 (17-28% with total minor isoform frequency > 5%)
41% (mouse) FANTOM & RIKEN, 200260% (mouse) Zavolan et al., 2003
• Exon-intron structure of orthologous genes– human – mouse – Drosophila–Anopheles
• Sequence divergence in alternative and constitutive regions
• Alternative splicing and protein structure
Data
• known alternative splicing– HASDB (human, ESTs+mRNAs)– ASMamDB (mouse, mRNAs+genes)
• additional variants– UniGene (human and mouse EST clusters)
• complete genes and genomic DNA– GenBank (full-length mouse genes)– human genome
Methods
• TBLASTN (initial identification of orthologs: mRNAs against genomic DNA)
• BLASTN (human mRNAs against genome)• Pro-EST (spliced alignment, ESTs and mRNA
against genomic DNA)• Pro-Frame (spliced alignment, proteins against
genomic DNA)– confirmation of orthology
• same exon-intron structure• >70% identity over the entire protein length
– analysis of conservation of alternative splicing• conservation of exons or parts of exons• conservation of sites
166 gene pairs
42 84 40
human mouse
Known alternative splicing:
126 124
Elementary alternatives
Cassette exon
Alternative donor site
Alternative acceptor site
Retained intron
Human genes
mRNA EST
cons. non-cons. cons. non-cons.
Cassette exons 56 25 74 26Alt. donors 18 7 16 10Alt. acceptors 13 5 19 15Retained introns 4 3 5 0Total 96 30 114 51Total genes 45 28 41 44
Conserved elementary alternatives: 69% (EST) - 76% (mRNA)
Genes with all isoforms conserved: 57 (45%)
Mouse genes
mRNA EST
cons. non-cons. cons. non-cons.
Cassette exons 70 5 39 9Alt. donors 24 6 17 6Alt. acceptors 15 6 16 9Retained introns 8 7 10 4Total 117 24 82 28Total genes 68 22 30 26
Conserved elementary alternatives: 75% (EST) - 83% (mRNA)
Genes with all isoforms conserved: 79 (64%)
Real or aberrant non-conserved AS?• 24-31% human vs. 17-25% mouse elementary
alternatives are not conserved• 55% human vs 36% mouse genes have at least
one non-conserved variant• denser coverage of human genes by ESTs:
– pick up rare (tissue- and stage-specific) => younger variants
– pick up aberrant (non-functional) variants• 17-24% mRNA-derived elementary alternatives
are non-conserved (compared to 25-32% EST-derived ones)
smoothelin
human
common
mouse
human-specific donor-site
mouse-specific cassette exon
autoimmune regulator
human
common
mouse
retained intron; downstream exons read in two frames
Na/K-ATPase gamma subunit (Fxyd2)
human
mouse
(deleted) intron
com
mon
alternative acceptor site within (inserted) intron
Comparison to other studies.Modrek and Lee, 2003: skipped exons
• 98% constitutive exons are conserved• 98% major form exons are conserved• 28% minor form exons are conserved
• inclusion level is a good predictor of conservation
• inclusion level of conserved exons in human and mouse is highly correlated
Minor non-conserved form exons are errors? No:
• minor form exons are supported by multiple ESTs
• 28% of minor form exons are upregulated in one specific tissue
• 70% of tissue-specific exons are not conserved
• splicing signals of conserved and non-conserved exons are similar
Thanaraj et al., 2003:extrapolation from EST comparisons
• 61% (47-86%) alternative splice junctions are conserved
• 74% (71-78%) constitutive splice junctions are conserved
• the former number is consistent with other studies, whereas the latter seems to be an underestimate
Regulation of alternative splicing: introns
• Brudno et al., 2001: UGCAUG is over-represented downstream of tissue-specific exons (brain, muscle).
• Sorek and Ast, 2003: Enhanced conservation (between human and mouse) in intronic sequences flanking alternatively spliced exons. UGCAUG is over-represented in conserved regions.
• Exon-intron structure of orthologous genes– human – mouse – Drosophila–Anopheles
• Sequence divergence in alternative and constitutive regions
• Alternative splicing and protein structure
Fruit fly and mosquito
• Technically more difficult than human-mouse:– incomplete genomes– difficulties in alignment, especially at gene
termini– changes in exon-intron structure irrespective of
alternative splicing (~4.7 introns per gene in Drosophila vs. ~3.5 introns per gene in Anopheles)
Filtering of the
dataset
FlyBase
alternatively spliced fruit fly gene and all its protein isoforms
Non-canonical sites:exclude isoform
Pro-Frame alignment of all isoforms with the fruit fly genome. Frameshift or in-frame stop for at least one isoform:
exclude gene
No constitutive segments inside gene:exclude gene
List of orthologous
pairs
List of filtered fruit fly genesENSEMBL
Pro-Frame alignment of all fruit fly isoforms with the mosquito genome mosquito
genesSimilarity for all isoforms <30%: exclude
orthologous pair
Poly-N within aligned region in the mosquito genome for at least one isoform:
exclude orthologous pair
Set of filtered orthologous pairs
Classification of exons and coding segments • for each pair of isoforms define: mutually Exclusive
exon, Cassette exon, retained Intron, alternative Acceptor site, alternative Donor site; then merge these definitions over all pairs for a gene
--I--
----D ---AD -C-------- EC------A-E----
Left marginal coding segments
Internalcoding segments
isoform 1
isoform 2
isoform 3
- exon - alternative coding segment - constitutive coding segment
----D
E----
E----
E----
E----
----D
----D
-----
-----
---A-
---A-
EC---
--I--
---AD
---AD
-C--- -C---
-C---
constitutive exon
---AD
Right marginal coding segments
Left marginal exons
Internal exons
Right marginal exons
How to define conservation of fruit fly alternative exons
• Alignment of an exon may depend on the isoform. In the cases listed below, shorter exons are assumed to be conserved, whereas longer ones are considered missing
isoform 1
isoform 2
- similarity in alignments of all isoforms including this segment was less than 35%
- similarity in alignment of at least one isoform including this segment was greater than 35%
**missing exon **missing exon *missing exon ***missing exon
Conservation of fruit coding segments in the
mosquito genome. Small (curated) sample
Type of segment
Missing Conserved Total
left marginal (alternative)
46 (77%) 14 (23%) 60 (12%)
internal alternative
22 (55%) 18 (45%) 40 (8%)
internal constitutive
83 (24%) 264 (76%) 347 (69%)
right marginal (alternative)
31 (56%) 24 (44%) 55 (11%)
Total 182 (36%) 320 (64%) 502 (100%)
Conservation of fruit coding segments in the
mosquito genome. Large (non-curated) sample
Type of segment
Missing Conserved Total
left marginal (alternative)
858 (57%) 639 (43%) 1497 (23%)
internal alternative
215 (55%) 178 (45%) 393 (6%)
internal constitutive
903 (23%) 2999 (77%) 3902 (59%)
right marginal (alternative)
414 (53%) 369 (47%) 783 (12%)
Total 2390 (36%) 4185 (64%) 6575 (100%)
Classification of slice events for fruit fly exons
• divided exon• joined exon• exactly conserved exon• mixed;
d eDr
An
- slice
j jj d m jd j m m j
- exon
Different types of events for the same exon dependent on an isoform
dDr (isoform 1)
- slice
j
- exon
An
d
Dr (isoform 2)j
An
j
j
e
e
Types of elementary alternatives and conservation of fruit fly exons in the mosquito genome. Large (non-curated) sample, internal exons
missing mixed joined divided exact
constitutive 728 (23%) 212 (7%) 754 (23%) 407 (13%) 1356 (42%)
Donor site 229 (50%) 21 (5%) 52 (11%) 47 (10%) 130 (28%)
Acceptor site 390 (43%) 45 (5%) 133 (15%) 124 (14%) 250 (28%)
retained Intron 37 (70%) 3 (6%) 2 (4%) 8 (15%) 6 (11%)
Cassette exon 90 (59%) 4 (3%) 9 (6%) 6 (4%) 50 (33%)
Exclusive exon 10 (15%) 1 (1%) 1 (1%) 1 (1%) 55 (82%)
Types of elementary alternatives and conservation of fruit fly exons in the mosquito genome. Large (non-curated) sample, internal exons
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
CONSTANTexon
Donor site Acceptorsite
retainedIntron
Cassetteexon
Exclusiveexon
EXACT
divided
joined
mixed
MISSING
Fruit fly and mosquito
• The general results are the same as for the human-mouse comparison: more conservation of constitutive segments than alternative ones:– 75% const. and 45% alt. segments are
conserved– constitutive exons: >50% conserved exactly,
~25% intron in drosophila, ~8% intron in anopheles
– conservation of alternatives: 36% cassette exons, 51% donor sites, 63% acceptor sites, 83% mutually exclusive exons
• Exon-intron structure of orthologous genes– human – mouse – Drosophila – Anopheles
• Sequence divergence in alternative and constitutive regions
• Alternative splicing and protein structure
Concatenates of constitutive and alternative regions in all genes: different evolutionary rates
Columns (left-to-right) – (1) constitutive regions; (2–4) alternative regions: N-end, internal, C-end
0,1760,199
0,187
0,301
0,00
0,10
0,20
0,30
Constitutive N-endalternative
Internalalternative
C-endalternative
d N/dS
0,886 0,874 0,878
0,807
0,7
0,8
0,9
Constitutive N-endalternative
Internalalternative
C-endalternative
Am
ino-
acid
iden
tity
• Relatively more non-synonimous substitutions in alternative regions (higher dN/dS ratio)
• Less amino acid identity in alternative regions
Genes with length of both const. and alt. reg. > 80 nt
• Horizontal axis: difference in dN/dS in const. and alt. regions• Vertical axis: number of genes• Violet : dN/dS in const. regions > dN/dS in alt. regions • Yellow: dN/dS in const. regions < dN/dS in alt. regions
658
207
79
27 19 27
773
333
140
7144 58
0
100
200
300
400
500
600
700
800
900
0.0-0.1 0.1-0.2 0.2-0.3 0.3-0.4 0.4-0.5 >0.5
279 proteins from SwissProt+TREMBL with “varsplic” features
constitutive alternative % alt. to all
length 199270 66054 25%all SNPs 1126 368 25%synonymous 576 (51%) 167 (45%) 22%benign 401 (36%) 141 (38%) 26%damaging 149 (13%) 60 (16%) 29%
again, there is some evidence of positive selection towards diversity. This is not due to aberrant ESTs
(only protein data are considered).
• Exon-intron structure of orthologous genes– human – mouse – Drosophila – Anopheles
• Sequence divergence in alternative and constitutive regions
• Alternative splicing and protein structure
Data• Alternatively spliced genes (proteins) from
SwissProt– human– mouse
• Protein structures from PDB• Domains from InterPro
– SMART– Pfam– Prosite– etc.
a)
6%10%
15%37%
40%
34%
21%
19%
6%13%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Expected Observed
Non-domain functional units partially
Domains partially
No annotated unit affected
Non-domain functional units completely
Domains completely
Alternative splicing avoids disrupting domains (and non-domain units)
Control:
fix the domain structure; randomly place alternative regions
… and this is not simply a consequence of the (disputed) exon-domain correlation
0
1
Rat
io(o
bser
vere
d/ex
pect
ed)
Mouse Human Mouse Human Mouse Human
nonAS_Exons AS_Exons AS
AS&Exon boundaries and SMART domains
inside domainsoutside domains
Positive selection towards domain shuffling (not simply avoidance of disrupting domains)
a)
6%10%
15%37%
40%
34%
21%
19%
6%13%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Expected Observed
Non-domain functional units partially
Domains partially
No annotated unit affected
Non-domain functional units completely
Domains completely
b)
Domains completely
Non-domain units
completely
No annotated
units affected
Expected Observed
Short (<50 aa) alternative splicing events within domains target protein functional sites
a)
6%10%
15%37%
40%
34%
21%
19%
6%13%
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Expected Observed
Non-domain functional units partially
Domains partially
No annotated unit affected
Non-domain functional units completely
Domains completely
c)
Prosite patterns
unaffected
Prosite patterns
affected
FT positions
unaffected
FT positions affected
Expected Observed
An attempt of integration
• AS is often young (as opposed to degenerating)
• young AS isoforms are often minor and tissue-specific
• … but still functional– although unique isoforms may be result of aberrant
splicing• AS regions show evidence for positive
selection – excess damaging SNPs– excess non-synonymous codon substitutions
What to do
• Each isoform (alternative region) can be characterized:– by conservation (between genomes)– if conserved, by selection (positive vs negative)
• human-mouse, also add rat; compare species of Drosophila and Caenorhabditis
– pattern of SNPs (synonymous, benign, damaging)– tissue-specificity
• in particular, whether it is cancer-specific– degree of inclusion (major/minor)– functionality (for isoforms)
• whether it generates a frameshift• how bad it is (the distance between the stop-codon and
the last exon-exon junction)
What to expect
• Cancer-specific isoforms will be less functional and more often non-conserved
• Set of non-conserved isoforms will contain a larger fraction of non-functional isoforms; and this may influence evolutionary conclusions on the sequence level
• Still, after removal of non-functional isoforms, one would see positive selection in alternative regions (more non-synonymous substitutions compared to constant regions etc.), especially in tissue-specific ones
ReferencesNurtdinov RN, Artamonova II, Mironov AA, Gelfand MS (2003)
Low conservation of alternative splicing patterns in the human and mouse genomes. Human Molecular Genetics 12: 1313-1320.
Kriventseva EV, Koch I, Apweiler R, Vingron M, Bork P, Gelfand MS, Sunyaev S. (2003) Increase of functional diversity by alternative splicing. Trends in Genetics 19: 124-128.
Brudno M, Gelfand MS, Spengler S, Zorn M, Dubchak I, Conboy JG (2001) Computational analysis of candidate intron regulatory elements for tissue-specific alternative pre-mRNA splicing. Nucleic Acids Research 29: 2338-2348.
Mironov AA, Fickett JW, Gelfand MS (1999). Frequent alternative splicing of human genes. Genome Research 9: 1288-1293.
Acknowledgements
• Discussions– Vsevolod Makeev (GosNIIGenetika)– Eugene Koonin (NCBI)– Igor Rogozin (NCBI)– Dmitry Petrov (Stanford)
• Support– Ludwig Institute of Cancer Research– Howard Hughes Medical Institute– Russian Fund of Basic Research– Russian Academy of Sciences
Authors• Andrei Mironov (Moscow State University) – spliced alignment• Ramil Nurtdinov (Moscow State University) – human/mouse
comparison• Irena Artamonova (Institute of Bioorganic Chemistry, now
Institute of Bioinformaics, GSF) – human/mouse comparison, MAGEA family
• Dmitry Malko (GosNIIGenetika) – Drosophila/Anopheles comparison
• Inna Dubchak (Lawrence Berkeley Lab) – sites• Michael Brudno (UC Berkeley, now Stanford) – sites• Ekaterina Ermakova (Moscow State University) – evolution of
alternative/constitutive regions• Vasily Ramensky (Institute of Molecular Biology) – SNPs• Eugenia Kriventseva (EBI, now BASF) – protein structure• Shamil Sunyaev (EMBL, now Harvard University Medical
School) – protein structure