regulation of alternative splicing jihye kim oral preliminary exam (may 7, 2007)
TRANSCRIPT
Regulation of Alternative Splicing
Jihye Kim
Oral Preliminary Exam (May 7, 2007)
Outline
• Alternative Splicing Overview• Goal : Investigate “regulation” of AS• Method : Association Rule Mining• Part I : Finding association rules of cis-regulatory
elements involved in alternative splicing
• Part II : Cis-regulatory Motif Combinations Associated with Tissue-specific Alternative Splicing
• Summary• Future Work
Splicing
• Introns are removed and flanking exons are concatenated
• Spliceosome
- snRNPs and other proteins
[image from http://fig.cox.miami.edu/~cmallery/150/gene/c7.17.11.spliceosome.jpg]
Splice Sites
• Recognized by spliceosome• Splice sites are too weak to predict intron
location accurately
[image from http://web-books.com/MoBio/Free/Ch5A4.htm]
5’ 3’
Splicing Factors and Binding Sites
• Assist spliceosome to identify splice sites• Splicing factors
– SR (serine/arginine-rich) proteins
• Exonic and intronic enhancers and silencers (cis-acting)– ESE (A/G rich motifs), ESS (hnRNP), ISE (G triples, UGCAUG), ISS
[Source from Katherina Kechris in Rocky’05 Conference]
Exon Exon 2
Alternative Splicing
• Over 70% in human genome• Major mechanism to generate protein diversity• Highly relevant to disease
– 15% disease-causing mutations affect splicing [Krawczak 1992]
[Krawczak 1992] Krawczak, M., Reiss, J., and Cooper, D.N. 1992 Hum. Genet. 90: 41-54
protein
Pre-mRNA
mRNA
Types of Alternative Splicing
[Source from Cartegni et al. 2002]
Cassette Exon
Investigating Alternative Splicing
• Traditionally, align ESTs and mRNAs to genomic sequences
• Recently, microarray technology
(Splice arrays)– Exon skipping is measured– Hard to measure other types of AS
Previous Work on AS Regulation
• Most methods– use only sequence data– focus on the effect of individual motifs
• Brain-specific exon skipping [Brudno 2001]– 25 brain-specific cassette exons from literature– Over-representation of UGCAUG in downstream intron
• RESCUE-ESE [Fairbrother 2002]– Frequent hexamers in exon by weak splice sites– 10 ESE motifs show enhancer activity in experiment
[Brudno 2001] Brudno M., Gelfand M.S., et al., 2001 NAR 20 (11) 2338-21348[Fairbrother 2002] Fairbrother WG., et al., 2002 Science 9;297(5583):1007-13
What We Have Done So Far
• Investigate cis-regulatory motifs that influence amount of AS or tissue-specific AS[Jihye Kim, Sihui Zhao, Steffen Heber, “Finding association rules of cis-regulatory elements involved in alternative splicing”, Proceedings of the 45th annual southeast regional conference (ACM-SE) pp. 232 – 237]
[Jihye Kim, Sihui Zhao, Steffen Heber, “Cis-regulatory Motif Combinations Associated with Tissue-specific Alternative Splicing”,7th workshop on Algorithms in Bioinformatics (WABI 2007) (submitted)
– Use mouse splice array data– Apply Association Rule Mining– Investigate motif combination involved in tissue-
specific AS
AS Datasets in Mouse
• Dataset– Splice Array [Pan 2004]
with 6 probes– 3126 exon skipping
genes in mouse
– %ASex : percentage of exon skipping in 10 tissues
[Pan 2004] Pan, Q., et al., 2004 Mol Cell 16(6):929-942
Aim I-I : representing data context
Association Rule Mining• By Agrawal et al. in 1993• Initially used for Market Basket Analysis
• An association rule is a pattern that states when X occurs, Y occurs with certain probability
• X : antecedent (left-hand-side, lhs), Y : consequent (right-hand-side, rhs)
• Goal: Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (minconf)
X Y
Rule Strength Measures
• Given a rule,
– Support = Pr(X∧Y)
– Confidence = Pr(Y | X)
– Lift = Pr(X∧Y)/ Pr(X)Pr(Y)• Dependency of lhs and rhs• Generally, lhs and rhs have positive dependency
if lift >1.0
X Y
ARM Example
Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana
Cart 2 : Beer, Nuts, Tissue, Diaper
Cart 3 : Apple, Beer
Cart 4 : Jam, Beer, Diaper
Cart 5 : Bread, Butter, Tissue, Jam
ARM Example
Min supp = 0.5 Min conf = 0.7
Frequent Itemset = itemset whose support > 0.5
Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana
Cart 2 : Beer, Nuts, Tissue, Diaper
Cart 3 : Apple, Beer
Cart 4 : Jam, Beer, Diaper
Cart 5 : Bread, Butter, Tissue, Jam
ARM Example
Min supp = 0.5 Min conf = 0.7
Frequent Itemsets (support)
Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana
Cart 2 : Beer, Nuts, Tissue, Diaper
Cart 3 : Apple, Beer
Cart 4 : Jam, Beer, Diaper
Cart 5 : Bread, Butter, Tissue, Jam
ARM Example
Min supp = 0.5 Min conf = 0.7
Frequent Itemsets (support)
Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana
Cart 2 : Beer, Nuts, Tissue, Diaper
Cart 3 : Apple, Beer
Cart 4 : Jam, Beer, Diaper
Cart 5 : Bread, Butter, Tissue, Jam
Bread(2/5 < 0.5)
ARM Example
Min supp = 0.5 Min conf = 0.7
Frequent Itemsets (support)
Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana
Cart 2 : Beer, Nuts, Tissue, Diaper
Cart 3 : Apple, Beer
Cart 4 : Jam, Beer, Diaper
Cart 5 : Bread, Butter, Tissue, Jam
Beer (0.8)Beer (0.8), Jam (0.6),
Diaper (0.6)
{Beer, Diaper} (0.6)
ARM Example
Min supp = 0.5 Min conf = 0.7
Frequent Itemsets
Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana
Cart 2 : Beer, Nuts, Tissue, Diaper
Cart 3 : Apple, Beer
Cart 4 : Jam, Beer, Diaper
Cart 5 : Bread, Butter, Tissue, Jam
Beer (0.8), Jam (0.6),
Diaper (0.6)
{Beer, Diaper} (0.6)
Association Rules (confidence)
ARM Example
Min supp = 0.5 Min conf = 0.7
Frequent Itemsets
Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana
Cart 2 : Beer, Nuts, Tissue, Diaper
Cart 3 : Apple, Beer
Cart 4 : Jam, Beer, Diaper
Cart 5 : Bread, Butter, Tissue, Jam
Beer (0.8), Jam (0.6),
Diaper (0.6)
{Beer, Diaper} (0.6)
Association Rules (confidence)
Beer => Jam (2/4 < 0.7)
ARM Example
Min supp = 0.5 Min conf = 0.7
Frequent Itemsets
Cart 1 : Milk, Bread, Diaper, Beer, Jam, Banana
Cart 2 : Beer, Nuts, Tissue, Diaper
Cart 3 : Apple, Beer
Cart 4 : Jam, Beer, Diaper
Cart 5 : Bread, Butter, Tissue, Jam
Beer (0.8), Jam (0.6),
Diaper (0.6)
{Beer, Diaper} (0.6)
Association Rules (confidence)
Beer => Diaper (0.75)
Apriori Algorithm
• Most popular algorithm
• Two steps:– Find all itemsets that satisify min_supp.
(frequent itemsets)• any subset of a frequent itemset is also frequent• Find all 1-item frequent itemsets; then all 2-item
frequent itemsets, and so on.
– Generate Rules• A B is an association rule if
Confidence(A B) ≥ min_conf
Part I : Finding association rules of cis-regulatory elements involved in alternative splicing[Proceedings of the 45th annual southeast regional conference (ACM-SE) Winston-Salem, North Carolina pp. 232 – 237]
K-mers Around Cassette Exon (items)
• Pre-mRNA sequences– Transcripts from NCBI– BLAT to align transcripts
to mouse genome– 200 bps from 7 regions
around cassette exon– 2565 genes in total
• Items (6mers) :AAAAAA to TTTTTT in region 1 … 7
Aim I-I : representing data context
ARM in Finding AS Motif Rule
• Items : all possible hexamers (motifs)• Transactions : 2565 AS genes• Goal : finding motif association rules in AS
genes. (e.g., AGGATA TTAGCT)• By Apriori algorithm [Agrawal 1993]
Find All Frequent Hexamers
Generate Hexamer Rules
[Agrawal 1993] Agrawal R., Imielinski T., Swami AN., 1993 SIGMOD 22(2):207-216
ARM Example
[Example]
Seq 1 : ACGATTAGG
Seq 2 : GAATAGG
Seq 3 : TGCAGG
Seq 4 : GGATTAGG
Seq 5 : CAGAT
Min support = 0.5
Min confidence = 0.7
ARM Example
[Example]
Seq 1 : ACGATTAGG
Seq 2 : GAATAGG
Seq 3 : TGCAGG
Seq 4 : GGATTAGG
Seq 5 : CAGAT
Min support = 0.5
Min confidence = 0.7
- Frequent 3-mer sets (support)AGG (0.8),
ARM Example
[Example]
Seq 1 : ACGATTAGG
Seq 2 : GAATAGG
Seq 3 : TGCAGG
Seq 4 : GGATTAGG
Seq 5 : CAGAT
Min support = 0.5
Min confidence = 0.7
- Frequent 3mers sets (support)AGG (0.8), GAT (0.6), TAG (0.6),{AGG,TAG} (0.6)
ARM Example
[Example]
Seq 1 : ACGATTAGG
Seq 2 : GAATAGG
Seq 3 : TGCAGG
Seq 4 : GGATTAGG
Seq 5 : CAGAT
Min support = 0.5
Min confidence = 0.7
- Frequent 3mers sets (support)AGG (0.8), GAT (0.6), TAG (0.6),{AGG,TAG} (0.6)
- Rules (confidence)AGG GATconf = 2 / 4 = 0.5 < minconf
ARM Example
[Example]
Seq 1 : ACGATTAGG
Seq 2 : GAATAGG
Seq 3 : TGCAGG
Seq 4 : GGATTAGG
Seq 5 : CAGAT
Min support = 0.5
Min confidence = 0.7
- Frequent 3mers sets (support)AGG (0.8), GAT (0.6), TAG (0.6),{AGG,TAG} (0.6)
- Rules (confidence)AGG TAG (0.75)TAG AGG (1.0)
Motif Association Rules from AS Genes
1 2 3 4 5 6 7
- 7_TGAAGA, 7_GAAGAA (ASF/SF2, SRp55)
- 6_TTTTCT, 6_AATAAA, …
- Among 6,000 6-mers, 1/3 are in AEDB
- Candidates of regulatory motifs
Association Rules
Minconf = 0.4
Frequent 6-mers
Minsup = 0.05 (129 genes)
- 7_AAAAAT 7_TGAAGA, 7_AAAGGA 7_AGAAGA,
- 7_GAAAAA 7_AAGAAG, 7_CTGCCT 7_CTGGAG,
- 7_AGGAAA 7_AAGAAG, 7_AATAAA 7_AAGAAG
- Candidates of regulatory combinations for AS
Aim I-II : finding motif association rules for all AS genes
Clustering by AS Pattern in 10 Tissues
• Hypothesize : Motif combinations “cause” AS profile• Cluster genes based on AS profile. We use
– Euclidean distance / Correlation – Average linkage clustering
• Frequent 6-mers in cluster are motif candidates
Aim I-III : finding motif association rules for cluster
Association Rules from Clusters
1 2 3 4 5 6 7
• Lift (XY) > 2.0• Comparison with outside the
cluster (p-value < 2.13e-10)• Association rules are
candidates of motif combinations for the corresponding AS pattern
Correlation based clusters
Aim I-III : finding motif association rules for cluster
Part II : Cis-regulatory Motif Combinations Associated with Tissue-specific Alternative Splicing[7th workshop on Algorithms in Bioinformatics (WABI 2007) (submitted)]
Finding Motifs Involved in Tissue-Specific AS
• Items : – hexamers in gene regions and– exon skipping rate in tissues
• Transactions :– 2565 genes from Pan’s data set
• Goal : find associations AGGATA in cassette exon High exon skipping in Brain
• We focus on complex rules, e.g.{AGGATA in cassette exon, CCTGCG in downstream intron} High exon skipping in Brain
Aim II-I : finding motif association rules for tissue-specific AS
AS profile items
• Use quartile to convert numeric %ASexes to character AS profile items– BrainLow :The first %ASex
quartile in Brain– BrainHigh : The last %ASex
quartile in BrainBrainLow BrainHigh
Motif Combination ARM Example
[Sequence]
Seq 1 : ACGATTAGG
Seq 2 : GAATAGG
Seq 3 : TGCAGG
Seq 4 : GGATTAGG
Seq 5 : CAGAT
Min support = 0.5
Min confidence = 0.7
[AS profile]
BH, HH
BH, HL
BH, HH
BL, HH
BH, HL
BH : BrianHighBL : BrainLowHH : HeartHighHL : HeartLow
+
Motif Combination ARM Example
Tissue-Specific AS Motif Combinations
• With strict thresholds– Min_supp = 0.01, Min_conf = 0.5, Min_lift = 1.2– MinLen of lhs = 2 (for complex rule)
• Rule appearance– lhs : hexamers, rhs : AS profile items
• 197 association rules are found in total• 27 complex rules are found
– lhs : combinations of 34 frequent hexamersrhs : AS profile items in tissues
– All rules have >1.9 lift – 23 rules show motif combinations in different regions
Aim II-I : finding motif association rules for tissue-specific AS
Antecedent Consequent Support Confidence Lift
{X4_GCTGGA, X4_TGCTGG} {IntestineLow} 0.016 0.519 2.006
{X4_GCTGGA, X4_TGCTGG} {LungLow} 0.016 0.506 1.961
{X4_TGCTGG, X4_CTGGAG} {IntestineLow} 0.011 0.539 2.083
{X4_TGCTGG, X4_CTGGAG} {LungLow} 0.010 0.5 1.937
{X5_TTTTTA, X7_AGAGGA} {HeartHigh} 0.010 0.510 2.043
{X1_AGCAGC, X5_TTTTTA} {MuscleHigh} 0.010 0.54 2.220
{X1_GAGCAG, X3_TTTTAA} {MuscleHigh} 0.010 0.510 2.096
{X1_GAGCAG, X3_TTCTTT} {LiverHigh} 0.013 0.508 2.048
{X4_AGAAGA, X5_TTATTT} {SalivaryLow} 0.011 0.528 2.066
{X4_AGAAGA, X5_TTATTT} {HeartLow} 0.011 0.528 2.075
{X4_AGAAGA, X5_TTATTT} {KidneyLow} 0.011 0.528 2.023
{X4_AGAAGA, X5_TTATTT} {LiverLow} 0.011 0.528 2.041
{X3_ATTTTT, X6_TTCCTG} {SalivaryHigh} 0.011 0.509 2.031
{X3_TTGTTT, X6_TGTCTC} {LiverHigh} 0.011 0.5 2.017
{X2_GCCTGG, X3_CCTCTG} {LiverLow} 0.011 0.542 2.092
{X2_GTGGGG, X5_TTGTTT} {MuscleHigh} 0.013 0.516 2.120
{X5_ATTTTA, X6_TGCTGT} {SalivaryHigh} 0.010 0.510 2.034
{X5_TCTTTT, X6_TTGTCT} {SalivaryHigh} 0.010 0.634 2.530
{X3_TCTGTT, X6_TTGTCT} {HeartHigh} 0.012 0.527 2.110
{X5_TTTTTA, X6_TTGTCT} {HeartHigh} 0.014 0.507 2.032
{X3_CTCTTT, X5_TTAAAA} {KidneyHigh} 0.010 0.5 2.042
{X2_GGGTGG, X5_TTATTT} {SalivaryHigh} 0.011 0.510 2.032
{X5_TCTTTT, X6_TTTTCA} {IntestineHigh} 0.011 0.5 2.007
{X3_TTTATT, X6_TTTCCT} {IntestineHigh} 0.014 0.522 2.094
{X5_TCTTTT, X5_TTATTT, X5_TTTTTA} {HeartHigh} 0.010 0.5 2.004
{X5_TTCTTT, X5_TATTTT, X5_TTTTCT} {SalivaryHigh} 0.011 0.527 2.104
{X3_TATTTT, X3_ATTTTT, X5_TTGTTT} {BrainHigh} 0.011 0.510 2.084
1 2 34 5 6 7
Aim II-I : finding motif association rules for tissue-specific AS
{5_TTTTTA, 7_AGAGGA} => {HeartHigh}
AS Profile of Motif Combinations
Aim II- II : analyzing motif combination
1 2 3 4 5 6 7
Summary of Graphs
• In some cases, genes with one motif do not show any different AS profile from all AS genes
• However, often, genes containing all multiple motifs show significantly changed exon skipping levels
• Combination of cis-regulatory motifs can influence AS profile in tissues
• AEDB in EBI– Transcript regulatory sequences from literature– 292 enhancers and silencers
• >60% extracted frequent hexamers are part of AEDB motifs
• >97% of hexamers involved in complex rules are part of AEDB motifs
Comparison with AEDB
Summary
• Association rule mining (ARM) applied
• Finding motif association rules for AS
• Finding motif association rules for AS clusters
• Finding motif combinations for tissue-specific AS
Future Work
Improve method• Improve motif representation, e.g.
– variable motif length, gapped k-mers– results from motif finding tools
• Improve AS profile representation• Add more features, e.g.
– position and distance between motifs– splice site– exon / intron length– conservation, gene information
• Statistical analysis– Thresholds– Multiple testing
Future Work
• Systematic analysis of simple & complex motifs • Other data sources
– Human splice array [Johnson 2003]– ESTs
• Investigate discovered motifs– Apply motif discovery tools– Analyze genome occurrence– Analyze gene and protein structure
• Build predictive model and apply it (If I have enough time )
• Experimental verification[Johnson 2003] Science. 2003 Dec 19;302(5653):2141-4
Acknowledgements
• Dr. Steffen Heber
• Dr. Eric A. Stone
• Dr. Zhao-Bang Zeng
• Dr. Barbara Sherry
• Sihui Zhao
• Li Zhang
• Hyunmin Kim
THANK YOU