ecs 289a - lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-w03/lec16and17.pdf• filkov et...
TRANSCRIPT
![Page 1: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/1.jpg)
ECS289A, WQ03, Filkov
Data Integration
Lectures 16 & 17
![Page 2: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/2.jpg)
ECS289A, WQ03, Filkov
Lectures Outline• Goals for Data Integration• Homogeneous data integration
– time series data (Filkov et al. 2002)• Heterogeneous data integration
– microarray + sequence– microarray + protein– microarray + location– is integration always beneficial?
• Data Integration for Developmental Networks (Davidson et al., 2002)
![Page 3: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/3.jpg)
ECS289A, WQ03, Filkov
Integrating data from various experiments should yield better understanding of the data compared to that of individual data sets.
![Page 4: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/4.jpg)
ECS289A, WQ03, Filkov
Goals for Data Integration
We integrate data sets with specific goals in mind:– better gene classification– better gene clustering– better regulatory networks
Methods used are the same (modeling):– SVMs– Bayesian inference– Clustering/Classification– Graph models and algorithms– Statistical Significance
![Page 5: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/5.jpg)
ECS289A, WQ03, Filkov
Homogeneous Data
• Expression Data (microarrays)• Sequence Data• Location Data (ChIP)• Protein Expression Data
Common platforms for storage, retrieval and comparison across similar data type
![Page 6: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/6.jpg)
ECS289A, WQ03, Filkov
Homogeneous IntegrationEg. Microarray expression data is compared across treatments to discover differential gene expression, i.e. genes that behave differently under treatment w.r.t control
treatment control
![Page 7: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/7.jpg)
ECS289A, WQ03, Filkov
How to integrate time-series data? Warp the curves so that they all have the same
• amplitude• phase and• period
Plot of ACT1 (YFL039C)
Spellman et al. 1998
Homogeneous Integration: Time-series (Filkov et al., 2002)
Yeast cells in different experiments are synchronized differently
![Page 8: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/8.jpg)
ECS289A, WQ03, Filkov
Warping Time-series
vs.
Final
PhaseAmplitude
Period
![Page 9: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/9.jpg)
ECS289A, WQ03, Filkov
Period and Phase Warp��������������� ������������������������������������ ������������� ��������� �������������
��������������������� ���� ����� ���� ����������� ���
�������������� ������������������ ����
��������������������� ���������������������
��������������������������������
�� ���� �� �!� �!�
�� ���� �� �!� �!�
�� ���� �� �!� �!�
�� ���� �� �!� �!�
������������������������������������������������
����������������������������������
��������������������
![Page 10: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/10.jpg)
ECS289A, WQ03, Filkov
������������� �������� ���������������������������������������������"
�������� �������������
������������
��� ��������� �� ������ ��
������ ##�±����� � $%�±�$�� � $� �&� ��#��
����&��'��(� )%�±��%�� � �%%�±��%�� � �%� �$� ��&&�
����*� $%�±��%�� � )%�±��%�� � �%+�%� �,� �,*��
���� ���� ���� �%� �,� ,$*��
(phase shift determined similarely...)
![Page 11: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/11.jpg)
ECS289A, WQ03, Filkov
-1
-0.5
0
0.5
1
1 3 5 7 9 11 13 15 17
-1-0.5
00.5
1
1 3 5 7 9 11 13 15 17
-1-0.5
00.5
1
1 2 3 4 5 6 7 8 9 10
>gi|12004594|gb|AF217406.1| Saccharomyces cerevisiae uridine nucleosidase (URH1) gene, complete cdsATGGAATCTGCTGATTTTTTTACCTCACGAAACTTATTAAAACAGATAATTTCCCTCATCTGCAAGGTTG GGGAAGGGTTGGACAGCAATAACCAGCGACTGTTTGAAGAAAAGATGACTGTTAGTAAAATACCCATATG GCTAGATTGTGATCCTGGTCATGATGATGCCATAGCCATTTTATTAGGCTGTTTCCATCCAGCTTTCAAT CTTCTAGGAATCAGCACGTGTTTTGGTAACGCACCGCCAGAGAATACTGACTACAACGCCCGTTCTCTTT TGACTGCGATGGGCAAAGCACAAGCGATTCCAGTTTATAAAGGCGCACAGAGACCTTGGAAAAGGGAACC TCATTATGCTCCTGACATTCATGGTATATCAGGTTTAGACGGCACTTCTTTGCTACCTAAGCCAACATTT GAGGCAAGAACTGATAAAACGTATATTGAGGCCATTGAAGAGGCGATTCTAGCTAACAATGGAGAGATAT CCTTTGTGTCTACTGGTGCACTTACCACATTAGCAACAGTTTTTAGGTGTAAACCATACCTAAAAAAATC
>gi|13534|emb|V00696.1|MISC16 Yeast (S. cerevisiae) mitochondrial gene for cytochromeATATATATAATTATAAATATATATATATATAATAAGTATTAATTAATAATATATATTTATATATTTTTTA TTAATTAATATATATAAAATATTAGTAATAAATAATATTATTAATATTTTATAAATAAATAATAATAATA TGGCATTTAGAAAATCAAATGTGTATTTAAGTTTAGTGAATAGTTATATTATTGATTCACCACAACCATC ATCAATTAATTATTGATGAAATATGGGTTCATTATTAGGTTTATGTTTAGTTATTCAAATTGTAACAGGT ATTTTTATGGCTATGCATTATTCATCTAATATTGAATTAGCTTTTTCATCTGTTGAACATATTATAAGAG ATGTGCATAATGGTTATATTTTAAGATATTTACATGCAAATGGTGCATCATTCTTTTTTATGGTAATGTT TATGCATATGGCTAAAGGTTTATATTATGGTTCATATAGATCACCAAGAGTACTATTATGAAATGTAGGT GTTATTATTTTCATTTTAACTATTGCTACAGCTTTTTTAGGTTATTGTTGTGTTTATGGACAGATGTCAC ATTGAGGTGCACTAGTTATTACTAATTTATTCTCAGCAATTCCATTTGTAGGTAACGATATTGTATCTTG
Heterogeneous Data Integration
• Microarray
-������� ��
• DNA Sequence
• Proteomics
������������ �����!
![Page 12: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/12.jpg)
ECS289A, WQ03, Filkov
Why Does It Pay to Integrate?
• Gifford, Computational FunctionalGenomics, Lecture 18
• “Multiple independent constraints can dramatically increase the significance of otherwise elusive effects”
• Dependent vs. Independent
![Page 13: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/13.jpg)
ECS289A, WQ03, Filkov
1. Classification
• Simple, intuition based classifications– compare to the leukemia classification of
Golub et al.,• Machine Learning classifiers (SVMs)
– compare to Cristianini et al.
![Page 14: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/14.jpg)
ECS289A, WQ03, Filkov
Eg. Gene Expression + Protein Interaction Data
(Ge et al., 2001)
Goals 1. To compare the levels of
interaction between proteins encoded by co-expressed genes vs. proteins not encoded by co-expressed genes
2. Improved modeling of protein-protein interactions
Methods
Calculate protein interaction density, and corresponding significance within and between co-expressed clusters of genes
![Page 15: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/15.jpg)
ECS289A, WQ03, Filkov
Transcriptome – Interactome Correlation Maps
![Page 16: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/16.jpg)
ECS289A, WQ03, Filkov
a) Protein-protein interaction data
b) Protein Interaction + Gene Expression Data
More Knowledge Yields Better Models
Stress response proteins
![Page 17: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/17.jpg)
ECS289A, WQ03, Filkov
Eg. Protein Function Prediction(Marcotte et al., 1999)
or
Combining various strategies to link functionally related proteins. Total: 93750 links
Link confidence:• highest confidence (4130 links)• high confidence (19521 links)• rest
![Page 18: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/18.jpg)
ECS289A, WQ03, Filkov
2. Integrating Clusterings(Filkov and Skiena, 2003)
�� ��
Data sets are usefully summarized as clusterings� Functional� Structural� Data Driven
By using multiple clusterings we can learn more, but how?
![Page 19: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/19.jpg)
ECS289A, WQ03, Filkov
����
./������� ������������� ������������0�1����� ������ �����!����2
��
�$
�� ��
�,
�&
�#�*
�
Problem: Find a Consensus Clustering that describes the givenclusterings well
Approach: Integrate theclusterings of data by minimizing the sum of distances between them and a consensus
�=i
i CPdS ),(min
![Page 20: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/20.jpg)
ECS289A, WQ03, Filkov
Solving the Consensus Problem• min S consensus is NP-complete even for a very simple
distance function (Rand Index)• Simple Heuristics based on random element move
between clusters work well on large data sets• a measure of benefit of integration
(artificial data)
�Integrating Spellman’s Data
–Alpha, Avg. SoD = 0.1121
–cdc15, Avg. SoD = 0.1042
–elu, Avg. SoD = 0.1073
–Overall, Avg. SoD = 0.107, benefit
�Spellman + Phylogeny = No benefit
�Spellman + Yeast Stress = Benefit
(real data)
![Page 21: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/21.jpg)
ECS289A, WQ03, Filkov
3. Gene Network Inference
• Data Integration for Link (Graph) Modeling in General
• Probabilistic Setting• Each data source is an “expert” proposing
a model• Independent experts: easy (Gifford 2002)
– independent significances, p1, p2
– combined significance, p=f(p1,p2)
![Page 22: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/22.jpg)
ECS289A, WQ03, Filkov
Graph Models
• Dependent experts (Hartemink et al., 2002)– Joint probability distributions– Bayesian Networks– Model scoring
• Maximizing a Bayesian scoring function• simulated annealing optimizer• averaging over high-scoring models
– Location+expression data used as priors
![Page 23: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/23.jpg)
ECS289A, WQ03, Filkov
![Page 24: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/24.jpg)
ECS289A, WQ03, Filkov
4. Putting It All TogetherDavidson et al., 2002
![Page 25: ECS 289A - Lectures16and17web.cs.ucdavis.edu/~filkov/classes/289a-W03/lec16and17.pdf• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational](https://reader036.vdocuments.mx/reader036/viewer/2022071403/60f7ae164897ea3c620c75f8/html5/thumbnails/25.jpg)
ECS289A, WQ03, Filkov
Bibliography• Chiang et al., Visualizing Associations Between Genome Sequences and
Gene Expression Data Using Genome-Mean Expression Profiles, Bioinformatics, v. 17, 2001, S49-S55.
• Davidson et al., A Genomic Regulatory Network for Development. Science 295 (5560): 1669-2002
• Filkov et al., Analysis Techniques for Microarray Time-Series Data, Journal of Computational Biology 9(2): 317-330 (2002).
• Filkov and Skiena, Integrating Heterogeneous Data Sets via Consensus Clustering, 2003 (in progress)
• Ge et al., Correlation Between Transcriptome and Interactome Mapping Data from Saccharomyces Cerevisiae, Nature Genetics, v. 29, 2001, 482-486.
• Hartemnik et al., Combining Location and Expression Data for Principled Discovery of Genetic Regulatory Network Models, Pacific Symposium onBiocomputing 2002.
• Marcotte et al., A Combined Algorithm for Genome-wide Prediction of Protein Function, Nature, v. 402, 1999, 83-86.
• Pavlidis et al., Learning Gene Functional Classification from Multiple Data Types, Journal of Computational Biology, v. 9, 2002, 401-411.
• Giffrod’s course at MIT: Comp. Funct. Genomics, 2002 http://www.psrg.lcs.mit.edu/6892/handouts/lecture-18-1.pdf