the rest of bioinformatics
DESCRIPTION
The rest of bioinformatics. Prof. William Stafford Noble Department of Genome Sciences Department of Computer Science and Engineering University of Washington [email protected]. One-minute responses. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/1.jpg)
The rest of bioinformatics
Prof. William Stafford NobleDepartment of Genome Sciences
Department of Computer Science and EngineeringUniversity of Washington
![Page 2: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/2.jpg)
One-minute responses• I always like it when we ask questions and you first say good question, even though the question
is not good.• I liked the lecture although the concepts were a bit advanced for me.• I understood about 90% of everything.
• The Python is more challenging but it is good to get confused sometimes.• Python was more interesting!• The comprehension of Python is improved at 95%.• Today’s program (first one) was really challenging. I thought the second one was easier to
understand.• Python problem 3 was really challenging for me.• The Python today was completely different from the rest and needed more time.
• Do your students at home write one-minute responses for the whole semester every day?– Yes.
• How did we discover the first mutation?– I am not sure I understand the question. We can observe mutations happening in microorganisms in the lab by
sequencing their DNA from one generation to the next.• Are you going to be readily available in future for consultations in case I get stuck?
– Yes, you can always email me at [email protected].• I do not think species are related because I believe in creation.
![Page 3: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/3.jpg)
Outline
• Parsimony• Distance methods
– Computing distances– Finding the tree
• Maximum likelihood
![Page 4: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/4.jpg)
Revision
• How do we compute the probability of observing this column, given this tree and an assumed model of evolution?
ACGCGTTGGGACGCGTTGGGACGCAATGAAACACAGGGAA
T T A G
Pr(column|tree,model)+
![Page 5: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/5.jpg)
Revision
• We enumerate all possible assignments to the internal nodes, compute the probability of each tree, and sum.
T T A G T T A G T T A G
A
A
A A
C
A A
G
A
![Page 6: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/6.jpg)
Revision
• How do we compute the probability of observing this column, given this assigned tree and an assumed model of evolution?
ACGCGTTGGGACGCGTTGGGACGCAATGAAACACAGGGAA
T T A G
Pr(column|tree,model)+T
A
A
![Page 7: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/7.jpg)
Revision
T T A G
T
A
A
πA, πC, πG, πT
L0
L1 L2
L3 L4L5
L6
• We use our evolutionary model to assign a probability to each branch, and then take the product of the probabilities of the branches.
• L(tree) = L0 L1 L2 L3 L4 L5 L6
![Page 8: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/8.jpg)
Revision• In maximum likelihood estimation, are mutations that occur on
branches of a single tree considered independent or mutually exclusive events?– Independent.
• What do different labelings of internal nodes of a tree represent?– Different possible evolutionary histories.
• Are the different labelings independent or mutually exclusive?– Mutually exclusive.
• Are the columns of a multiple alignment considered independent or mutually exclusive?– Independent
![Page 9: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/9.jpg)
Maximum likelihood revisitedfor each possible tree
for each column of the alignmentfor each assignment of internal nodes
for each branch compute the probability of that branchassigned tree probability ← multiply branch probabilities
column probability ← sum assigned tree probabilitiestree probability ← multiply column probabilities
return the tree with the highest probability
![Page 10: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/10.jpg)
Sequence analysis tasks
• Protein structure prediction• Remote homology detection• Gene finding
![Page 11: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/11.jpg)
Protein structure prediction
• Given: amino acid sequence
• Return: protein structure
A complex of earthworm hemoglobin, comprised of 144 globin chains.
Source: Protein Databank.
![Page 12: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/12.jpg)
Remote homology detection
• The hidden Markov model generalizes the PSSM used by PSI-BLAST.
• The model is trained using expectation-maximization.
M1 M2 M3 M4 M5 M6 M7 M8
I1 I2 I3 I4 I5 I6 I7 I8I0
D1 D2 D3 D4 D5 D6 D7 D8
B E
![Page 13: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/13.jpg)
Gene finding
Pedersen and Hein, Bioinformatics 2003.
![Page 14: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/14.jpg)
Mass spectrometry
• Spectrum identification• Protein inference• Biomarker discovery
![Page 15: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/15.jpg)
EAMPK
GDIFYPGYCPDVK
LPLENENQGK
ASVYNSFVSNGVK
YVMTFK
ENQGVVNR
![Page 16: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/16.jpg)
Biological networks
• Functional networks• Protein-protein interaction networks• Metabolic networks• Regulatory networks
![Page 17: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/17.jpg)
Adai et al. JMB 340:179-190 (2004).
![Page 18: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/18.jpg)
Protein-protein interactions• Each node is a
protein.• Each edge is a
physical interaction.• Edges are measured
via– Yeast two-hybrid– TAP tagging plus
MS/MS
Jeong et al. Nature. 2001.
![Page 19: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/19.jpg)
Regulatory networks• Mammalian cell cycle.• Colors represent different
types of interactions– Black: binding– Red: covalent
modifications and gene expression
– Green: enzyme actions– Blue: stimulations and
inhibitions
Kohn. Mol Cell Biol. 1999
![Page 20: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/20.jpg)
Metabolic networks• Nodes are enzymes
or metabolites.• Edges represent
interactions.• This network
represents the Arabidopsis TCA cycle.
![Page 21: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/21.jpg)
Gene expression
• Clustering• Predictive modeling• Clinical applications
![Page 22: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/22.jpg)
Gene expression matrix
The matrix entry at (i, j) is the expression level of gene i in experiment j.
Experiments
Gen
es
![Page 23: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/23.jpg)
Fibroblast gene clustering
• Cholesterol biosynthesis• Cell cycle• Immediate-early response• Signaling and angiogenesis• Wound healing and tissue remodeling
Iyer et al. “The transcriptional program in the response of human fibroblasts to serum.” Science. 283:83-7, 1999.
![Page 24: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/24.jpg)
Achieves >75% accuracy.
![Page 25: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/25.jpg)
Next generation sequencing
Next generation sequencing video
![Page 26: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/26.jpg)
Spaced seed alignment
• Tags and tag-sized pieces of reference are cut into small “seeds.”
• Pairs of spaced seeds are stored in an index.
• Look up spaced seeds for each tag.
• For each “hit,” confirm the remaining positions.
• Report results to the user.
![Page 27: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/27.jpg)
Burrows-Wheeler
• Store entire reference genome.
• Align tag base by base from the end.
• When tag is traversed, all active locations are reported.
• If no match is found, then back up and try a substitution.
![Page 28: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/28.jpg)
Spliced-read mapping
• Used for processed mRNA data.• Reports reads that span introns. • Examples: TopHat, ERANGE
![Page 29: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/29.jpg)
Beyond the genome
• Epigenetics• Chromatin state assignment• Genome 3D architecture
![Page 30: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/30.jpg)
Next generation assays
ENCODE Project Consortium 2011. PLoS Biol 9:e1001046
![Page 31: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/31.jpg)
Rediscovering genes
![Page 32: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/32.jpg)
![Page 33: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/33.jpg)
Population genetics
• Genotype to phenotype• Human disease genetics• Population history
![Page 34: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/34.jpg)
jbiol.com
Human migrations
![Page 35: The rest of bioinformatics](https://reader036.vdocuments.mx/reader036/viewer/2022062316/56816770550346895ddc5bd2/html5/thumbnails/35.jpg)
Other topics
• Natural language processing• Image analysis