algorithms in computational biology (236522) spring 2002 lecturer: prof. shlomo moran ta: ydo wexler...
TRANSCRIPT
.
Algorithms in Computational Biology (236522)
Spring 2002
Lecturer: Prof. Shlomo MoranTA: Ydo Wexler
Lecture: Tuesday12:30-14:30, Taub 6Tutorial: Tuesday11:30-12:30, Taub 6
2
Course Information(pages with this and more info will be distributed by next
week)
Requirements & Grades: 15-25% homework, in five theoretical question sets.
[Submit in two weeks time]. Homework is obligatory. 75-85% test. Must pass beyond 55 for the homework’s
grade to count Exam date: to be decided, after coordination with the
students.
3
Bibliography
Biological Sequence Analysis, R.Durbin et al. , Cambridge University Press, 1998
Introduction to Molecular Biology, J. Setubal, J. Meidanis, PWS publishing Company, 1997
A brochure of Prof. Geiger course of last Semester will be available at Taub library (this Semester less topics will be covered, some of which, possibly, in more details)
url: www.cs.technion.ac.il/~cs236522
4
Course PrerequisitesComputer Science and Probability Background Data structure 1 (cs234218) Algorithms 1 (cs234247) Probability (any course)
Some Biology Background Formally: None, to allow CS students to take this course. Recommended: Molecular Biology 1 (especially for those in the
Bioinformatics track), or a similar Biology course, and/or a serious desire to complement your knowledge in Biology by reading the appropriate material (see the course web site).
Studying the algorithms in this course while acquiring enough biology background is far more rewarding than ignoring the biological context.
.
Biological Background
This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran.
Solve questions 1-3, p. 30 (to be on the course web site)
Due time: Tutorial class of 29.10.02 (2 weeks from today), or earlier in the teaching assistant’s mail slot.
First home work assignment: Read the first chapter (pages 1-30) of Setubal et al., 1997. (a copy is available in the Taub building library, and one for loan at Fishbach).
7
Computational Biology
Computational biology is the application of computational tools and techniques to (primarily) molecular biology. It enables new ways of study in life sciences, allowing analytic and predictive methodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, Computer Science, and Statistics.
Computational biology is also called Bioinformatics, although many practitioners define Bioinformatics somewhat narrower by restricting the field to molecular Biology only.
8
Examples of Areas of Interest
• Building evolutionary trees from molecular (and other) data• Efficiently constructing genomes of various organisms• Understanding the structure of genomes (SNP, SSR, Genes)• Understanding function of genes in the cell cycle and disease• Deciphering structure and function of proteins
_____________________SNP: Single Nucleotide PolymorphismSSR: Simple Sequence Repeat
9
Exponential growth of biological information: growth of sequences, structures, and literature.
12
Course Goals
Learning about computational tools for (primarily) molecular biology.
Cover computational tasks that are posed by modern molecular biology
Discuss the biological motivation and setup for these tasks
Understand the kinds of solutions that exist and what principles justify them
13
Topics I
Dealing with DNA/Protein sequences: Genome projects and how sequences are found Finding similar sequences Models of sequences: Hidden Markov Models Transcription regulation Protein Families Gene finding
14
Topics II
Models of genetic change: Long term: evolutionary changes among species Reconstructing evolutionary trees from sequences Short term: genetic variations in a population Finding genes by linkage and association
15
Topics III (if time allows)
Protein World: How proteins fold - secondary & tertiary structure How to predict protein folds from sequences data How to analyze proteins changes from raw
experimental measurements (MassSpec)
16
Human Genome
Most human cells contain
46 chromosomes:
2 sex chromosomes (X,Y):
XY – in males.
XX – in females.
22 pairs of chromosomes named autosomes.
17
DNA OrganizationS
ourc
e: A
lber
ts e
t al
18
The Double HelixS
ourc
e: A
lber
ts e
t al
19
DNA Components
Four nucleotide types: Adenine Guanine Cytosine Thymine
Hydrogen bonds(electrostatic connection): A-T C-G
20
Genome Sizes
E.Coli (bacteria) 4.6 x 106 bases Yeast (simple fungi) 15 x 106 bases Smallest human chromosome 50 x 106 bases Entire human genome 3 x 109 bases
21
Genetic Information
Genome – the collection of genetic information.
Chromosomes – storage units of genes.
Gene – basic unit of genetic information. They determine the inherited characters.
22
GenesThe DNA strings include: Coding regions (“genes”)
E. coli has ~4,000 genes Yeast has ~6,000 genes C. Elegans has ~13,000 genes Humans have ~32,000 genes
Control regions These typically are adjacent to the genes They determine when a gene should be “expressed”
“Junk” DNA (unknown function - ~90% of the DNA in human’s chromosomes)
23
The Cell
All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types.
24
Example: Tissues in Stomach
How is this variety encoded and expressed ?
25
Central Dogma
Transcription
mRNA
Translation
ProteinGene
cells express different subset of the genesIn different tissues and under different conditions
שעתוק תרגום
26
Transcription
Coding sequences can be transcribed to RNA
RNA nucleotides: Similar to DNA, slightly different backbone Uracil (U) instead of Thymine (T)
Sou
rce:
Mat
hew
s &
van
Hol
de
27
Transcription: RNA Editing
Exons hold information, they are more stable during evolution.This process takes place in the nucleus. The mRNA molecules diffuse through the nucleus membrane to the outer cell plasma.
1. Transcribe to RNA2. Eliminate introns3. Splice (connect) exons* Alternative splicing exists
28
RNA roles Messenger RNA (mRNA)
Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block).
Transfer RNA (tRNA) Decodes the mRNA molecules to amino-acids. It connects
to the mRNA with one side and holds the appropriate amino acid on its other side.
Ribosomal RNA (rRNA) Part of the ribosome, a machine for translating mRNA to
proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the tRNA to the amino acid chain being created.
...
29
Translation
Translation is mediated by the ribosome Ribosome is a complex of protein & rRNA
molecules The ribosome attaches to the mRNA at a
translation initiation site Then ribosome moves along the mRNA sequence
and in the process constructs a sequence of amino acids (polypeptide) which is released and folds into a protein.
30
Genetic Code
There are 20 amino acids from which proteins are build.
31
Protein Structure
Proteins are poly-peptides of 70-3000 amino-acids
This structure is (mostly) determined by the sequence of amino-acids that make up the protein
32
Protein Structure
33
Evolution
Related organisms have similar DNA Similarity in sequences of proteins Similarity in organization of genes along the
chromosomes Evolution plays a major role in biology
Many mechanisms are shared across a wide range of organisms
During the course of evolution existing components are adapted for new functions
34
Evolution
Evolution of new organisms is driven by Diversity
Different individuals carry different variants of the same basic blue print
Mutations The DNA sequence can be changed due to
single base changes, deletion/insertion of DNA segments, etc.
Selection bias
35
The Tree of Life
Sou
rce:
Alb
erts
et
al
36
Example for Phylogenetic AnalysisInput: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species.
Question: Which evolutionary tree best explains these sequences ?
AGAAAA
GGAAAG
AAA AAA
AAA
21 1
Total #substitutions = 4
One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the evolutionary tree (Also called phylogenetic tree).
37
Example ContinuedThere are many trees possible. For example:
AGAGGA
AAAAAG
AAA AGA
AAA
11
1
Total #substitutions = 3
GGAAAA
AGAAAG
AAA AAA
AAA
11 2
Total #substitutions = 4
The left tree is “better” than the right tree.
Questions:Is this principle yielding realistic phylogenetic trees ? (Evolution)How can we compute the best tree efficiently ? (Computer Science)What is the probability of substitutions given the data ? (Learning)Is the best tree found significantly better than others ? (Statistics)
.
Werner’s Syndrome
A successful application of genetic analysis for Gene
Hunting
39
The Disease
First references in 1960s Causes premature ageing Autosomal recessive Linkage studies from 1992 WRN gene cloned in 1996 Subsequent discovery of mechanisms involved in
wild-type and mutant proteins
40
Marker Distance Distance from prior from first DHS133 0.0D8S136 7.6 7.6D8S137 7.4 15.0D8S131 0.9 15.9D8S339 6.7 22.6D8S259 1.6 24.2FGFR 2.5 26.7D8S255 2.8 29.5ANK 2.1 31.6PLAT 2.8 34.4D8S165 11.4 45.8D8S166 1.0 46.8D8S164 43.8 90.6
Identifying the Marker/s
Match most ‘likely’ cumulative distance against cumulative distances from marker file.
Distance 22.6cM (centi Morgans) fell exactly on the marker D8S339.
41
Locating D8S339
Position of marker D8S339 was unknown. But positions of the adjacent markers D8S131 and
D8S259 were known. Recombination distances from D8S339 to both
D8S131 and D8S259 are given. By assuming recombination physical distance, we
estimate position of D8S339 in the next drawing.
42
Results
D8S131 Marker KnownPosition
D8S259Marker Known Position
D8S339 Estimated Position (1993)
WRN Actual
Position (1996)
http://genome.ucsc.edu/cgi-bin/hgTracks?position=chr8:32213515-38608031
Linkage accuracy: ~1,250,000 bp