bioinformatics practical for biochemists - max planck institute for developmental biology ·...
TRANSCRIPT
Bioinformatics Practical for
Biochemists
01. DNA & Genomics
1
Description
• Lectures about general topics in Bioinformatics & History
• Tutorials will provide you with a toolbox of bioinformatics programs to analyse data
• Hands-On sessions will give you the opportunity to use these toolsWLAN: MPI LoungePassword: mpi-2011
2
Course Outline
• Mon – DNA & Genomics
• Tue – Introduction to Proteins
• Wed – Annotation of Sequence Features
• Thr – Protein Classification
• Fri – Evolution & Design
Course Material:eb.mpg.de/research/departments/protein-evolution/teaching
3
Course Outline
• 13:00-14:00 Presentation
• 14:15-17:30 Tutorial (2 x 30min) & hands-on practical
• You will need to keep an electronic lab notebook
• Fri afternoon: Test Exercises
4
Software Requirements
• Browser (e.g. Firefox)
• “Advanced” Word Processor
• PyMOL (www.pymol.org – free for teaching)
• Java (https://www.java.com/verify)
• add http://toolkit.tuebingen.mpg.de as ‘Exception Site’ in Javas security settings
5
DNA & Genomics
1953 Model of DNA (F. Crick)
6
wikipedia.org
What is the “genetic material”?
• 1865 Gregor Mendel
• basic rules of heredity
• 1869 Friedrich Miescher
• discovery of ‘nuclein’ (DNA), Hoppe-Seyler repeated all experiments
• 1881 Edward Zacharias
• chromosomes are composed of nuclein
• 1899 Richard Altmann
• renaming nuclein to nucleic acid
7
• 1928 Frederick Griffith
• “transforming principle” - Str. pneumoniae experiment
• 1944 Avery, MacLeod & McCarty
• Griffith’s “transforming principle” is DNA
Essent. of Molec. Biology, Malathi V, 2012
DNA is the “transforming material”
8
bacteriophagetherapy.info / www.lifesciencesfoundation.org
DNA is the genetic material
• 1950 Erwin Chargaff
• A/T, C/G same amount in different tissues
• 1952 Hershey & Chase
• DNA is the genetic material using 32P/35S Phage/E. coli experiment
9
http://osulibrary.oregonstate.edu/specialcollections/coll/pauling/dna/notes/1952a.22-ms-01.html
Solving the DNA structure
• 1952/53 Linus Pauling
• beat Cavendish Lab in discovery of α-helix
• Cavendish Lab (Cambridge) Watson & Crick allowedto work full-time on DNA
• Pauling shared manuscriptwith Cavendish Lab before publication(via his son Peter Pauling)
10
Solving the DNA structure
• 1951/1952 Franklin & Wilkins
• 1951 Lecture with Watson attending
• A-DNA / B-DNA
• periodicity, phosphates are outside
• 1953 X-ray of B-DNA (Photo 51)- Wilkins showed image to Watson - Perutz showed a confidential committee report to Watson & Crick
11
Nature, 1953
Solving the DNA structure
original papers
NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature 397© 2003 Nature Publishing Group
original papers
NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature 397© 2003 Nature Publishing Group
original papers
NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature 397© 2003 Nature Publishing Group
original papers
NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature 397© 2003 Nature Publishing Group 12
original papers
NATURE | VOL 421 | 23 JANUARY 2003 | www.nature.com/nature 397© 2003 Nature Publishing Group
DNA structure
13
Getting the “code”
• 1953 George E. Palade
• “RNA organelles” (ribosomes)
• 1957 Crick et.al
• suggest non-overlapping triplets
• only 20 out of 64 triplet code for an amino acid
• “comma-free code”
14
Getting the “code”
• 1961 Nirenberg & Matthaei
• polyU mRNA produces polyF protein
• complete genetic code
• 1961 Sydney Brenner
• no overlapping codes
• concept of mRNA
• triplet Code (Crick, Brenner, Barnett, Watts-Tobin)
NO. AS09 December 30, 1961 ‘NATURE 122i
GENERAL NATURE OF THE GENETIC CODE FOR PROTEINS
@ DR.I R. J./WATTS-TOBIN - Medical Research Council Unit for Molecular Biology,
Cavendish Laboratory, Cambridge
HERE is now a mass of indirect evidence which suggests that ths amino-a&d sequence along the
polypeptids chain of a protein is determined by the sequence of the bases along some particular part of the nucleic acid of the genetic material. Since there are twenty common amino-acids found throughout Sature, but only four common bases, it haa often been surmised that the sequence of the four baaes is in soms way a code for the sequence of the amino- acids. In this article ws report genetic experiments which, togsther with the work of others, suggest that the genetic code is of the foUowing general type:
(a) A group of three bases (or, leas likely, a multiple of three bases) codes one amino-acid.
(b) The code is not of the overlapping type (see Fig. 1).
(c) The sequence of the baass is read from a fixed Btarting point. This dstsrminsa how the long sequences of bases are to bs correctly read off as triplets. There ars no special ‘commas’ to show how to select the right triplets. If the starting point is displaced by one bass, then the reading into triplets is displaced, and thus becomes incorrsct.
(d) The code is probably ‘degenerate’; that is, in general, one particular ammo-acid can be coded by one of several tripieta of bases.
The Reading of the Code The evidence that the genetic cods is not over-
lapping (see Fig. 1) doss not come from our work. but from that, of Wittmannl and of Tsugita and Frasnkel-Conrat on the mutants of tobacco mosaic virus produced by nitrous asid. In an overlapping triplet code, an alteration to one baas will in general change three adjacent amino-acids in the polypeptide chain. Their work on the alterations produced in the protein of the virus show that usually only one amino-acid at a time is changed a8 a result of treating the ribonuclsic acid (RNA) of the virus with nitrous acid. In the rarer cases where two amino-acids are altered (owing presumably to two separate deamma- tions by the nitrous acid on one piece of RNA), the altered amino-acids ars not in adjacent positions in the polypeptide chain.
Brsnnera had previously shown that, if the code were universal (that is, the same throughout Nature), then all overlapping triplet codes were impossible. Moreover, all the abnormal human hremoglobins studied in detail4 show only single amino-acid changes. The newer experimental rssulta ssssntially rule out all simple codes of the overlapping type.
If the code is not overlapping, then there must be Borne arrangement to show how to select the correct triplets (or quadruplets, or whatever it may be) along the continuous sequence of bases. One obvious suggestion is that, say, every fourth baas is a ‘comma’. &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free
codes of Crick, Griffith and Or&j. Alternatively, the correct choice may be made by starting at a fixed point and working along the sequence of bases three (or four, or whatever) at a time. which we now favour.
It is this possibility
Experimental Results Our genetic experiments have heen carried out on
the B cistron of the rn region of the bacteriophage T’4, which attacke strains of Eschmichia coli. This is the system so brilliantly exploited by BenzeP*‘. The rn region consists. of two adjacent genes, or ‘cistrona’, called cistron A and cistron B. The wild- type phags will grow on both E. coli B (here called B) and on J!?. coli K12 (a) (here called K), but a phage which has lost the function of either gene will not grow on K. Such a phags produces an r plaque on B. Many point mutations of ths genes are known which behave in this way. Deletions of part of the region are also found. Other mutations, known as ‘leaky’, show partial function; that is, they will grow on R but their plaque-type on B is not truly wild. We ‘report hers our work ,on the mutant P 13 (now renamed FC 0) in the Bl segment of the B cistron. Thie mutant was originally produced by the action of proflavins.
We@ have previously argued that acridines such aa pro5vin act as mutagens because they add or dslsts a base or bases. The most striking evidence in favour of this is that mutants produced by a&dines are seldom ‘leaky’ ; they are almost always completely lacking in the function of the gene. Since our note was published, experimental data from two eourcsa have been added to 0u.1: previous evidence: (1) we have examined a set of 126 pn mutants made with acridine yellow; of these only 6 are IeaLT- (typically about half the mutants made with base analogues are leaky) ; (2) Streisinger lo has found that whereas mutants of the lysozyme of phage T4 produced by baas-analogues are usually leaky, all lysozyme mutants produced by proflavin are negative, that is, the function is completely lacking.
If an acridine mutant i,3 produced by, say, adding a base, it should revert to ‘lvild-type’ by deleting a bass. Our work on revertants of FC-0 shows that it-usually
Starlinq point 3 ,, ;$I Overlappirq code
+7
NUCLEIC ACID * I’ ’ ’ ’ ’ ’ ’ --- ,-J+-~----
1 3 '
ETC.
Non-overlapplnq Code
Fig. 1. To show the difference between an overlapping code and a non-overlappinu code. The short wrticnl lines represent the bases of the nucleic acid. The czw illustrated is for a triplet code
15
E. coli
Getting the “code” – incl. start & stop codons
• Alternative start codon
• AUG (83%)
• GUG (14%)
• UUG (3%)
• Alternative stops
• UAA (63%, ‘ochre’)
• UGA (29% ‘opal’) / or Sec (Seleoncys)
• UAG (8%, ‘amber’)
➡ Start of protein is not easy to determine
16
Gene Structure – Eurkayotes / Prokaryotes
17
Miller, O. L. et al. Visualization of bacterial genes in action. Science 169, 392–395
Gene Structure – Polysomes in Prokaryotes
• EM picture of polysomes on a chromosome
23
Transcription initiation
DNA
mRNA with Ribosomes
Gene Structure – Transcription in Prokaryotes
18
• Promotor immediately adjacent to genes in the upstream direction
• e.g. -35 & -10 regionTTGACA, TATAAT (Pribnow box)
• Promotor is recognised by the sigma (σ) factor of the RNA polymerase
• different σ factors bind different promotors
• affinity to sequence motifs regulate transcription level
The -10 region of 350 E. coli promoters
weblogo.berkeley.edu
0
1
2
bits
5′
0 1
A
G
CT
2
C
G
TA
3
C
G
A
T
4
C
G
T
A
5
G
T
C
A
6
G
A
C
T
7
3′
Ermolaeva et. al, 2000
Gene Structure – Transcription in Prokaryotes
19
• Rho-independent
• palindromic sequence - RNA forms hairpin
• G+C rich region followed by A+T rich region
• RNA-Pol stalls at hairpin
• falls off since rU-dA regionfollows
• Rho-dependent
• rho protein binds to nascent RNA (no simple consensus)
• moves towards transcription bubble
• breaks RNA/DNA helix by pulling RNA away
Ermolaeva et. al, 2000
Gene Structure – Translation in Prokaryotes
20
• Initiation
• the start codon (AUG) is guided to the P-site in the 30ssubunit by base-pairing with the3’ end of the 16S rRNA
• Termination
• stop codon is reached
• ribosome stalls
• ‘Release factors’ recognise stop-codons and terminate translation
BacterialGeneRegula/onandTranscrip/onalNetworks,M.MadanBabu,E.coli
Gene Structure – Operons ‘organize’ genes in Prokaryotes
• Operons are transcription units
• polycistronic - i.e. only one promotor…though indication that this is not always true
• up to 15 genes in an operon
• ~60 % of genes
• distance of genes is short, might even overlap
• expression level & transcript stability often decrease in 5’-3’ direction
• gene products often with functional association (e.g. enzymes of one biosynthesis pathway)
• not well conserved, gene order might change
22
Griswold, A. (2008) Nature Education 1(1)Understanding Bioinformatics, Zvelebil & Baum, 2007
Gene Structure – Prokaryotic Operons
lac Operon
1: Regulatory gene
Promotor region
3: ß-galactosidase4: ß-gal permease8: ß-gal transacetylase
24
u-tokyo.ac.jp, Chaptman
Gene Structure – Prokayotes
• Lac Operon
• Promotor 5’ of β-galactosidase (lacZ)
21
Ribosome binding site(rbs)
PromoterOperator
End oflacI
Start oflacZ
mRNA
CAPcAMPbindingsite
RNA polymerasebinding site
LacI binding site
-35 -10
CCTTTCGCCCGTCACTCGCGTTGCGTTAATTACACTCAATCGAGTGAGTAATCCGTGGGGTCCGAAATGTGAAATACGAAGGCCGAGCATACAACACACCTTAACACTCGCCTATTGTTAAAGTGTGTCCTTTGTCGATACTGGTACGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCATG
Gene Structure – Eukaryotes / Prokaryotes
25
Essent. of Molecular Biol.
Gene Structure – Transcription in Eukaryotes
26
• RNA Pol II transcribes mRNA
• several transcription factors (TF) assist RNA Pol II
• Promotors
• using different TF
• TATA box – TATAAAA(-25bp)
• Initiator (Inr) sequence - start point of transcription
• GC-box – GGGCG (one or more copies -40-100bp)
• CAAT-box – CCAAT (-40-100bp)
Essent. of Molecular Biol.
Gene Structure – Transcription in Eukaryotes
27
• Initiation
• Initial step is the binding of Transcription Factor IID (TFIID) to TATA-box
• Termination
• AAUAAA signal
• leads to cleavage of RNA 15-30nt downstream
• poly(A) tail added
• Post-transcriptional modification
• capping (7-methylguanosine)
• tailing (see above)
• splicing
Essent. of Molecular Biol.
Gene Structure – Transcription in Eukaryotes
28
• Enhancers/Silencers
• protein-coding genes contains typically several enhancers
• usually 700-1000bp up/downstream or within intron
• control gene expression pattern (e.g. tissue specificity)
• Insulators
• blocks the interaction of enhancers/silencers with promotor
• limit the influence of enhancers
Essent. of Molecular Biol.
Gene Structure – Translation in Eukaryotes
29
• Initiation
• eIF-4F binds to Cap-Structure of mRNA
• followed by binding/aligning to preinitiation complex
• Termination
• termination codons
• release factors (eRFs)
• Post-translational Modification
• proteolytic cleavage, acylation, myristoylation, methylation, phosphorylation, acetylation, formylation, sulphation, prenylation…
zazzle.com
30
Gene Structure – Eukaryotes
Gene Structure – Comparison
!Eukaryote! Prokaryote!
Genes!
• Often&have&introns&
• Intraspecific&gene&order&and&number&generally&relatively&stable&&
• many&non8coding&(RNA)&genes&
• There&is&NOT&generally&a&relationship&between&organism&complexity&and&gene&number&
• No&introns&
• Gene&order&and&number&may&vary&between&strains&of&a&species&
Gene!regulation!
• Promoters,&often&with&distal&long&range&enhancers/silencers,&MARS,&transcriptional&domains&
• Generally&mono8cistronic&
• Promoters&
• Enhancers/silencers&rare&&
• Genes&often®ulated&as&polycistronic&operons&
Repetitive!sequences!• Generally&highly&repetitive&with&genome&wide&families&from&transposable&element&propagation&
• Generally&few&repeated&sequences&
• Relatively&few&transposons&
Organelle!(subgenomes)!
• Mitochondrial&(all)&
• chloroplasts&(in&plants)&• Absent&
32
Genomic era
• 1975 Frederick Sanger
• dideoxy sequencing
• 1986 Human Genome Initiative
• Genomes
• 1995 H. influenca 1.8 Mb 1.7k genes
• 1997 E. coli 4.6 Mb 4.3k genes
• 1996 S. cerevisiae 12.5 Mb 5.7k genes
• 1998 C. elegans 100 Mb 21.7k genes
• 2000 D. melanogaster 121 Mb 17k genes
33
Kavanoff, Nature Education : Supercoiled chromosome of E. coli.
Prokaryotic Genome
• E. coli
• 6 Mbp (1µm long)
• Cell: ~1 x 2 µm
• clumps called ‘nucleoid’
• ~80% DNA
• ~ 100 supercoiled domains
• Phosphate chargecompensated by e.g.spermidine
34
Science (2001), Nature (2001)
The human genome
• 2001 Draft H. sapiens 2.9 Bb 20-30k genes
35
Qui, Nature 2006
Genome – Packing Problem
• human:
• 2 x 3e9 base pairs
• packed in a nucleus of 6µm ∅
Histones
Chromosome
Histone tails
36
The human genome
37
Average length of chromosome:5cm 8.5cm (chr1) -2cm (chrY)
Genome Structure – Comparison
!Eukaryote! Prokaryote!
Size!
• Large&(10&Mb&–&100,000&Mb)&
• There&is¬&generally&a&relationship&between&organism&complexity&and&its&genome&size&(many&plants&have&larger&genomes&than&human!)&
• Generally&small&(<10&Mb;&most&<&5Mb)&
• Complexity&(as&measured&by&#&of&genes&and&metabolism)&generally&proportional&to&genome&size&
Content! • Most&DNA&is&nonLcoding& • DNA&is&“coding&gene&dense”&
Telomeres/!Centromeres!
• Present&(Linear&DNA)&• Circular&DNA,&doesn't&need&telomeres&
• Don’t&have&mitosis,&hence,&no¢romeres.&
Number!of!chromosomes!
• More&than&one,&(often)&including&those&discriminating&sexual&identity&
• Often&one,&sometimes&more,&Lbut&plasmids,¬&true&chromosome.&
Chromatin! • Histone&bound&(which&serves&as&a&genome®ulation&point)&
• No&histones&
• Uses&supercoiling&to&pack&genome&
&
38
Gene content
39
Hu et al, 2009; Reichardt, 2007
Gene content
• ≥ 33% of genes in E. coli of unknown function
• 5% of these orphan genes - unique to E. coli
• ~ 40% of genes in human of unknown function
• 5% orphan genes
40
Gregory (2005), Nature
Human Genome Content
SINEs
LINEs
Protein-codinggenes
Introns
Miscellaneousunique sequences
Miscellaneousheterochromatin
Segmentalduplications
Simple sequencerepeats
DNA transposonsLTR retrotransposons
20.4%
13.1%
1.5%
25.9%
11.6%
8%
5%
3%2.9%
8.3%
41
Next Generation Sequencing (NGS)
43
Jason M. Rizzo, and Michael J. Buck Cancer Prev Res 2012;5:887-900
Next Generation Sequencing (NGS)
44
Jason M. Rizzo, and Michael J. Buck Cancer Prev Res 2012;5:887-900
Major New Microbial Groups Expand Diversity and Alter our Understanding of the Tree of Life - Castelle & Banfield, 2018, Cell
Genomic Era – steady exploration of the Tree of Life