bioinformatics practical for biochemists - max planck institute for developmental biology ·...

Bioinformatics Practical for

Biochemists

01. DNA & Genomics

1

Description

• Lectures about general topics in Bioinformatics & History

• Tutorials will provide you with a toolbox of bioinformatics programs to analyse data

• Hands-On sessions will give you the opportunity to use these toolsWLAN: MPI LoungePassword: mpi-2011

2

Course Outline

• Mon – DNA & Genomics

• Tue – Introduction to Proteins

• Wed – Annotation of Sequence Features

• Thr – Protein Classification

• Fri – Evolution & Design

Course Material:eb.mpg.de/research/departments/protein-evolution/teaching

3

http://eb.mpg.de/research/departments/protein-evolution/teaching

Course Outline

• 13:00-14:00 Presentation

• 14:15-17:30 Tutorial (2 x 30min) & hands-on practical

• You will need to keep an electronic lab notebook

• Fri afternoon: Test Exercises

4

Software Requirements

• Browser (e.g. Firefox)

• “Advanced” Word Processor

• PyMOL (www.pymol.org – free for teaching)

• Java (https://www.java.com/verify)

• add http://toolkit.tuebingen.mpg.de as ‘Exception Site’ in Javas security settings

5

http://www.pymol.org

https://www.java.com/verify

http://toolkit.tuebingen.mpg.de

DNA & Genomics

1953 Model of DNA (F. Crick)

6

wikipedia.org

What is the “genetic material”?

• 1865 Gregor Mendel

• basic rules of heredity

• 1869 Friedrich Miescher

• discovery of ‘nuclein’ (DNA), Hoppe-Seyler repeated all experiments

• 1881 Edward Zacharias

• chromosomes are composed of nuclein

• 1899 Richard Altmann

• renaming nuclein to nucleic acid

7

• 1928 Frederick Griffith

• “transforming principle” - Str. pneumoniae experiment

• 1944 Avery, MacLeod & McCarty

• Griffith’s “transforming principle” is DNA

Essent. of Molec. Biology, Malathi V, 2012

DNA is the “transforming material”

8

bacteriophagetherapy.info / www.lifesciencesfoundation.org

DNA is the genetic material

• 1950 Erwin Chargaff

• A/T, C/G same amount in different tissues

• 1952 Hershey & Chase

• DNA is the genetic material using 32P/35S Phage/E. coli experiment

9

http://www.lifesciencesfoundation.or

http://osulibrary.oregonstate.edu/specialcollections/coll/pauling/dna/notes/1952a.22-ms-01.html

Solving the DNA structure

• 1952/53 Linus Pauling

• beat Cavendish Lab in discovery of α-helix

• Cavendish Lab (Cambridge) Watson & Crick allowedto work full-time on DNA

• Pauling shared manuscriptwith Cavendish Lab before publication(via his son Peter Pauling)

10

http://osulibrary.oregonstate.edu/specialcollections/coll/pauling/dna/notes/1952a.22-ms-01.html


• 1951/1952 Franklin & Wilkins

• 1951 Lecture with Watson attending

• A-DNA / B-DNA

• periodicity, phosphates are outside

• 1953 X-ray of B-DNA (Photo 51)- Wilkins showed image to Watson - Perutz showed a confidential committee report to Watson & Crick

11

DNA structure

13

Getting the “code”

• 1953 George E. Palade

• “RNA organelles” (ribosomes)

• 1957 Crick et.al

• suggest non-overlapping triplets

• only 20 out of 64 triplet code for an amino acid

• “comma-free code”

14

Getting the “code”

• 1961 Nirenberg & Matthaei

• polyU mRNA produces polyF protein

• complete genetic code

• 1961 Sydney Brenner

• no overlapping codes

• concept of mRNA

• triplet Code (Crick, Brenner, Barnett, Watts-Tobin)

NO. AS09 December 30, 1961 ‘NATURE 122i

GENERAL NATURE OF THE GENETIC CODE FOR PROTEINS

@ DR.I R. J./WATTS-TOBIN - Medical Research Council Unit for Molecular Biology,

Cavendish Laboratory, Cambridge

HERE is now a mass of indirect evidence which suggests that ths amino-a&d sequence along the

polypeptids chain of a protein is determined by the sequence of the bases along some particular part of the nucleic acid of the genetic material. Since there are twenty common amino-acids found throughout Sature, but only four common bases, it haa often been surmised that the sequence of the four baaes is in soms way a code for the sequence of the amino- acids. In this article ws report genetic experiments which, togsther with the work of others, suggest that the genetic code is of the foUowing general type:

(a) A group of three bases (or, leas likely, a multiple of three bases) codes one amino-acid.

(b) The code is not of the overlapping type (see Fig. 1).

(c) The sequence of the baass is read from a fixed Btarting point. This dstsrminsa how the long sequences of bases are to bs correctly read off as triplets. There ars no special ‘commas’ to show how to select the right triplets. If the starting point is displaced by one bass, then the reading into triplets is displaced, and thus becomes incorrsct.

(d) The code is probably ‘degenerate’; that is, in general, one particular ammo-acid can be coded by one of several tripieta of bases.

The Reading of the Code The evidence that the genetic cods is not over-

lapping (see Fig. 1) doss not come from our work. but from that, of Wittmannl and of Tsugita and Frasnkel-Conrat on the mutants of tobacco mosaic virus produced by nitrous asid. In an overlapping triplet code, an alteration to one baas will in general change three adjacent amino-acids in the polypeptide chain. Their work on the alterations produced in the protein of the virus show that usually only one amino-acid at a time is changed a8 a result of treating the ribonuclsic acid (RNA) of the virus with nitrous acid. In the rarer cases where two amino-acids are altered (owing presumably to two separate deamma- tions by the nitrous acid on one piece of RNA), the altered amino-acids ars not in adjacent positions in the polypeptide chain.

Brsnnera had previously shown that, if the code were universal (that is, the same throughout Nature), then all overlapping triplet codes were impossible. Moreover, all the abnormal human hremoglobins studied in detail4 show only single amino-acid changes. The newer experimental rssulta ssssntially rule out all simple codes of the overlapping type.

If the code is not overlapping, then there must be Borne arrangement to show how to select the correct triplets (or quadruplets, or whatever it may be) along the continuous sequence of bases. One obvious suggestion is that, say, every fourth baas is a ‘comma’. &other idea is that certain triplets make ‘sense’, whereas others make ‘nonsense’, as in the comma-free

codes of Crick, Griffith and Or&j. Alternatively, the correct choice may be made by starting at a fixed point and working along the sequence of bases three (or four, or whatever) at a time. which we now favour.

It is this possibility

Experimental Results Our genetic experiments have heen carried out on

the B cistron of the rn region of the bacteriophage T’4, which attacke strains of Eschmichia coli. This is the system so brilliantly exploited by BenzeP*‘. The rn region consists. of two adjacent genes, or ‘cistrona’, called cistron A and cistron B. The wild- type phags will grow on both E. coli B (here called B) and on J!?. coli K12 (a) (here called K), but a phage which has lost the function of either gene will not grow on K. Such a phags produces an r plaque on B. Many point mutations of ths genes are known which behave in this way. Deletions of part of the region are also found. Other mutations, known as ‘leaky’, show partial function; that is, they will grow on R but their plaque-type on B is not truly wild. We ‘report hers our work ,on the mutant P 13 (now renamed FC 0) in the Bl segment of the B cistron. Thie mutant was originally produced by the action of proflavins.

We@ have previously argued that acridines such aa pro5vin act as mutagens because they add or dslsts a base or bases. The most striking evidence in favour of this is that mutants produced by a&dines are seldom ‘leaky’ ; they are almost always completely lacking in the function of the gene. Since our note was published, experimental data from two eourcsa have been added to 0u.1: previous evidence: (1) we have examined a set of 126 pn mutants made with acridine yellow; of these only 6 are IeaLT- (typically about half the mutants made with base analogues are leaky) ; (2) Streisinger lo has found that whereas mutants of the lysozyme of phage T4 produced by baas-analogues are usually leaky, all lysozyme mutants produced by proflavin are negative, that is, the function is completely lacking.

If an acridine mutant i,3 produced by, say, adding a base, it should revert to ‘lvild-type’ by deleting a bass. Our work on revertants of FC-0 shows that it-usually

Starlinq point 3 ,, ;$I Overlappirq code

+7

NUCLEIC ACID * I’ ’ ’ ’ ’ ’ ’ --- ,-J+-~----

1 3 '

ETC.

Non-overlapplnq Code

Fig. 1. To show the difference between an overlapping code and a non-overlappinu code. The short wrticnl lines represent the bases of the nucleic acid. The czw illustrated is for a triplet code

15

E. coli

Getting the “code” – incl. start & stop codons

• Alternative start codon

• AUG (83%)

• GUG (14%)

• UUG (3%)

• Alternative stops

• UAA (63%, ‘ochre’)

• UGA (29% ‘opal’) / or Sec (Seleoncys)

• UAG (8%, ‘amber’)

➡ Start of protein is not easy to determine

16

Gene Structure – Eurkayotes / Prokaryotes

17

Miller, O. L. et al. Visualization of bacterial genes in action. Science 169, 392–395

Gene Structure – Polysomes in Prokaryotes

• EM picture of polysomes on a chromosome

23

Transcription initiation

DNA

mRNA with Ribosomes

Gene Structure – Transcription in Prokaryotes

18

• Promotor immediately adjacent to genes in the upstream direction

• e.g. -35 & -10 regionTTGACA, TATAAT (Pribnow box)

• Promotor is recognised by the sigma (σ) factor of the RNA polymerase

• different σ factors bind different promotors

• affinity to sequence motifs regulate transcription level

The -10 region of 350 E. coli promoters

weblogo.berkeley.edu

0

1

2

bits

5′

0 1

A

G

CT

2

C

G

TA

3

C

G

A

T

4

C

G

T

A

5

G

T

C

A

6

G

A

C

T

7

3′

Ermolaeva et. al, 2000

Gene Structure – Transcription in Prokaryotes

19

• Rho-independent

• palindromic sequence - RNA forms hairpin

• G+C rich region followed by A+T rich region

• RNA-Pol stalls at hairpin

• falls off since rU-dA regionfollows

• Rho-dependent

• rho protein binds to nascent RNA (no simple consensus)

• moves towards transcription bubble

• breaks RNA/DNA helix by pulling RNA away

Ermolaeva et. al, 2000

Gene Structure – Translation in Prokaryotes

20

• Initiation

• the start codon (AUG) is guided to the P-site in the 30ssubunit by base-pairing with the3’ end of the 16S rRNA

• Termination

• stop codon is reached

• ribosome stalls

• ‘Release factors’ recognise stop-codons and terminate translation

BacterialGeneRegula/onandTranscrip/onalNetworks,M.MadanBabu,E.coli

Gene Structure – Operons ‘organize’ genes in Prokaryotes

• Operons are transcription units

• polycistronic - i.e. only one promotor…though indication that this is not always true

• up to 15 genes in an operon

• ~60 % of genes

• distance of genes is short, might even overlap

• expression level & transcript stability often decrease in 5’-3’ direction

• gene products often with functional association (e.g. enzymes of one biosynthesis pathway)

• not well conserved, gene order might change

22

Griswold, A. (2008) Nature Education 1(1)Understanding Bioinformatics, Zvelebil & Baum, 2007

Gene Structure – Prokaryotic Operons

lac Operon

1: Regulatory gene

Promotor region

3: ß-galactosidase4: ß-gal permease8: ß-gal transacetylase

24

u-tokyo.ac.jp, Chaptman

Gene Structure – Prokayotes

• Lac Operon

• Promotor 5’ of β-galactosidase (lacZ)

21

Ribosome binding site(rbs)

PromoterOperator

End oflacI

Start oflacZ

mRNA

CAPcAMPbindingsite

RNA polymerasebinding site

LacI binding site

-35 -10

CCTTTCGCCCGTCACTCGCGTTGCGTTAATTACACTCAATCGAGTGAGTAATCCGTGGGGTCCGAAATGTGAAATACGAAGGCCGAGCATACAACACACCTTAACACTCGCCTATTGTTAAAGTGTGTCCTTTGTCGATACTGGTACGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCATG

http://u-tokyo.ac.jp

Gene Structure – Eukaryotes / Prokaryotes

25

Essent. of Molecular Biol.

Gene Structure – Transcription in Eukaryotes

26

• RNA Pol II transcribes mRNA

• several transcription factors (TF) assist RNA Pol II

• Promotors

• using different TF

• TATA box – TATAAAA(-25bp)

• Initiator (Inr) sequence - start point of transcription

• GC-box – GGGCG (one or more copies -40-100bp)

• CAAT-box – CCAAT (-40-100bp)



27

• Initiation

• Initial step is the binding of Transcription Factor IID (TFIID) to TATA-box

• Termination

• AAUAAA signal

• leads to cleavage of RNA 15-30nt downstream

• poly(A) tail added

• Post-transcriptional modification

• capping (7-methylguanosine)

• tailing (see above)

• splicing



28

• Enhancers/Silencers

• protein-coding genes contains typically several enhancers

• usually 700-1000bp up/downstream or within intron

• control gene expression pattern (e.g. tissue specificity)

• Insulators

• blocks the interaction of enhancers/silencers with promotor

• limit the influence of enhancers


Gene Structure – Translation in Eukaryotes

29

• Initiation

• eIF-4F binds to Cap-Structure of mRNA

• followed by binding/aligning to preinitiation complex

• Termination

• termination codons

• release factors (eRFs)

• Post-translational Modification

• proteolytic cleavage, acylation, myristoylation, methylation, phosphorylation, acetylation, formylation, sulphation, prenylation…

zazzle.com

30

Gene Structure – Eukaryotes

Gene Structure – Comparison

!Eukaryote! Prokaryote!

Genes!

• Often&have&introns&

• Intraspecific&gene&order&and&number&generally&relatively&stable&&

• many&non8coding&(RNA)&genes&

• There&is&NOT&generally&a&relationship&between&organism&complexity&and&gene&number&

• No&introns&

• Gene&order&and&number&may&vary&between&strains&of&a&species&

Gene!regulation!

• Promoters,&often&with&distal&long&range&enhancers/silencers,&MARS,&transcriptional&domains&

• Generally&mono8cistronic&

• Promoters&

• Enhancers/silencers&rare&&

• Genes&often&regulated&as&polycistronic&operons&

Repetitive!sequences!• Generally&highly&repetitive&with&genome&wide&families&from&transposable&element&propagation&

• Generally&few&repeated&sequences&

• Relatively&few&transposons&

Organelle!(subgenomes)!

• Mitochondrial&(all)&

• chloroplasts&(in&plants)&• Absent&

32

Genomic era

• 1975 Frederick Sanger

• dideoxy sequencing

• 1986 Human Genome Initiative

• Genomes

• 1995 H. influenca 1.8 Mb 1.7k genes

• 1997 E. coli 4.6 Mb 4.3k genes

• 1996 S. cerevisiae 12.5 Mb 5.7k genes

• 1998 C. elegans 100 Mb 21.7k genes

• 2000 D. melanogaster 121 Mb 17k genes

33

Kavanoff, Nature Education : Supercoiled chromosome of E. coli.

Prokaryotic Genome

• E. coli

• 6 Mbp (1µm long)

• Cell: ~1 x 2 µm

• clumps called ‘nucleoid’

• ~80% DNA

• ~ 100 supercoiled domains

• Phosphate chargecompensated by e.g.spermidine

34

Science (2001), Nature (2001)

The human genome

• 2001 Draft H. sapiens 2.9 Bb 20-30k genes

35

Qui, Nature 2006

Genome – Packing Problem

• human:

• 2 x 3e9 base pairs

• packed in a nucleus of 6µm ∅

Histones

Chromosome

Histone tails

36

The human genome

37

Average length of chromosome:5cm 8.5cm (chr1) -2cm (chrY)

Genome Structure – Comparison

!Eukaryote! Prokaryote!

Size!

• Large&(10&Mb&–&100,000&Mb)&

• There&is&not&generally&a&relationship&between&organism&complexity&and&its&genome&size&(many&plants&have&larger&genomes&than&human!)&

• Generally&small&(<10&Mb;&most&<&5Mb)&

• Complexity&(as&measured&by&#&of&genes&and&metabolism)&generally&proportional&to&genome&size&

Content! • Most&DNA&is&nonLcoding& • DNA&is&“coding&gene&dense”&

Telomeres/!Centromeres!

• Present&(Linear&DNA)&• Circular&DNA,&doesn't&need&telomeres&

• Don’t&have&mitosis,&hence,&no&centromeres.&

Number!of!chromosomes!

• More&than&one,&(often)&including&those&discriminating&sexual&identity&

• Often&one,&sometimes&more,&Lbut&plasmids,&not&true&chromosome.&

Chromatin! • Histone&bound&(which&serves&as&a&genome&regulation&point)&

• No&histones&

• Uses&supercoiling&to&pack&genome&

&

38

Gene content

39

Hu et al, 2009; Reichardt, 2007

Gene content

• ≥ 33% of genes in E. coli of unknown function

• 5% of these orphan genes - unique to E. coli

• ~ 40% of genes in human of unknown function

• 5% orphan genes

40

Gregory (2005), Nature

Human Genome Content

SINEs

LINEs

Protein-codinggenes

Introns

Miscellaneousunique sequences

Miscellaneousheterochromatin

Segmentalduplications

Simple sequencerepeats

DNA transposonsLTR retrotransposons

20.4%

13.1%

1.5%

25.9%

11.6%

8%

5%

3%2.9%

8.3%

41

Next Generation Sequencing (NGS)

43

Jason M. Rizzo, and Michael J. Buck Cancer Prev Res 2012;5:887-900

Next Generation Sequencing (NGS)

44

Jason M. Rizzo, and Michael J. Buck Cancer Prev Res 2012;5:887-900

Major New Microbial Groups Expand Diversity and Alter our Understanding of the Tree of Life - Castelle & Banfield, 2018, Cell

Genomic Era – steady exploration of the Tree of Life

bioinformatics practical for biochemists - max planck institute for developmental biology ·...

Documents