introduction to bioinformatics 2. genetics background course 341 department of computing imperial...

28
Introduction to Bioinformatics 2. Genetics Background Course 341 Department of Computing Imperial College, London © Simon Colton

Post on 19-Dec-2015

227 views

Category:

Documents


1 download

TRANSCRIPT

Introduction to Bioinformatics2. Genetics Background

Course 341

Department of Computing

Imperial College, London

© Simon Colton

Coursework

1 coursework – worth 20 marks– Work in pairs

Retrieving information from a database

Using Perl to manipulate that information

The Robot Scientist

Performs experiments Learns from results

– Using machine learning Plans more experiments Saves time and money

Team member:– Stephen Muggleton

Biological Nomenclature

Need to know the meaning of:– Species, organism, cell, nucleus, chromosome, DNA– Genome, gene, base, residue, protein, amino acid– Transcription, translation, messenger RNA– Codons, genetic code, evolution, mutation, crossover– Polymer, genotype, phenotype, conformation– Inheritance, homology, phylogenetic trees

Substructure and Effect(Top Down/Bottom Up)

Species

Organism

Cell

Nucleus

Chromosome

DNA strand

Gene

Base

Protein

Amino AcidFoldsinto

Affects theFunction of

Affects theBehaviour of

Prescribes

Cells

Basic unit of life Different types of cell:

– Skin, brain, red/white blood– Different biological function

Cells produced by cells– Cell division (mitosis)– 2 daughter cells

Eukaryotic cells– Have a nucleus

Nucleus and Chromosomes

Each cell has nucleus Rod-shaped particles inside

– Are chromosomes– Which we think of in pairs

Different number for species– Human(46),tobacco(48)– Goldfish(94),chimp(48)– Usually paired up

X & Y Chromosomes– Humans: Male(xy), Female(xx)– Birds: Male(xx), Female(xy)

DNA Strands

Chromosomes are same in every cell of organism– Supercoiled DNA (Deoxyribonucleic acid)

Take a human, take one cell– Determine the structure of all chromosonal DNA– You’ve just read the human genome (for 1 person)– Human genome project

13 years, 3.2 billion chemicals (bases) in human genome

Other genomes being/been decoded:– Pufferfish, fruit fly, mouse, chicken, yeast, bacteria

DNA Structure

Double Helix (Crick & Watson)– 2 coiled matching strands– Backbone of sugar phosphate pairs

Nitrogenous Base Pairs – Roughly 20 atoms in a base– Adenine Thymine [A,T]– Cytosine Guanine [C,G]– Weak bonds (can be broken)– Form long chains called polymers

Read the sequence on 1 strand– GATTCATCATGGATCATACTAAC

Differences in DNA

2% tiny

Roughly 4%

Share

Materia

l

DNA differentiates:– Species/race/gender– Individuals

We share DNA with– Primates,mammals– Fish, plants, bacteria

Genotype– DNA of an individual

Genetic constitution

Phenotype– Characteristics of the

resulting organism Nature and nurture

Genes

Chunks of DNA sequence– Between 600 and 1200 bases long– 32,000 human genes, 100,000 genes in tulips

Large percentage of human genome – Is “junk”: does not code for proteins

“Simpler” organisms such as bacteria– Are much more evolved (have hardly any junk)– Viruses have overlapping genes (zipped/compressed)

Often the active part of a gene is spit into exons– Seperated by introns

The Synthesis of Proteins

Instructions for generating Amino Acid sequences– (i) DNA double helix is unzipped– (ii) One strand is transcribed to messenger RNA – (iii) RNA acts as a template

ribosomes translate the RNA into the sequence of amino acids

Amino acid sequences fold into a 3d molecule Gene expression

– Every cell has every gene in it (has all chromosomes)– Which ones produce proteins (are expressed) & when?

Transcription

Take one strand of DNA Write out the counterparts to each base

– G becomes C (and vice versa)– A becomes T (and vice versa)

Change Thymine [T] to Uracil [U] You have transcribed DNA into messenger RNA Example:

Start: GGATGCCAATGIntermediate: CCTACGGTTACTranscribed: CCUACGGUUAC

Genetic Code

How the translation occurs

Think of this as a function:– Input: triples of three base letters (Codons)– Output: amino acid– Example: ACC becomes threonine (T)

Gene sequences end with: – TAA, TAG or TGA

Genetic CodeA=Ala=Alanine

C=Cys=Cysteine

D=Asp=Aspartic acid

E=Glu=Glutamic acid

F=Phe=Phenylalanine

G=Gly=Glycine

H=His=Histidine

I=Ile=Isoleucine

K=Lys=Lysine

L=Leu=Leucine

M=Met=Methionine

N=Asn=Asparagine

P=Pro=Proline

Q=Gln=Glutamine

R=Arg=Arginine

S=Ser=Serine

T=Thr=Threonine

V=Val=Valine

W=Trp=Tryptophan

Y=Tyr=Tyrosine

Example Synthesis

TCGGTGAATCTGTTTGAT Transcribed to:

AGCCACUUAGACAAACUATranslated to:

SHLDKL

Proteins

DNA codes for – strings of amino acids

Amino acids strings– Fold up into complex 3d molecule – 3d structures:conformations– Between 200 & 400 “residues”– Folds are proteins

Residue sequences– Always fold to same conformation

Proteins play a part– In almost every biological process

Evolution of Genes: Inheritance

Evolution of species– Caused by reproduction and survival of the fittest

But actually, it is the genotype which evolves– Organism has to live with it (or die before reproduction)– Three mechanisms: inheritance, mutation and crossover

Inheritance: properties from parents– Embryo has cells with 23 pairs of chromosomes– Each pair: 1 chromosome from father, 1 from mother– Most important factor in offspring’s genetic makeup

Evolution of Genes: Mutation

Genes alter (slightly) during reproduction– Caused by errors, from radiation, from toxicity– 3 possibilities: deletion, insertion, alteration

Deletion: ACGTTGACTC ACGTGACTC Insertion: ACGTTGACTC AGCGTTGACTC Substitution: ACGTTGACTC ACGATGACTT Mutations are almost always deleterious

– A single change has a massive effect on translation– Causes a different protein conformation

Evolution of Genes: Crossover (Recombination)

DNA sections are swapped – From male and female genetic input to offspring DNA

Bioinformatics Application #1Phylogenetic trees

Understand our evolution Genes are homologous

– If they share a common ancestor By looking at DNA seqs

– For particular genes– See who evolved from who

Example:– Mammoth most related to

African or Indian Elephants?

LUCA:– Last Universal Common Ancestor– Roughly 4 billion years ago

Genetic Disorders

Disorders have fuelled much genetics research– Remember that genes have evolved to function

Not to malfunction

Different types of genetic problems Downs syndrome: three chromosome 21s Cystic fibrosis:

– Single base-pair mutation disables a protein– Restricts the flow of ions into certain lung cells– Lung is less able to expel fluids

Bioinformatics Application #2Predicting Protein Structure

Proteins fold to set up an active site– Small, but highly effective (sub)structure– Active site(s) determine the activity of the protein

Remember that translation is a function– Always same structure given same set of codons– Is there a set of rules governing how proteins fold?– No one has found one yet– “Holy Grail” of bioinformatics

Protein Structure Knowledge

Both protein sequence and structure– Are being determined at an exponential rate

1.3+ Million protein sequences known– Found with projects like Human Genome Project

20,000+ protein structures known– Found using techniques like X-ray crystallography

Takes between 1 month and 3 years– To determine the structure of a protein– Process is getting quicker

Sequence versus Structure

009590850

100000

200000

300000

400000

500000

Year

Number

Protein sequence

Protein structure

Database Approaches

Slow(er) rate of finding protein structure– Still a good idea to pursue the Holy Grail

Structure is much more conservative than sequence– 1.3m genes, but only 2,000 – 10,000 different conformations

First approach to sequence prediction:– Store [sequence,structure] pairs in a database– Find ways to score similarity of residue sequences– Given a new sequence, find closest matches

A good match will possibly mean similar protein shape E.g., sequence identity > 35% will give a good match

– Rest of the first half of the course about these issues

Potential (Big) Payoffsof Protein Structure Prediction

Protein function prediction– Protein interactions and docking

Rational drug design– Inhibit or stimulate protein activity with a drug

Systems biology– Putting it all together: “E-cell” and “E-organism”– In-silico modelling of biological entities and process

Further Reading

Human Genome Project at Sanger Centre– http://www.sanger.ac.uk/HGP/

Talking glossary of genetic terms– http://www.genome.gov/glossary.cfm

Primer on molecular genetics– http://www.ornl.gov/TechResources/Human_Genome/publicat/primer/toc.html