msc bioinformatics medical genetics tutorial

Bioinformatics in human genetics

- what causes human genetic disease?

- how do we identify the genes responsible?

- how do we find the molecular mechanism?

Richard Adams,

Psychiatric Genetics Group,

Medical Genetics Section,

Molecular Medicine Centre,

Western General Hospital.

[email protected]

http://www.genetics.med.ed.ac.uk

Sources of mutation

type of mutation mechanismfrequency per cell division

point mutation1. mistakes in DNA replication2. DNA damage by chemical mutagens (or by radiation) and misrepair

~10-10/basepair~10-5/gene~0.5/cell

submicroscopic deletion or insertion or duplication

1. unequal crossing over2. misalignment during DNA replication3. insertion of mobile element4. DNA damage by chemical mutagens (or by radiation) and misrepair

included in the above

microscopically visible deletion, translocation or inversion

1. unequal crossing over2. DNA damage by chemical mutagens (or by radiation) and misrepair

6 x 10-4

Loss/gain of a whole chromosome missegregation at mitosis 1 in 100

- mutations are a primary cause of evolution- most are detrimental but some lead to novel and advantageous function.

Mutations, disease and polymorphisms

Direct loss of protein function (most common) E.g., phenylketonurea, cystic fibrosis (triplet coding for F508 removed). Loss of splicing leading to protein dysfunction e.g., spinal muscular atrophy Loss/alteration of gene regulatory regions

type I diabetes, CCR5 and HIV, few confirmed examples yet.Many mutations are frequent in the population. Those with a frequency of >1% in the population are called polymorphisms. types of polymorphism include Variable Number Tandem Repeats (VNTRs) e.g.,

AGCTGGTACATATATATATATCGTTACGTGA maternal AGCTGGTACATATAT------CGTTACGTG paternal

or single nucleotide polymorphisms (SNPs) e.g., AGCTGGTTCAGCACTAGCAGTCT maternal AGCTGGTTCAGTACTAGCAGTCT paternal

These polymorphisms allow us to distinguish homologous chromosomes.

Genetic diseases differ in their mode of inheritance • Simple Mendelian (easy to analyse) e.g. Huntington's disease, cystic fibrosis, Duchenne muscular dystrophy

There is a complete correlation between genotype and phenotype. If you've got the mutant gene, you'll get the disease. Dysfunction of these genes is SUFFICIENT and NECESSARY for the disease to occur.

•Oligogenic (more difficult)e.g., Alzheimer’s diseasea small number of genes contribute most of the genetic risk.Dysfunction of these genes is SUFFICIENT but not NECESSARY for the

disease to occur.

• Complex or multifactorial (hard to analyse) e.g. many common diseases, such as cancer, asthma, schizophrenia,

hypertension, heart disease.The risk of getting the disease is modified by- individual's genotype.

- Other factors, especially other genes and environment, also

influence the risk of getting a disease. -many genes involved Dysfunction of any one gene is NEITHER NECESSARY NOR SUFFICIENT for the

disease to occur.

Classic mapping strategy for Mendelian disease genes

Examine families with inherited disease“Linkage studies”

Identify region of interest

Identify candidate genes in ROI

Identify possible functional mutations

Perform an “Association study” of a gene of interest

Identify the causative mutation experimentally

~ approx 30 000 genes- genes are only 1.5% of all DNA- only ~ 5% of genome under selection

23 pairs of chromosomes = 46 altogether

During production of eggs and sperm, chromosomes pair up and recombine with each other – “shuffling”.

The human genome

Many polymorphic markers are used in a real mapping study.

Mapping disease genes by linkage analysis

Blue = paternalRed = maternal

A, B and C are paternal allelesa, b and c are maternal alleles

During meiosis recombination occurs randomly

Disease mutation will segregateMost often with A then B, then C.

Is successful when :1) High penetrance of the mutation2) unambiguous disease diagnosis3) Reproducible between studies

Results in - small chromosome region defined- ~ 1cM =~ 1Mb =~ 10 genes-basis for gene association study.

Observe recombination events in a multi-generational family

Gene association studies

• Define more precisely the disease locus.

• Use SNPs as these are most frequent polymorphisms.

• Family studies do not contain enough recombination events.

• Therefore use large numbers of unrelated people –

examine how well particular SNPs have co-segregated with a disease.

• Very expensive/laborious to examine hundreds of SNPs in hundreds of people

• Software attempts to reconstruct the recombination events that have

occurred to generate the current genotypic diversity.

• Results in the identification of a disease “haplotype”

• Basis for determining the causative mutation.

What’s a Haplotype?

• Combination of adjacent alleles along chromosome that are inherited as a unit.• Recombination not perfectly random• Therefore not usually possible to identify a single mutation• Useful as mapping tool - only necessary to examine a subset of ‘tagging’ SNPs in order to cover genomic ROI

(www.hapmap.org).

SNP1 SNP2 SNP3A A G 27A C G 1G A A 7G A G 1G C A 52

A/G A/C A/G

Not so simple for complex diseases

Unfortunately the approach doesn’t work nearly as well for complex (polygenic) disorders

like schizophrenia, diabetes, asthma etc…

From Glazier et al. 2002

Why are genes involved in complex disease hard to find?

• Low correlation of genotype with phenotype => poor linkage scores

=> large regions

• Studies often are not reproducible between populations.

• E.g., for bipolar disorder ~ 1/3 of genome has been implicated in >= 1 study.

• Typically region of interest contains hundreds of genes –

currently unfeasible to examine all of them in association

studies.

F48F50F59F22

2.0

3.2

1.22.2

8.5

12

11.3Mb

21

27

Minimal Region I

Minimal Region II

Figure 2: Four families display linkage to chromosome 4p15-16. Minimal region I and II are defined by the regions where three of the four linkage signals overlap.

Case –study : Bipolar disorder and chromosome 4.

- approx. 9.5Mb(33 known genes) is shared between 3 of the 4 families- but 22Mb (65 known genes) may be implicated

Approaches to prioritize candidate gene selections

• - use genome annotation/ clinical knowledge to predict what may be a good candidate. E.g.,

– “I’m looking for a gene involved in schizophrenia and I know that dopamine levels are elevated in schizophrenia patients – so I’ll screen the dopamine receptor genes first.”

– May be useful when we have some idea of molecular mechanism.

– But for bipolar disorder and schizophrenia we have only a basic idea of the cellular mechanisms involved.

Let computers do the monkey work

• Growing interest in automated methods to prioritize candidate genes

• Half a dozen systems freely available (some on the www)

• E.g.,

– POCUS

– GeneSeeker

– PROSPECTR

GeneSeeker – correlating gene expression with disease

• Van Driel et al. 2005

• Takes as input a genomic region and an expression pattern

• Assumes that genes involved in a particular disease will be expressed in the same tissues.

• Compares expression profiles of genes in the region of interest, returns subset that are expressed in relevant tissues.

• No statistics, is entirely qualitative – just uses boolean querying of database.

GeneSeeker : Pros & Cons

• Easy assumption to understand (psychiatric illnesses associated with genes expressed in the brain, for example…)

• Fast

• Must have sufficient knowledge about disease (which tissues involved?)

• Not all genes have reliable, normalized expression data

• Eliminates unlikely genes but doesn’t prioritize likely ones…

• Assumes new gene will be similar to existing known disease genes.

POCUSPrioritization Of Candidates Using Statistics

• Premise : Several genes may act in a pathway – partial disruption of several of these may result in disease.

• Takes as input two or more regions of interest (identified through linkage analysis).

• Works on oligogenic diseases (genes not necessary, but sufficient) e.g., Alzheimer’s disease, hypertension, inflammatory bowel disease.

• Looks for significantly overrepresented Gene Ontology terms or protein domains within those regions.

• Returns all genes with those functions – for a locus with ~ 500 genes a 20 fold enrichment for disease genes is

produced.

Gene ontologies

• The gene ontology consortium ( GO, www.geneontology,org)

• An attempt to create a controlled vocabulary for biology.– Biological process e.g., glucose metabolism– Molecular function e.g., dehydrogenase– Cellular component e.g., mitochondria

• Terms are arranged in a directed acyclic graph e.g.,Synaptic

TransmissionNeurophysiologicalProcess

Synaptic Transmission

Regulation of Action Potential

NeurotransmitterRegulation

Regulation of Synapse Structure

Cell-cellsignalling

POCUS : Pros & Cons

• No prior knowledge of disease etiology needed

• No assumption of mechanism

• Fast

• Not all genes have adequate functional annotation

• Matches need to be exact: POCUS doesn’t take the tree-like structure of GO into account.

Problems with candidate gene /annotation based approaches

• Dependence on functional annotation and prior knowledge.

- this is variable between genes.

- about 1/6 of human genes have almost no functional annotation

• Little possibility of discovering novelty

Sequence based methods to identify disease genes

• Initial observation - genes disrupted in schizophrenia were

very long.

• Examine sequence features of known disease genes:

- gene length, sequence identity, number of homologues, sequence

motifs etc.

• Some statistically significant differences are apparent between

disease and non? disease genes.

Proteins encoded by disease genes are on average longer than non-disease genes.

0 500 1000 1500 2000 2500

Protein length (aa)

Disease Non-disease

normalizedfrequency

Sequence properties differ between disease and non-disease genes

Can these sequence features be used to predict ‘unknown’ disease genes?

Homology definitions

Initial gene duplication event

Ancestral DNAX

X Y

X YX Y

paralogues orthologues

Human chimp

Machine learning in biology

• Based on these data we can predict which other genes in the genome also have these properties, using machine learning approaches.

• Machine learning approaches therefore learn from a known training set in order to classify unknown instances into one of 2 classes.

• Especially useful in biology where training data comes from difficult, expensive experiments and we want to extrapolate across the genome.

• Machine learning approaches include support vector machines, bayesian statistics, nearest neighbour instance based learning, and decision trees.

• Decision trees are particularly useful as they produce a human interpretable set of decisions whose biological significance can be analysed.

PRiOritization by Sequence & and PhylogEnetic features of CandidaTe Regions - PROSPECTR

- build classifier using a decision tree (based on C4.5) using Weka machine learning package ( http://www.cs.waikato.ac.nz/~ml/weka/)

- can input an arbitrary number of attributes- gives human readable classifier

- also happens to give better results than SVM and Bayes based methods.

- train data set on approx 1084 disease genes versus 1084 randomly selected genes

- Decision tree implemented in Perl and applied to whole genome. - Results stored in MySQL database. - Database queryable from web.

http://www.genetics.med.ed.ac.uk/prospectr

START

Mouse homol>42%

Has signalPeptide?

Mouse homol>95%

Gene length>997bp Exons > 32

Gene length>563

-0.1630.827 -0.3150.114 0.151-0.036 -0.0260.818

Best paralog>78%

Rata %id>59%

3’UTR <647bp

CDS len >704bp

Hs/Mm Ka/Ks<0.195

-0.4220.344 -0.0440.205 0.106-0.087 -0.0340.2

N Y N Y N Y N Y N Y N Y

N Y N Y N Y N Y N Y

0.008-0.57

GC > 37.5%

N Y

-0.0380.213

Mouse %id > 68.3%

N Y

0.015-0.492

Worm %id >55%

N Y

-0.0340.027

-0.5940.029 -0.0140.792

CLASS is DISEASE if Score < 0

Classifier performance :DATASET RECALL MISCLASSIFICATION

On training set : 70% 41%

10 fold X validation: 70% 42%

On 675 genes from 71% 40%Human Gene Mutstion DB

On 54 genes associated 72% 42%With ‘oligogenic’ diseases

Training set

TRUE

+ve

False +ve

Performance on genomic data

For each disease gene D in HGMD disease set:

examine 30Mb locus L surrounding gene D

Score each gene X in L

Rank genes.

Where is D in list?:

Score every gene in the genome, and normalise scores between 0 and 133% of genes (36/61) scoring > 0.75 are disease genes (8 fold enrichment)0.8% of genes (35/4357)scoring < 0.3 are disease genes (6 fold enrichment)

Prospectr - Pros and Cons

Pros •Applicable to all gene sequences

• Fast

• Hypothesis independent

Cons

• Maybe too unbiased

•How well will it predict complex disease genes?

Breakdown of disease causing mutations.

(from Human Gene Mutation Database 2002)

mutation type number % of total

deletion 6085 21.8

insertion/duplication 1911 6.8

rearrangement 512 1.8

repeat variations 38 0.1

missense/nonsense 16441 58.9

splicing 2727 9.8

regulatory 213 0.8

Most known disease mutations affect protein sequence.

Identifying causative mutations

Polymorphic moderate to low risk variants

Disease Locus Change Frequency Relative risk

Alzheimers APOE C112R 0.09-0.22 4-15Thrombosis factorV R506Q 0.00-0.08 5-10Haemochromatosis Hfe H63D 0.02-0.22 4.0NIDDM PPAR P12A 0.85 1.25IDDM INS promo 0.85 1.5-2.5

HIV CCR5 promo 0.01-0.14 highCrohns disease NOD2 G908R 0.01 6.0

R702W 0.04 3.0Breast cancer BRCA2 N372H 0.25 1.3Colon cancer APC I1307K 0.03 2.0

Neural tube defects MTHFR C677T 0.3-2.0Graves disease CTLA4 T17A 0.35 1.5-2.0CJD PRNP M129V 0.65 3.0FMF MEFV P369S 0.02 7.0Bipolar disorder XPD1 promoter ?? ??

again, mainly mis-sense mutations

Breakdown of SNP distribution Source : www.ensembl.org 2005

Total 9 100 000 confirmed SNPs currently

mutation type number % of total

Mis-sense/ nonsense 58119 0.64

Synonymous 44041 0.48

Splice site changing 7023 0.08

‘Regulatory’ SNPS 818215 9.2

Intronic 3201212 35.2

Intergenic 4971390 54.1

But most mutations are intronic or between genes!. => MAJOR CHALLENGE

http://www.ensembl.org/

How to identify functional variants

In proteins - many homology based approaches e.g.,

SIFT blocks.fhcrc.org/sift/SIFT.html

PolyPhen www.bork.embl-heidelberg.de/PolyPhen/

Identifying functional regulatory and splicing sites computationally

- much harder due to

fewer tools – no equivalent of cDNA libraries, ESTs

short sequences – occur very often but only some functional

often sequences only present in vertebrates, slower experiments

http://blocks.fhcrc.org/sift/SIFT.html

Use of protein sequence alignments

IQDDPTLFDYNVDERVAKFTKYA-------KDQAAFFQTDNIIMTMGSDFQYENANAWYKPIND----DLESP--DYNVDDRVEKLVKYAQLQAIFYKTNNVIFTMGEDFNYQHAEMWFT-------PVVDNPRSPENAKTLVNYFLKLASSQKGFYRTNHTVMTMGSDFHYENANMWFK PIRDD--PDLED----YNVDEIVQKFLNASHKQADYYKTNHIIMTMGSDFQYENANLWYK-------PVVDDPTSPENANKLVDYFLNLASSQKKYYRTNHTVMTMGSDFQYENANMWFK -----DQPLVEDPRSPENAKELVDYFLNVATAQGRYYRTNHTVMTMGSDFQYENANMWFK FVEDRRSPEYN---AEELVNYFLQLATA----QGQHFRTNHTIMTMGSDFQYENANMWFR FVED---PRSPEYNAKELVNYFLQLATA----QSEHYRTNHTIMTMGSDFQYENANMWFK PIIDGKHS------PDNNVKERVDAFLAYVTEMAEHFRTPNVILTMGEDFHYQNADMWYK VVEDTRSPEY-------NAKELVRYFLKLATDQGKLYRTKHTVMTMGSDFQYENANTWFK ----GVPP---ETIHLGNVQKRAEMLLDQYRKKSKLFRTTVVLAPLGDDFRYCERTEQFK ----GVPP---ETIHPGNVQSRARMLLDQYRKKSKLFRTKVLLAPLGDDFRYCEYTLQFK VQDD---PDLFD----YNVQERVNAFVAAALDQANITRINHIMFTMGTDFRYQYAHTWYR

For each position in an alignment, calculate the probability of an amino acid substitution being tolerated.

Conclusions

• Bioinformatics indispensable for modern medical genetics• ‘Simple’ disease genes identified routinely

• Major challenges in dissecting ‘complex’ diseases • Identifying functional sequences in non-coding DNA

• Predicting combined effect of multiple polymorphisms

Bibliography and further reading

• Data Mining. Ian Witten And Eibe Frank. (www.mkp.com)

• Finding disease genes computationally: General : Tabor et al., Nature Reviews in Genetics (May 2002) POCUS : Turner et al., Genome Biology. (2003)4(11):R75. Prospectr : Adie et al,. BMC Bioinformatics (2005 6(1) p55)

• Polymorphisms and haplotype mapping Crawford, Annual Reviews of Medicine 2005 56, 303-320.

• Web based tutorials on linkage/disease gene mapping http://www.ucl.ac.uk/~ucbhjow/b241/biochemical.html

• Identifying detrimental amino acid substitutions: SIFT: Ng and Henikoff Nucleic Acids Research (2003) 3812-14

http://www.mkp.com/

http://www.ucl.ac.uk/~ucbhjow/b241/biochemical.html

msc bioinformatics medical genetics tutorial

Documents

genes genes

disease mutation

mapping disease genes

mendelian disease genes

huntingtons disease

alzheimers disease

heart disease

disease locus