msc bioinformatics medical genetics tutorial
TRANSCRIPT
Bioinformatics in human genetics
- what causes human genetic disease?
- how do we identify the genes responsible?
- how do we find the molecular mechanism?
Richard Adams,
Psychiatric Genetics Group,
Medical Genetics Section,
Molecular Medicine Centre,
Western General Hospital.
http://www.genetics.med.ed.ac.uk
Sources of mutation
type of mutation mechanismfrequency per cell division
point mutation1. mistakes in DNA replication2. DNA damage by chemical mutagens (or by radiation) and misrepair
~10-10/basepair~10-5/gene~0.5/cell
submicroscopic deletion or insertion or duplication
1. unequal crossing over2. misalignment during DNA replication3. insertion of mobile element4. DNA damage by chemical mutagens (or by radiation) and misrepair
included in the above
microscopically visible deletion, translocation or inversion
1. unequal crossing over2. DNA damage by chemical mutagens (or by radiation) and misrepair
6 x 10-4
Loss/gain of a whole chromosome missegregation at mitosis 1 in 100
- mutations are a primary cause of evolution- most are detrimental but some lead to novel and advantageous function.
Mutations, disease and polymorphisms
Direct loss of protein function (most common) E.g., phenylketonurea, cystic fibrosis (triplet coding for F508 removed). Loss of splicing leading to protein dysfunction e.g., spinal muscular atrophy Loss/alteration of gene regulatory regions
type I diabetes, CCR5 and HIV, few confirmed examples yet.Many mutations are frequent in the population. Those with a frequency of >1% in the population are called polymorphisms. types of polymorphism include Variable Number Tandem Repeats (VNTRs) e.g.,
AGCTGGTACATATATATATATCGTTACGTGA maternal AGCTGGTACATATAT------CGTTACGTG paternal
or single nucleotide polymorphisms (SNPs) e.g., AGCTGGTTCAGCACTAGCAGTCT maternal AGCTGGTTCAGTACTAGCAGTCT paternal
These polymorphisms allow us to distinguish homologous chromosomes.
Genetic diseases differ in their mode of inheritance • Simple Mendelian (easy to analyse) e.g. Huntington's disease, cystic fibrosis, Duchenne muscular dystrophy
There is a complete correlation between genotype and phenotype. If you've got the mutant gene, you'll get the disease. Dysfunction of these genes is SUFFICIENT and NECESSARY for the disease to occur.
•Oligogenic (more difficult)e.g., Alzheimer’s diseasea small number of genes contribute most of the genetic risk.Dysfunction of these genes is SUFFICIENT but not NECESSARY for the
disease to occur.
• Complex or multifactorial (hard to analyse) e.g. many common diseases, such as cancer, asthma, schizophrenia,
hypertension, heart disease.The risk of getting the disease is modified by- individual's genotype.
- Other factors, especially other genes and environment, also
influence the risk of getting a disease. -many genes involved Dysfunction of any one gene is NEITHER NECESSARY NOR SUFFICIENT for the
disease to occur.
Classic mapping strategy for Mendelian disease genes
Examine families with inherited disease“Linkage studies”
Identify region of interest
Identify candidate genes in ROI
Identify possible functional mutations
Perform an “Association study” of a gene of interest
Identify the causative mutation experimentally
~ approx 30 000 genes- genes are only 1.5% of all DNA- only ~ 5% of genome under selection
23 pairs of chromosomes = 46 altogether
During production of eggs and sperm, chromosomes pair up and recombine with each other – “shuffling”.
The human genome
Many polymorphic markers are used in a real mapping study.
Mapping disease genes by linkage analysis
Blue = paternalRed = maternal
A, B and C are paternal allelesa, b and c are maternal alleles
During meiosis recombination occurs randomly
Disease mutation will segregateMost often with A then B, then C.
Is successful when :1) High penetrance of the mutation2) unambiguous disease diagnosis3) Reproducible between studies
Results in - small chromosome region defined- ~ 1cM =~ 1Mb =~ 10 genes-basis for gene association study.
Observe recombination events in a multi-generational family
Gene association studies
• Define more precisely the disease locus.
• Use SNPs as these are most frequent polymorphisms.
• Family studies do not contain enough recombination events.
• Therefore use large numbers of unrelated people –
examine how well particular SNPs have co-segregated with a disease.
• Very expensive/laborious to examine hundreds of SNPs in hundreds of people
• Software attempts to reconstruct the recombination events that have
occurred to generate the current genotypic diversity.
• Results in the identification of a disease “haplotype”
• Basis for determining the causative mutation.
What’s a Haplotype?
• Combination of adjacent alleles along chromosome that are inherited as a unit.• Recombination not perfectly random• Therefore not usually possible to identify a single mutation• Useful as mapping tool - only necessary to examine a subset of ‘tagging’ SNPs in order to cover genomic ROI
(www.hapmap.org).
SNP1 SNP2 SNP3A A G 27A C G 1G A A 7G A G 1G C A 52
A/G A/C A/G
Not so simple for complex diseases
Unfortunately the approach doesn’t work nearly as well for complex (polygenic) disorders
like schizophrenia, diabetes, asthma etc…
From Glazier et al. 2002
Why are genes involved in complex disease hard to find?
• Low correlation of genotype with phenotype => poor linkage scores
=> large regions
• Studies often are not reproducible between populations.
• E.g., for bipolar disorder ~ 1/3 of genome has been implicated in >= 1 study.
• Typically region of interest contains hundreds of genes –
currently unfeasible to examine all of them in association
studies.
F48F50F59F22
2.0
3.2
1.22.2
8.5
12
11.3Mb
21
27
Minimal Region I
Minimal Region II
Figure 2: Four families display linkage to chromosome 4p15-16. Minimal region I and II are defined by the regions where three of the four linkage signals overlap.
Case –study : Bipolar disorder and chromosome 4.
- approx. 9.5Mb(33 known genes) is shared between 3 of the 4 families- but 22Mb (65 known genes) may be implicated
Approaches to prioritize candidate gene selections
• - use genome annotation/ clinical knowledge to predict what may be a good candidate. E.g.,
– “I’m looking for a gene involved in schizophrenia and I know that dopamine levels are elevated in schizophrenia patients – so I’ll screen the dopamine receptor genes first.”
– May be useful when we have some idea of molecular mechanism.
– But for bipolar disorder and schizophrenia we have only a basic idea of the cellular mechanisms involved.
Let computers do the monkey work
• Growing interest in automated methods to prioritize candidate genes
• Half a dozen systems freely available (some on the www)
• E.g.,
– POCUS
– GeneSeeker
– PROSPECTR
GeneSeeker – correlating gene expression with disease
• Van Driel et al. 2005
• Takes as input a genomic region and an expression pattern
• Assumes that genes involved in a particular disease will be expressed in the same tissues.
• Compares expression profiles of genes in the region of interest, returns subset that are expressed in relevant tissues.
• No statistics, is entirely qualitative – just uses boolean querying of database.
GeneSeeker : Pros & Cons
• Easy assumption to understand (psychiatric illnesses associated with genes expressed in the brain, for example…)
• Fast
• Must have sufficient knowledge about disease (which tissues involved?)
• Not all genes have reliable, normalized expression data
• Eliminates unlikely genes but doesn’t prioritize likely ones…
• Assumes new gene will be similar to existing known disease genes.
POCUSPrioritization Of Candidates Using Statistics
• Premise : Several genes may act in a pathway – partial disruption of several of these may result in disease.
• Takes as input two or more regions of interest (identified through linkage analysis).
• Works on oligogenic diseases (genes not necessary, but sufficient) e.g., Alzheimer’s disease, hypertension, inflammatory bowel disease.
• Looks for significantly overrepresented Gene Ontology terms or protein domains within those regions.
• Returns all genes with those functions – for a locus with ~ 500 genes a 20 fold enrichment for disease genes is
produced.
Gene ontologies
• The gene ontology consortium ( GO, www.geneontology,org)
• An attempt to create a controlled vocabulary for biology.– Biological process e.g., glucose metabolism– Molecular function e.g., dehydrogenase– Cellular component e.g., mitochondria
• Terms are arranged in a directed acyclic graph e.g.,Synaptic
TransmissionNeurophysiologicalProcess
Synaptic Transmission
Regulation of Action Potential
NeurotransmitterRegulation
Regulation of Synapse Structure
Cell-cellsignalling
POCUS : Pros & Cons
• No prior knowledge of disease etiology needed
• No assumption of mechanism
• Fast
• Not all genes have adequate functional annotation
• Matches need to be exact: POCUS doesn’t take the tree-like structure of GO into account.
Problems with candidate gene /annotation based approaches
• Dependence on functional annotation and prior knowledge.
- this is variable between genes.
- about 1/6 of human genes have almost no functional annotation
• Little possibility of discovering novelty
Sequence based methods to identify disease genes
• Initial observation - genes disrupted in schizophrenia were
very long.
• Examine sequence features of known disease genes:
- gene length, sequence identity, number of homologues, sequence
motifs etc.
• Some statistically significant differences are apparent between
disease and non? disease genes.
Proteins encoded by disease genes are on average longer than non-disease genes.
0 500 1000 1500 2000 2500
Protein length (aa)
Disease Non-disease
normalizedfrequency
Sequence properties differ between disease and non-disease genes
Can these sequence features be used to predict ‘unknown’ disease genes?
Homology definitions
Initial gene duplication event
Ancestral DNAX
X Y
X YX Y
paralogues orthologues
Human chimp
Machine learning in biology
• Based on these data we can predict which other genes in the genome also have these properties, using machine learning approaches.
• Machine learning approaches therefore learn from a known training set in order to classify unknown instances into one of 2 classes.
• Especially useful in biology where training data comes from difficult, expensive experiments and we want to extrapolate across the genome.
• Machine learning approaches include support vector machines, bayesian statistics, nearest neighbour instance based learning, and decision trees.
• Decision trees are particularly useful as they produce a human interpretable set of decisions whose biological significance can be analysed.
PRiOritization by Sequence & and PhylogEnetic features of CandidaTe Regions - PROSPECTR
- build classifier using a decision tree (based on C4.5) using Weka machine learning package ( http://www.cs.waikato.ac.nz/~ml/weka/)
- can input an arbitrary number of attributes- gives human readable classifier
- also happens to give better results than SVM and Bayes based methods.
- train data set on approx 1084 disease genes versus 1084 randomly selected genes
- Decision tree implemented in Perl and applied to whole genome. - Results stored in MySQL database. - Database queryable from web.
http://www.genetics.med.ed.ac.uk/prospectr
START
Mouse homol>42%
Has signalPeptide?
Mouse homol>95%
Gene length>997bp Exons > 32
Gene length>563
-0.1630.827 -0.3150.114 0.151-0.036 -0.0260.818
Best paralog>78%
Rata %id>59%
3’UTR <647bp
CDS len >704bp
Hs/Mm Ka/Ks<0.195
-0.4220.344 -0.0440.205 0.106-0.087 -0.0340.2
N Y N Y N Y N Y N Y N Y
N Y N Y N Y N Y N Y
0.008-0.57
GC > 37.5%
N Y
-0.0380.213
Mouse %id > 68.3%
N Y
0.015-0.492
Worm %id >55%
N Y
-0.0340.027
-0.5940.029 -0.0140.792
CLASS is DISEASE if Score < 0
Classifier performance :DATASET RECALL MISCLASSIFICATION
On training set : 70% 41%
10 fold X validation: 70% 42%
On 675 genes from 71% 40%Human Gene Mutstion DB
On 54 genes associated 72% 42%With ‘oligogenic’ diseases
Training set
TRUE
+ve
False +ve
Performance on genomic data
For each disease gene D in HGMD disease set:
examine 30Mb locus L surrounding gene D
Score each gene X in L
Rank genes.
Where is D in list?:
Score every gene in the genome, and normalise scores between 0 and 133% of genes (36/61) scoring > 0.75 are disease genes (8 fold enrichment)0.8% of genes (35/4357)scoring < 0.3 are disease genes (6 fold enrichment)
Prospectr - Pros and Cons
Pros •Applicable to all gene sequences
• Fast
• Hypothesis independent
Cons
• Maybe too unbiased
•How well will it predict complex disease genes?
Breakdown of disease causing mutations.
(from Human Gene Mutation Database 2002)
mutation type number % of total
deletion 6085 21.8
insertion/duplication 1911 6.8
rearrangement 512 1.8
repeat variations 38 0.1
missense/nonsense 16441 58.9
splicing 2727 9.8
regulatory 213 0.8
Most known disease mutations affect protein sequence.
Identifying causative mutations
Polymorphic moderate to low risk variants
Disease Locus Change Frequency Relative risk
Alzheimers APOE C112R 0.09-0.22 4-15Thrombosis factorV R506Q 0.00-0.08 5-10Haemochromatosis Hfe H63D 0.02-0.22 4.0NIDDM PPAR P12A 0.85 1.25IDDM INS promo 0.85 1.5-2.5
HIV CCR5 promo 0.01-0.14 highCrohns disease NOD2 G908R 0.01 6.0
R702W 0.04 3.0Breast cancer BRCA2 N372H 0.25 1.3Colon cancer APC I1307K 0.03 2.0
Neural tube defects MTHFR C677T 0.3-2.0Graves disease CTLA4 T17A 0.35 1.5-2.0CJD PRNP M129V 0.65 3.0FMF MEFV P369S 0.02 7.0Bipolar disorder XPD1 promoter ?? ??
again, mainly mis-sense mutations
Breakdown of SNP distribution Source : www.ensembl.org 2005
Total 9 100 000 confirmed SNPs currently
mutation type number % of total
Mis-sense/ nonsense 58119 0.64
Synonymous 44041 0.48
Splice site changing 7023 0.08
‘Regulatory’ SNPS 818215 9.2
Intronic 3201212 35.2
Intergenic 4971390 54.1
But most mutations are intronic or between genes!. => MAJOR CHALLENGE
How to identify functional variants
In proteins - many homology based approaches e.g.,
SIFT blocks.fhcrc.org/sift/SIFT.html
PolyPhen www.bork.embl-heidelberg.de/PolyPhen/
Identifying functional regulatory and splicing sites computationally
- much harder due to
fewer tools – no equivalent of cDNA libraries, ESTs
short sequences – occur very often but only some functional
often sequences only present in vertebrates, slower experiments
Use of protein sequence alignments
IQDDPTLFDYNVDERVAKFTKYA-------KDQAAFFQTDNIIMTMGSDFQYENANAWYKPIND----DLESP--DYNVDDRVEKLVKYAQLQAIFYKTNNVIFTMGEDFNYQHAEMWFT-------PVVDNPRSPENAKTLVNYFLKLASSQKGFYRTNHTVMTMGSDFHYENANMWFK PIRDD--PDLED----YNVDEIVQKFLNASHKQADYYKTNHIIMTMGSDFQYENANLWYK-------PVVDDPTSPENANKLVDYFLNLASSQKKYYRTNHTVMTMGSDFQYENANMWFK -----DQPLVEDPRSPENAKELVDYFLNVATAQGRYYRTNHTVMTMGSDFQYENANMWFK FVEDRRSPEYN---AEELVNYFLQLATA----QGQHFRTNHTIMTMGSDFQYENANMWFR FVED---PRSPEYNAKELVNYFLQLATA----QSEHYRTNHTIMTMGSDFQYENANMWFK PIIDGKHS------PDNNVKERVDAFLAYVTEMAEHFRTPNVILTMGEDFHYQNADMWYK VVEDTRSPEY-------NAKELVRYFLKLATDQGKLYRTKHTVMTMGSDFQYENANTWFK ----GVPP---ETIHLGNVQKRAEMLLDQYRKKSKLFRTTVVLAPLGDDFRYCERTEQFK ----GVPP---ETIHPGNVQSRARMLLDQYRKKSKLFRTKVLLAPLGDDFRYCEYTLQFK VQDD---PDLFD----YNVQERVNAFVAAALDQANITRINHIMFTMGTDFRYQYAHTWYR
For each position in an alignment, calculate the probability of an amino acid substitution being tolerated.
Conclusions
• Bioinformatics indispensable for modern medical genetics• ‘Simple’ disease genes identified routinely
• Major challenges in dissecting ‘complex’ diseases • Identifying functional sequences in non-coding DNA
• Predicting combined effect of multiple polymorphisms
Bibliography and further reading
• Data Mining. Ian Witten And Eibe Frank. (www.mkp.com)
• Finding disease genes computationally: General : Tabor et al., Nature Reviews in Genetics (May 2002) POCUS : Turner et al., Genome Biology. (2003)4(11):R75. Prospectr : Adie et al,. BMC Bioinformatics (2005 6(1) p55)
• Polymorphisms and haplotype mapping Crawford, Annual Reviews of Medicine 2005 56, 303-320.
• Web based tutorials on linkage/disease gene mapping http://www.ucl.ac.uk/~ucbhjow/b241/biochemical.html
• Identifying detrimental amino acid substitutions: SIFT: Ng and Henikoff Nucleic Acids Research (2003) 3812-14