lecture 10. topics in omic studies (basics) the chinese university of hong kong csci5050...
TRANSCRIPT
Lecture 10. Topics in Omic Studies (Basics)
The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2
Lecture outline1. Genome-wide association studies2. Omic studies3. Case studies
Last update: 3-Nov-2015
GENOME-WIDE ASSOCIATION STUDIES
Part 1
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 4
Useful terms in genetics• Locus (plural: loci)
– A specific location, region, or gene on a chromosome• E.g., chr1:20376; human CFTR gene
• Allele– A variant of the DNA sequence at a given locus
• E.g., A; wild-type
• Genotype– The set of alleles at a certain locus
• E.g., A/A; A/C; wild-type/mutant
• Character– An observable property
• E.g., eye color; shape of pea• Multiple levels (from expression level to growth rate)
• Trait– A variant of a character
• E.g., blue (eye color); round (pea shape)
• Phenotype– An observed trait of an organism due to a combination of its genotype and
environmental factorsLast update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 5
Research problems: several directions
• Given a phenotype– Find the locus associated with it– Find an allele associated with it– Find the allele(s) that cause(s) it
• Given an allele/genotype– Determine the resulting trait/phenotype
• Given the alleles of different loci– Study how they are related to each other
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 6
From phenotype to genotype• Suppose we observe a certain phenotype (e.g., a disease) in
some individuals, how do we find out the relevant loci/alleles?
• Things to consider:– Genetic vs. non-genetic factors– Single-locus vs. multi-locus– Homozygous vs. heterozygous (dominant vs. recessive)
• Difficulties:– Sample size– Controls– Data availability– Association vs. causality– Multiple hypothesis testing
Last update: 10-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 7
Some well-known disease-related genes
• -globin: mutations related to sickle-cell anemia (relationship discovered by Linus Pauling in 1949)
• p53: tumor-suppressor• BRCA1, BRCA2 (breast cancer 1/2, early onset):
mutations related to breast cancer• CFTR (cystic fibrosis transmembrane conductance
regulator): mutations related to cystic fibrosis(first mutation found in 1988 by Francis Collins, Lap-Chee Tsui and John Riordan)
• CCR5 (C-C chemokine receptor type 5): a mutation related to protection against M-tropic strains of HIV-1 infection
Database: OMIM (Online Mendelian Inheritance in Man)
Last update: 3-Nov-2015
Image credit: Wikipedia
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 8
Different approaches
Last update: 3-Nov-2015
Image credit: Mullen et al., Neurology 72(6):558-565, (2009)
Affected maleUnaffected male
Affected femaleUnaffected female
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 9
Family linkage study• Main ideas:
– Identify members with/without the phenotype– Study the inheritance patterns of genetic markers
• Fine: SNP (single nucleotide polymorphism)• Coarse: RFLP (restriction fragment length
polymorphism)• Many other types: AFLP, DArT, RAD, RAPD, SFP, SSLP,
SSR, STR, VNTR, ... (wiki “genetic marker”)
– Deduce possible loci related to the phenotype• Usually not very precise
– Also deduce homozygosity/heterozygosity
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 10
Restriction fragment length polymorphism
• Simple case for illustration:
• Reality: lots of irrelevant data
Last update: 3-Nov-2015
: Restriction site
Image credit: Wikipedia, http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/TechRFLP.shtml
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 11
Case-control studies• Instead of studying families, another way is to find unrelated individuals
with the phenotype as cases, and individuals without the phenotype (or random individuals with low chance of having the phenotype) as controls
• Advantage over family studies:– Easier to get large samples
• Disadvantage: More diverse background– Genotypic differences may be due to ethnicity, gender, etc. (more
later)– Need to balance between case and control groups, or perform special
analysis to separate out different factors (e.g., principle component analysis)
• Large studies may also have issues with diverse experimental protocols, data quality, etc.– Retraction of a 2010 Science paper about potential genetic signatures
related to longevity – Retraction note: Science 333:404, (2011)
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 12
Genome-wide association studies• With high-throughput technologies, it is now possible to
check many potential variants at the same time– SNP arrays
• Predefined SNPs• Relatively inexpensive
– Array-comparative genomic hybridization (Array CGH) for copy number variations
– Whole-genome sequencing• High coverage• High cost, especially when read depth needs to be high for confident SNP
calling• Can also detect other types of variants (e.g., indels)
– Exome sequencing• Only sequence captured exons• Compromise between coverage and cost
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 13
Copy number variation (CNV)
Last update: 3-Nov-2015
Image credit: http://clincancerres.aacrjournals.org/content/10/24/8204/F3.medium.gif, Chial, Nature Education 1(1), (2008)Array-CGH
Aneuploidy (e.g., in cancer)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 14
Measures of association (allele)• Suppose there are n1 cases and n2 controls, among which m1
and m2 have a given allele, respectively• Is the allele likely associated with the phenotype?• Contingency table:
• Null hypothesis H0: allele and phenotype are independent (hypergeometric distribution, row and column totals are fixed, others are variables)
Last update: 10-Nov-2015
With the phenotype Without the phenotype TotalWith the allele m1 m2 m1+m2
Without the allele n1-m1 n2-m2 (n1+n2)-(m1+m2)
Total n1 n2 n1+n2
Prሺm1|H0ሻ= ቀn1m1ቁቀn2m2ቁ
ቀn1 + n2m1 + m2ቁ
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 15
Measures of association (cont’d)
• p-value:– If the null hypothesis is true (that the allele is
really unrelated to the phenotype), what is the probability of observing a value of m1 equal to or larger than the observed value?
Last update: 3-Nov-2015
0.0 0.2 0.4 0.6 0.8 1.0
0.51.0
1.52.0
x
dchis
q(x,
df = 1
)
m1
Probability density of null distribution
Observed value
p-value
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 16
Measures of association (cont’d)• What probability (p-value) to compute:
– One-sided Fisher’s exact test (when we think having the allele has a positive effect to the phenotype): reject H0 if Pr(m1 or more|H0) is small
• Infeasible if numbers are large
– Two-sided chi-square test (when we think having the allele has either a positive or negative effect to the phenotype): reject H0 if deviation from expectation at least as much as observed has a low probability
• Expectation: e.g., #with both allele and phenotype = (m1+m2)n1/(n1+n2)
• Test-statistic: , which follows Chi-square distribution with 1 degree of freedom when n1 and n2 are large
– The “power” of a test is the probability that the null hypothesis will be rejected when it is actually false
• The probability that we will not miss a real association
Last update: 3-Nov-2015
χ2 = ሺOi −Eiሻ2Eii
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 17
Example• Suppose there are n1=12 cases and n2=8 controls, among
which m1=6 and m2=3 have a given allele, respectively• Contingency table:
• Pr(m1=6 | H0) = (12C6)(8C3)/(20C9) = (924)(56)/167960 = 0.3081
• Pr(m16 | H0) = (12C6)(8C3)/(20C9) + (12C7)(8C2)/(20C9) + (12C8)(8C1)/(20C9) + (12C9)(8C0)/(20C9) = 0.4650– Even if the phenotype is independent of the allele, by chance we still
have a high probability of observing m16– Therefore the allele is not statistically associated with the phenotype
Last update: 3-Nov-2015
With the phenotype Without the phenotype TotalWith the allele 6 3 9Without the allele 6 5 11Total 12 8 20
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 18
Example (cont’d)• Suppose there are n1=12 cases and n2=8 controls, among which m1=6 and
m2=3 have a given allele, respectively• Contingency table:
• Expectations if phenotype is independent of allele:
• Chi-square statistic: (6 – 5.4)2 / 5.4 + (3 – 3.6)2 / 3.6 + (6 – 6.6)2 / 6.6 + (5 - 4.4)2 / 4.4 = 0.3030– Pr(2 > 0.3030 | H0) = 0.5820– Even if phenotype is independent of the allele, there is still a high chance of
getting the observed values as deviated or more deviated from expectation
Last update: 3-Nov-2015
With the phenotype Without the phenotype Total
With the allele 6 3 9
Without the allele 6 5 11
Total 12 8 20
With the phenotype Without the phenotype Total
With the allele (9/20)(12/20)20=5.4 (9/20)(8/20)20=3.6 9
Without the allele (11/20)(12/20)20=6.6 (11/20)(8/20)20=4.4 11
Total 12 8 20
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 19
Measures of association (genotype)
• If we also want to know whether the homozygous and heterozygous situations are different, we need a 3x2 table instead (χ2: 2 degrees of freedom):
• Other tests are available for phenotypes that are not binary– Quantitative trait locus (QTL): locus with a continuous
trait• eQTL: Relating to expression level of a gene
Last update: 3-Nov-2015
Genotype With the phenotype Without the phenotypeAA m11 m21
Aa m12 m22
aa m13 m23
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 20
A problem of p-value• p-value only tells how likely the observed data can be
generated by random chance, but not how much the actual situation deviates from the null hypothesis
• Example:
• When sample size is large, it is quite clear that the allele and phenotype are not independent– However, the association is weak (51% vs. 49%)
Last update: 3-Nov-2015
n1 n2 m1 m2 p-value (2-sided chi-square with Yates’ correction)
100 100 51 49 0.88751000 1000 510 490 0.3955
10000 10000 5100 4900 0.0049100000 100000 51000 49000 <0.0001
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 21
Effect size• In addition to p-value, we also want to know how much more likely
the individuals in the case group have the allele than those in the control group
• Usual measures for this “effect size”:– Relative risk,
RR = (fraction of cases among people with the allele) /(fraction of cases among people without the allele)
[Cannot be computed since the case/control ratio is different from the case/non-case ratio]
– Odds ratio,OR = [(# with the allele in case group) /
[(# without the allele in case group)] /[(# with the allele in control group) /[(# without the allele in control group)]
= [m1/(n1 - m1)] / [m2/(n2 - m2)]= [m1(n2-m2)]/[m2(n1-m1)]
Last update: 11-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 22
Example
Last update: 11-Nov-2015
n1 n2 m1 m2 p-value (2-sided chi-square with Yates’ correction)
Odds ratio
100 100 51 49 0.8875 1.081000 1000 510 490 0.3955 1.08
10000 10000 5100 4900 0.0049 1.08100000 100000 51000 49000 <0.0001 1.08
10 10 7 3 0.1797 5.441000 1000 700 300 <0.0001 5.44
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 23
Cases explained• Are p-value and effect size sufficient?• Consider this situation:
– p-value (two-sided chi-square with Yates’ correction): <0.0001– OR: (100/9900) / (10 / 9990) = 10.1
• Is this allele very important?– It only explains 1% of the individuals with the phenotype– As of 2010, genetic variants discovered only explain 10% of
type-2 diabetes heritability (Billings and Florez, Annals of the New York Academy of Sciences 1212:59-77, 2010)
• Further complicated by environmental factors
Last update: 3-Nov-2015
With the phenotype Without the phenotypeWith the allele 100 10Without the allele 9900 9990
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 24
Stratification• Now consider this case:
• p-value, effect size, and cases explained are all good– OR = (3000 / 7000) / (1000 / 9000) = 3.86 > 1
• What can still be wrong?• If we consider males and females separately:
– OR(male) = (2900 / 6000) / (50 / 100) = 0.97 < 1– OR(female) = (100 / 1000) / (950 / 8900) = 0.94 < 1– Phenotype is associated with gender, not allele -- “Simpson’s paradox”
Last update: 3-Nov-2015
With the phenotype Without the phenotypeWith the allele 3000 1000Without the allele 7000 9000
With the phenotype Without the phenotypeMale Female Male Female
With the allele 2900 100 50 950Without the allele 6000 1000 100 8900
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 25
Multiple hypothesis testing• If many loci are studied at the same time, another
issue is multiple hypothesis testing• If the p-value for an allele to be associated with a
phenotype is 0.01,– If the allele is in fact not associated with the phenotype,
there is a 1% chance that we can get the observed case and control counts or more extreme by chance
– If we consider 100 loci, we expect to encounter one such situation on average
– In reality, we are considering millions of loci at the same time
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 26
Correction for multiple hypothesis testing
• Bonferroni correction (family-wise error rate)– Suppose we have tested N loci, all of which are not associated with the phenotype– What is the chance that at least one of them has a p-value ≤ p?– Pr(locus 1 has a p-value ≤ p OR
Pr(locus 2 has a p-value ≤ p OR ... ORPr(locus N has a p-value ≤ p)≤ Pr(locus 1 has a p-value ≤ p) +≤ Pr(locus 2 has a p-value ≤ p) + ... +≤ Pr(locus N has a p-value ≤ p)= p + p + ... + p(N times)= Np
– For example, instead of 0.01, we only consider a p-value of 0.01/N or less to be significant
• Interpretation: The probability for one or more loci we call as associated with the phenotype to be due to random chance is smaller than 0.01
– Other commonly used correction methods: 1) false discovery rate based on Benjamini-Hochberg procedure, 2) q-value
Last update: 10-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 27
Association between different loci• In some cases, a disease-associated allele is due to
somatic mutation not inherited from parents• More often, it is inherited from parents. In this case,
alleles at some loci not causing the disease may also appear to be disease-associated
• Reasons:– Genetic linkage– Linkage disequilibrium
• Consequence: statistical association does not necessarily imply– Biological association– Causality
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 28
Genetic linkage• For diploid organisms, there are two
copies of each chromosome• During meiosis, the two homologous
chromosomes can exchange genetic materials by recombination during chromosome crossover
• If two loci are close, the chance of crossover between them is small– Their alleles are likely passed on to
daughter cells together– The rate of crossover can be used as a
distance measure between two loci
Last update: 3-Nov-2015
Sister chromatids
Image source: http://www.tokresource.org/tok_classes/biobiobio/biomenu/meiosis/Crossover.gif
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 29
Genetic linkage (cont’d)• 1 centimorgan (cM) or map unit (m.u.) equals the
distance between chromosomal locations with average 1% intervening crossover per generation
• Notice that if two loci are far apart, it is possible to have multiple cross over events between them in one generation
• Testing whether two loci are linked with recombinant rate θ if there are x non-recombinant offspring and y recombinant offspring according to a pedigree:– Log of odds,
Last update: 3-Nov-2015
LOD= maxθ log10ሺ1− θሻxθy0.5x+y = log10ቀ1− yx+ yቁxቀ
yx+ yቁy0.5x+y = log10ቀ
xx+ yቁxቀ
yx+ yቁy0.5x+y
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 30
Linkage disequilibrium (LD)• Sometimes the alleles at two loci can co-occur
with a frequency that deviates from expectation (even if they are on different chromosomes)
• Some reasons:– Selection– Population bottleneck– Non-uniform rate of recombination– Non-random mating– ...
Last update: 10-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 31
Representing GWAS results• Allele list and LD map
Last update: 3-Nov-2015
Image credit: Altshuler et al., Science 322(5903):881-888, (2008)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 32
Representing GWAS results (cont’d)• Manhattan plot: -
log(p) against chromosomal locations
• Quantile-quantile plot (Q-Q plot): theoretical distribution of chi-square values vs. observed values
Last update: 3-Nov-2015
Image credit: Samani et al., New England Journal of Medicine 357:443-453, (2007)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 33
Combinatorial effects [project]• It is possible that two loci have a joint effect on the
phenotype• Example: fraction of individuals with a phenotype
– Each locus is not strongly associated with phenotype (“main effect” not strong)
– The two loci together are stronger associated with the phenotype (the “interaction effect” is strong)
Last update: 10-Nov-2015
Locus 1Locus 2
AA Aa aa
BB 0.9 0.1 0.1Bb 0.1 0.9 0.1bb 0.1 0.1 0.9
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 34
Follow-up• Validations:
– Sanger sequencing to confirm the presence of allele in the tested samples
– Functional enrichment and network analysis– Replication in larger cohorts– Knock-out or (for genes) knock-down/over-expression
experiments• In vitro• Animal models
• Ultimate applications:– Biomarker for diagnosis/prognosis prediction– Druggable targets
Last update: 10-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 35
NHGRI GWA Catalog
Last update: 3-Nov-2015
Image source: http://www.genome.gov/gwastudies/
OMIC STUDIESPart 2
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 37
Ome, omic, and omics• Traditionally, biologists study one or a few biological
objects at a time– Hypothesis driven
• Now it is possible to study many biological objects at the same time– Data driven
• Suppose we want to study a type of objects or phenomena, X– “X-ome”: A large amount of data related to X, or the whole set
of X– “X-omic”: To study a large amount of data related to X– “X-omics”: The area of studying a large amount of data related
to X
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 38
Different kinds of X-omicsObject/ phenomenon type, X
X-ome X-omics
Genes/ DNA Genome Genomics (The study of all genes/whole set of DNA)Transcripts/ transcription Transcriptome Transcriptomics (The study of gene expression levels)Exons/ transcription Exome Exomics (The study of all exons)Proteins Proteome Proteomics (The study of protein identity and abundance)Metabolism Metabolome Metabolomics (The study of metabolic reactions)DNA methylation Methylome Methylomics (The study of whole-genome DNA
methylation)Non-coding RNAs, DNA methylation, histone modifications
Epigenome Epigenomics (The study of inheritable non-DNA signals)
Population of co-existing species in an environment
Metagenome Metagenomics (The study of different genomes, transcriptomes, etc. in a common environment)
Phenotypes Phenome Phenomics (The comprehensive story of phenotypes)Interactions Interactome Interactomics (The study of all interactions of a certain
type)... ... ...
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 39
Why omics?• Unbiased and complete
– If a hypothesis can explain an observation, it is not necessarily the only hypothesis that can explain it
– May discover something surprising• Easy (good or bad?)
– From data to hypothesis• Rapidly decreasing cost, becoming affordable
– ‘If it does not cost much, why not?’
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 40
Why some people are reluctant• Traditionally, science follows the “hypothesis
test” pattern• Outputs of high-throughput experiments contain
irrelevant data, secondary effects, and noise• Costly in the sense that
– Not guaranteed that anything will be found at the time the experiments are performed
– Many hypotheses need to be validated
Need to be well aware of these potential pitfalls
Last update: 10-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 41
Types of studies• With a specific question. Examples:
– What are the genetic factors that increase the susceptibility of type-2 diabetes?
• Survey genetic variants in the whole genome
– How common is RNA editing?• Compare all transcripts with DNA
• With a broad question. Examples:– What are the characteristics of domains defined by chromatin
features?• Define domains using genome-wide chromatin features, then correlate
with other features
– Is there anything special about the distribution of protein binding sites?
• Compare all regions bound by protein binding sites and other regions
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 42
Some specific CBB problems• More “downstream” than standard data
processing tasks– Haplotype phasing [project]– Genotype imputation
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 43
Haplotype• Haplotype: The combination of alleles on the
same chromosome– Example: if an individual has A/C genotype at a locus,
and G/T at another, there are four possible haplotypes:
• AG• AT• CG• CT
– The individual actually has two of them, one from each parent
– For k loci, there are up to 2k possible haplotypes
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 44
Haplotype phasing• Haplotype phasing: Finding the alleles that were
inherited together– Even better, from which parent
• Main ideas:– Single individual: Look for sequencing reads that cover
more than one variant• Difficult due to short read lengths (two human SNPs are
about 1000bp apart on average)• Paired-end reads help, but not much
– Multiple individuals– Family analysis (comparing with
parents/siblings/children)Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 45
Haplotype inference• All cases for a trio:
Last update: 3-Nov-2015
Description Father Child Mother
F:hom.; M:same hom. (; C:hom.) AA A|A AA
F:hom.; M:diff hom. (; C:het.) AA A|C CC
F:hom.; M:het. with com.; C:hom. AA A|A AC
F:hom.; M:het. with com.; C:het. AA A|C AC
F:hom.; M:het. without com. (; C:het.) AA A|C CG
F:het.; M:het. with 2 com.; C:hom. AC A|A AC
F:het.; M:het. with 2 com.; C:het. (The only unresolved case) AC A|C or C|A AC
F:het.; M:het. with 1 com.; C:hom. AC A|A AG
F:het.; M:het. with 1 com.; C:het. with the com. AC A|G AG
F:het.; M:het. with 1 com.; C:het. without the com. AC C|G AG
F:het.; M:het. without com. (; C:het.) AC A|G GT
Abbreviations: F: father; M: mother; C: child; hom.: homozygous; het.: heterozygous; com.: common allele
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 46
Haplotype inference (cont’d)
Last update: 3-Nov-2015
Image credit: Roach et al., The American Journal of Human Genetics 89:382-397, (2011)
Parsimony: assuming no intervening recombination
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 47
Haplotype blocks• A related problem: Given the SNPs of a list of loci
from many unrelated individuals, identify haplotype blocks (SNPs that are usually inherited together)– Before whole-genome sequencing became popular,
one purpose of it was to find representatives of each block (the “tagging SNPs”) to reduce the number of SNPs that need to be tested others can be inferred (statistically) from them
– Two main goals of the International HapMap project: Cataloging SNPs and constructing haplotype blocks
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 48
Missing genotypes• Sometimes we want to guess the
genotype/haplotype at certain loci:– They are missing because of the method used
(e.g., a SNP array only checks a pre-defined set of SNPs)
– Information at some neighboring loci are available– Full information for some other samples are
available (e.g., those with whole-genome sequencing data)
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 49
Genotype imputation
Last update: 3-Nov-2015
Image credit: Howie et al., PLoS Genetics 5(6):e1000529, (2009)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 50
Some common omic analysis techniques
• Aggregation: Aligning different elements, then aggregate their features– E.g., The methylation level
at TSS, first 10% of 5’UTR, second 10% of 5’UTR, etc.
• Correlation: Finding relationships between different datasets– E.g., Binding patterns of
different DNA-binding proteins
Last update: 3-Nov-2015
Image credit: Lister et al., Nature 5462:315-322, (2009); Filion et al., Cell 143:212-224, (2010)
CASE STUDIESPart 3
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 52
Case studies• GWAS studies• Whole-genome sequencing of individuals from
different populations• Large integrative projects• Personal genomics
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 53
First GWAS study• Reported in 2005, about age-related
macular degeneration• 96 patients, 50 controls• Genotyped 116,204 SNPs• In one paper, two SNPs were reported
– p-values: 4.1 x 10-8 and 1.4 x 10-6
– Dominant (i.e., 1,2 vs. 0 copies of risk allele)
• Expected odds ratios: 4.6 and 4.7
– Recessive (i.e., 2 vs. 0,1 copies of risk allele)• Expected odds ratios: 7.4 and 6.2
• Three papers by different groups:– Complement Factor H Polymorphism
in/Polymorphism and/Variant Increases the Risk of Age-Related macular Degeneration
Last update: 3-Nov-2015
Image credit: Klein et al., Science 308(5720):385-389, (2005)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 54
One of the largest GWAS studies• Wellcome Trust Case Control Consortium Study• 14,000 cases of 7 common diseases
– Bipolar disorder– Coronary heart disease– Crohn’s disease– Hypertension– Rheumatoid arthritis– Type 1 diabetes– Type 2 diabetes
• 3,000 shared controls
Last update: 3-Nov-2015
Image credit: The Wellcome Trust Case Control Consortium, Nature 447(7145):661-678, (2007)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 55
Human whole-genome sequencing• 2001: Draft human genome finished• 2007: J. Craig Venter• 2008: James Watson• 2008: Han Chinese (YH)• 2009: Yoruban male (NA18507)• 2009: Korean male (SJK)• 2009: Korean male (AK1)• ...
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 56
SNPs of different populations
Last update: 3-Nov-2015
Image credit: Paschou et al., PLoS Genetics 3(9):e160, (2007)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 57
18 Korean genomes & 17 transcriptomes
Last update: 3-Nov-2015
Image credit: Ju et al., Nature Genetics 43(8):745-752, (2011)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 58
1000 Genomes Project [Project]• Pilot phase:
– Low-coverage whole-genome sequencing of 179 individuals from 4 populations
– High-coverage sequencing of two mother-father-child trios– Exon-targeted sequencing of 697 individuals from 7 populations
Last update: 3-Nov-2015
Image credit: The 1000 Genomes Project Consortium, Nature 467(7319):1061-1073, (2010)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 59
Loss-of-function variants• One interesting finding from the 1000 Genomes Project: Each
person has on average about 100 loss-of-function (LoF) variants with about 20 genes completely inactivated– Functions: blood type, muscle performance, drug metabolism, etc.– Less evolutionarily conserved– Fewer protein-protein interactions– Likely to have similar genes in the genome
Last update: 3-Nov-2015
Image credit: MacArthur et al., Science 335(6070):823-828, (2012)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 60
Soybean project• Sequenced 17 wild and
14 cultivated soybean genomes
• Data useful for identifying relevant alleles that exhibit desirable phenotypes– E.g., drought resistance
Last update: 3-Nov-2015
Image credit: Prof. Hon-Ming Lam, CUHK
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 61
Metagenomics• Sequences from many different types of
(microbe) species in a single sample– Soil– Sea water– Skin– Gut– ...
• Analysis tasks very different from standard sequencing projects– Difficult to tell from which species each
sequence read belongs– Use highly-conserved sequences (e.g., rRNA)
to estimate abundance of each species– Concept of shared resources (e.g., proteins)– Environment factors play critical roles
Last update: 3-Nov-2015
Image credit: Wikipedia
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 62
ENCODE [project]• Encyclopedia of DNA Elements• Identifying and characterizing all human DNA elements
• Similar projects for worm and fly (model organism ENCODE)
Last update: 3-Nov-2015
Image credit: Darryl Leja (NHGRI) and Ian Dunham (EBI)
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 63
Personal genomics• Studying your own genome
– Whole genome sequencing– Exome sequencing– SNPs– ...
• From personal genomics to personal medicine– Easy to produce data– Difficult to interpret results (predict implications)– Even more difficult to do disease prevention
• Other issues– Genetics is not the only factor– Everything is probability– Psychological impacts
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 64
Some companies
Last update: 3-Nov-2015
CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 65
Summary• Association studies try to find loci/alleles related to
phenotypes– Family studies– Case-control studies
• Criteria for evaluating the significance of a variant:– p-value, effect size, cases explained
• Confounding factors:– Stratification, linkage disequilibrium
• Levels of coverage:– Selected variants < exome sequencing < whole-genome
sequencing• The current challenge is to produce the right data and
identify the useful informationLast update: 3-Nov-2015