lecture 10. topics in omic studies (basics) the chinese university of hong kong csci5050...

Lecture 10. Topics in Omic Studies (Basics)

The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Genome-wide association studies2. Omic studies3. Case studies

Last update: 3-Nov-2015

GENOME-WIDE ASSOCIATION STUDIES

Part 1


Useful terms in genetics• Locus (plural: loci)

– A specific location, region, or gene on a chromosome• E.g., chr1:20376; human CFTR gene

• Allele– A variant of the DNA sequence at a given locus

• E.g., A; wild-type

• Genotype– The set of alleles at a certain locus

• E.g., A/A; A/C; wild-type/mutant

• Character– An observable property

• E.g., eye color; shape of pea• Multiple levels (from expression level to growth rate)

• Trait– A variant of a character

• E.g., blue (eye color); round (pea shape)

• Phenotype– An observed trait of an organism due to a combination of its genotype and

environmental factorsLast update: 3-Nov-2015


Research problems: several directions

• Given a phenotype– Find the locus associated with it– Find an allele associated with it– Find the allele(s) that cause(s) it

• Given an allele/genotype– Determine the resulting trait/phenotype

• Given the alleles of different loci– Study how they are related to each other



From phenotype to genotype• Suppose we observe a certain phenotype (e.g., a disease) in

some individuals, how do we find out the relevant loci/alleles?

• Things to consider:– Genetic vs. non-genetic factors– Single-locus vs. multi-locus– Homozygous vs. heterozygous (dominant vs. recessive)

• Difficulties:– Sample size– Controls– Data availability– Association vs. causality– Multiple hypothesis testing



Some well-known disease-related genes

• -globin: mutations related to sickle-cell anemia (relationship discovered by Linus Pauling in 1949)

• p53: tumor-suppressor• BRCA1, BRCA2 (breast cancer 1/2, early onset):

mutations related to breast cancer• CFTR (cystic fibrosis transmembrane conductance

regulator): mutations related to cystic fibrosis(first mutation found in 1988 by Francis Collins, Lap-Chee Tsui and John Riordan)

• CCR5 (C-C chemokine receptor type 5): a mutation related to protection against M-tropic strains of HIV-1 infection

Database: OMIM (Online Mendelian Inheritance in Man)


Image credit: Wikipedia


Different approaches


Image credit: Mullen et al., Neurology 72(6):558-565, (2009)

Affected maleUnaffected male

Affected femaleUnaffected female


Family linkage study• Main ideas:

– Identify members with/without the phenotype– Study the inheritance patterns of genetic markers

• Fine: SNP (single nucleotide polymorphism)• Coarse: RFLP (restriction fragment length

polymorphism)• Many other types: AFLP, DArT, RAD, RAPD, SFP, SSLP,

SSR, STR, VNTR, ... (wiki “genetic marker”)

– Deduce possible loci related to the phenotype• Usually not very precise

– Also deduce homozygosity/heterozygosity



Restriction fragment length polymorphism

• Simple case for illustration:

• Reality: lots of irrelevant data


: Restriction site

Image credit: Wikipedia, http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/TechRFLP.shtml


Case-control studies• Instead of studying families, another way is to find unrelated individuals

with the phenotype as cases, and individuals without the phenotype (or random individuals with low chance of having the phenotype) as controls

• Advantage over family studies:– Easier to get large samples

• Disadvantage: More diverse background– Genotypic differences may be due to ethnicity, gender, etc. (more

later)– Need to balance between case and control groups, or perform special

analysis to separate out different factors (e.g., principle component analysis)

• Large studies may also have issues with diverse experimental protocols, data quality, etc.– Retraction of a 2010 Science paper about potential genetic signatures

related to longevity – Retraction note: Science 333:404, (2011)



Genome-wide association studies• With high-throughput technologies, it is now possible to

check many potential variants at the same time– SNP arrays

• Predefined SNPs• Relatively inexpensive

– Array-comparative genomic hybridization (Array CGH) for copy number variations

– Whole-genome sequencing• High coverage• High cost, especially when read depth needs to be high for confident SNP

calling• Can also detect other types of variants (e.g., indels)

– Exome sequencing• Only sequence captured exons• Compromise between coverage and cost



Copy number variation (CNV)


Image credit: http://clincancerres.aacrjournals.org/content/10/24/8204/F3.medium.gif, Chial, Nature Education 1(1), (2008)Array-CGH

Aneuploidy (e.g., in cancer)


Measures of association (allele)• Suppose there are n1 cases and n2 controls, among which m1

and m2 have a given allele, respectively• Is the allele likely associated with the phenotype?• Contingency table:

• Null hypothesis H0: allele and phenotype are independent (hypergeometric distribution, row and column totals are fixed, others are variables)


With the phenotype Without the phenotype TotalWith the allele m1 m2 m1+m2

Without the allele n1-m1 n2-m2 (n1+n2)-(m1+m2)

Total n1 n2 n1+n2

Prሺm1|H0ሻ= ቀn1m1ቁቀn2m2ቁ

ቀn1 + n2m1 + m2ቁ


Measures of association (cont’d)

• p-value:– If the null hypothesis is true (that the allele is

really unrelated to the phenotype), what is the probability of observing a value of m1 equal to or larger than the observed value?


0.0 0.2 0.4 0.6 0.8 1.0

0.51.0

1.52.0

x

dchis

q(x,

df = 1

)

m1

Probability density of null distribution

Observed value

p-value


Measures of association (cont’d)• What probability (p-value) to compute:

– One-sided Fisher’s exact test (when we think having the allele has a positive effect to the phenotype): reject H0 if Pr(m1 or more|H0) is small

• Infeasible if numbers are large

– Two-sided chi-square test (when we think having the allele has either a positive or negative effect to the phenotype): reject H0 if deviation from expectation at least as much as observed has a low probability

• Expectation: e.g., #with both allele and phenotype = (m1+m2)n1/(n1+n2)

• Test-statistic: , which follows Chi-square distribution with 1 degree of freedom when n1 and n2 are large

– The “power” of a test is the probability that the null hypothesis will be rejected when it is actually false

• The probability that we will not miss a real association


χ2 = ሺOi −Eiሻ2Eii


Example• Suppose there are n1=12 cases and n2=8 controls, among

which m1=6 and m2=3 have a given allele, respectively• Contingency table:

• Pr(m1=6 | H0) = (12C6)(8C3)/(20C9) = (924)(56)/167960 = 0.3081

• Pr(m16 | H0) = (12C6)(8C3)/(20C9) + (12C7)(8C2)/(20C9) + (12C8)(8C1)/(20C9) + (12C9)(8C0)/(20C9) = 0.4650– Even if the phenotype is independent of the allele, by chance we still

have a high probability of observing m16– Therefore the allele is not statistically associated with the phenotype


With the phenotype Without the phenotype TotalWith the allele 6 3 9Without the allele 6 5 11Total 12 8 20


Example (cont’d)• Suppose there are n1=12 cases and n2=8 controls, among which m1=6 and

m2=3 have a given allele, respectively• Contingency table:

• Expectations if phenotype is independent of allele:

• Chi-square statistic: (6 – 5.4)2 / 5.4 + (3 – 3.6)2 / 3.6 + (6 – 6.6)2 / 6.6 + (5 - 4.4)2 / 4.4 = 0.3030– Pr(2 > 0.3030 | H0) = 0.5820– Even if phenotype is independent of the allele, there is still a high chance of

getting the observed values as deviated or more deviated from expectation


With the phenotype Without the phenotype Total

With the allele 6 3 9

Without the allele 6 5 11

Total 12 8 20

With the phenotype Without the phenotype Total

With the allele (9/20)(12/20)20=5.4 (9/20)(8/20)20=3.6 9

Without the allele (11/20)(12/20)20=6.6 (11/20)(8/20)20=4.4 11

Total 12 8 20


Measures of association (genotype)

• If we also want to know whether the homozygous and heterozygous situations are different, we need a 3x2 table instead (χ2: 2 degrees of freedom):

• Other tests are available for phenotypes that are not binary– Quantitative trait locus (QTL): locus with a continuous

trait• eQTL: Relating to expression level of a gene


Genotype With the phenotype Without the phenotypeAA m11 m21

Aa m12 m22

aa m13 m23


A problem of p-value• p-value only tells how likely the observed data can be

generated by random chance, but not how much the actual situation deviates from the null hypothesis

• Example:

• When sample size is large, it is quite clear that the allele and phenotype are not independent– However, the association is weak (51% vs. 49%)


n1 n2 m1 m2 p-value (2-sided chi-square with Yates’ correction)

100 100 51 49 0.88751000 1000 510 490 0.3955

10000 10000 5100 4900 0.0049100000 100000 51000 49000 <0.0001


Effect size• In addition to p-value, we also want to know how much more likely

the individuals in the case group have the allele than those in the control group

• Usual measures for this “effect size”:– Relative risk,

RR = (fraction of cases among people with the allele) /(fraction of cases among people without the allele)

[Cannot be computed since the case/control ratio is different from the case/non-case ratio]

– Odds ratio,OR = [(# with the allele in case group) /

[(# without the allele in case group)] /[(# with the allele in control group) /[(# without the allele in control group)]

= [m1/(n1 - m1)] / [m2/(n2 - m2)]= [m1(n2-m2)]/[m2(n1-m1)]



Example


n1 n2 m1 m2 p-value (2-sided chi-square with Yates’ correction)

Odds ratio

100 100 51 49 0.8875 1.081000 1000 510 490 0.3955 1.08

10000 10000 5100 4900 0.0049 1.08100000 100000 51000 49000 <0.0001 1.08

10 10 7 3 0.1797 5.441000 1000 700 300 <0.0001 5.44


Cases explained• Are p-value and effect size sufficient?• Consider this situation:

– p-value (two-sided chi-square with Yates’ correction): <0.0001– OR: (100/9900) / (10 / 9990) = 10.1

• Is this allele very important?– It only explains 1% of the individuals with the phenotype– As of 2010, genetic variants discovered only explain 10% of

type-2 diabetes heritability (Billings and Florez, Annals of the New York Academy of Sciences 1212:59-77, 2010)

• Further complicated by environmental factors


With the phenotype Without the phenotypeWith the allele 100 10Without the allele 9900 9990


Stratification• Now consider this case:

• p-value, effect size, and cases explained are all good– OR = (3000 / 7000) / (1000 / 9000) = 3.86 > 1

• What can still be wrong?• If we consider males and females separately:

– OR(male) = (2900 / 6000) / (50 / 100) = 0.97 < 1– OR(female) = (100 / 1000) / (950 / 8900) = 0.94 < 1– Phenotype is associated with gender, not allele -- “Simpson’s paradox”


With the phenotype Without the phenotypeWith the allele 3000 1000Without the allele 7000 9000

With the phenotype Without the phenotypeMale Female Male Female

With the allele 2900 100 50 950Without the allele 6000 1000 100 8900


Multiple hypothesis testing• If many loci are studied at the same time, another

issue is multiple hypothesis testing• If the p-value for an allele to be associated with a

phenotype is 0.01,– If the allele is in fact not associated with the phenotype,

there is a 1% chance that we can get the observed case and control counts or more extreme by chance

– If we consider 100 loci, we expect to encounter one such situation on average

– In reality, we are considering millions of loci at the same time



Correction for multiple hypothesis testing

• Bonferroni correction (family-wise error rate)– Suppose we have tested N loci, all of which are not associated with the phenotype– What is the chance that at least one of them has a p-value ≤ p?– Pr(locus 1 has a p-value ≤ p OR

Pr(locus 2 has a p-value ≤ p OR ... ORPr(locus N has a p-value ≤ p)≤ Pr(locus 1 has a p-value ≤ p) +≤ Pr(locus 2 has a p-value ≤ p) + ... +≤ Pr(locus N has a p-value ≤ p)= p + p + ... + p(N times)= Np

– For example, instead of 0.01, we only consider a p-value of 0.01/N or less to be significant

• Interpretation: The probability for one or more loci we call as associated with the phenotype to be due to random chance is smaller than 0.01

– Other commonly used correction methods: 1) false discovery rate based on Benjamini-Hochberg procedure, 2) q-value



Association between different loci• In some cases, a disease-associated allele is due to

somatic mutation not inherited from parents• More often, it is inherited from parents. In this case,

alleles at some loci not causing the disease may also appear to be disease-associated

• Reasons:– Genetic linkage– Linkage disequilibrium

• Consequence: statistical association does not necessarily imply– Biological association– Causality



Genetic linkage• For diploid organisms, there are two

copies of each chromosome• During meiosis, the two homologous

chromosomes can exchange genetic materials by recombination during chromosome crossover

• If two loci are close, the chance of crossover between them is small– Their alleles are likely passed on to

daughter cells together– The rate of crossover can be used as a

distance measure between two loci


Sister chromatids

Image source: http://www.tokresource.org/tok_classes/biobiobio/biomenu/meiosis/Crossover.gif


Genetic linkage (cont’d)• 1 centimorgan (cM) or map unit (m.u.) equals the

distance between chromosomal locations with average 1% intervening crossover per generation

• Notice that if two loci are far apart, it is possible to have multiple cross over events between them in one generation

• Testing whether two loci are linked with recombinant rate θ if there are x non-recombinant offspring and y recombinant offspring according to a pedigree:– Log of odds,


LOD= maxθ log10ሺ1− θሻxθy0.5x+y = log10ቀ1− yx+ yቁxቀ

yx+ yቁy0.5x+y = log10ቀ

xx+ yቁxቀ

yx+ yቁy0.5x+y


Linkage disequilibrium (LD)• Sometimes the alleles at two loci can co-occur

with a frequency that deviates from expectation (even if they are on different chromosomes)

• Some reasons:– Selection– Population bottleneck– Non-uniform rate of recombination– Non-random mating– ...



Representing GWAS results• Allele list and LD map


Image credit: Altshuler et al., Science 322(5903):881-888, (2008)


Representing GWAS results (cont’d)• Manhattan plot: -

log(p) against chromosomal locations

• Quantile-quantile plot (Q-Q plot): theoretical distribution of chi-square values vs. observed values


Image credit: Samani et al., New England Journal of Medicine 357:443-453, (2007)


Combinatorial effects [project]• It is possible that two loci have a joint effect on the

phenotype• Example: fraction of individuals with a phenotype

– Each locus is not strongly associated with phenotype (“main effect” not strong)

– The two loci together are stronger associated with the phenotype (the “interaction effect” is strong)


Locus 1Locus 2

AA Aa aa

BB 0.9 0.1 0.1Bb 0.1 0.9 0.1bb 0.1 0.1 0.9


Follow-up• Validations:

– Sanger sequencing to confirm the presence of allele in the tested samples

– Functional enrichment and network analysis– Replication in larger cohorts– Knock-out or (for genes) knock-down/over-expression

experiments• In vitro• Animal models

• Ultimate applications:– Biomarker for diagnosis/prognosis prediction– Druggable targets



NHGRI GWA Catalog


Image source: http://www.genome.gov/gwastudies/

OMIC STUDIESPart 2


Ome, omic, and omics• Traditionally, biologists study one or a few biological

objects at a time– Hypothesis driven

• Now it is possible to study many biological objects at the same time– Data driven

• Suppose we want to study a type of objects or phenomena, X– “X-ome”: A large amount of data related to X, or the whole set

of X– “X-omic”: To study a large amount of data related to X– “X-omics”: The area of studying a large amount of data related

to X



Different kinds of X-omicsObject/ phenomenon type, X

X-ome X-omics

Genes/ DNA Genome Genomics (The study of all genes/whole set of DNA)Transcripts/ transcription Transcriptome Transcriptomics (The study of gene expression levels)Exons/ transcription Exome Exomics (The study of all exons)Proteins Proteome Proteomics (The study of protein identity and abundance)Metabolism Metabolome Metabolomics (The study of metabolic reactions)DNA methylation Methylome Methylomics (The study of whole-genome DNA

methylation)Non-coding RNAs, DNA methylation, histone modifications

Epigenome Epigenomics (The study of inheritable non-DNA signals)

Population of co-existing species in an environment

Metagenome Metagenomics (The study of different genomes, transcriptomes, etc. in a common environment)

Phenotypes Phenome Phenomics (The comprehensive story of phenotypes)Interactions Interactome Interactomics (The study of all interactions of a certain

type)... ... ...



Why omics?• Unbiased and complete

– If a hypothesis can explain an observation, it is not necessarily the only hypothesis that can explain it

– May discover something surprising• Easy (good or bad?)

– From data to hypothesis• Rapidly decreasing cost, becoming affordable

– ‘If it does not cost much, why not?’



Why some people are reluctant• Traditionally, science follows the “hypothesis

test” pattern• Outputs of high-throughput experiments contain

irrelevant data, secondary effects, and noise• Costly in the sense that

– Not guaranteed that anything will be found at the time the experiments are performed

– Many hypotheses need to be validated

Need to be well aware of these potential pitfalls



Types of studies• With a specific question. Examples:

– What are the genetic factors that increase the susceptibility of type-2 diabetes?

• Survey genetic variants in the whole genome

– How common is RNA editing?• Compare all transcripts with DNA

• With a broad question. Examples:– What are the characteristics of domains defined by chromatin

features?• Define domains using genome-wide chromatin features, then correlate

with other features

– Is there anything special about the distribution of protein binding sites?

• Compare all regions bound by protein binding sites and other regions



Some specific CBB problems• More “downstream” than standard data

processing tasks– Haplotype phasing [project]– Genotype imputation



Haplotype• Haplotype: The combination of alleles on the

same chromosome– Example: if an individual has A/C genotype at a locus,

and G/T at another, there are four possible haplotypes:

• AG• AT• CG• CT

– The individual actually has two of them, one from each parent

– For k loci, there are up to 2k possible haplotypes



Haplotype phasing• Haplotype phasing: Finding the alleles that were

inherited together– Even better, from which parent

• Main ideas:– Single individual: Look for sequencing reads that cover

more than one variant• Difficult due to short read lengths (two human SNPs are

about 1000bp apart on average)• Paired-end reads help, but not much

– Multiple individuals– Family analysis (comparing with

parents/siblings/children)Last update: 3-Nov-2015


Haplotype inference• All cases for a trio:


Description Father Child Mother

F:hom.; M:same hom. (; C:hom.) AA A|A AA

F:hom.; M:diff hom. (; C:het.) AA A|C CC

F:hom.; M:het. with com.; C:hom. AA A|A AC

F:hom.; M:het. with com.; C:het. AA A|C AC

F:hom.; M:het. without com. (; C:het.) AA A|C CG

F:het.; M:het. with 2 com.; C:hom. AC A|A AC

F:het.; M:het. with 2 com.; C:het. (The only unresolved case) AC A|C or C|A AC

F:het.; M:het. with 1 com.; C:hom. AC A|A AG

F:het.; M:het. with 1 com.; C:het. with the com. AC A|G AG

F:het.; M:het. with 1 com.; C:het. without the com. AC C|G AG

F:het.; M:het. without com. (; C:het.) AC A|G GT

Abbreviations: F: father; M: mother; C: child; hom.: homozygous; het.: heterozygous; com.: common allele


Haplotype inference (cont’d)


Image credit: Roach et al., The American Journal of Human Genetics 89:382-397, (2011)

Parsimony: assuming no intervening recombination


Haplotype blocks• A related problem: Given the SNPs of a list of loci

from many unrelated individuals, identify haplotype blocks (SNPs that are usually inherited together)– Before whole-genome sequencing became popular,

one purpose of it was to find representatives of each block (the “tagging SNPs”) to reduce the number of SNPs that need to be tested others can be inferred (statistically) from them

– Two main goals of the International HapMap project: Cataloging SNPs and constructing haplotype blocks



Missing genotypes• Sometimes we want to guess the

genotype/haplotype at certain loci:– They are missing because of the method used

(e.g., a SNP array only checks a pre-defined set of SNPs)

– Information at some neighboring loci are available– Full information for some other samples are

available (e.g., those with whole-genome sequencing data)



Genotype imputation


Image credit: Howie et al., PLoS Genetics 5(6):e1000529, (2009)


Some common omic analysis techniques

• Aggregation: Aligning different elements, then aggregate their features– E.g., The methylation level

at TSS, first 10% of 5’UTR, second 10% of 5’UTR, etc.

• Correlation: Finding relationships between different datasets– E.g., Binding patterns of

different DNA-binding proteins


Image credit: Lister et al., Nature 5462:315-322, (2009); Filion et al., Cell 143:212-224, (2010)

CASE STUDIESPart 3


Case studies• GWAS studies• Whole-genome sequencing of individuals from

different populations• Large integrative projects• Personal genomics



First GWAS study• Reported in 2005, about age-related

macular degeneration• 96 patients, 50 controls• Genotyped 116,204 SNPs• In one paper, two SNPs were reported

– p-values: 4.1 x 10-8 and 1.4 x 10-6

– Dominant (i.e., 1,2 vs. 0 copies of risk allele)

• Expected odds ratios: 4.6 and 4.7

– Recessive (i.e., 2 vs. 0,1 copies of risk allele)• Expected odds ratios: 7.4 and 6.2

• Three papers by different groups:– Complement Factor H Polymorphism

in/Polymorphism and/Variant Increases the Risk of Age-Related macular Degeneration


Image credit: Klein et al., Science 308(5720):385-389, (2005)


One of the largest GWAS studies• Wellcome Trust Case Control Consortium Study• 14,000 cases of 7 common diseases

– Bipolar disorder– Coronary heart disease– Crohn’s disease– Hypertension– Rheumatoid arthritis– Type 1 diabetes– Type 2 diabetes

• 3,000 shared controls


Image credit: The Wellcome Trust Case Control Consortium, Nature 447(7145):661-678, (2007)


Human whole-genome sequencing• 2001: Draft human genome finished• 2007: J. Craig Venter• 2008: James Watson• 2008: Han Chinese (YH)• 2009: Yoruban male (NA18507)• 2009: Korean male (SJK)• 2009: Korean male (AK1)• ...



SNPs of different populations


Image credit: Paschou et al., PLoS Genetics 3(9):e160, (2007)


18 Korean genomes & 17 transcriptomes


Image credit: Ju et al., Nature Genetics 43(8):745-752, (2011)


1000 Genomes Project [Project]• Pilot phase:

– Low-coverage whole-genome sequencing of 179 individuals from 4 populations

– High-coverage sequencing of two mother-father-child trios– Exon-targeted sequencing of 697 individuals from 7 populations


Image credit: The 1000 Genomes Project Consortium, Nature 467(7319):1061-1073, (2010)


Loss-of-function variants• One interesting finding from the 1000 Genomes Project: Each

person has on average about 100 loss-of-function (LoF) variants with about 20 genes completely inactivated– Functions: blood type, muscle performance, drug metabolism, etc.– Less evolutionarily conserved– Fewer protein-protein interactions– Likely to have similar genes in the genome


Image credit: MacArthur et al., Science 335(6070):823-828, (2012)


Soybean project• Sequenced 17 wild and

14 cultivated soybean genomes

• Data useful for identifying relevant alleles that exhibit desirable phenotypes– E.g., drought resistance


Image credit: Prof. Hon-Ming Lam, CUHK


Metagenomics• Sequences from many different types of

(microbe) species in a single sample– Soil– Sea water– Skin– Gut– ...

• Analysis tasks very different from standard sequencing projects– Difficult to tell from which species each

sequence read belongs– Use highly-conserved sequences (e.g., rRNA)

to estimate abundance of each species– Concept of shared resources (e.g., proteins)– Environment factors play critical roles


Image credit: Wikipedia


ENCODE [project]• Encyclopedia of DNA Elements• Identifying and characterizing all human DNA elements

• Similar projects for worm and fly (model organism ENCODE)


Image credit: Darryl Leja (NHGRI) and Ian Dunham (EBI)


Personal genomics• Studying your own genome

– Whole genome sequencing– Exome sequencing– SNPs– ...

• From personal genomics to personal medicine– Easy to produce data– Difficult to interpret results (predict implications)– Even more difficult to do disease prevention

• Other issues– Genetics is not the only factor– Everything is probability– Psychological impacts



Some companies



Summary• Association studies try to find loci/alleles related to

phenotypes– Family studies– Case-control studies

• Criteria for evaluating the significance of a variant:– p-value, effect size, cases explained

• Confounding factors:– Stratification, linkage disequilibrium

• Levels of coverage:– Selected variants < exome sequencing < whole-genome

sequencing• The current challenge is to produce the right data and

identify the useful informationLast update: 3-Nov-2015

lecture 10. topics in omic studies (basics) the chinese university of hong kong csci5050...

Documents

2015csci5050 bioinformatics

computational biology

manlast update

otherlast update

image credit

certain locuse

alleles of different

certain phenotype