lecture 10. topics in omic studies (basics) the chinese university of hong kong csci5050...

65
Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

Upload: bertram-stanley

Post on 18-Jan-2016

218 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

Lecture 10. Topics in Omic Studies (Basics)

The Chinese University of Hong KongCSCI5050 Bioinformatics and Computational Biology

Page 2: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 2

Lecture outline1. Genome-wide association studies2. Omic studies3. Case studies

Last update: 3-Nov-2015

Page 3: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

GENOME-WIDE ASSOCIATION STUDIES

Part 1

Page 4: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 4

Useful terms in genetics• Locus (plural: loci)

– A specific location, region, or gene on a chromosome• E.g., chr1:20376; human CFTR gene

• Allele– A variant of the DNA sequence at a given locus

• E.g., A; wild-type

• Genotype– The set of alleles at a certain locus

• E.g., A/A; A/C; wild-type/mutant

• Character– An observable property

• E.g., eye color; shape of pea• Multiple levels (from expression level to growth rate)

• Trait– A variant of a character

• E.g., blue (eye color); round (pea shape)

• Phenotype– An observed trait of an organism due to a combination of its genotype and

environmental factorsLast update: 3-Nov-2015

Page 5: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 5

Research problems: several directions

• Given a phenotype– Find the locus associated with it– Find an allele associated with it– Find the allele(s) that cause(s) it

• Given an allele/genotype– Determine the resulting trait/phenotype

• Given the alleles of different loci– Study how they are related to each other

Last update: 3-Nov-2015

Page 6: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 6

From phenotype to genotype• Suppose we observe a certain phenotype (e.g., a disease) in

some individuals, how do we find out the relevant loci/alleles?

• Things to consider:– Genetic vs. non-genetic factors– Single-locus vs. multi-locus– Homozygous vs. heterozygous (dominant vs. recessive)

• Difficulties:– Sample size– Controls– Data availability– Association vs. causality– Multiple hypothesis testing

Last update: 10-Nov-2015

Page 7: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 7

Some well-known disease-related genes

• -globin: mutations related to sickle-cell anemia (relationship discovered by Linus Pauling in 1949)

• p53: tumor-suppressor• BRCA1, BRCA2 (breast cancer 1/2, early onset):

mutations related to breast cancer• CFTR (cystic fibrosis transmembrane conductance

regulator): mutations related to cystic fibrosis(first mutation found in 1988 by Francis Collins, Lap-Chee Tsui and John Riordan)

• CCR5 (C-C chemokine receptor type 5): a mutation related to protection against M-tropic strains of HIV-1 infection

Database: OMIM (Online Mendelian Inheritance in Man)

Last update: 3-Nov-2015

Image credit: Wikipedia

Page 8: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 8

Different approaches

Last update: 3-Nov-2015

Image credit: Mullen et al., Neurology 72(6):558-565, (2009)

Affected maleUnaffected male

Affected femaleUnaffected female

Page 9: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 9

Family linkage study• Main ideas:

– Identify members with/without the phenotype– Study the inheritance patterns of genetic markers

• Fine: SNP (single nucleotide polymorphism)• Coarse: RFLP (restriction fragment length

polymorphism)• Many other types: AFLP, DArT, RAD, RAPD, SFP, SSLP,

SSR, STR, VNTR, ... (wiki “genetic marker”)

– Deduce possible loci related to the phenotype• Usually not very precise

– Also deduce homozygosity/heterozygosity

Last update: 3-Nov-2015

Page 10: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 10

Restriction fragment length polymorphism

• Simple case for illustration:

• Reality: lots of irrelevant data

Last update: 3-Nov-2015

: Restriction site

Image credit: Wikipedia, http://www.ncbi.nlm.nih.gov/projects/genome/probe/doc/TechRFLP.shtml

Page 11: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 11

Case-control studies• Instead of studying families, another way is to find unrelated individuals

with the phenotype as cases, and individuals without the phenotype (or random individuals with low chance of having the phenotype) as controls

• Advantage over family studies:– Easier to get large samples

• Disadvantage: More diverse background– Genotypic differences may be due to ethnicity, gender, etc. (more

later)– Need to balance between case and control groups, or perform special

analysis to separate out different factors (e.g., principle component analysis)

• Large studies may also have issues with diverse experimental protocols, data quality, etc.– Retraction of a 2010 Science paper about potential genetic signatures

related to longevity – Retraction note: Science 333:404, (2011)

Last update: 3-Nov-2015

Page 12: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 12

Genome-wide association studies• With high-throughput technologies, it is now possible to

check many potential variants at the same time– SNP arrays

• Predefined SNPs• Relatively inexpensive

– Array-comparative genomic hybridization (Array CGH) for copy number variations

– Whole-genome sequencing• High coverage• High cost, especially when read depth needs to be high for confident SNP

calling• Can also detect other types of variants (e.g., indels)

– Exome sequencing• Only sequence captured exons• Compromise between coverage and cost

Last update: 3-Nov-2015

Page 13: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 13

Copy number variation (CNV)

Last update: 3-Nov-2015

Image credit: http://clincancerres.aacrjournals.org/content/10/24/8204/F3.medium.gif, Chial, Nature Education 1(1), (2008)Array-CGH

Aneuploidy (e.g., in cancer)

Page 14: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 14

Measures of association (allele)• Suppose there are n1 cases and n2 controls, among which m1

and m2 have a given allele, respectively• Is the allele likely associated with the phenotype?• Contingency table:

• Null hypothesis H0: allele and phenotype are independent (hypergeometric distribution, row and column totals are fixed, others are variables)

Last update: 10-Nov-2015

With the phenotype Without the phenotype TotalWith the allele m1 m2 m1+m2

Without the allele n1-m1 n2-m2 (n1+n2)-(m1+m2)

Total n1 n2 n1+n2

Prሺm1|H0ሻ= ቀn1m1ቁቀn2m2ቁ

ቀn1 + n2m1 + m2ቁ

Page 15: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 15

Measures of association (cont’d)

• p-value:– If the null hypothesis is true (that the allele is

really unrelated to the phenotype), what is the probability of observing a value of m1 equal to or larger than the observed value?

Last update: 3-Nov-2015

0.0 0.2 0.4 0.6 0.8 1.0

0.51.0

1.52.0

x

dchis

q(x,

df = 1

)

m1

Probability density of null distribution

Observed value

p-value

Page 16: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 16

Measures of association (cont’d)• What probability (p-value) to compute:

– One-sided Fisher’s exact test (when we think having the allele has a positive effect to the phenotype): reject H0 if Pr(m1 or more|H0) is small

• Infeasible if numbers are large

– Two-sided chi-square test (when we think having the allele has either a positive or negative effect to the phenotype): reject H0 if deviation from expectation at least as much as observed has a low probability

• Expectation: e.g., #with both allele and phenotype = (m1+m2)n1/(n1+n2)

• Test-statistic: , which follows Chi-square distribution with 1 degree of freedom when n1 and n2 are large

– The “power” of a test is the probability that the null hypothesis will be rejected when it is actually false

• The probability that we will not miss a real association

Last update: 3-Nov-2015

χ2 = ሺOi −Eiሻ2Eii

Page 17: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 17

Example• Suppose there are n1=12 cases and n2=8 controls, among

which m1=6 and m2=3 have a given allele, respectively• Contingency table:

• Pr(m1=6 | H0) = (12C6)(8C3)/(20C9) = (924)(56)/167960 = 0.3081

• Pr(m16 | H0) = (12C6)(8C3)/(20C9) + (12C7)(8C2)/(20C9) + (12C8)(8C1)/(20C9) + (12C9)(8C0)/(20C9) = 0.4650– Even if the phenotype is independent of the allele, by chance we still

have a high probability of observing m16– Therefore the allele is not statistically associated with the phenotype

Last update: 3-Nov-2015

With the phenotype Without the phenotype TotalWith the allele 6 3 9Without the allele 6 5 11Total 12 8 20

Page 18: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 18

Example (cont’d)• Suppose there are n1=12 cases and n2=8 controls, among which m1=6 and

m2=3 have a given allele, respectively• Contingency table:

• Expectations if phenotype is independent of allele:

• Chi-square statistic: (6 – 5.4)2 / 5.4 + (3 – 3.6)2 / 3.6 + (6 – 6.6)2 / 6.6 + (5 - 4.4)2 / 4.4 = 0.3030– Pr(2 > 0.3030 | H0) = 0.5820– Even if phenotype is independent of the allele, there is still a high chance of

getting the observed values as deviated or more deviated from expectation

Last update: 3-Nov-2015

With the phenotype Without the phenotype Total

With the allele 6 3 9

Without the allele 6 5 11

Total 12 8 20

With the phenotype Without the phenotype Total

With the allele (9/20)(12/20)20=5.4 (9/20)(8/20)20=3.6 9

Without the allele (11/20)(12/20)20=6.6 (11/20)(8/20)20=4.4 11

Total 12 8 20

Page 19: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 19

Measures of association (genotype)

• If we also want to know whether the homozygous and heterozygous situations are different, we need a 3x2 table instead (χ2: 2 degrees of freedom):

• Other tests are available for phenotypes that are not binary– Quantitative trait locus (QTL): locus with a continuous

trait• eQTL: Relating to expression level of a gene

Last update: 3-Nov-2015

Genotype With the phenotype Without the phenotypeAA m11 m21

Aa m12 m22

aa m13 m23

Page 20: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 20

A problem of p-value• p-value only tells how likely the observed data can be

generated by random chance, but not how much the actual situation deviates from the null hypothesis

• Example:

• When sample size is large, it is quite clear that the allele and phenotype are not independent– However, the association is weak (51% vs. 49%)

Last update: 3-Nov-2015

n1 n2 m1 m2 p-value (2-sided chi-square with Yates’ correction)

100 100 51 49 0.88751000 1000 510 490 0.3955

10000 10000 5100 4900 0.0049100000 100000 51000 49000 <0.0001

Page 21: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 21

Effect size• In addition to p-value, we also want to know how much more likely

the individuals in the case group have the allele than those in the control group

• Usual measures for this “effect size”:– Relative risk,

RR = (fraction of cases among people with the allele) /(fraction of cases among people without the allele)

[Cannot be computed since the case/control ratio is different from the case/non-case ratio]

– Odds ratio,OR = [(# with the allele in case group) /

[(# without the allele in case group)] /[(# with the allele in control group) /[(# without the allele in control group)]

= [m1/(n1 - m1)] / [m2/(n2 - m2)]= [m1(n2-m2)]/[m2(n1-m1)]

Last update: 11-Nov-2015

Page 22: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 22

Example

Last update: 11-Nov-2015

n1 n2 m1 m2 p-value (2-sided chi-square with Yates’ correction)

Odds ratio

100 100 51 49 0.8875 1.081000 1000 510 490 0.3955 1.08

10000 10000 5100 4900 0.0049 1.08100000 100000 51000 49000 <0.0001 1.08

10 10 7 3 0.1797 5.441000 1000 700 300 <0.0001 5.44

Page 23: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 23

Cases explained• Are p-value and effect size sufficient?• Consider this situation:

– p-value (two-sided chi-square with Yates’ correction): <0.0001– OR: (100/9900) / (10 / 9990) = 10.1

• Is this allele very important?– It only explains 1% of the individuals with the phenotype– As of 2010, genetic variants discovered only explain 10% of

type-2 diabetes heritability (Billings and Florez, Annals of the New York Academy of Sciences 1212:59-77, 2010)

• Further complicated by environmental factors

Last update: 3-Nov-2015

With the phenotype Without the phenotypeWith the allele 100 10Without the allele 9900 9990

Page 24: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 24

Stratification• Now consider this case:

• p-value, effect size, and cases explained are all good– OR = (3000 / 7000) / (1000 / 9000) = 3.86 > 1

• What can still be wrong?• If we consider males and females separately:

– OR(male) = (2900 / 6000) / (50 / 100) = 0.97 < 1– OR(female) = (100 / 1000) / (950 / 8900) = 0.94 < 1– Phenotype is associated with gender, not allele -- “Simpson’s paradox”

Last update: 3-Nov-2015

With the phenotype Without the phenotypeWith the allele 3000 1000Without the allele 7000 9000

With the phenotype Without the phenotypeMale Female Male Female

With the allele 2900 100 50 950Without the allele 6000 1000 100 8900

Page 25: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 25

Multiple hypothesis testing• If many loci are studied at the same time, another

issue is multiple hypothesis testing• If the p-value for an allele to be associated with a

phenotype is 0.01,– If the allele is in fact not associated with the phenotype,

there is a 1% chance that we can get the observed case and control counts or more extreme by chance

– If we consider 100 loci, we expect to encounter one such situation on average

– In reality, we are considering millions of loci at the same time

Last update: 3-Nov-2015

Page 26: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 26

Correction for multiple hypothesis testing

• Bonferroni correction (family-wise error rate)– Suppose we have tested N loci, all of which are not associated with the phenotype– What is the chance that at least one of them has a p-value ≤ p?– Pr(locus 1 has a p-value ≤ p OR

Pr(locus 2 has a p-value ≤ p OR ... ORPr(locus N has a p-value ≤ p)≤ Pr(locus 1 has a p-value ≤ p) +≤ Pr(locus 2 has a p-value ≤ p) + ... +≤ Pr(locus N has a p-value ≤ p)= p + p + ... + p(N times)= Np

– For example, instead of 0.01, we only consider a p-value of 0.01/N or less to be significant

• Interpretation: The probability for one or more loci we call as associated with the phenotype to be due to random chance is smaller than 0.01

– Other commonly used correction methods: 1) false discovery rate based on Benjamini-Hochberg procedure, 2) q-value

Last update: 10-Nov-2015

Page 27: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 27

Association between different loci• In some cases, a disease-associated allele is due to

somatic mutation not inherited from parents• More often, it is inherited from parents. In this case,

alleles at some loci not causing the disease may also appear to be disease-associated

• Reasons:– Genetic linkage– Linkage disequilibrium

• Consequence: statistical association does not necessarily imply– Biological association– Causality

Last update: 3-Nov-2015

Page 28: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 28

Genetic linkage• For diploid organisms, there are two

copies of each chromosome• During meiosis, the two homologous

chromosomes can exchange genetic materials by recombination during chromosome crossover

• If two loci are close, the chance of crossover between them is small– Their alleles are likely passed on to

daughter cells together– The rate of crossover can be used as a

distance measure between two loci

Last update: 3-Nov-2015

Sister chromatids

Image source: http://www.tokresource.org/tok_classes/biobiobio/biomenu/meiosis/Crossover.gif

Page 29: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 29

Genetic linkage (cont’d)• 1 centimorgan (cM) or map unit (m.u.) equals the

distance between chromosomal locations with average 1% intervening crossover per generation

• Notice that if two loci are far apart, it is possible to have multiple cross over events between them in one generation

• Testing whether two loci are linked with recombinant rate θ if there are x non-recombinant offspring and y recombinant offspring according to a pedigree:– Log of odds,

Last update: 3-Nov-2015

LOD= maxθ log10ሺ1− θሻxθy0.5x+y = log10ቀ1− yx+ yቁxቀ

yx+ yቁy0.5x+y = log10ቀ

xx+ yቁxቀ

yx+ yቁy0.5x+y

Page 30: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 30

Linkage disequilibrium (LD)• Sometimes the alleles at two loci can co-occur

with a frequency that deviates from expectation (even if they are on different chromosomes)

• Some reasons:– Selection– Population bottleneck– Non-uniform rate of recombination– Non-random mating– ...

Last update: 10-Nov-2015

Page 31: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 31

Representing GWAS results• Allele list and LD map

Last update: 3-Nov-2015

Image credit: Altshuler et al., Science 322(5903):881-888, (2008)

Page 32: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 32

Representing GWAS results (cont’d)• Manhattan plot: -

log(p) against chromosomal locations

• Quantile-quantile plot (Q-Q plot): theoretical distribution of chi-square values vs. observed values

Last update: 3-Nov-2015

Image credit: Samani et al., New England Journal of Medicine 357:443-453, (2007)

Page 33: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 33

Combinatorial effects [project]• It is possible that two loci have a joint effect on the

phenotype• Example: fraction of individuals with a phenotype

– Each locus is not strongly associated with phenotype (“main effect” not strong)

– The two loci together are stronger associated with the phenotype (the “interaction effect” is strong)

Last update: 10-Nov-2015

Locus 1Locus 2

AA Aa aa

BB 0.9 0.1 0.1Bb 0.1 0.9 0.1bb 0.1 0.1 0.9

Page 34: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 34

Follow-up• Validations:

– Sanger sequencing to confirm the presence of allele in the tested samples

– Functional enrichment and network analysis– Replication in larger cohorts– Knock-out or (for genes) knock-down/over-expression

experiments• In vitro• Animal models

• Ultimate applications:– Biomarker for diagnosis/prognosis prediction– Druggable targets

Last update: 10-Nov-2015

Page 35: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 35

NHGRI GWA Catalog

Last update: 3-Nov-2015

Image source: http://www.genome.gov/gwastudies/

Page 36: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

OMIC STUDIESPart 2

Page 37: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 37

Ome, omic, and omics• Traditionally, biologists study one or a few biological

objects at a time– Hypothesis driven

• Now it is possible to study many biological objects at the same time– Data driven

• Suppose we want to study a type of objects or phenomena, X– “X-ome”: A large amount of data related to X, or the whole set

of X– “X-omic”: To study a large amount of data related to X– “X-omics”: The area of studying a large amount of data related

to X

Last update: 3-Nov-2015

Page 38: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 38

Different kinds of X-omicsObject/ phenomenon type, X

X-ome X-omics

Genes/ DNA Genome Genomics (The study of all genes/whole set of DNA)Transcripts/ transcription Transcriptome Transcriptomics (The study of gene expression levels)Exons/ transcription Exome Exomics (The study of all exons)Proteins Proteome Proteomics (The study of protein identity and abundance)Metabolism Metabolome Metabolomics (The study of metabolic reactions)DNA methylation Methylome Methylomics (The study of whole-genome DNA

methylation)Non-coding RNAs, DNA methylation, histone modifications

Epigenome Epigenomics (The study of inheritable non-DNA signals)

Population of co-existing species in an environment

Metagenome Metagenomics (The study of different genomes, transcriptomes, etc. in a common environment)

Phenotypes Phenome Phenomics (The comprehensive story of phenotypes)Interactions Interactome Interactomics (The study of all interactions of a certain

type)... ... ...

Last update: 3-Nov-2015

Page 39: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 39

Why omics?• Unbiased and complete

– If a hypothesis can explain an observation, it is not necessarily the only hypothesis that can explain it

– May discover something surprising• Easy (good or bad?)

– From data to hypothesis• Rapidly decreasing cost, becoming affordable

– ‘If it does not cost much, why not?’

Last update: 3-Nov-2015

Page 40: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 40

Why some people are reluctant• Traditionally, science follows the “hypothesis

test” pattern• Outputs of high-throughput experiments contain

irrelevant data, secondary effects, and noise• Costly in the sense that

– Not guaranteed that anything will be found at the time the experiments are performed

– Many hypotheses need to be validated

Need to be well aware of these potential pitfalls

Last update: 10-Nov-2015

Page 41: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 41

Types of studies• With a specific question. Examples:

– What are the genetic factors that increase the susceptibility of type-2 diabetes?

• Survey genetic variants in the whole genome

– How common is RNA editing?• Compare all transcripts with DNA

• With a broad question. Examples:– What are the characteristics of domains defined by chromatin

features?• Define domains using genome-wide chromatin features, then correlate

with other features

– Is there anything special about the distribution of protein binding sites?

• Compare all regions bound by protein binding sites and other regions

Last update: 3-Nov-2015

Page 42: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 42

Some specific CBB problems• More “downstream” than standard data

processing tasks– Haplotype phasing [project]– Genotype imputation

Last update: 3-Nov-2015

Page 43: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 43

Haplotype• Haplotype: The combination of alleles on the

same chromosome– Example: if an individual has A/C genotype at a locus,

and G/T at another, there are four possible haplotypes:

• AG• AT• CG• CT

– The individual actually has two of them, one from each parent

– For k loci, there are up to 2k possible haplotypes

Last update: 3-Nov-2015

Page 44: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 44

Haplotype phasing• Haplotype phasing: Finding the alleles that were

inherited together– Even better, from which parent

• Main ideas:– Single individual: Look for sequencing reads that cover

more than one variant• Difficult due to short read lengths (two human SNPs are

about 1000bp apart on average)• Paired-end reads help, but not much

– Multiple individuals– Family analysis (comparing with

parents/siblings/children)Last update: 3-Nov-2015

Page 45: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 45

Haplotype inference• All cases for a trio:

Last update: 3-Nov-2015

Description Father Child Mother

F:hom.; M:same hom. (; C:hom.) AA A|A AA

F:hom.; M:diff hom. (; C:het.) AA A|C CC

F:hom.; M:het. with com.; C:hom. AA A|A AC

F:hom.; M:het. with com.; C:het. AA A|C AC

F:hom.; M:het. without com. (; C:het.) AA A|C CG

F:het.; M:het. with 2 com.; C:hom. AC A|A AC

F:het.; M:het. with 2 com.; C:het. (The only unresolved case) AC A|C or C|A AC

F:het.; M:het. with 1 com.; C:hom. AC A|A AG

F:het.; M:het. with 1 com.; C:het. with the com. AC A|G AG

F:het.; M:het. with 1 com.; C:het. without the com. AC C|G AG

F:het.; M:het. without com. (; C:het.) AC A|G GT

Abbreviations: F: father; M: mother; C: child; hom.: homozygous; het.: heterozygous; com.: common allele

Page 46: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 46

Haplotype inference (cont’d)

Last update: 3-Nov-2015

Image credit: Roach et al., The American Journal of Human Genetics 89:382-397, (2011)

Parsimony: assuming no intervening recombination

Page 47: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 47

Haplotype blocks• A related problem: Given the SNPs of a list of loci

from many unrelated individuals, identify haplotype blocks (SNPs that are usually inherited together)– Before whole-genome sequencing became popular,

one purpose of it was to find representatives of each block (the “tagging SNPs”) to reduce the number of SNPs that need to be tested others can be inferred (statistically) from them

– Two main goals of the International HapMap project: Cataloging SNPs and constructing haplotype blocks

Last update: 3-Nov-2015

Page 48: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 48

Missing genotypes• Sometimes we want to guess the

genotype/haplotype at certain loci:– They are missing because of the method used

(e.g., a SNP array only checks a pre-defined set of SNPs)

– Information at some neighboring loci are available– Full information for some other samples are

available (e.g., those with whole-genome sequencing data)

Last update: 3-Nov-2015

Page 49: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 49

Genotype imputation

Last update: 3-Nov-2015

Image credit: Howie et al., PLoS Genetics 5(6):e1000529, (2009)

Page 50: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 50

Some common omic analysis techniques

• Aggregation: Aligning different elements, then aggregate their features– E.g., The methylation level

at TSS, first 10% of 5’UTR, second 10% of 5’UTR, etc.

• Correlation: Finding relationships between different datasets– E.g., Binding patterns of

different DNA-binding proteins

Last update: 3-Nov-2015

Image credit: Lister et al., Nature 5462:315-322, (2009); Filion et al., Cell 143:212-224, (2010)

Page 51: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CASE STUDIESPart 3

Page 52: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 52

Case studies• GWAS studies• Whole-genome sequencing of individuals from

different populations• Large integrative projects• Personal genomics

Last update: 3-Nov-2015

Page 53: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 53

First GWAS study• Reported in 2005, about age-related

macular degeneration• 96 patients, 50 controls• Genotyped 116,204 SNPs• In one paper, two SNPs were reported

– p-values: 4.1 x 10-8 and 1.4 x 10-6

– Dominant (i.e., 1,2 vs. 0 copies of risk allele)

• Expected odds ratios: 4.6 and 4.7

– Recessive (i.e., 2 vs. 0,1 copies of risk allele)• Expected odds ratios: 7.4 and 6.2

• Three papers by different groups:– Complement Factor H Polymorphism

in/Polymorphism and/Variant Increases the Risk of Age-Related macular Degeneration

Last update: 3-Nov-2015

Image credit: Klein et al., Science 308(5720):385-389, (2005)

Page 54: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 54

One of the largest GWAS studies• Wellcome Trust Case Control Consortium Study• 14,000 cases of 7 common diseases

– Bipolar disorder– Coronary heart disease– Crohn’s disease– Hypertension– Rheumatoid arthritis– Type 1 diabetes– Type 2 diabetes

• 3,000 shared controls

Last update: 3-Nov-2015

Image credit: The Wellcome Trust Case Control Consortium, Nature 447(7145):661-678, (2007)

Page 55: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 55

Human whole-genome sequencing• 2001: Draft human genome finished• 2007: J. Craig Venter• 2008: James Watson• 2008: Han Chinese (YH)• 2009: Yoruban male (NA18507)• 2009: Korean male (SJK)• 2009: Korean male (AK1)• ...

Last update: 3-Nov-2015

Page 56: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 56

SNPs of different populations

Last update: 3-Nov-2015

Image credit: Paschou et al., PLoS Genetics 3(9):e160, (2007)

Page 57: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 57

18 Korean genomes & 17 transcriptomes

Last update: 3-Nov-2015

Image credit: Ju et al., Nature Genetics 43(8):745-752, (2011)

Page 58: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 58

1000 Genomes Project [Project]• Pilot phase:

– Low-coverage whole-genome sequencing of 179 individuals from 4 populations

– High-coverage sequencing of two mother-father-child trios– Exon-targeted sequencing of 697 individuals from 7 populations

Last update: 3-Nov-2015

Image credit: The 1000 Genomes Project Consortium, Nature 467(7319):1061-1073, (2010)

Page 59: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 59

Loss-of-function variants• One interesting finding from the 1000 Genomes Project: Each

person has on average about 100 loss-of-function (LoF) variants with about 20 genes completely inactivated– Functions: blood type, muscle performance, drug metabolism, etc.– Less evolutionarily conserved– Fewer protein-protein interactions– Likely to have similar genes in the genome

Last update: 3-Nov-2015

Image credit: MacArthur et al., Science 335(6070):823-828, (2012)

Page 60: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 60

Soybean project• Sequenced 17 wild and

14 cultivated soybean genomes

• Data useful for identifying relevant alleles that exhibit desirable phenotypes– E.g., drought resistance

Last update: 3-Nov-2015

Image credit: Prof. Hon-Ming Lam, CUHK

Page 61: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 61

Metagenomics• Sequences from many different types of

(microbe) species in a single sample– Soil– Sea water– Skin– Gut– ...

• Analysis tasks very different from standard sequencing projects– Difficult to tell from which species each

sequence read belongs– Use highly-conserved sequences (e.g., rRNA)

to estimate abundance of each species– Concept of shared resources (e.g., proteins)– Environment factors play critical roles

Last update: 3-Nov-2015

Image credit: Wikipedia

Page 62: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 62

ENCODE [project]• Encyclopedia of DNA Elements• Identifying and characterizing all human DNA elements

• Similar projects for worm and fly (model organism ENCODE)

Last update: 3-Nov-2015

Image credit: Darryl Leja (NHGRI) and Ian Dunham (EBI)

Page 63: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 63

Personal genomics• Studying your own genome

– Whole genome sequencing– Exome sequencing– SNPs– ...

• From personal genomics to personal medicine– Easy to produce data– Difficult to interpret results (predict implications)– Even more difficult to do disease prevention

• Other issues– Genetics is not the only factor– Everything is probability– Psychological impacts

Last update: 3-Nov-2015

Page 64: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 64

Some companies

Last update: 3-Nov-2015

Page 65: Lecture 10. Topics in Omic Studies (Basics) The Chinese University of Hong Kong CSCI5050 Bioinformatics and Computational Biology

CSCI5050 Bioinformatics and Computational Biology | Kevin Yip-cse-cuhk | Fall 2015 65

Summary• Association studies try to find loci/alleles related to

phenotypes– Family studies– Case-control studies

• Criteria for evaluating the significance of a variant:– p-value, effect size, cases explained

• Confounding factors:– Stratification, linkage disequilibrium

• Levels of coverage:– Selected variants < exome sequencing < whole-genome

sequencing• The current challenge is to produce the right data and

identify the useful informationLast update: 3-Nov-2015