nuevas perspectivas en análisis genomico: implicaciones del proyecto encode 1 rory johnson...

36
Nuevas perspectivas en análisis genomico: implicaciones del proyecto ENCODE 1 Rory Johnson Bioinformatics and Genomics Centre for Genomic Regulation AEEH 21 / 2 / 14

Upload: sheena-hodges

Post on 25-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

1

Nuevas perspectivas en análisis genomico: implicaciones del proyecto ENCODE

Rory Johnson

Bioinformatics and Genomics

Centre for Genomic Regulation

AEEH

21 / 2 / 14

2

This talk:

• Our view of the human genome today thanks to ENCODE

• What it means for translational research

3

Epigenetics: the intermediate between genome and phenotype

4

(Hong Kong)

Our changing view of the genome

2000 2014

5

Our changing view of the genome

ChromatinHistones,+ modifications

Transcription factors

CAGGCATTAACCTTAGTCCTAATGGTTAGAGTCGTCCCTGATAATCTTAGTGAGGAAGGGACATTTCCAGAGTCGCCCAG CAGCAAATTCCAGATGTCTAAGGTCCCCAAACAGAACAAAATTGCATAAT

This organisation is encoded in non-protein coding genome sequence

Enhancers

6

Genome sequence:

SimpleStatic

Epigenome sequence:

Multi-layeredDynamicCell-specific

=> Hence ENCODE

The Genome and Epigenome

7

The human genome in numbers

• 3 x10^9 base pairs

• 20,345 protein coding genes

• 13,870 Long noncoding RNA genes

• 9013 Small noncoding RNA genes

• 3x10^6 regulatory regions (enhancers)

• 12,460 known trait-associated SNPs (short nucleotide variants)

• 88% of trait-associated SNPs lie outside protein coding sequence

8

Next Generation Sequencing

The high throughput reading of DNA or RNA.

The main system now is Illumina Hiseq

Statistics:Read length: ~150ntReads per lane: ~150 millionLanes per run: 16Total nt per run: ~400 billionCost per run: ~16,000 euro

(Human genome project took 13 years and $3billion to sequence 3 billion nt, ending 2003)

http://www.labome.com/method/RNA-seq-Using-Next-Generation-Sequencing.html

9

NGS based methods for genome analysis: towards the clinic

ChIP-seq (chromatin immunoprecipation)

Transcription factor binding / chromatin state

Dnase-seq Transcription factor binding / chromatin state

RNAseq mRNA transcription / splicing

Ribosome footprinting Translation rate

Hiseq Genome 3D structure

These methods have been demonstrated to be practical for continuous patient monitoring or diagnostics:• Rui et al Cell, Volume 148, Issue 6, 1293-1307, “iPOP”• Buenrostro et al Nat Methods Nature Methods 10, 1213–1218 (2013)

“Using ATAC-seq maps of human CD4+ T cells from a proband obtained on consecutive days, we demonstrated the feasibility of analyzing an individual's epigenome on a timescale compatible with clinical decision-making.”

 

10

The ENCODE Project

• ENCODE: Encyclopedia of DNA Elements (http://www.genome.gov/10005107)

• International consortium dedicated to comprehensively mapping the human epigenome.

• Created high quality ongoing gene annotations: GENCODE

• 32 laboratories, $400million

• In Spain: Roderic Guigo (CRG) was one of the leaders (with Tom Gingeras, CSHL) of the transcriptomics section.

• 147 cell types (mainly transformed cell lines)

• 1640 genome-wide datasets

11

RNAseq Gene expression

ChIP Chromatin

ChIP Transcription Factors

ChIA-PET Genome structure / folding

GENCODE Gene annotation catalogue

ENCODE integrates multiple data types across cell types

12

Visualizing ENCODE data at the UCSC Genome Browser

http://genome-euro.ucsc.edu/

13

ENCODE data of relevance to hepatology

ENCODE Tier 2: HepG2 cell line hepatocellular carcinoma(see http://www.genome.gov/26524238 for other cell types)

Including:8 RNAseq experiments 114 Transcription Factor ChIP experiments (inc CEBPB, HNF4A, HNF4G)http://genome-euro.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html

Genes

Chromatin

TranscriptionFactors

RNA

14

Chromatin state is extremely cell type specific

15

Other projects of relevance: Epigenomics Roadmap Project

16

Other projects of relevance: eQTL

• Gtex – Genotype Tissue Expression project

• Hunting for genetic variants that influence gene expression

Linking genetic variants to changes in gene expression – regulatory variants or “expression quantitative trait loci” (eQTL)

These will be different between tissues

17

What does this mean for translational research?

• Protein-focussed studies will miss the majority of functional disease causing variants / mutations

• Non-coding variants will usually be regulatory

• Non-coding variants will usually be cell type specific

• Large projects like ENCODE are producing rich data that can be used to interpret clinical results

`

18

How can genetic variants (SNPs) in noncoding regions cause phenotype?

• By altering the nucleotide sequence recognized by regulatory protein

Hawkins et al Nature Reviews Genetics 11, 476-486

19

Gene Expression DiseaseGenetic Variant (SNP)

How can genetic variants (SNPs) in noncoding regions cause phenotype?

20

How does ENCODE affect translational research projects?

• Genome wide association study (GWAS)

• Exome sequencing

• Gene expression profiling

21

Translational research approaches 1: Genetic approaches

Genomic approaches to identify genetic variants underlying disease:

GWAS – genome wide association study

Exome sequencing – target genome sequencing

Advantages Disadvantages

Genome wide Depends on limited # of marker SNPs

Not biased towards coding regions Low resolution

Good at identifying common variants Does not yield insights into mechanism

Advantages Disadvantages

Proteome wide No information about noncoding variants

Can identify rare causative variants Likely missing most causative variants

Usually yields mechanistic hypothesis

High resolution

22

Interpretation of GWAS results

GWAS gives an unbiased genome wide set of candidate SNPs

The majority of these lie outside protein coding regions

Two main challenges:

1. Identifying the causative SNP

2. Understanding the mechanism of action of that SNP

Li et al PLoS Genet 8(7): e1002791.

Hepatocellular carcinoma

23

Identifying the causative SNP using ENCODE data

Schaub et al Genome Res. 2012 Sep;22(9):1748-59. doi: 10.1101/gr.136127.111.e

Hunt for the likely functional SNP in LD with marker

24

Schaub et al Genome Res. 2012 Sep;22(9):1748-59. doi: 10.1101/gr.136127.111.e

Understanding the mechanism of a noncoding SNP using ENCODE data

25

RegulomeDB: A web server for functional prediction of SNPs using ENCODE data

26

Exome sequencing

Exome sequencing: targeted genome sequencing of protein coding exons

Relies on capturing a selected subset of genome

Advantages: • lower cost and • higher statistical power• can detect rare private mutations

Disadvantages:• Presently ignoring the noncoding genome

(~99%)

27

Exome sequencing: whats next?

Whole genome sequence not likely to be practical: no statistical power

Exome technology is highly customisable could be adapted to noncoding regions

The main question: what are the target regions?

• How to define the target space? • regulatory regions? • Noncoding RNAs? • Protein binding sites?

• Likely to be organ / disease specific

• Will require bioinformatic analysis to design reagents before experimental project begins.

28

Translational research approaches 2: Transcriptomic approaches

ENCODE has made a major contribution to gene expression studies, by providing high quality annotations of novel noncoding genes through GENCODE.

Microarray studies

• Microarrays are restricted by the catalogue of probes chosen

• Commercial arrays: usually protein coding genes

• MicroRNA arrays available

• Long noncoding RNA arrays available (CRG provide free designs) – based on ENCODE annotations

29

Translational research approaches 2: Transcriptomic approaches

RNAseq

• Unbiased > can discover novel RNAs

• Can quantify expression of known and novel genes, and discover RNA from non “genic” loci

• Analysis requires more bioinformatic analysis

• Still more expensive than arrays

30

Translational research approaches 2: Transcriptomic approaches

Problems:

It is easy to discover and quantify the expression of novel genes

It is difficult to understand the function of such genes

We have no bioinformatic tools to predict the function of most novel ncRNAs

We have limited experimental tools to investigate them

31

What does ENCODE mean for these studies?

GWAS • GWAS study design will not likely be affected• ENCODE will allow better interpretation of discovered

SNPs

Exome • Whole genome cohort studies may never be feasible• Capture sequence approach can be redesigned to

study noncoding variants in disease of choice• ENCODE and other public data will aid in the design of

these projects

Gene expression • New gene annotations can help in both microarray and RNAseq projects to discover novel noncoding gene targets.

• RNAseq will eventually replace arrays as costs drop, but right now new array designs are competitive in large experiments and given bioinformatic requirements

Nothing would have been possible without…

CRG Bioinformatics & Genomics

Roderic Guigó

Bioinformatics and Genomics group

ENCODE / GENCODE

Jennifer Harrow Tim Hubbard(GENCODE, Sanger)

FUNDING

Ramón y Cajal RYC-2011-08851

Plan Nacional BIO2011-27220

32

33

The main message of ENCODE

To understand genotypes and phenotypes, we must look beyond the protein coding gene.

Further reading:

Interpreting noncoding genetic variation in complex traits and human disease•Lucas D Ward & Manolis Kellis•AffiliationsNature Biotechnology 30, 1095–1106 (2012)

34

How could variants in noncoding regions cause phenotype?

• By altering the nucleotide sequence recognized by regulatory protein

• By altering a noncoding RNA gene, either in expression levels or mature sequence

Hawkins et al Nature Reviews Genetics 11, 476-486Haas et al RNA Biol. 2012 Jun;9(6):924-37

Levels of genome regulation

We now appreciate the genome is regulated at multiple levels:

• “Epigenetically” – chromatin structure

• Transcriptionally – RNA production

• Post-transcriptionally – RNA processing (splicing, transport, stability)

• Translationally – protein production at ribosome

• Structurally – the folding structure of the genome

=> These sequences all have effects on phenotype and thus may contribute to disease

=> All of these are encoded in noncoding DNA sequence

35