nuevas perspectivas en análisis genomico: implicaciones del proyecto encode 1 rory johnson...
TRANSCRIPT
1
Nuevas perspectivas en análisis genomico: implicaciones del proyecto ENCODE
Rory Johnson
Bioinformatics and Genomics
Centre for Genomic Regulation
AEEH
21 / 2 / 14
2
This talk:
• Our view of the human genome today thanks to ENCODE
• What it means for translational research
5
Our changing view of the genome
ChromatinHistones,+ modifications
Transcription factors
CAGGCATTAACCTTAGTCCTAATGGTTAGAGTCGTCCCTGATAATCTTAGTGAGGAAGGGACATTTCCAGAGTCGCCCAG CAGCAAATTCCAGATGTCTAAGGTCCCCAAACAGAACAAAATTGCATAAT
This organisation is encoded in non-protein coding genome sequence
Enhancers
6
Genome sequence:
SimpleStatic
Epigenome sequence:
Multi-layeredDynamicCell-specific
=> Hence ENCODE
The Genome and Epigenome
7
The human genome in numbers
• 3 x10^9 base pairs
• 20,345 protein coding genes
• 13,870 Long noncoding RNA genes
• 9013 Small noncoding RNA genes
• 3x10^6 regulatory regions (enhancers)
• 12,460 known trait-associated SNPs (short nucleotide variants)
• 88% of trait-associated SNPs lie outside protein coding sequence
8
Next Generation Sequencing
The high throughput reading of DNA or RNA.
The main system now is Illumina Hiseq
Statistics:Read length: ~150ntReads per lane: ~150 millionLanes per run: 16Total nt per run: ~400 billionCost per run: ~16,000 euro
(Human genome project took 13 years and $3billion to sequence 3 billion nt, ending 2003)
http://www.labome.com/method/RNA-seq-Using-Next-Generation-Sequencing.html
9
NGS based methods for genome analysis: towards the clinic
ChIP-seq (chromatin immunoprecipation)
Transcription factor binding / chromatin state
Dnase-seq Transcription factor binding / chromatin state
RNAseq mRNA transcription / splicing
Ribosome footprinting Translation rate
Hiseq Genome 3D structure
These methods have been demonstrated to be practical for continuous patient monitoring or diagnostics:• Rui et al Cell, Volume 148, Issue 6, 1293-1307, “iPOP”• Buenrostro et al Nat Methods Nature Methods 10, 1213–1218 (2013)
“Using ATAC-seq maps of human CD4+ T cells from a proband obtained on consecutive days, we demonstrated the feasibility of analyzing an individual's epigenome on a timescale compatible with clinical decision-making.”
10
The ENCODE Project
• ENCODE: Encyclopedia of DNA Elements (http://www.genome.gov/10005107)
• International consortium dedicated to comprehensively mapping the human epigenome.
• Created high quality ongoing gene annotations: GENCODE
• 32 laboratories, $400million
• In Spain: Roderic Guigo (CRG) was one of the leaders (with Tom Gingeras, CSHL) of the transcriptomics section.
• 147 cell types (mainly transformed cell lines)
• 1640 genome-wide datasets
11
RNAseq Gene expression
ChIP Chromatin
ChIP Transcription Factors
ChIA-PET Genome structure / folding
GENCODE Gene annotation catalogue
ENCODE integrates multiple data types across cell types
13
ENCODE data of relevance to hepatology
ENCODE Tier 2: HepG2 cell line hepatocellular carcinoma(see http://www.genome.gov/26524238 for other cell types)
Including:8 RNAseq experiments 114 Transcription Factor ChIP experiments (inc CEBPB, HNF4A, HNF4G)http://genome-euro.ucsc.edu/ENCODE/dataMatrix/encodeDataMatrixHuman.html
Genes
Chromatin
TranscriptionFactors
RNA
16
Other projects of relevance: eQTL
• Gtex – Genotype Tissue Expression project
• Hunting for genetic variants that influence gene expression
Linking genetic variants to changes in gene expression – regulatory variants or “expression quantitative trait loci” (eQTL)
These will be different between tissues
17
What does this mean for translational research?
• Protein-focussed studies will miss the majority of functional disease causing variants / mutations
• Non-coding variants will usually be regulatory
• Non-coding variants will usually be cell type specific
• Large projects like ENCODE are producing rich data that can be used to interpret clinical results
`
18
How can genetic variants (SNPs) in noncoding regions cause phenotype?
• By altering the nucleotide sequence recognized by regulatory protein
Hawkins et al Nature Reviews Genetics 11, 476-486
19
Gene Expression DiseaseGenetic Variant (SNP)
How can genetic variants (SNPs) in noncoding regions cause phenotype?
20
How does ENCODE affect translational research projects?
• Genome wide association study (GWAS)
• Exome sequencing
• Gene expression profiling
21
Translational research approaches 1: Genetic approaches
Genomic approaches to identify genetic variants underlying disease:
GWAS – genome wide association study
Exome sequencing – target genome sequencing
Advantages Disadvantages
Genome wide Depends on limited # of marker SNPs
Not biased towards coding regions Low resolution
Good at identifying common variants Does not yield insights into mechanism
Advantages Disadvantages
Proteome wide No information about noncoding variants
Can identify rare causative variants Likely missing most causative variants
Usually yields mechanistic hypothesis
High resolution
22
Interpretation of GWAS results
GWAS gives an unbiased genome wide set of candidate SNPs
The majority of these lie outside protein coding regions
Two main challenges:
1. Identifying the causative SNP
2. Understanding the mechanism of action of that SNP
Li et al PLoS Genet 8(7): e1002791.
Hepatocellular carcinoma
23
Identifying the causative SNP using ENCODE data
Schaub et al Genome Res. 2012 Sep;22(9):1748-59. doi: 10.1101/gr.136127.111.e
Hunt for the likely functional SNP in LD with marker
24
Schaub et al Genome Res. 2012 Sep;22(9):1748-59. doi: 10.1101/gr.136127.111.e
Understanding the mechanism of a noncoding SNP using ENCODE data
26
Exome sequencing
Exome sequencing: targeted genome sequencing of protein coding exons
Relies on capturing a selected subset of genome
Advantages: • lower cost and • higher statistical power• can detect rare private mutations
Disadvantages:• Presently ignoring the noncoding genome
(~99%)
27
Exome sequencing: whats next?
Whole genome sequence not likely to be practical: no statistical power
Exome technology is highly customisable could be adapted to noncoding regions
The main question: what are the target regions?
• How to define the target space? • regulatory regions? • Noncoding RNAs? • Protein binding sites?
• Likely to be organ / disease specific
• Will require bioinformatic analysis to design reagents before experimental project begins.
28
Translational research approaches 2: Transcriptomic approaches
ENCODE has made a major contribution to gene expression studies, by providing high quality annotations of novel noncoding genes through GENCODE.
Microarray studies
• Microarrays are restricted by the catalogue of probes chosen
• Commercial arrays: usually protein coding genes
• MicroRNA arrays available
• Long noncoding RNA arrays available (CRG provide free designs) – based on ENCODE annotations
29
Translational research approaches 2: Transcriptomic approaches
RNAseq
• Unbiased > can discover novel RNAs
• Can quantify expression of known and novel genes, and discover RNA from non “genic” loci
• Analysis requires more bioinformatic analysis
• Still more expensive than arrays
30
Translational research approaches 2: Transcriptomic approaches
Problems:
It is easy to discover and quantify the expression of novel genes
It is difficult to understand the function of such genes
We have no bioinformatic tools to predict the function of most novel ncRNAs
We have limited experimental tools to investigate them
31
What does ENCODE mean for these studies?
GWAS • GWAS study design will not likely be affected• ENCODE will allow better interpretation of discovered
SNPs
Exome • Whole genome cohort studies may never be feasible• Capture sequence approach can be redesigned to
study noncoding variants in disease of choice• ENCODE and other public data will aid in the design of
these projects
Gene expression • New gene annotations can help in both microarray and RNAseq projects to discover novel noncoding gene targets.
• RNAseq will eventually replace arrays as costs drop, but right now new array designs are competitive in large experiments and given bioinformatic requirements
Nothing would have been possible without…
CRG Bioinformatics & Genomics
Roderic Guigó
Bioinformatics and Genomics group
ENCODE / GENCODE
Jennifer Harrow Tim Hubbard(GENCODE, Sanger)
FUNDING
Ramón y Cajal RYC-2011-08851
Plan Nacional BIO2011-27220
32
33
The main message of ENCODE
To understand genotypes and phenotypes, we must look beyond the protein coding gene.
Further reading:
Interpreting noncoding genetic variation in complex traits and human disease•Lucas D Ward & Manolis Kellis•AffiliationsNature Biotechnology 30, 1095–1106 (2012)
34
How could variants in noncoding regions cause phenotype?
• By altering the nucleotide sequence recognized by regulatory protein
• By altering a noncoding RNA gene, either in expression levels or mature sequence
Hawkins et al Nature Reviews Genetics 11, 476-486Haas et al RNA Biol. 2012 Jun;9(6):924-37
Levels of genome regulation
We now appreciate the genome is regulated at multiple levels:
• “Epigenetically” – chromatin structure
• Transcriptionally – RNA production
• Post-transcriptionally – RNA processing (splicing, transport, stability)
• Translationally – protein production at ribosome
• Structurally – the folding structure of the genome
=> These sequences all have effects on phenotype and thus may contribute to disease
=> All of these are encoded in noncoding DNA sequence
35
36
Karczewski KJ et al Proc Natl Acad Sci U S A. 2013 Jun 4;110(23):9607-12
A SNP for breast cancer creates a NFκB binding site
Case study: Studying disease-associated regulatory SNPs incorporating cohort epigenome data