noncoding genome 7 23_15_final_upload
TRANSCRIPT
THE NONCODING GENOME & HEALTH
Gerry Higgins1,2, Ari Allyn-Feuer1 and Brian Athey1
1Department of Computational Medicine and Bioinformatics
University of Michigan Medical School, Ann Arbor, MI
2Assurex Health, Inc. Mason OH
Conflict of interest statement:
Dr. Gerald A. Higgins is an employee of Assurex Health, Inc. (Mason, OH) and holds options in the company. He serves as an Adjunct Research Professor at the University of Michigan (U-M) Medical School. Dr. Higgins has a Conflict of Interest Management Plan on file at the University of Michigan.
Dr. Brian D. Athey is chairman of the Department of Computational Medicine and Bioinformatics, University of Michigan Medical School. He also serves as Chair of the Scientific Advisory Board of Assurex Health, Inc. and holds options in the company. He has performed this work as a University of Michigan Professor. Assurex Health, Inc. and the University of Michigan have established a Master Research Agreement and a Conflict of Interest Management Plan for Dr. Athey.
TOPICS
→ Challenge: The Human Genome Data Tsunami
→ The Noncoding Genome & Health
→ Genome Informatics for In Silico Discovery
Challenge: The Human Genome Data Tsunami
THE HUMAN GENOME1. Whole genome:
→ ~3.2 billion base pairs (bps)1
2. Whole exome:→ ~18,000 – 22,500 protein-coding
genes2
→ ~1.3 – 1.4% of whole genome2
3. Noncoding ‘regulome’: → Tissue-specific, but may up to 30%
of whole genome3
1International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004. 431, 931-945.
2Pertea M and Salzberg SL. Between a chicken and a grape: estimating the number of human genes. Genome Biology. 2010. 11, 206.
3Kellis M, Wold B, Snyder MP et al. Defining functional DNA elements in the human genome. PNAS. 2013. 111:6131–6138.
NEXT GENERATION SEQUENCING TECHNOLOGY GENERATES
A MASSIVE AMOUNT OF DATA: THE SHORT READ ARCHIVE
2nd Generation NGS 3rd Generation NGS 4th Generation NGS
TechnologyExample
SBS or degradation Many approaches
Direct inspection of the DNA molecule using current blockade and nanopore technology
Resolution
Averaged across many copies of the DNA molecule being sequenced
Long sequences corrected with short reads
Single-molecule resolution
Raw read accuracy
High, with >60-fold coverage
High, missed variant calls: 1 in 500kb – 1M bases
Highest – Theoretically 99.9999%
Read length Short (e.g., 150 bps) Long, 1,000 bps Longest, 5,000 bps or
longerThroughput Moderate High with correction High
Current cost
Moderate cost per base Low cost per base Lowest cost per base
Start-to-Finish Days Several hours 1 hour per whole human
genomeSample preparation
Complex, library and PCR amplification required
Complex, library and PCR amplification required
Simple
Data analysis
Complex because of large data volumes and because short reads complicate assembly and alignment algorithms
Complex because of large data volumes– however those can be solved by new high speed camera and chip technologies
Very complex because of signal analysis and noise suppression, very large data volumes
Primary results
Base calls with quality values
Base calls with quality values, other base information such as, structural variants and phased haplotypes
Base calls with quality values, variants, phased haplotypes and epigenetic modifications
Relative Growth of Different Data Populations1
Short read archive: Raw sequence produced by 2nd generation sequencers.WGS: Whole genome sequence.UAVs: Unmanned aerial vehicles (drones).
1Higgins GA and Athey BD. “Emerging DNA Sequencing Technologies and Applications”, in Next-Generation DNA Sequencing Informatics, Second Edition. (2015). Cold Spring Harbor Laboratory Press (Oxford). ISBN 978-1-621821-23-6.
NANOPORE SEQUENCING WILL GREATLY INCREASE
THROUGH-PUT
1Higgins GA and Athey BD. “Emerging DNA Sequencing Technologies and Applications”, in Next-Generation DNA Sequencing Informatics, Second Edition. (2015). Cold Spring Harbor Laboratory Press (Oxford). ISBN 978-1-621821-23-6.
WHAT CONSTITUTES A REFERENCE HUMAN GENOME?
→ Is this sequence the same for all humans?→ Is this sequence accurate for any human?→ Are the coordinates absolute?
INTEGRATION CHALLENGES IN BIOMEDICAL DATA SCIENCE
Ontology & Terminology
→ Epigenomics→ Genomics→ Transcriptomics→ Proteomics→ Metabolomics→ Metagenomics→ Nutrigenomics→ Pharmacogenomics
Omics
→ Family history→ Medical history→ Results of lab tests→ Procedures→ Pharmacy→ Diagnostic codes→ Age, weight, ethnicity→ Gender
Clinical data
→ Carcinogen exposure→ Diet, exercise, lifestyle, medication adherence→ Pathogen exposure→ Physical & psychological traumaEnvironment &
Epidemiology
The Noncoding Genome & Health
THE NONCODING GENOME IS NOT “JUNK DNA”
THE NONCODING TRANSCRIPTOME
Regulatory RNAs that are not ribosomal RNA, mRNA or tRNA:
→ ~60,000 long noncoding RNAs (lncRNAs)1
→ ~65,000 enhancer RNAs (eRNAs)2
→ ~2,600 microRNAs (miRNAs)3
→ ? Number of piwi RNAs (piRNAs)
1Iyer MK et al. The landscape of long noncoding RNAs in the human transcriptome. Nature Genet. (2015). doi:10.1038/ng.3192
2Arner E et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science. (2015). 347 (6225).
3Hammond SM. An overview of microRNAs. Advanced. Drug Delivery Rev. (2015). 87, 3-14.
THE EPIGENOMEMost tissues = CpG islands; mHCBrain = CpG, CAC; mostly 5mHC
H3K27ac + H3K4me1 = active promoterH3K27ac + H3K4me3 = active enhancer
SPATIAL DISTRIBUTION OF TRANSCRIPTIONAL DOMAINS IN THE
NUCLEUS
HIGH RESOLUTION HI-C SHOWS SPATIAL GENOME CONSISTS OF ENHANCER-PROMOTER LOOPS1
1Rao SSP et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. (2014). doi.org/10.1016/j.cell.2014.11.021
Visualization of transcription provides a validation of loops
GENES MOVE IN SPACE OVER TIME1
→ Time-series Hi-C shows that the CLOCK genes loop together in the 3D chromatin environment with a circadian cycle1;
→ This rhythmic coupling is driven by the glucocorticoid receptor;
→ Disruption of this ‘4D nucleome’ is a symptom of many diseases;
→ Drugs have different effects depending on the time of administration2
1Chen H et al. Functional organization of the human 4D Nucleome. PNAS. (2015). doi: 10.1073/pnas.1505822112
2Zhang R et al. A circadian gene expression atlas in mammals: Implications for biology and medicine. PNAS. (2014). 111, 16219-16224.
EARLY AND/OR CHRONIC STRESS DEGRADE HEALTH
Wiley J et al. Disruption of glucocorticoid receptor signaling leads to inflammatory bowel disease. Gastroenterology. (2015), In press
>60% of GWAS DISEASE SNPs IMPACT ENHANCERS1
1Onengut-Gumuscu S et al. Fine mapping of type 1 diabetes susceptibility loci and evidence for co-localization of causal variants with lymphoid gene enhancers. Nature Genet. (2015). doi:10.1038/ng.3245;
Farh K K-H et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature. 518, 337-343 (2015). doi:10.1038/nature13835;
Yao L et al. Functional annotation of colon cancer risk SNPs. Nature Comm. 5:5114 (2015) doi: 10.1038/ncomms6114;
Aaltonen L et al. CTCF/cohesin-binding sites are frequently mutated in cancer. Nature Genet. (2015). doi:10.1038/ng.3335;
Melton C et al. Recurrent somatic mutations in regulatory regions of human cancer genomes.Nature Genet. (2015). doi:10.1038/ng.3332;
Darabi H et al. Polymorphisms in a putative enhancer at the 10q21. 2 breast cancer risk locus regulate NRBF2 expression. Amer. J. Human Genetics. (2015). 97 (1), 22-34.
Intron, 57%
Intergenic, 11%
5'UTR, 16%
3'UTR, 9%
Missense coding variant, 3%Synonymous coding variant, 3%
ONLY 6% OF SNPs IN NEUROPSYCHIATRIC PHARMACOGENOMIC GWAS ARE CODING
VARIANTS1
1Higgins GA et al. Epigenomic mapping and effect sizes of noncoding variants associated with psychotropic drug response. Pharmacogenomics J. 2015, in press
HEALTH CONSEQUENCES OF NEUROPSYCHIATRIC DRUGS
→ 50% of all drug-related ER visits and subsequent hospitalizations are due to prescribed medications used in psychiatry1;
→ These comprise more than 90,000 ER visits in the U.S. each year1;
→ The vast majority of the FDA’s pharmacogenomic drug label precautions are for medications used in psychiatry and neurology2;
→The only government-required pharmacogenomic testing is for carbamazepine and lamotrigine in Singapore3.
1Hampton LH et al. Emergency department visits by adults for psychiatric medication adverse events. JAMA Psychiatry. (2014). 71, 1006-1014.
2www.fda.gov/drugs/scienceresearch/researchareas/pharmacogenetics/ucm083378.htm3Mitropoulos, K. et al. Success stories in genomic medicine from resource-
limited countries. Human Genomics. (2015). 9, 1 1-17.
Effect sizes of pharmacoepigenomic regulatory variants
Tagging SNPs associated with drug-induced cutaneous injury
Addiction SNPs for enhancers
Analgesia SNPs
Lithium-response SNP that disrupts a promoter
Genome Informatics for In Silico Discovery
IN SILICO DISCOVERY OF THE LITHIUM RESPONSE PATHWAY
Hypothesis [1]: Lithium response SNPs should overlap with drug side effects and risk for bipolar disorder?
[1] For rationale, please see: PMID15694273, PMID21047205, PMID21254218, PMID21781277, PMID22057216, PMID23021822, PMID24108394, PMID24126708, PMID24626773, etc.
23,312 SNPs imputed from GWAS followed by epigenome mapping
Lithium ResponseLithium Side Effects Bipolar Risk & Psych.
27% Overlap!
Is this a new pathway we discovered?Send out gene SNPs for
“blinded” pathway analysis
SNP LOCATIONrs1481892 Intron of ARNTLrs4237700 Intron of ARNTLrs4756764 Intron of ARNTLrs10832015 5’ UTR, ARNTLrs4146387 Intron of ARNTLrs7107287 Intron of ARNTLrs2314339 Intron of NR1D1rs12412727 Intron of ANK3rs10821745 5’ UTR, ANK3rs1016388 Intron of CACNA1Crs2284017 Intron of CACNG2rs2395655 5’UTR, CDKN1Ars6740584 5’ UTR, CREB1rs78957301 5’UTR, GRIA2rs17035898 Intron of GRIA2rs334558 5’UTR, GSK3Brs111885243 5’UTR, SLC1A2rs12418812 Intron of SLC1A2rs4354668 5’UTR, SLC1A2
19 SNPs IN 10 GENES FILTERED FROM 23,312 VARIANTS
GENE SET ENRICHMENT– all are contained within a single CNS network!
Figure 2. Pathway analysis from IPA™ using the 10 genes shown in color, this is an integrated glutamatergic pathway called by this software. This pathway is very significantly enriched for the “glutamate receptor” at p<10-27 (Fisher’s exact test).
→Data-driven discovery of the human epigenome revealed SNPs that may be of utility for stratification of lithium response;
→These pharmacoepigenomic variants are all associated with a single regulatory network in human brain that includes enhancers and promoters;
→Based on the experimental design used in this in silico method, the probability that this result is erroneous is vanishingly small;
→An up-to-date understanding of the regulatory architecture of the human genome provides the best foundation for developing the science that will support the next generation of pharmacogenomic tests.
REQUIRES DEMONSTRATION OF CLINCAL UTILITY
Lots of venues for publishing peer-reviewed manuscripts…and they are growing!!!