noncoding genome 7 23_15_final_upload

THE NONCODING GENOME & HEALTH

Gerry Higgins1,2, Ari Allyn-Feuer1 and Brian Athey1

1Department of Computational Medicine and Bioinformatics

University of Michigan Medical School, Ann Arbor, MI

2Assurex Health, Inc. Mason OH

Conflict of interest statement:

Dr. Gerald A. Higgins is an employee of Assurex Health, Inc. (Mason, OH) and holds options in the company. He serves as an Adjunct Research Professor at the University of Michigan (U-M) Medical School. Dr. Higgins has a Conflict of Interest Management Plan on file at the University of Michigan.

Dr. Brian D. Athey is chairman of the Department of Computational Medicine and Bioinformatics, University of Michigan Medical School. He also serves as Chair of the Scientific Advisory Board of Assurex Health, Inc. and holds options in the company. He has performed this work as a University of Michigan Professor. Assurex Health, Inc. and the University of Michigan have established a Master Research Agreement and a Conflict of Interest Management Plan for Dr. Athey.

TOPICS

→ Challenge: The Human Genome Data Tsunami

→ The Noncoding Genome & Health

→ Genome Informatics for In Silico Discovery

Challenge: The Human Genome Data Tsunami

THE HUMAN GENOME1. Whole genome:

→ ~3.2 billion base pairs (bps)1

2. Whole exome:→ ~18,000 – 22,500 protein-coding

genes2

→ ~1.3 – 1.4% of whole genome2

3. Noncoding ‘regulome’: → Tissue-specific, but may up to 30%

of whole genome3

1International Human Genome Sequencing Consortium. Finishing the euchromatic sequence of the human genome. Nature. 2004. 431, 931-945.

2Pertea M and Salzberg SL. Between a chicken and a grape: estimating the number of human genes. Genome Biology. 2010. 11, 206.

3Kellis M, Wold B, Snyder MP et al. Defining functional DNA elements in the human genome. PNAS. 2013. 111:6131–6138.

NEXT GENERATION SEQUENCING TECHNOLOGY GENERATES

A MASSIVE AMOUNT OF DATA: THE SHORT READ ARCHIVE

2nd Generation NGS 3rd Generation NGS 4th Generation NGS

TechnologyExample

SBS or degradation Many approaches

Direct inspection of the DNA molecule using current blockade and nanopore technology

Resolution

Averaged across many copies of the DNA molecule being sequenced

Long sequences corrected with short reads

Single-molecule resolution

Raw read accuracy

High, with >60-fold coverage

High, missed variant calls: 1 in 500kb – 1M bases

Highest – Theoretically 99.9999%

Read length Short (e.g., 150 bps) Long, 1,000 bps Longest, 5,000 bps or

longerThroughput Moderate High with correction High

Current cost

Moderate cost per base Low cost per base Lowest cost per base

Start-to-Finish Days Several hours 1 hour per whole human

genomeSample preparation

Complex, library and PCR amplification required

Complex, library and PCR amplification required

Simple

Data analysis

Complex because of large data volumes and because short reads complicate assembly and alignment algorithms

Complex because of large data volumes– however those can be solved by new high speed camera and chip technologies

Very complex because of signal analysis and noise suppression, very large data volumes

Primary results

Base calls with quality values

Base calls with quality values, other base information such as, structural variants and phased haplotypes

Base calls with quality values, variants, phased haplotypes and epigenetic modifications

Relative Growth of Different Data Populations1

Short read archive: Raw sequence produced by 2nd generation sequencers.WGS: Whole genome sequence.UAVs: Unmanned aerial vehicles (drones).

1Higgins GA and Athey BD. “Emerging DNA Sequencing Technologies and Applications”, in Next-Generation DNA Sequencing Informatics, Second Edition. (2015). Cold Spring Harbor Laboratory Press (Oxford). ISBN 978-1-621821-23-6.

NANOPORE SEQUENCING WILL GREATLY INCREASE

THROUGH-PUT

1Higgins GA and Athey BD. “Emerging DNA Sequencing Technologies and Applications”, in Next-Generation DNA Sequencing Informatics, Second Edition. (2015). Cold Spring Harbor Laboratory Press (Oxford). ISBN 978-1-621821-23-6.

WHAT CONSTITUTES A REFERENCE HUMAN GENOME?

→ Is this sequence the same for all humans?→ Is this sequence accurate for any human?→ Are the coordinates absolute?

INTEGRATION CHALLENGES IN BIOMEDICAL DATA SCIENCE

Ontology & Terminology

→ Epigenomics→ Genomics→ Transcriptomics→ Proteomics→ Metabolomics→ Metagenomics→ Nutrigenomics→ Pharmacogenomics

Omics

→ Family history→ Medical history→ Results of lab tests→ Procedures→ Pharmacy→ Diagnostic codes→ Age, weight, ethnicity→ Gender

Clinical data

→ Carcinogen exposure→ Diet, exercise, lifestyle, medication adherence→ Pathogen exposure→ Physical & psychological traumaEnvironment &

Epidemiology

The Noncoding Genome & Health

THE NONCODING GENOME IS NOT “JUNK DNA”

THE NONCODING TRANSCRIPTOME

Regulatory RNAs that are not ribosomal RNA, mRNA or tRNA:

→ ~60,000 long noncoding RNAs (lncRNAs)1

→ ~65,000 enhancer RNAs (eRNAs)2

→ ~2,600 microRNAs (miRNAs)3

→ ? Number of piwi RNAs (piRNAs)

1Iyer MK et al. The landscape of long noncoding RNAs in the human transcriptome. Nature Genet. (2015). doi:10.1038/ng.3192

2Arner E et al. Transcribed enhancers lead waves of coordinated transcription in transitioning mammalian cells. Science. (2015). 347 (6225).

3Hammond SM. An overview of microRNAs. Advanced. Drug Delivery Rev. (2015). 87, 3-14.

THE EPIGENOMEMost tissues = CpG islands; mHCBrain = CpG, CAC; mostly 5mHC

H3K27ac + H3K4me1 = active promoterH3K27ac + H3K4me3 = active enhancer

SPATIAL DISTRIBUTION OF TRANSCRIPTIONAL DOMAINS IN THE

NUCLEUS

HIGH RESOLUTION HI-C SHOWS SPATIAL GENOME CONSISTS OF ENHANCER-PROMOTER LOOPS1

1Rao SSP et al. A 3D map of the human genome at kilobase resolution reveals principles of chromatin looping. Cell. (2014). doi.org/10.1016/j.cell.2014.11.021

Visualization of transcription provides a validation of loops

GENES MOVE IN SPACE OVER TIME1

→ Time-series Hi-C shows that the CLOCK genes loop together in the 3D chromatin environment with a circadian cycle1;

→ This rhythmic coupling is driven by the glucocorticoid receptor;

→ Disruption of this ‘4D nucleome’ is a symptom of many diseases;

→ Drugs have different effects depending on the time of administration2

1Chen H et al. Functional organization of the human 4D Nucleome. PNAS. (2015). doi: 10.1073/pnas.1505822112

2Zhang R et al. A circadian gene expression atlas in mammals: Implications for biology and medicine. PNAS. (2014). 111, 16219-16224.

EARLY AND/OR CHRONIC STRESS DEGRADE HEALTH

Wiley J et al. Disruption of glucocorticoid receptor signaling leads to inflammatory bowel disease. Gastroenterology. (2015), In press

>60% of GWAS DISEASE SNPs IMPACT ENHANCERS1

1Onengut-Gumuscu S et al. Fine mapping of type 1 diabetes susceptibility loci and evidence for co-localization of causal variants with lymphoid gene enhancers. Nature Genet. (2015). doi:10.1038/ng.3245;

Farh K K-H et al. Genetic and epigenetic fine mapping of causal autoimmune disease variants. Nature. 518, 337-343 (2015). doi:10.1038/nature13835;

Yao L et al. Functional annotation of colon cancer risk SNPs. Nature Comm. 5:5114 (2015) doi: 10.1038/ncomms6114;

Aaltonen L et al. CTCF/cohesin-binding sites are frequently mutated in cancer. Nature Genet. (2015). doi:10.1038/ng.3335;

Melton C et al. Recurrent somatic mutations in regulatory regions of human cancer genomes.Nature Genet. (2015). doi:10.1038/ng.3332;

Darabi H et al. Polymorphisms in a putative enhancer at the 10q21. 2 breast cancer risk locus regulate NRBF2 expression. Amer. J. Human Genetics. (2015). 97 (1), 22-34.

Intron, 57%

Intergenic, 11%

5'UTR, 16%

3'UTR, 9%

Missense coding variant, 3%Synonymous coding variant, 3%

ONLY 6% OF SNPs IN NEUROPSYCHIATRIC PHARMACOGENOMIC GWAS ARE CODING

VARIANTS1

1Higgins GA et al. Epigenomic mapping and effect sizes of noncoding variants associated with psychotropic drug response. Pharmacogenomics J. 2015, in press

HEALTH CONSEQUENCES OF NEUROPSYCHIATRIC DRUGS

→ 50% of all drug-related ER visits and subsequent hospitalizations are due to prescribed medications used in psychiatry1;

→ These comprise more than 90,000 ER visits in the U.S. each year1;

→ The vast majority of the FDA’s pharmacogenomic drug label precautions are for medications used in psychiatry and neurology2;

→The only government-required pharmacogenomic testing is for carbamazepine and lamotrigine in Singapore3.

1Hampton LH et al. Emergency department visits by adults for psychiatric medication adverse events. JAMA Psychiatry. (2014). 71, 1006-1014.

2www.fda.gov/drugs/scienceresearch/researchareas/pharmacogenetics/ucm083378.htm3Mitropoulos, K. et al. Success stories in genomic medicine from resource-

limited countries. Human Genomics. (2015). 9, 1 1-17.

Effect sizes of pharmacoepigenomic regulatory variants

Tagging SNPs associated with drug-induced cutaneous injury

Addiction SNPs for enhancers

Analgesia SNPs

Lithium-response SNP that disrupts a promoter

Genome Informatics for In Silico Discovery

IN SILICO DISCOVERY OF THE LITHIUM RESPONSE PATHWAY

Hypothesis [1]: Lithium response SNPs should overlap with drug side effects and risk for bipolar disorder?

[1] For rationale, please see: PMID15694273, PMID21047205, PMID21254218, PMID21781277, PMID22057216, PMID23021822, PMID24108394, PMID24126708, PMID24626773, etc.

23,312 SNPs imputed from GWAS followed by epigenome mapping

Lithium ResponseLithium Side Effects Bipolar Risk & Psych.

27% Overlap!

Is this a new pathway we discovered?Send out gene SNPs for

“blinded” pathway analysis

SNP LOCATIONrs1481892 Intron of ARNTLrs4237700 Intron of ARNTLrs4756764 Intron of ARNTLrs10832015 5’ UTR, ARNTLrs4146387 Intron of ARNTLrs7107287 Intron of ARNTLrs2314339 Intron of NR1D1rs12412727 Intron of ANK3rs10821745 5’ UTR, ANK3rs1016388 Intron of CACNA1Crs2284017 Intron of CACNG2rs2395655 5’UTR, CDKN1Ars6740584 5’ UTR, CREB1rs78957301 5’UTR, GRIA2rs17035898 Intron of GRIA2rs334558 5’UTR, GSK3Brs111885243 5’UTR, SLC1A2rs12418812 Intron of SLC1A2rs4354668 5’UTR, SLC1A2

19 SNPs IN 10 GENES FILTERED FROM 23,312 VARIANTS

GENE SET ENRICHMENT– all are contained within a single CNS network!

Figure 2. Pathway analysis from IPA™ using the 10 genes shown in color, this is an integrated glutamatergic pathway called by this software. This pathway is very significantly enriched for the “glutamate receptor” at p<10-27 (Fisher’s exact test).

→Data-driven discovery of the human epigenome revealed SNPs that may be of utility for stratification of lithium response;

→These pharmacoepigenomic variants are all associated with a single regulatory network in human brain that includes enhancers and promoters;

→Based on the experimental design used in this in silico method, the probability that this result is erroneous is vanishingly small;

→An up-to-date understanding of the regulatory architecture of the human genome provides the best foundation for developing the science that will support the next generation of pharmacogenomic tests.

REQUIRES DEMONSTRATION OF CLINCAL UTILITY

Lots of venues for publishing peer-reviewed manuscripts…and they are growing!!!

noncoding genome 7 23_15_final_upload

Health & Medicine

human genome data tsunami

genome biology

quality values base

base information

base low cost

base lowest cost

university of michigan

employee of assurex