ashg sedlazeck grc_share

28
Structural Variation Characterization Across the Human Genome and Populations Fritz Sedlazeck October, 17, 2017

Upload: genome-reference-consortium

Post on 22-Jan-2018

73 views

Category:

Health & Medicine


2 download

TRANSCRIPT

Page 1: Ashg sedlazeck grc_share

Structural Variation Characterization Across the Human Genome and PopulationsFritz Sedlazeck

October, 17, 2017

Page 2: Ashg sedlazeck grc_share

Scientific interestsDetection of Variants

Sniffles (in bioRxiv)

SURVIVOR Jeffares et. al. (2017)

BOD-Score Sedlazeck et.al.(2013)

Mapping/ Assembly reads

NextGenMap-LR(in bioRxiv)

Falcon UnzipChin et.al. (2016)

NextGenMapSedlazeck et.al. (2013)

Benchmarking/ Biases

DangerTrackDolgalev et.al. (2017)

TeaserSmolka et.al. (2015)

SequencingJünemann et.al. (2013)

ApplicationsModel organisms:-Cancer (SKBR3) (in bioRxiv)-miRNA editing (Vesely et.al. 2012)

Non Model organisms:-Cottus transposons (Dennenmoseret. al. 2017)-Clunio (Kaiser et. al. 2016)-Seabass (Vij et.al. 2016)-Pineapple (Ming et.al. 2015)

Figure'1'

“moonlight”'

Page 3: Ashg sedlazeck grc_share

Structural VariationsGenomic DisordersEvolution

Impact on regulation Impact on phenotypes

Reg

ula

tory

Sta

te

Cell Line

A549Aorta

B_cells_PB_Roadmap

CD14CD16__monocyte_CB

CD14CD16__monocyte_VB

CD4_ab_T_cell_VB

CD8_ab_T_cell_CB

CM_CD4_ab_T_cell_VB

DND_41

eosinophil_VB

EPC_VB

erythroblast_CB

Fetal_Adrenal_Gland

Fetal_Intestine_Large

Fetal_Intestine_Small

Fetal_Muscle_Leg

Fetal_Muscle_Trunk

Fetal_Stomach

Fetal_Thymus

Gastric

GM12878

H1_mesenchymal

H1_neuronal_progenitor

H1_trophoblast

H1ESC H9

HeLa_S3

HepG2HMEC

HSMM

HSMMtube

HUVEC_prol_CB

HUVECIM

R90

iPS_20b

iPS_DF_19_11

iPS_DF_6_9K562

Left_Ventric

leLung

M0_macrophage_CB

M0_macrophage_VB

M1_macrophage_CB

M1_macrophage_VB

M2_macrophage_CB

M2_macrophage_VB

Monocytes_CD14_PB_Roadmap

Monocytes_CD14

MSC_VB

naive_B_cell_VB

Natural_Killer_cells_PB

neutrophil_CB

neutrophil_myelocyte_BM

neutrophil_VB

NH_A

NHDF_ADNHEK

NHLF

OsteoblOvary

Pancreas

Placenta

Psoas_Muscle

Right_Atrium

Small_Intestine

Spleen

T_cells_PB_Roadmap

Thymus

CTCF_binding_siteACTIVE

CTCF_binding_siteINACTIVE

CTCF_binding_sitePOISED

CTCF_binding_siteREPRESSED

enhancerACTIVE

enhancerINACTIVE

enhancerPOISED

enhancerREPRESSED

open_chromatin_regionACTIVE

open_chromatin_regionINACTIVE

open_chromatin_regionNA

open_chromatin_regionPOISED

open_chromatin_regionREPRESSED

promoterACTIVE

promoter_flanking_regionACTIVE

promoter_flanking_regionINACTIVE

promoter_flanking_regionPOISED

promoter_flanking_regionREPRESSED

promoterINACTIVE

promoterPOISED

promoterREPRESSED

TF_binding_siteACTIVE

TF_binding_siteINACTIVE

TF_binding_siteNA

TF_binding_sitePOISED

TF_binding_siteREPRESSED

A549Aorta

B_cells_PB_Roadmap

CD14CD16__monocyte_CB

CD14CD16__monocyte_VB

CD4_ab_T_cell_VB

CD8_ab_T_cell_CB

CM_CD4_ab_T_cell_VB

DND_41

eosinophil_VB

EPC_VB

erythroblast_CB

Fetal_Adrenal_Gland

Fetal_Intestine_Large

Fetal_Intestine_Small

Fetal_Muscle_Leg

Fetal_Muscle_Trunk

Fetal_Stomach

Fetal_Thymus

Gastric

GM12878

H1_mesenchymal

H1_neuronal_progenitor

H1_trophoblast

H1ESC H9

HeLa_S3

HepG2HMEC

HSMM

HSMMtube

HUVEC_prol_CB

HUVECIM

R90

iPS_20b

iPS_DF_19_11

iPS_DF_6_9K562

Left_Ventric

leLung

M0_macrophage_CB

M0_macrophage_VB

M1_macrophage_CB

M1_macrophage_VB

M2_macrophage_CB

M2_macrophage_VB

Monocytes_CD14_PB_Roadmap

Monocytes_CD14

MSC_VB

naive_B_cell_VB

Natural_Killer_cells_PB

neutrophil_CB

neutrophil_myelocyte_BM

neutrophil_VB

NH_A

NHDF_ADNHEK

NHLF

OsteoblOvary

Pancreas

Placenta

Psoas_Muscle

Right_Atrium

Small_Intestine

Spleen

T_cells_PB_Roadmap

Thymus

CTCF_binding_siteACTIVE

CTCF_binding_siteINACTIVE

CTCF_binding_sitePOISED

CTCF_binding_siteREPRESSED

enhancerACTIVE

enhancerINACTIVE

enhancerPOISED

enhancerREPRESSED

open_chromatin_regionACTIVE

open_chromatin_regionINACTIVE

open_chromatin_regionNA

open_chromatin_regionPOISED

open_chromatin_regionREPRESSED

promoterACTIVE

promoter_flanking_regionACTIVE

promoter_flanking_regionINACTIVE

promoter_flanking_regionPOISED

promoter_flanking_regionREPRESSED

promoterINACTIVE

promoterPOISED

promoterREPRESSED

TF_binding_siteACTIVE

TF_binding_siteINACTIVE

TF_binding_siteNA

TF_binding_sitePOISED

TF_binding_siteREPRESSED

050

010

0015

0020

00

scale

affec

ted #

Page 4: Ashg sedlazeck grc_share

Diploid genome

• Impact on Regulation

• Variability of genes

• Need to understand the full structure

Page 5: Ashg sedlazeck grc_share

Challenges: Pursuing the diploid genome

1. Accurate prediction of SVs

2. Comparison of SVs

3. Annotation and interpretation of SVs

4. Population analysis

5. Diploid Genome

Layer et.al. (2014)

Page 6: Ashg sedlazeck grc_share

1.1 How to detect Structural Variations (SVs)

Page 7: Ashg sedlazeck grc_share

• (+) SVs in repetitive regions

• (+) Span SVs

• (+) Uniform coverage

• (+) Can identify more complex SVs

• (-) Higher seq. error rate

• (-) Hard to align

1.1 Long Read Technologies

Page 8: Ashg sedlazeck grc_share

1.1 Accurate mapping and SV calling

NextGenMap-LR (NGMLR):• Long read mapper• Convex gap costs• Faster then BWA-MEM

Sniffles:• SV caller for long reads• All types of SVs• Phasing of SVs

Page 9: Ashg sedlazeck grc_share

1.2 NA12878: SV calling

Tech. Coverage

Avg read len Method SVs TRA

PacBio 55x 4,334 Sniffles 22,877 119

OxfordNanopore @Baylor

34x 4,982 Sniffles 12,596 46

Illumina 50x 2 x 101 Manta, Delly, Lumpy

7,275 2,247

Sedlazeck et.al. (2017)

Page 10: Ashg sedlazeck grc_share

1.1 NA12878: SV calling

Tech. Coverage

Avg readlen

Method SVs TRA DEL INS

PacBio 55x 4,334 Sniffles 22,877 119 9,933 12,052

OxfordNanopore @Baylor

34x 4,982 Sniffles 12,596 46 7,102 5,166

Illumina 50x 2 x 101 Manta, Delly,

Lumpy

7,275 2,247 3,744 0

Sedlazeck et.al. (2017)

Page 11: Ashg sedlazeck grc_share

1.1 NA12878: check 2,247 vs 119 TRA

Illumina data

Translocation:

PacBio data

ONT data

Truncated reads:

Insertion In rep. region

Overlap Illumina TRA(%)

Insertions 53.05

Deletions 12.06

Duplications 0.57

Nested 0.31

High coverage 1.87

Low complexity 9.79

Explained 77.65

Sedlazeck et.al. (2017)

Page 12: Ashg sedlazeck grc_share

1.1 NA12878: check 2,247 vs 119 TRA

ONT data

PacBio data

Illumina data

Insertion In rep. region

Inversion:

Translocation:

Truncated reads:

Insertion In rep. region

Sedlazeck et.al. (2017)

Page 13: Ashg sedlazeck grc_share

1.2 More complex SVs

Inverted tandem duplication:• Pelizaeus-Merzbacher

disease• MECP2• VIPR2

Sedlazeck et.al. (2017)

PacBio data

Illumina data

Page 14: Ashg sedlazeck grc_share

1.2 More complex SVs

Inversion flanked by deletions:• Haemophilia A• Only found over long range PCR!

(2007)

Sedlazeck et.al. (2017)

Illumina data

PacBio data

Page 15: Ashg sedlazeck grc_share

Challenges

1. Accurate prediction of SVs: Sniffles (talk on Thursday!)

2. Comparison of SVs

3. Annotation and interpretation of SVs

4. Population analysis

5. Diploid Genome

Layer et.al. (2014)

Page 16: Ashg sedlazeck grc_share

2. Comparison of SVs

SURVIVOR Framework:• Compare SVs

• GiaB: 95 vcf file: 1 minute

• Simulate SVs

• Simulate long reads

• Summarize SVs results

Jeffares et.al. (2017)

New SVs

Observed SVs

Page 17: Ashg sedlazeck grc_share

2. Genome in a Bottle: merging 95 vcfs (1 min)

10x Genomics

BioNano

Complete Genomics

Illumina

PacBio

Minimum 2 callers:SV Caller Comparison:

Using PCR+Sanger validate SVs form multiple categories.

Join CSHL + Baylor to help with validations!

Page 18: Ashg sedlazeck grc_share

Challenges

1. Accurate prediction of SVs: Sniffles (talk on Thursday!)

2. Comparison of SVs: SURIVOR

3. Annotation and interpretation of SVs

4. Population analysis

5. Diploid Genome

Page 19: Ashg sedlazeck grc_share

Histogram over genes impacted

#Gene hit by SVS

Fre

que

ncy

0 20 40 60 80

020

00

4000

6000

3. Annotation: SURVIVOR_ant

Annotating SVs with:• Multiple GTF, BED, VCF

Genome in a Bottle:• 63,677 genes (GTF)

• 1,733,686 regions (3 bed files)

• 22 seconds:• 8,314 Genes impacted

Sedlazeck et.al. (2017)

#G

en

es

# SV hit gene

Genes impacted by SVs

Page 20: Ashg sedlazeck grc_share

Challenges

1. Accurate prediction of SVs: Sniffles (talk on Thursday!)

2. Comparison of SVs: SURIVOR

3. Annotation and interpretation of SVs: SURVIVOR_ANT

4. Population analysis

5. Diploid Genome

Page 21: Ashg sedlazeck grc_share

4. SVs in Population: SURVIVOR

• Birth defect study (Karyn MeltzSteinberg, WashU: Wed. 9am: Room 310A)

• 4 callers, 114 samples

• CCDG (William Salerno, HGSC: poster on Friday, #1281)

• 5 callers, 22,600 samples

• Non human:• S. Pombe: 3 callers, 161 samples• Tomato: 3 callers, 846 samples

Page 22: Ashg sedlazeck grc_share

4. SVs in 22,600 Individuals

We need large SV studies:• Common vs. rare SVs

• Inform GWAS studies

• Ethnicity specific SVs

• Catalog variability of regions• MHC, LPA, etc.

0.0e+00 5.0e+07 1.0e+08 1.5e+08

0.0

00

.10

0.2

0

CHR6: Average SV Allele Frequency per 100kb

Position

Alle

le f

req

uen

cy

MHC LPA

#SV

s

Shared across individuals

Position

Page 23: Ashg sedlazeck grc_share

Challenges

1. Accurate prediction of SVs: Sniffles (talk on Thursday!)

2. Comparison of SVs: SURIVOR

3. Annotation and interpretation of SVs: SURVIVOR_ANT

4. Population analysis: SURVIVOR

5. Diploid Genome

Page 24: Ashg sedlazeck grc_share

5.1 Diploid Genome

Challenges: • Sequencing technology

• Computational methods

• Money

HGSC Approach: GADGET1. Sequence 100 individuals: PacBio + 10x Genomics

2. SV detection/genotyping

3. Phasing of SVs+ SNP

4. Population based genotyping of SVs short reads.

Page 25: Ashg sedlazeck grc_share

5.2 Diploid Genome

Selecting 100 samples

• We want to maximize the outcome/ $ spent

• Selection of samples (red)

• Select top 100 (red)

• Random selection of samples (boxplot)

Histogram of mat[, 2]

# SVS

#P

atien

ts

2e+04 4e+04 6e+04 8e+04 1e+050

50

10

01

50

200

250

1 6 12 19 26 33 40 47 54 61 68 75 82 89 96

020

40

60

80

100

Random vs. informed choice of samples (CCDG)

# of chosen Samples

SV

in p

opu

lation (

%)

Informed

Top100

Random

Number of chosen samples

SV in

po

pu

lati

on

(%

)

Page 26: Ashg sedlazeck grc_share

5.3 Diploid Genome (Prototype)

Page 27: Ashg sedlazeck grc_share

Challenges/ Summary

1. Accurate prediction of SVs: Sniffles (Talk on Thursday!)

2. Comparison of SVs: SURIVOR

3. Annotation and interpretation of SVs:SURVIVOR_ANT

4. Population analysis: SURVIVOR

5. Diploid Genome: GADGET

All methods are available:

https://github.com/fritzsedlazeck

https://fritzsedlazeck.github.io/

1 6 12 19 26 33 40 47 54 61 68 75 82 89 96

020

40

60

80

100

Random vs. informed choice of samples (CCDG)

# of chosen Samples

SV

in p

opu

lation (

%)

Informed

Top100

Random

Number of chosen samples

SV in

po

pu

lati

on

(%

)

Page 28: Ashg sedlazeck grc_share

William Salerno

Stephen Richards

Richard Gibbs

Michael Schatz

Schatz lab

Acknowledgments

Daniel JeffaresJürg BählerChristophe Dessimoz

Justin Zook

GiaB consortium