creating reference-grade human genome assemblies

20
Creating Reference-Grade Human Genome Assemblies Tina Graves Lindsay GRC Workshop at Genome Informatics Sept 19, 2016

Upload: genome-reference-consortium

Post on 17-Jan-2017

84 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: Creating Reference-Grade Human Genome Assemblies

Creating Reference-Grade Human Genome Assemblies

Tina Graves LindsayGRC Workshop at Genome InformaticsSept 19, 2016

Page 2: Creating Reference-Grade Human Genome Assemblies

The Human Reference is a Work in Progress!

• The current reference – GRCh38 - is not optimal for some regions of the genome and/or some individuals/ancestries.

• GRCh38 is comprised of DNA from several individual humans.

• Allelic diversity and structural variation present major challenges when assembling a representative diploid genome.

• New technologies, methods, and resources since 2003 have allowed for substantial improvements in the reference genome.

• Additional high-quality reference sequences are needed to represent the full range of genetic diversity in humans

Page 3: Creating Reference-Grade Human Genome Assemblies

AC074378.4AC079749.5

AC134921.2AC147055.2

AC140484.1AC019173.4

AC093720.2AC021146.7

NCBI36 NC_000004.10 (chr4) Tiling Path

Xue Y et al, 2008

TMPRSS11E TMPRSS11E2

GRCh37 NC_000004.11 (chr4) Tiling Path

AC074378.4AC079749.5

AC134921.1AC147055.2

AC093720.2AC021146.7

TMPRSS11E

GRCh37: NT_167250.1 (UGT2B17 alternate locus)

AC074378.4AC140484.1

AC019173.4AC226496.2

AC021146.7

TMPRSS11E2

UGT2B17 – Conflicting Alleles

GAP

Page 4: Creating Reference-Grade Human Genome Assemblies

Samples to be Sequenced

Page 5: Creating Reference-Grade Human Genome Assemblies

Sequencing Plan

Page 6: Creating Reference-Grade Human Genome Assemblies

Definitions of Genome Level• Platinum Genome

• Haploid genome source• Contiguous, haplotype-resolved representation of entire

genome• BAC library available

• Gold Genome• Diploid genome source• Part of a trio

• Parents will be sequenced to help haplotype resolve some regions

• BAC libraries available • Targeted regions sequenced using these BAC libraries• Will contain some haplotype resolved regions

Page 7: Creating Reference-Grade Human Genome Assemblies

CHM1: A Key Resource for Improving the Reference

• CHM1 cell line established from a haploid hydatidiform mole (complete, paternal; 46XX) (U.Surti)

• CHORI-17 BAC library (P. deJong)• CHORI-17 BAC end sequences (n=325,659)• CHORI-17 multiple enzyme fingerprint map (1,560 fpc contigs)• CHORI-17 BACs

• >750 have been sequenced• 664 of them in Genbank as phase 3 sequence

• CHM1 WGS assembly• Initial assembly produced from >100X coverage of Illumina data• Initial PacBio assembly produced using ~54X of P5 PacBio data• Latest PacBio assembly produced using ~60X of P6 PacBio data

Page 8: Creating Reference-Grade Human Genome Assemblies

Assembly Assessment Methods• Assemblies will run through NCBI QA pipeline

• Assessed for contiguity, annotation, and concordance with the finished BACs

• Assembly Assembly alignments will be generated between each PB assembly and GRCh38

• BioNano Genome Map• SV calls generated from comparing the BioNano data to

each of the assemblies • Hybrid scaffolding conflicts will also point out potential

assembly errors

• Alignment of the Illumina reads back to the each of the assemblies• Heterozygous calls are likely indicative of a collapse in the

assembly (for the haploid genomes)

Page 9: Creating Reference-Grade Human Genome Assemblies
Page 10: Creating Reference-Grade Human Genome Assemblies

Hybrid Scaffolds – PacBio and BioNano

Seq Assem

Seq Assem

Seq Assem

BN Hybrid

BN Hybrid

BN Hybrid

# of Contigs

Contig N50 (Mb)

Total Size (Gb)

# of Scaffolds

Scaff N50 (Mb)

Total Size (Gb)

CHM1 (P6)GCA_001297185MGI CHM1 map(Jason’s version)

3641 26.9 2.99 161 47.6 2.84

CHM1 (P6) GCA_001307025MGI CHM1 Map

(Adam’s version)

4850 20.6 2.94 221 40.04 2.82

Page 11: Creating Reference-Grade Human Genome Assemblies

Hybrid ScaffoldHybrid Scaffold

PacBio Contigs

BioNano Contigs

Page 12: Creating Reference-Grade Human Genome Assemblies

1q21 Region – GRCh38 vs GCA_0012971851 Megabase

GRCh38

GCA_001297185

Seg Dup Track

Page 13: Creating Reference-Grade Human Genome Assemblies

1q21 Region - GRCh38 vs GCA_001297185

GRCh38

GCA_001297185

Seg Dup Track

99.9+% identity99.1% identity

Page 14: Creating Reference-Grade Human Genome Assemblies

CHM1 – Next Steps

• Move forward with improving GCA_001297185

• Based on alignment of BioNano data as well as comparisons to GRCh38, make additional breaks where possible

• Incorporate all finished BACs

• Final alignment to GRCh38 in order to produce chromosome AGPs and submit

Page 15: Creating Reference-Grade Human Genome Assemblies

First Gold Genome - NA19240

Initial Assembly Stats# Seq Contigs 3569Max Contig Length 20,393,869 bpTotal Assembly Size 2,745,634,789 bpN50 6,003,115 bpN90 848,151 bpN95 345,457 bp

• NA19240 – Yoruban sample

• Generated >70X raw PacBio data

Publication Pending

Page 16: Creating Reference-Grade Human Genome Assemblies

NA19240 BioNano Hybrid and SV StatsSeq

AssemSeq

AssemSeq Asse

m

BN Hybrid

BN Hybrid

BN Hybrid

BN Hybrid

BN Hybrid

# of Contigs

Contig N50 (Mb)

Total Size (Gb)

# of Scaffold

s

Scaffold N50 (Mb)

Total Size (Gb)

Conflicts WGS

Conflicts BN

NA19240 3569 6.01 2.75 421 14.78 2.74 49 60

Potential mis-

assemblies

Breaks made

Conflicts 28 22Ends 13 5Insertions 5 2Translocations

74 14

Initial curated assembly = GCA_001524155.1

Page 17: Creating Reference-Grade Human Genome Assemblies

Finished BACs Resolve This Region

GRCh38

PB Assembly

BAC Alignments

Seg Dup

Page 18: Creating Reference-Grade Human Genome Assemblies

Which Assembly is Best?

2.815 2.820 2.825 2.830 2.835 2.840 2.845 2.8505.806.006.206.406.606.807.007.207.407.607.80

Contig Lengt

h N50 (MB)

Total Assembly Size (GB)

HG00733 Puerto Rican Assembly Stats

• Use other sources to assess multiple assemblies• BioNano• Long linked reads

Page 19: Creating Reference-Grade Human Genome Assemblies

Genome Status

Data Source

Origin Level of Coverage

Status

CHM1 NA Platinum Assembly ImprovementCHM13 NA Platinum Assembly Assessment

NA19240 Yoruban Gold Paper in ReviewHG00733 Puerto

RicanGold Assembly Assessment

HG00514 Han Chinese

Gold Assembly Assessment

NA12878 European Gold Assembly AssessmentHG01352 Columbian Gold Assembly AssessmentHG02818 Gambian Gold Data Generation

CompletedHG02059 Kinh

Vietnamese

Gold Data Generation Completed

NA19434 Luhya Gold Data Generation

Page 20: Creating Reference-Grade Human Genome Assemblies

AcknowledgementsThe McDonnell Genome Institute at Washington University in St. Louis

Rick WilsonBob FultonWes WarrenKaryn Meltz SteinbergVince MagriniSean McGrathDerek AlbrachtMilinn KremitzkiSusan RockDebbie ScheerChad Tomlinson

Patrick MinxChris MarkovicEddie BelterLee TraniSara KohlbergSusan Dutcher

University of WashingtonEvan Eichler

NCBIValerie Schneider

University of Pittsburgh School of Medicine

(CHM1 and CHM13 cell line)Urvashi Surti

BioNano GenomicsPalak ShethAlex Hastie

Pacific BiosciencesJason ChinNick Sisneros

UCSFPui-Yan KwokYvonne LaiChin LinCatherine

Chu

NHGRIAdam PhillippySergey Koren

10X GenomicsDeanna Church