graph and assembly strategies for the mhc and ribosomal dna regions

27
Graph and assembly strategies for the MHC and ribosomal DNA regions Alexander Dilthey

Upload: genome-reference-consortium

Post on 17-Jan-2017

229 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: Graph and assembly strategies for the MHC and ribosomal DNA regions

Graph and assembly strategies for the MHC and ribosomal DNA regions

Alexander Dilthey

Page 2: Graph and assembly strategies for the MHC and ribosomal DNA regions

The MHC is the zebrafish of the genome!

(model region)

Page 3: Graph and assembly strategies for the MHC and ribosomal DNA regions

PRGs – Population Reference Graphs• Simple: acyclic, directed (sub-class of general variation graphs)

• Usually built from MSA, preserve gap positions(i.e. global homology between input sequences).

• Generative model: Recombination

• Ploidy well-defined (0, 1, 2)

TA CT A G

C

C

_

_

A

TA

A

Page 4: Graph and assembly strategies for the MHC and ribosomal DNA regions

Outline• Quick recap:

What we know about the utility of graph genome approaches

• New results:

Haplotyping in hypervariable regions (HLA)Pseudo graph alignment

• De novo assembly of ribosomal DNA

Page 5: Graph and assembly strategies for the MHC and ribosomal DNA regions

In most of the MHC, single-reference approaches work just fine…

Num

ber o

f kme

rs (m

illion

s)4.5

5.0

PGF reference Platypus PRG-Viterbi PRG-Mapped

kmers recoveredkmers not recovered

+ long-read validation with consistent results (not shown)Dilthey et al., Nature Genetics 2015

Page 6: Graph and assembly strategies for the MHC and ribosomal DNA regions

… graph genomes outperform in the most complex sub-region of the MHC …

Dilthey et al., Nature Genetics 2015

Page 7: Graph and assembly strategies for the MHC and ribosomal DNA regions

… remaining problems driven by incomplete input haplotypes + algorithmics.

Aligned kmers

Chromotype position (kb)

Read

posit

ion (k

b)

0 10 200

2

4

6

Incomplete input haplotypes:Large uncharacterized inversion

Algorithmics:Incorrect HLA haplotyping.

Dilthey et al., Nature Genetics 2015

Page 8: Graph and assembly strategies for the MHC and ribosomal DNA regions

HLA haplotyping• Hypothesis: Whole-genome sequencing data contains the information

necessary for accurate HLA typing

• “HLA typing” HLA gene exon sequences• HLA class I: exons 2 and 3• HLA class II: exon 2

• Challenge: align reads to the right gene – homology hell.

• Proper read-to-graph alignment instead of k-Mers.

Page 9: Graph and assembly strategies for the MHC and ribosomal DNA regions

Class I exon homology

Exon 2 Exon 3

HLA-A 3284 allelesHLA-B 4077 allelesHLA-C 2799 alleles

Page 10: Graph and assembly strategies for the MHC and ribosomal DNA regions

Approach: deep PRG + mapping

Exonic MSAT*01:01 _ _ A C G T A C T _ _T*01:02 C A A C A T A C T _ _T*01:03 _ _ A C G C G C T _ _T*01:04 _ _ A T C C G C T A CT*01:05 _ _ A T C C C C T _ _T*01:06 _ _ _ C C T A C T _ _

Genomic MSAT*01:01 A G C A _ _ A C G T A C T _ _ C C T AT*01:02 A C C A C A A C A T A C T _ _ C C T AT*01:04 _ T T A _ _ A T C C G C T A C C C T A

8 xMHC reference haplotypes

PGF (with T*01:01) A C T A G C A _ _ A C G T A C T _ _ C C T A T G AMANN (with T*01:04) T T T _ T T A _ _ A T C C G C T A C C C T A T G A

1) Gene-only PRG – 46 (pseudo) genes, mostly HLA|--NNN--| |--NNN--| Gene 1 Gene 2 Gene 3

Padding UTR Exon 1 Intron 1 Exon 2 UTR Padding

Num

ber o

f ref

eren

ce se

quen

ces

Region covered by 'genomic' sequences

2) Varying numbers of input sequences across PRG

3) Use hierarchical MSA approach to combine in

Page 11: Graph and assembly strategies for the MHC and ribosomal DNA regions

Approach: deep PRG + mapping

Level 1

CA

_ _

C T

C

CC

G

AAligned read

2 3 4 5 6 7

A _ TATA _ C

198 9 10 11 12 13 14 15 16 17 18 25 26

C AGTATC

20 21 22 23 24

TCTC

T T

A

_

A _A G

CT

C

T

T

C T

ATAC

C {G, C}T

C

G

CA A

_ _

A

4) Seed-and-extend paired-end mapping to PRG

5) Likelihood-based inference: maximize L( aligned reads | HLA types ) (independently per locus)

Page 12: Graph and assembly strategies for the MHC and ribosomal DNA regions

High-quality WGS data enables gold-standard accuracy

(of note: 2/3 original discrepancies with validation data were errors in the validation data!)

Page 13: Graph and assembly strategies for the MHC and ribosomal DNA regions

… but not from exome, MiSeq data

Page 14: Graph and assembly strategies for the MHC and ribosomal DNA regions

Sequencing error?

Page 15: Graph and assembly strategies for the MHC and ribosomal DNA regions

Effective fragment length? [2 x read length + IS]

Page 16: Graph and assembly strategies for the MHC and ribosomal DNA regions

Conclusion (intermediate)• If the input sequencing data is „good enough“, we manage near-

perfect haplotyping in the genome‘s most polymorphic region

• Effective fragment length likely the most important factor

• Not-so-good sequencing data: joint haplotyping + alignment(i.e. alignment location is not independent of inferred haplotype)

• Read mapping implementation SLOW

Page 17: Graph and assembly strategies for the MHC and ribosomal DNA regions

Pseudo graph mappingInput sequences

Page 18: Graph and assembly strategies for the MHC and ribosomal DNA regions

Pseudo graph mappingInput sequences

Graph

Page 19: Graph and assembly strategies for the MHC and ribosomal DNA regions

Pseudo graph mappingInput sequences

Graph

Align short reads to input sequences...

Page 20: Graph and assembly strategies for the MHC and ribosomal DNA regions

Pseudo graph mappingInput sequences

Graph

Align short reads to input sequences...

... transpose onto graph

Page 21: Graph and assembly strategies for the MHC and ribosomal DNA regions

Scrubbing, cutting, cleaning

Input MSA Lin. alignment MSA coor. Scrubbed

123456789 123456X789 123456789Seq1 AACAC_TTT Seq1 AACAC_TTT AACAC__TTT AACAC_TTTSeq2 TTCACGTTT Read AACACGTTT AACAC_GTTT AACACGTTT

-Graph TTCAC TTT G

Scrubbing: get rid of INDEL-induced changes in the alignment coordinate system

Cutting: Examine alignment gap structure; cut in „bad“ areas; use longest stretch

Cleaning: Find the best gap-less sequence-to-graph alignment + extension with gaps

Graph alignment

123456789Graph AACACGTTTSeq1 AACACGTTT

Page 22: Graph and assembly strategies for the MHC and ribosomal DNA regions

Accuracy slightly worse; fast!

Conclusion: perhaps there is a middle ground between graph and linear sequence alignment. Work in progress. Further tuning?

Inferred Accuracy Call Rate Inferred Accuracy Call RateA 6 6 1.00 1.00 6 1.00 1.00B 6 6 1.00 1.00 6 1.00 1.00C 6 6 1.00 1.00 6 1.00 1.00DQA1 6 6 1.00 1.00 6 1.00 1.00DQB1 6 6 1.00 1.00 6 1.00 1.00DRB1 6 6 1.00 1.00 6 1.00 1.00A 22 22 0.86 1.00 22 1.00 1.00B 22 22 1.00 1.00 22 1.00 1.00C 22 22 1.00 1.00 22 1.00 1.00DQA1 12 12 1.00 1.00 12 1.00 1.00DQB1 22 22 1.00 1.00 22 1.00 1.00DRB1 22 22 0.91 1.00 22 0.95 1.00

PlatinumTrio

1000 Genomes

Highest Resolution

MHC-PRG-2 HLA*PRGNLocusCohort

Page 23: Graph and assembly strategies for the MHC and ribosomal DNA regions

Towards additional high-quality reference haplotypes…

Remaining challenges: extreme repeats, haplotypes.Sergey Koren

Page 24: Graph and assembly strategies for the MHC and ribosomal DNA regions

Ribosomal DNA• Encodes ribosomal RNA• Hundreds of copies

(tandem repeat arrays)

• Variation poorly characterized

• Step 1: Targeted approach• Step 2: WGS-based• Step 3: Variation graph

Page 25: Graph and assembly strategies for the MHC and ribosomal DNA regions

Read error vs variation

… from whole-genome data?Long reads de Bruijn graph Technology!

6% > 50k

Page 26: Graph and assembly strategies for the MHC and ribosomal DNA regions

Summary• Variation graphs are worth the effort – at least in highly complex regions.

• Evidence: MHC „model system“+ overall improvement of Genome inference accuracy+ complex-locus haplotyping

• Incorporate LD?

• Middle ground between full graph alignment and linear sequence alignment?

• Ribosomal DNA – let me know if you‘re also interested!

Page 27: Graph and assembly strategies for the MHC and ribosomal DNA regions

AcknowledgementsNIHAdam PhillippySergey KorenBrian WalenzJung-Hyun KimVladimir Larionov

OxfordGil McVeanZam IqbalAlexander Mentzer

HistogeneticsNezih Cereb

UCSF/NantesPierre-Antoine Gourraud

GSKMatt NelsonCharles Cox