101717.kh miga ashg_grc

42
Centromere Sequence Assembly Karen H. Miga University of California, Santa Cruz 10/17/17 GRC GIAB Workshop ASHG

Upload: genome-reference-consortium

Post on 22-Jan-2018

70 views

Category:

Health & Medicine


0 download

TRANSCRIPT

Page 1: 101717.kh miga ashg_grc

Centromere Sequence Assembly

Karen H. Miga University of California, Santa Cruz

10/17/17GRC GIAB Workshop

ASHG

Page 2: 101717.kh miga ashg_grc

Megabase-sized gapsP-ARM Q-ARM

CEN

HUMAN  CENTROMERES:  MULTI-­‐MEGABASE  SIZED  GAPS  IN  ALL  CHROMOSOME  ASSEMBLIES

Page 3: 101717.kh miga ashg_grc

CEN

Page 4: 101717.kh miga ashg_grc

PROGRESS  UPDATE:    CENTROMERE  SEQUENCE  ASSEMBLIES

1.      GRCh38  Reference  Models  for  Human  Centromere  Arrays

2.    Efforts  to  Generate  True,  Linear  Assemblies  of  Centromeric  regions:  Chromosome  Y

3.    Future  PerspecSve

Page 5: 101717.kh miga ashg_grc

p-arm q-arm... ...Multi-megabase sized arrays of satellite DNA

...ATCCGATTACG ATCCGATTACGATCCGATTACG... ...ATCCGATTACG ATCCGATTACGATCCGATTACG...

CHALLENGE  OF  ASSEMBLING  LONG  TRACTS  OF  (NEAR  IDENTICAL)  TANDEM  REPEATS

Page 6: 101717.kh miga ashg_grc

p-arm q-arm... ...ALPHA SATELLITE

~171bp Tandem Repeat

Wide Range of Percent ID: ~60-100%

1 2 3 4

HUMAN  CENTROMERES:  ALPHA  SATELLITTE

Page 7: 101717.kh miga ashg_grc

Narrow Range of Percent ID: 94% - 100%

“Higher Order Repeat”

Multi-monomeric Repeat Unit

p-arm q-arm... ...

1 2 3 4 1 2 3 4 1 2 3 4

HIGHER  ORDER  REPEATS  

Page 8: 101717.kh miga ashg_grc

p-arm q-arm... ...

p-arm q-arm... ...

Array “A”

Array “B” Array “C”

chrX

chr3

CHROMOSOME-­‐SPECIFIC  SATELLITE  SEQUENCE  ORGANIZATION

Page 9: 101717.kh miga ashg_grc

p-arm q-arm... ...

... ...-A- -T-

GENOME  MODEL  OF  SEQUENCE  ORGANIZATION  IN  CENTROMERE-­‐ASSIGNED  GAPS

Page 10: 101717.kh miga ashg_grc

p-arm q-arm... ...

... ...-A- -T-

GENOME  MODEL  OF  SEQUENCE  ORGANIZATION  IN  CENTROMERE-­‐ASSIGNED  GAPS

LINESINE

OTHER

NON-ALPHA SATELLITE

Page 11: 101717.kh miga ashg_grc

p-arm q-arm... ...

... ...-A- -T-

GENOME  MODEL  OF  SEQUENCE  ORGANIZATION  IN  CENTROMERE-­‐ASSIGNED  GAPS

LINESINE

OTHER

NON-ALPHA SATELLITE

Unmapped (Yet Assembled) Scaffolds

Page 12: 101717.kh miga ashg_grc

Characterize HORs in Human Genome1

1. GRCh38  Alpha  Satellite  Reference  Models  

1

Page 13: 101717.kh miga ashg_grc

A B C D E F

Characterize HORs in Human Genome1

1. GRCh38  Alpha  Satellite  Reference  Models  

1

Page 14: 101717.kh miga ashg_grc

>200 ENCODE datasets

A B C D E F

Characterize HORs in Human Genome1

1. GRCh38  Alpha  Satellite  Reference  Models  

>200 ENCODE datasetsStep by Step Example For Single P-read, I

•  Example: p-read 78e9dc60_12326_5223 (internal read identifier) –  p-read length 12257

–  Self-self dot plot

22

0 2000 4000 6000 8000 10000 12000

2000

4000

6000

8000

10000

12000

0

α-Centauri (centromeric automated repeat identification)

12,257 bp PacBio Read with 70 adjacent monomers (NA12878, 10x Coverage HuPac Data)

5’…

…3’

10x

10

B

C

D

E

F

A10

10

1010

10

5’ 3’

1

http://github.com/volkansevim/alpha-CENTAURI.

Page 15: 101717.kh miga ashg_grc

B

C

D

E

F

A

Chromosome specific assignment

?

Page 16: 101717.kh miga ashg_grc

Experimental Evidence: FISH Hybridization/Mapping and Screening Somatic Cell Hybrid Panel

B

C

D

E

F

A

D7Z16-mer

Waye  et  al  (1987)  98%    GenBank:  M16101  

Flow Sorted Chromosome Alignment/Enrichment Sequence enrichment analysis of isolated human chromosomes

Long Range Paired Read Support“Anchor” to mapped to the assembled p-arm and/or q-arm

Chromosome specific assignment

Page 17: 101717.kh miga ashg_grc

Chromosome-assignment of Higher Order Repeats

Page 18: 101717.kh miga ashg_grc

Characterize HORs in Human Genome

1. GRCh38  Alpha  Satellite  Reference  Models  

DXZ1 (12-mer)CENX

e.g.1 2 3 4 5 6 7 8 9 10 11 12

LINEHuRef WGS Sanger

read Db

Constructing WGS Read Libraries for each HOR array2

HuRef

LINEA/T

1

Page 19: 101717.kh miga ashg_grc

Characterize HORs in Human Genome

1. GRCh38  Alpha  Satellite  Reference  Models  

Constructing WGS Read Libraries for each HOR array

m3v1

m1v1m2v1

m2v2

m4v1

m12v1

m5v1

m6v1

LINE

m11v1

m10v1

m9v1

m8v1

m7v1

1.01.0

1.0

1.0

1.0

1.0

1.0

0.5

0.5

0.5

0.3

0.70.3

0.7

1.0

LINEA/T

2

1

3 Model Array Variants in Sequence Graph:

Page 20: 101717.kh miga ashg_grc

linearSat• 2nd Order Markov Chain• Length determined by normalized array length estimates

m3v1

m1v1m2v1

m2v2

m4v1

m12v1

m5v1

m6v1

LINE

m11v1

m10v1

m9v1

m8v1

m7v1

1.01.0

1.0

1.0

1.0

1.0

1.0

0.5

0.5

0.5

0.3

0.70.3

0.7

1.0

Not the “true” long-range organization, yet adequately represents the alpha satellite

array sequence

https://github.com/JimKent/linearSat

Page 21: 101717.kh miga ashg_grc

LINEAR  ORDERING  OF  REFERENCE  MODELS  AND  ASSEMBLED  CONTIGS  USING  MATE  PAIRS

CENXXp Xq

3.8 Mb

chrX

2.25Mb; ~860 HOR units 0.73Mb; ~43 HOR units

0.3Mb; Low Copy Repeat

pp3p 3qCEN3.1 CEN3.2Unmapped HuRef

Assembled Contig(s)(e.g. ABBA01185959)

chr3

Page 22: 101717.kh miga ashg_grc

Yp Yq

Xp Xq

100Kb

12p 12q

17q17p

2p 2q

6p 6q

3p 3q

4p 4q

11p 11q

8p 8q

10p 10q

7p 7pq 7q

9q9p

1p 1q

16q16p

18p 18q

19p 19q

20q20p

5p 5q

1

2

3

4

5

6

7

8

9

10

11

12

15

16

17

18

19

20

15q 15p

X

Y

21p14q

21q

Acrocentric Chr (13,14,21,22)

An Initial Draft of Human Centromere

Sequence Composition

Alpha  Satellite  Reference  Models:  ~60  Mb  (59571670  bp)

Page 23: 101717.kh miga ashg_grc

CENTROMERE  SEQUENCE  ASSEMBLY  

1. GRCh38  Alpha  Satellite  Reference  Models  

2. Linear  Assembly  of  a  Human  Centromere  Miga, KH., et al. Genome research 24.4 (2014): 697-707.l 20

Page 24: 101717.kh miga ashg_grc

LINEAR  ASSEMBLY  OF    A  HUMAN  CENTROMERE  ON  THE  Y  CHROMOSOME

Small, haploid satellite array with well-characterized 5.8 kb repeat

p-arm q-arm

Page 25: 101717.kh miga ashg_grc

BACS:  OVERLAP-­‐LAYOUT-­‐ASSEMBLY

p-arm q-arm

Collection of 9 BACs known to span the Y Centromere

Overlap determined by single copy sequence variants

Tilford et al 2001 Nature

Page 26: 101717.kh miga ashg_grc

HIGH  QUALITY  +  LONG  (100  kb  +)  READS

~100 kb

Collapsed Representation

Challenge of Assembling

Identical Tandem Repeats with Short

Reads

Page 27: 101717.kh miga ashg_grc

HIGH  QUALITY  +  LONG  (100  kb  +)  READS

High Quality Consensus Sequence

~100 kb

Page 28: 101717.kh miga ashg_grc

NANOPORE  SEQUENCING:  LONGBOARD  (1D)UCSC LONGBOARD 1D PROTOCOL

Page 29: 101717.kh miga ashg_grc

UCSC LONGBOARD 1D PROTOCOL NANOPORE  SEQUENCING:  LONGBOARD  (1D)

Page 30: 101717.kh miga ashg_grc

UCSC LONGBOARD 1D PROTOCOL

In total, we have generated 3500+ reads greater than 150 kb

NANOPORE  SEQUENCING:  LONGBOARD  (1D)

Page 31: 101717.kh miga ashg_grc

MULTIPLE ALIGNMENT STRATEGY TO IMPROVE QUALITY BY CONSENSUS

High Quality Consensus Requires

Modest Coverage

UCSC LONGBOARD 1D PROTOCOL MULTIPLE  ALIGNMENT  STRATEGY  TO  IMPROVE  QUALITY  BY  CONSENSUS

Page 32: 101717.kh miga ashg_grc

RP11 718M18 221.4 kb

Vector

Insert

634 Predicted Nucleotide Variants

2 Tandem Structural Rearrangements

38 CENY RPTS (>99% Identity to published consensus)

Homopolymers [A]n

Homopolymers [T]n

Page 33: 101717.kh miga ashg_grc

Identify informative, single copy sites in the array useful for overlap BAC-based assembly

IDENTIFY SINGLE COPY VARIANTS USING ILLUMINA DATA

RP11 718M18 221.4 kb

VALIDATE  HIGH-­‐CONFIDENT    SINGLE  COPY  VARIANTS  WITH  ILLUMINA

RP11 718M18 221.4 kb

Page 34: 101717.kh miga ashg_grc

VALIDATE  HIGH-­‐CONFIDENT    SINGLE  COPY  VARIANTS

Page 35: 101717.kh miga ashg_grc

LINEAR  ASSEMBLY  OF  HUMAN  Y  CENTROMERE

Page 36: 101717.kh miga ashg_grc

Future  PerspecSve

1.      Linear  assemblies  of  human  centromeric  regions  improve  in  step  with  sequencing  technology  (i.e.  read  length  and  quality)  

2.    One  genome  is  not  enough:  Highly  variable  

3.    Linear  CEN  assemblies  present  a  mapping  challenge  to  most  genomic  applicaSons

Page 37: 101717.kh miga ashg_grc

True Linear Maps of Human CEN Regions

Y CEN

True Linear Arrangement

Informatics/Analysis Data Structure

Page 38: 101717.kh miga ashg_grc

Key Advantages of Satellite DNA Graphs

1. Eliminates sequence redundancy

Page 39: 101717.kh miga ashg_grc

Key Advantages of Satellite DNA Graphs

Improves Unambiguous Short Read Mapping

REPEAT REPEAT REPEAT

?

5’ 3’REPEAT

Benedict Paten Adam Novak

Centromere GraphsDemonstrate unambiguous mapping the majority ( > 98%) of 1000 genome alpha satellite reads

1. Eliminates sequence redundancy

Page 40: 101717.kh miga ashg_grc

Key Advantages of Satellite DNA Graphs

1. Eliminates sequence redundancy

2. Information describing long-range haplotypes are retained as defined “paths” in the graph:

Page 41: 101717.kh miga ashg_grc

Key Advantages of Satellite DNA Graphs

1. Eliminates sequence redundancy

2. Information describing long-range haplotypes are retained as defined “paths” in the graph

3. Graph data structure and sequence analysis tools will be consistent with the rest of the human genome

The major histocompatibility complex (Kiran Garimella & Gil McVean)

Page 42: 101717.kh miga ashg_grc

Creating (and mapping to) a Universal Reference Genome

Benedict Paten, Adam Novak, David Haussler, UC Santa Cruz

Mark Akeson Miten Jain Hugh Olsen Benedict Paten Dave Deamer

Robin AbuShumays Andrew Smith Ian Fiddes Art Rand Logan Mulroney

Jordan Eizenga Rojin Safavi Rachel Lawton Andrew Bailey Ariah Mackie

David HausslerBenedict Paten

Jim Kent Sofie Salama

UCSC Nanopore Analysis GroupMiten Jain Hugh Olsen Mark Akeson

Dan TurnerDavid Stoddart

Oxford Nanopore Technologies

Huntington F. WillardDavid Page

Product Version

Device MinION MK1

Flow cell FLO-MIN106

Kits Rapid Sequencing Kit

Data analysis

Albacore 1.0.1 Metrichor 1D

Acknowledgements