101717.kh miga ashg_grc
TRANSCRIPT
Centromere Sequence Assembly
Karen H. Miga University of California, Santa Cruz
10/17/17GRC GIAB Workshop
ASHG
Megabase-sized gapsP-ARM Q-ARM
CEN
HUMAN CENTROMERES: MULTI-‐MEGABASE SIZED GAPS IN ALL CHROMOSOME ASSEMBLIES
CEN
PROGRESS UPDATE: CENTROMERE SEQUENCE ASSEMBLIES
1. GRCh38 Reference Models for Human Centromere Arrays
2. Efforts to Generate True, Linear Assemblies of Centromeric regions: Chromosome Y
3. Future PerspecSve
p-arm q-arm... ...Multi-megabase sized arrays of satellite DNA
...ATCCGATTACG ATCCGATTACGATCCGATTACG... ...ATCCGATTACG ATCCGATTACGATCCGATTACG...
CHALLENGE OF ASSEMBLING LONG TRACTS OF (NEAR IDENTICAL) TANDEM REPEATS
p-arm q-arm... ...ALPHA SATELLITE
~171bp Tandem Repeat
Wide Range of Percent ID: ~60-100%
1 2 3 4
HUMAN CENTROMERES: ALPHA SATELLITTE
Narrow Range of Percent ID: 94% - 100%
“Higher Order Repeat”
Multi-monomeric Repeat Unit
p-arm q-arm... ...
1 2 3 4 1 2 3 4 1 2 3 4
HIGHER ORDER REPEATS
p-arm q-arm... ...
p-arm q-arm... ...
Array “A”
Array “B” Array “C”
chrX
chr3
CHROMOSOME-‐SPECIFIC SATELLITE SEQUENCE ORGANIZATION
p-arm q-arm... ...
... ...-A- -T-
GENOME MODEL OF SEQUENCE ORGANIZATION IN CENTROMERE-‐ASSIGNED GAPS
p-arm q-arm... ...
... ...-A- -T-
GENOME MODEL OF SEQUENCE ORGANIZATION IN CENTROMERE-‐ASSIGNED GAPS
LINESINE
OTHER
NON-ALPHA SATELLITE
p-arm q-arm... ...
... ...-A- -T-
GENOME MODEL OF SEQUENCE ORGANIZATION IN CENTROMERE-‐ASSIGNED GAPS
LINESINE
OTHER
NON-ALPHA SATELLITE
Unmapped (Yet Assembled) Scaffolds
Characterize HORs in Human Genome1
1. GRCh38 Alpha Satellite Reference Models
1
A B C D E F
Characterize HORs in Human Genome1
1. GRCh38 Alpha Satellite Reference Models
1
>200 ENCODE datasets
A B C D E F
Characterize HORs in Human Genome1
1. GRCh38 Alpha Satellite Reference Models
>200 ENCODE datasetsStep by Step Example For Single P-read, I
• Example: p-read 78e9dc60_12326_5223 (internal read identifier) – p-read length 12257
– Self-self dot plot
22
0 2000 4000 6000 8000 10000 12000
2000
4000
6000
8000
10000
12000
0
α-Centauri (centromeric automated repeat identification)
12,257 bp PacBio Read with 70 adjacent monomers (NA12878, 10x Coverage HuPac Data)
5’…
…3’
10x
10
B
C
D
E
F
A10
10
1010
10
5’ 3’
1
http://github.com/volkansevim/alpha-CENTAURI.
B
C
D
E
F
A
Chromosome specific assignment
?
Experimental Evidence: FISH Hybridization/Mapping and Screening Somatic Cell Hybrid Panel
B
C
D
E
F
A
D7Z16-mer
Waye et al (1987) 98% GenBank: M16101
Flow Sorted Chromosome Alignment/Enrichment Sequence enrichment analysis of isolated human chromosomes
Long Range Paired Read Support“Anchor” to mapped to the assembled p-arm and/or q-arm
Chromosome specific assignment
Chromosome-assignment of Higher Order Repeats
Characterize HORs in Human Genome
1. GRCh38 Alpha Satellite Reference Models
DXZ1 (12-mer)CENX
e.g.1 2 3 4 5 6 7 8 9 10 11 12
LINEHuRef WGS Sanger
read Db
Constructing WGS Read Libraries for each HOR array2
HuRef
LINEA/T
1
Characterize HORs in Human Genome
1. GRCh38 Alpha Satellite Reference Models
Constructing WGS Read Libraries for each HOR array
m3v1
m1v1m2v1
m2v2
m4v1
m12v1
m5v1
m6v1
LINE
m11v1
m10v1
m9v1
m8v1
m7v1
1.01.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.3
0.70.3
0.7
1.0
LINEA/T
2
1
3 Model Array Variants in Sequence Graph:
linearSat• 2nd Order Markov Chain• Length determined by normalized array length estimates
m3v1
m1v1m2v1
m2v2
m4v1
m12v1
m5v1
m6v1
LINE
m11v1
m10v1
m9v1
m8v1
m7v1
1.01.0
1.0
1.0
1.0
1.0
1.0
0.5
0.5
0.5
0.3
0.70.3
0.7
1.0
Not the “true” long-range organization, yet adequately represents the alpha satellite
array sequence
https://github.com/JimKent/linearSat
LINEAR ORDERING OF REFERENCE MODELS AND ASSEMBLED CONTIGS USING MATE PAIRS
CENXXp Xq
3.8 Mb
chrX
2.25Mb; ~860 HOR units 0.73Mb; ~43 HOR units
0.3Mb; Low Copy Repeat
pp3p 3qCEN3.1 CEN3.2Unmapped HuRef
Assembled Contig(s)(e.g. ABBA01185959)
chr3
Yp Yq
Xp Xq
100Kb
12p 12q
17q17p
2p 2q
6p 6q
3p 3q
4p 4q
11p 11q
8p 8q
10p 10q
7p 7pq 7q
9q9p
1p 1q
16q16p
18p 18q
19p 19q
20q20p
5p 5q
1
2
3
4
5
6
7
8
9
10
11
12
15
16
17
18
19
20
15q 15p
X
Y
21p14q
21q
Acrocentric Chr (13,14,21,22)
An Initial Draft of Human Centromere
Sequence Composition
Alpha Satellite Reference Models: ~60 Mb (59571670 bp)
CENTROMERE SEQUENCE ASSEMBLY
1. GRCh38 Alpha Satellite Reference Models
2. Linear Assembly of a Human Centromere Miga, KH., et al. Genome research 24.4 (2014): 697-707.l 20
LINEAR ASSEMBLY OF A HUMAN CENTROMERE ON THE Y CHROMOSOME
Small, haploid satellite array with well-characterized 5.8 kb repeat
p-arm q-arm
BACS: OVERLAP-‐LAYOUT-‐ASSEMBLY
p-arm q-arm
Collection of 9 BACs known to span the Y Centromere
Overlap determined by single copy sequence variants
Tilford et al 2001 Nature
HIGH QUALITY + LONG (100 kb +) READS
~100 kb
Collapsed Representation
Challenge of Assembling
Identical Tandem Repeats with Short
Reads
HIGH QUALITY + LONG (100 kb +) READS
High Quality Consensus Sequence
~100 kb
NANOPORE SEQUENCING: LONGBOARD (1D)UCSC LONGBOARD 1D PROTOCOL
UCSC LONGBOARD 1D PROTOCOL NANOPORE SEQUENCING: LONGBOARD (1D)
UCSC LONGBOARD 1D PROTOCOL
In total, we have generated 3500+ reads greater than 150 kb
NANOPORE SEQUENCING: LONGBOARD (1D)
MULTIPLE ALIGNMENT STRATEGY TO IMPROVE QUALITY BY CONSENSUS
High Quality Consensus Requires
Modest Coverage
UCSC LONGBOARD 1D PROTOCOL MULTIPLE ALIGNMENT STRATEGY TO IMPROVE QUALITY BY CONSENSUS
RP11 718M18 221.4 kb
Vector
Insert
634 Predicted Nucleotide Variants
2 Tandem Structural Rearrangements
38 CENY RPTS (>99% Identity to published consensus)
Homopolymers [A]n
Homopolymers [T]n
Identify informative, single copy sites in the array useful for overlap BAC-based assembly
IDENTIFY SINGLE COPY VARIANTS USING ILLUMINA DATA
RP11 718M18 221.4 kb
VALIDATE HIGH-‐CONFIDENT SINGLE COPY VARIANTS WITH ILLUMINA
RP11 718M18 221.4 kb
VALIDATE HIGH-‐CONFIDENT SINGLE COPY VARIANTS
LINEAR ASSEMBLY OF HUMAN Y CENTROMERE
Future PerspecSve
1. Linear assemblies of human centromeric regions improve in step with sequencing technology (i.e. read length and quality)
2. One genome is not enough: Highly variable
3. Linear CEN assemblies present a mapping challenge to most genomic applicaSons
True Linear Maps of Human CEN Regions
Y CEN
True Linear Arrangement
Informatics/Analysis Data Structure
Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
Key Advantages of Satellite DNA Graphs
Improves Unambiguous Short Read Mapping
REPEAT REPEAT REPEAT
?
5’ 3’REPEAT
Benedict Paten Adam Novak
Centromere GraphsDemonstrate unambiguous mapping the majority ( > 98%) of 1000 genome alpha satellite reads
1. Eliminates sequence redundancy
Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
2. Information describing long-range haplotypes are retained as defined “paths” in the graph:
Key Advantages of Satellite DNA Graphs
1. Eliminates sequence redundancy
2. Information describing long-range haplotypes are retained as defined “paths” in the graph
3. Graph data structure and sequence analysis tools will be consistent with the rest of the human genome
The major histocompatibility complex (Kiran Garimella & Gil McVean)
Creating (and mapping to) a Universal Reference Genome
Benedict Paten, Adam Novak, David Haussler, UC Santa Cruz
Mark Akeson Miten Jain Hugh Olsen Benedict Paten Dave Deamer
Robin AbuShumays Andrew Smith Ian Fiddes Art Rand Logan Mulroney
Jordan Eizenga Rojin Safavi Rachel Lawton Andrew Bailey Ariah Mackie
David HausslerBenedict Paten
Jim Kent Sofie Salama
UCSC Nanopore Analysis GroupMiten Jain Hugh Olsen Mark Akeson
Dan TurnerDavid Stoddart
Oxford Nanopore Technologies
Huntington F. WillardDavid Page
Product Version
Device MinION MK1
Flow cell FLO-MIN106
Kits Rapid Sequencing Kit
Data analysis
Albacore 1.0.1 Metrichor 1D
Acknowledgements