karen miga uc santa cruz - amazon s3...primase, dna, polypeptide 2 (prim2), mrna chr3/chr6...

39
Karen Miga UC Santa Cruz

Upload: others

Post on 20-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

Karen Miga

UC Santa Cruz

Page 2: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

Assessing variation in the human genome

enables discovery research

“Much of the missing heritability (the „dark matter‟ of the

genome) will probably turn up as the technology

advances.”

- Francis Collins

Nature 464, 674-675 (2010)

Page 3: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

CENTROMERIC REGIONS

Millions of bases of repetitive DNA

The promise of long read sequences to improve

sequence variant discovery

Page 4: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

NO LONGER CONSIDERED “JUNK DNA”

Function of Centromeric and Heterochromatic

DNA

CENTROMERE

FUNCTION

Page 5: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

NO LONGER CONSIDERED “JUNK DNA”

Function of Centromeric and Heterochromatic

DNA

CENTROMERE

FUNCTION CANCER

Page 6: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

NO LONGER CONSIDERED “JUNK DNA”

Function of Centromeric and Heterochromatic

DNA

CENTROMERE

FUNCTION CANCER AGING

Page 7: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

PacBio Long Read Sequences to Predict

Satellite Sequence Variants (satVARs)

Quality Corrected

Reads

Automated

Sequence

Characterization

Genome

SatVAR Discovery

CEN

Generate a profile of satellite

variants for a given individual

genome

Page 8: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

~171bp

Tandem Repeat

Wide Range of Percent ID: ~60-100%

ALPHA SATELLITE

1 2 3 4

Alpha Satellite define all normal human centromeres

Page 9: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

~171bp

Tandem Repeat

Wide Range of Percent ID: ~60-100%

ALPHA SATELLITE

1 2 3 4 1 2 3 4 1 2 3 4

Alpha Satellite repeats (or monomers) are commonly

found in long arrays of near-identical higher order

repeats

Page 10: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

~171bp

Tandem Repeat

Wide Range of Percent ID: ~60-100%

ALPHA SATELLITE

1 2 3 4 1 2 3 4 1 2 3 4

“Higher Order Repeat” Multi-monomeric Repeat Unit

Alpha Satellite repeats (or monomers) are commonly

found in long arrays of near-identical higher order

repeats

Page 11: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

~171bp

Tandem Repeat

Wide Range of Percent ID: ~60-100%

ALPHA SATELLITE

1 2 3 4

Satellite DNA are the primary sequence in each gap

1 2 3 4 1 2 3 4

Narrow Range of Percent ID: 94% -

100%

Alpha Satellite repeats (or monomers) are commonly

found in long arrays of near-identical higher order

repeats

Page 12: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

CEN

Array 1

Array 2 Array 3

Each chromosome has a different centromeric

sequences

Page 13: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

Array 1. Individual A

Array 1. Individual B

A

B

~0.5 Mb

~2.0 Mb

Higher-order arrays vary between individuals

Page 14: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

Higher-order arrays can vary between

homologous chromosomes in the same individual

CEN

Array 1. maternally inherited

Array 1. paternally inherited

~0.5 Mb

~2.0 Mb

mat

pat

Page 15: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

Model of Centromere Sequence Organization

Array 1

] [ n

8-mer

Array 2

] [ n

4-mer

Page 16: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

Model of Centromere Sequence Organization

Array 1

] [ n

8-mer

Array 2

] [ n

4-mer

DELETION

(6-mer)

INSERTION

(12-mer)

Rearrangements in repeat structure

Page 17: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

Model of Centromere Sequence Organization

Array 1

] [ n

8-mer

Array 2

] [ n

4-mer

DELETION

(6-mer)

INSERTION

(12-mer)

Rearrangements in repeat structure Shifts in repeat orientation

?

Page 18: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

DELETION

(6-mer)

INSERTION

(12-mer)

Rearrangements in repeat structure Shifts in repeat orientation

?

Sites of Interspersed Repeats

LINE

Junction with seemingly unique DNA

Transcribed Genes

Page 19: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

DELETION

(6-mer)

INSERTION

(12-mer)

Rearrangements in repeat structure Shifts in repeat orientation

?

Sites of Interspersed Repeats

LINE

Junction with seemingly unique DNA

Transcribed Genes

Implement a strategy to characterize satellite sequence variants with

long-read sequences

Page 20: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

Implement a strategy to characterize satellite sequence variants with

long-read sequences

github.com/volkansevim/alpha-CENTAURI

ALPHA satellite CENTromeric AUtomated Repeat Identification

• Unlike reads shorter than the underlying repeat structure that rely on indirect inference methods, long reads allow direct inference of satellite higher order repeat structure.

87606 Error Corrected

pReads

Human Centromeric DNA

Variants:

Alpha Satellite

CHM1 GENOME

Page 21: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

3‟ 5‟

1,9S1

2.5 kb quality corrected PacBio read

Page 22: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

3‟ 5‟

1. Identifies clusters of monomers with high sequence

similarity (FALCON error correction module)

98% Identical

# bases # bases

Page 23: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

3‟ 5‟

1. Identifies clusters of monomers with high sequence

similarity (FALCON error correction module)

98% Identical

# bases # bases

2. Cluster similarity threshold per read by evaluating a range of

identity values (98% to 88%, by 1% decrements)

Page 24: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

3‟ 5‟

1. Identifies clusters of monomers with high sequence

similarity (FALCON error correction module)

3. Evaluates the spacing between monomers involved in each

cluster group

98% Identical

# bases # bases

2. Cluster similarity threshold per read by evaluating a range of

identity values (98% to 88%, by 1% decrements)

Page 25: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

3‟ 5‟

“Regular” Repeat Structure

Page 26: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

3‟ 5‟

5‟

3‟

Page 27: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

3‟ 5‟

5‟

3‟

Page 28: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage
Page 29: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

D11Z1

5-mer

CEN

chr11

1680

PacBio

preads

REGULAR

Page 30: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

IRREGULAR

CEN

chr11

6-mer (89.9%, 391 preads)

1

2

4

5

4-mer (1.4%, 39 preads)

1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 2

15,559 bp pread

3 4-mer

(1.6%, 43 preads)

4-mer (1.4%, 40 preads)

INSERTION

(6-mer)

Page 31: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

CEN

chr11

INVERSION

6-mer 2

4

5

Inversion: 1 event, junction: 236 bp

4-mer 4-mer

4-mer

1

3

In total, ~5% (4493/87606)

of all alpha satellite reads

provide evidence for

an inversion

Page 32: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

TRACKING SITES OF INTERSPERSED REPEATS

LINE/L1 L1Hs (2384 bp) LINE/L1 L1P3 (1358 bp)

96% recent LINEs

L1Hs, LIP1, L1PA2-4

Page 33: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

chr3 CEN

3918 preads that contain both alpha satellite and at

least 10 kb of non-alpha satellite sequence

Identify Junctions with seemingly unique DNA

Primase, DNA, polypeptide 2 (prim2),

mRNA

chr3/chr6 Paralogous (non-sat) Region

~300 kb

CHM1: LJ1101000307.1

Page 34: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

Full Coverage

(~60x)

Low Coverage

(10x)

Page 35: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

Full Coverage

(~60x)

Low Coverage

(10x)

Page 36: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

“Unmapped”

Database

Page 37: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

PacBio Long Read Sequences to Predict

Satellite Sequence Variants (satVARs)

CHM1

Genome

SatVAR Discovery

CEN

Profile of satellite DNA variants

CHM1, CHM13

TRIO data sets

(CEPH and GIAB)

Page 39: Karen Miga UC Santa Cruz - Amazon S3...Primase, DNA, polypeptide 2 (prim2), mRNA chr3/chr6 Paralogous (non-sat) Region ~300 kb CHM1: LJ1101000307.1 Full Coverage (~60x) Low Coverage

1000 Genome Sequence Data

(400 male individuals)