exploiting long read sequencing technology to build a substantially improved pig reference genome...

Post on 17-Jan-2017

83 Views

Category:

Science

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Exploiting long read sequencing technology to build a substantially improved pig

reference genome sequenceAlan Archibald

The Roslin Institute and R(D)SVSUniversity of Edinburgh

Draft reference pig genome sequenceSwine Genome Sequencing Consortium

Hybrid Shotgun Sequencing Strategy

Whole- genome shotgun reads

Combine overlapping whole-genome and BAC-derived reads

Assemble clone sequences to represent chromosomes and annotate using Ensembl automated pipeline

BAC shotgun reads

Minimal set of overlapping BACs selected from physical map

Sequence

assembly

Sscrofa10.2 – chromosome assigned scaffolds only

  Length (bp)Chromosomes 1-18, X, Y  Contigs N50 80,720Contigs N90 13,487Average contig length 31,604Largest contig length 1,598,650Scaffold N50 637,332Scaffold N90 189,449Average scaffold length 436,176Largest scaffold length 3,862,550

BAC Contigs / Fragments

(paired) end sequencesof subclone libraries

768 subclones / BACAv read: 707 bp

phrap

create fragment chains

Submission to EMBL/Genbank

A B C D E F G

GA C B E F DNNNN NNN NNN NNN NNN NNNN

fragment chain 1 fragment chain 2

A B C D E F G

Limitations of Sscrofa10.2• Missing coverage ~10%

– Poorly captured in unplaced scaffolds• Local scaffolding issues

– Order & orientation of sequence contigs within BACs not resolved unambiguously

– No BAC clone sequence assigned to > 1 scaffold• Unresolved redundancy from overlapping BAC clones• Project memory loss

– e.g. unplaced FPC contigs listed at end of q-arm

http://geval.sanger.ac.uk/PGP_pig_10_2/Info/Index

Sscrofa10.2- QC

• Illumina PE reads from same pig mapped to Sscrofa10.2

• Looked for indicators of structural variation– including high/low coverage, incorrect orientation and abnormal insert sizes.

• Looked for homozygous variants

Sscrofa10.2-Chr 1

De novo genome assemblies using Pacific Biosystems long read technology

TJTabaso (Duroc 2-14) MARC1423004

Duroc sow Duroc/Landrace/Yorkshire barrow

PacBio – draft WGS assembly• Duroc 2-14 (same pig as most of Sscrofa10.2)• 65x genome coverage• Pacific Biosystems P6 chemistry• Length cut-off for reads for assembly 13 kbp• Coverage of corrected reads for assembly 19x

Contig QC

Variants

• Homozygous SNPs:– Sscrofa10.2: 415,056– Pacbio contigs: 34,545

• Homozygous indels:– Sscrofa10.2: 168,037– Pacbio contigs: 1,729,510

Scaffolding

• Scaffold by mapping contigs to Sscrofa10.2– using Nucmer– Assumme Sscrofa10.2 gross structure is correct

• Radiation Hybrid and Linkage maps, 60K SNPs• FPC physical map

• 2.36 Gb ungapped length

• 434 contigs

Chromosome 6

Chromosome 6

Gap Filling

• Gap filling was done using PBJelly

• Further gaps filled using large finished BACs from Sscrofa10.2 assembly– 7 had large sequenced BAC contigs crossing them–We sequenced 5 more

• Plus manual placing of some fiddly contigs

• 181 gaps remaining

• N50 increased to 35.8Mb #35MbCtgClub

Targeted gap closureCH242-323K10

Targeted gap closureCH242-284F8

Targeted gap closureCH242-284F8

Sequencing Additional BACs

• 5 BACs with ends that appear to cross gaps in the assembly– Sequenced using the MinION and were assembled into individual contigs using Canu

– Polished using Pilon

• Mapping of the assembled BAC contigs to the scaffolds showed they could be placed in their expected regions

• Potential to fill 129 more gaps in this way

#porecamp

Error Correction

• Arrow (succeeds Quiver)– Using PacBio reads to error correct assembled sequence– Reduced homozygous SNPs

• from 34,545 to 27,018

– Reduced homozygous indels• 1,729,510 to 1,036,696 

• Pilon (currently running)– Using Illumina mate pair and Illumina paired ends libraries– Can detect and correct SNPs and indels, structural abnormalities, plus potential for gap filling

– Expecting to reduce the remaining false variants

Evaluate• Order and Orientation wrt RH map• Order, orientation, distance between paired ends

– CH242 BAC ends– Fosmid ends– Illumina mate pairs (5-7 Kbp, 9-11 Kbp)– Illumina paired ends (500-660 bp)

• Gene models

BAC end sequence alignments – orientation & insert size

BBS4

IGF2

CFTR – ST7

ST7

Sscrofa11 - a new pig reference genome sequence worthy of adoption by the GRC

Alan ArchibaldThe Roslin Institute and R(D)SVS

University of Edinburgh

Adding pig genome to GRC High quality, highly contiguous genome Resources for gap closure

- Isogenic BAC library CHORI242, ends sequenced- Isogenic fosmid library WTSI_1005, ends sequenced

User communities, incl. SGSC, FAANG Funding

- BBSRC strategic funding (The Roslin Institute)- BBSRC BBR Ensembl- COST Action CA15112 (FAANG-Europe)

Acknowledgements

• Roslin Institute– Amanda Warr– Mick Watson– David Hume– Heather Finlayson– Christine Burkard– Lel Eory– Richard Talbot– John Hickey

• PacBio– Richard Hall– Jason Chin– Harold Lee– Regina Lam– Kirsti Kim– Jim Burrows  alan.archibald@roslin.ed.ac.uk

@AlanArchibald51

• USDA– Tim Smith– Derek Bickhart– Ben Rosen– Steve Schroeder

• gEVAL– Will Chow– Kerstin Howe

• Other– Sergey Koren– Chris Warkup– Swine Genome Sequencing Consortium

MARC BARC

@FAANGEurope

top related