grc workshop agbt2015_tg
TRANSCRIPT
![Page 1: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/1.jpg)
GRC Workshop at AGBT 2015
Tina Graves-Lindsay
![Page 2: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/2.jpg)
CHM1 PacBio Data and Initial Assembly Stats
• 54X Whole Genome Coverage in long reads
• 8.8kb Avg read length
• P5-C3 Chemistry
• PacBio Assembly done by Jason Chin
• Initial assembly had 4.5 MB N50 contig length
• Have alignments of PacBio CHM1 assembly to CHM1_1.1 and
GRCh38
![Page 3: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/3.jpg)
PacBio CHM1 Assembly potentially fills GRCh38 Gaps
GRCh38
PacBio CHM1
Data exists in PacBio unitig, not present in GRCh38
![Page 4: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/4.jpg)
CHM1_1.1 WGS Assembly Contigs
PacBio Assembly Contig
Alignment of CHM1 PacBio assembly to CHM1_1.1
![Page 5: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/5.jpg)
BioNano Genome Map confirms assembly of PacBio Contig
PacBio Assembly Contig
BioNano Genome Map Contigs
![Page 6: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/6.jpg)
1q21
1q21 patch alignment to chromosome 1
1q32 1q21 1p21
![Page 7: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/7.jpg)
SRGAP2 Region in PacBio Asssembly
1q21
![Page 8: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/8.jpg)
CHM1 Falcon vs MHAP Assembly Stats
• MHAP assembly Available for download – GCA_000772585.3
Falcon Assembly MHAP
Number of Contigs 5528 3434
N50 Contig Length 5,460,023 4,320,471
Total Assembly Size 2,818,296,359 2,828,300,545
![Page 9: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/9.jpg)
CHM1 Assemblies – More on the Way
• MHAP Assembly
• Done by Adam Phillippy
• 1-2 more assemblies will be generated
• Dazzler Assembly
• Gene Myers version
• Longer contig N50 length
• Believe we will be evaluating it, but haven’t seen it yet
• Falcon Assemblies
• Jason Chin generating 1-2 additional Falcon assemblies using
improved software
![Page 10: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/10.jpg)
CHM1 Assembly Assessment Methods
• Assemblies will run through NCBI QA pipeline
• Assessed for contiguity, annotation, and concordance with the
finished BAC paths
• Assembly Assembly alignments will be generated between each PB
assembly and Illumina-based CHM1 assembly as well as GRCh38
• BioNano Genome Map
• SV calls generated from comparing the map data to each of the
CHM1 assemblies
• Alignment of the Illumina reads back to the CHM1
assemblies
• Heterozygous calls are likely indicative of a collapse in the
assembly
![Page 11: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/11.jpg)
The Platinum Genome
• What is it?• Contiguous
• Haplotype-resolved representation of entire genome
• Best assembly from mini-assemblethon will be picked and improved
• BAC clone paths will be incorporated into PacBio whole genome assembly
• Comparison back to CHM1_1.1 to see if portions of the Illuminaassembly will fill in any gaps
• Pick additional BACs to cover regions of the assembly that are still very fragmented
![Page 12: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/12.jpg)
CHM13 – 2nd Platinum Genome
• CHM13 – another hydatidiform mole sample
• PacBio data generated
• 60X data was generated using P5 and P6 Chemistry
• Avg read length ~11kb, longer than CHM1 data
• Data available in SRA
• Generating Illumina coverage to use for assembly QA, SV
detection, and consensus base error correction
• Plan to use BACs to improve the assembly where needed
• Alignment of Assembly to BioNano Genome map
• Currently ~91% of CHM13 assembly aligns to BioNano map
contigs
![Page 13: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/13.jpg)
CHM13 Assembled by DNAnexus
• DNAnexus is a cloud-based genome informatics & data
management platform that enables:
• Large scale genomic analysis
• Easy and secure collaboration of data
• Governance and compliance
• Simple deployment of your own code or use of pre-packaged tools
• DNAnexus packaged FALCON so that it can be run without
complicated installation and at scale.
• DNAnexus gives access to massive computational resources
on-demand.
• During assembly of CHM13 FALCON made use of 350
concurrent workers and 1400 concurrent cores.
![Page 14: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/14.jpg)
DNAnexus FALCON Pipeline
![Page 15: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/15.jpg)
CHM13 – 2nd Platinum Genome
Stats PacBio DNAnexus
Number of Contigs 2873 2203
N50 12,981,785 11,909,487
N90 2,100,287 1,745,715
N95 743,427 808,675
Max Contig Length 63,148,543 53,079,926
Total Sequence 2,851,367,788 2,809,672,639
Total Assembly Time 5 days 41 hours
![Page 16: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/16.jpg)
Refseq Analysis
GRCh38 CHM1_1.1 MHAP
CHM1
PacBio
CHM1
CHM13
Number of
sequences
not aligning
21 88 67 67 125
Split
Transcripts8 35 1245 1131 285
CDS coverage
<95%17 266 1339 1212 265
Total Sequences Retrieved from Entrez 49680
![Page 17: Grc workshop agbt2015_tg](https://reader033.vdocuments.mx/reader033/viewer/2022042817/55a624521a28ab193c8b46c0/html5/thumbnails/17.jpg)
Future Directions
• Improve assemblies of both CHM1 and CHM13 to result in a
completely resolved final assembly for each genome
• From both assemblies, add significant structural variants
to the reference as alternate loci
• Sequence additional genomes to add even more diversity
to the reference from more underrepresented populations