genome assembly forensics
DESCRIPTION
Automated assemblies are one thing, good assemblies are another! This presentation covers the basic concepts of using paired-end and mate pair read data to identify mis-assemblies. It also covers some of the tools for visualising and correcting mis-assemblies. An attempt is made to rate these tools on their feature set and scalability beyond small (TRANSCRIPT
![Page 1: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/1.jpg)
Genome Assembly Forensics and Visualisation
Nathan S. Watson-Haigh
Fri 11th May 2012, ACPFG Journal Club
Schatz, M.C. et al., 2007. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology, 8(3), p.R34.Phillippy, A.M., Schatz, M.C. & Pop, M., 2008. Genome assembly forensics: finding the elusive mis-assembly. Genome Biology, 9(3), p.R55.Schatz, M.C. et al., 2011. Hawkeye and AMOS: Visualizing and Assessing the Quality of Genome Assemblies. Briefings in
Bioinformatics. Available at: http://bib.oxfordjournals.org/content/early/2011/12/23/bib.bbr074.
![Page 2: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/2.jpg)
Overview
• Genome Assembly• N50/N90/N95• Paired-end and Matepair Reads• Mis-assembly Signatures• Assembly Validation and Manual Editing
![Page 3: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/3.jpg)
Genome Assembly – Shotgun Reads
aligned shotgun reads
DNA being sequenced
![Page 4: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/4.jpg)
Genome Assembly – Repeats
![Page 5: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/5.jpg)
Genome Assembly – Repeats
![Page 6: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/6.jpg)
Genome Assembly – Repeats
reads from different repeats can’t be
resolved
double coverage
![Page 7: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/7.jpg)
Genome Assembly – Repeats
![Page 8: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/8.jpg)
Genome Assembly – Diploid
![Page 9: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/9.jpg)
Assembly Metrics – N50
• The N50 is the most widely reported metric for de novo assemblies
• It is a single measure of the contig length size distribution of an assembly– If contigs are sorted into descending length order, the
n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs
– Commonly reported with the N90 and N95
![Page 10: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/10.jpg)
Assembly Metrics – N50
+ = N50
+ = N90
+ = N95
![Page 11: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/11.jpg)
Assembly Metrics – N50
• The N50 is the most widely reported metric for de novo assemblies
• It is a single measure of the contig length size distribution of an assembly– If contigs are sorted into descending length order, the
n50 is the size of the contig above which the assembly contains at least 50% of the total length of all the contigs
– Commonly reported with the N90 and N95• These stats DO NOT imply anything about
assembly quality– Could simply concatenate contigs together to get a
better N50!!
![Page 12: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/12.jpg)
Paired-end Reads
![Page 13: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/13.jpg)
Matepair Reads
![Page 14: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/14.jpg)
Paired-end and Matepair Reads
Paired-end Matepair
reverse compliment
![Page 15: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/15.jpg)
So, Why are Pairs so Useful?
![Page 16: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/16.jpg)
So, Why are Pairs so Useful?
![Page 17: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/17.jpg)
Pairs are Useful – Orientation and Separation
![Page 18: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/18.jpg)
Pairs are Useful – Orientation and Separation
![Page 19: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/19.jpg)
Pairs are Useful – Orientation and Separation
![Page 20: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/20.jpg)
Pairs are Useful – Orientation and Separation
![Page 21: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/21.jpg)
Pairs are Useful – Orientation and Separation
Incorrect orientationIncorrect distance
![Page 22: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/22.jpg)
Mis-assembly Signatures – Collapsed Tandem Repeat
Correct alignment
Incorrect alignment
![Page 23: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/23.jpg)
Mis-assembly Signatures – Collapsed Tandem Repeat
Mis-assembly
Correct assembly
![Page 24: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/24.jpg)
Mis-assembly Signatures – Collapsed (small) Tandem Repeat
Mis-assembly
Correct assembly
![Page 25: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/25.jpg)
Mis-assembly Signatures – Collapsed Repeat
Mis-assembly
Correct assembly
![Page 26: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/26.jpg)
Mis-assembly Signatures – Rearrangement
Mis-assembly
Correct assembly
![Page 27: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/27.jpg)
Automated Assemblies Are One Thing, Good Assemblies Are Another
• Given the computer resources you can generate an automated assembly in a few weeks– Not necessarily good– Need to optimise assembly parameters
• For small organisms (< ~15Mbases)– Commodity hardware– OLC assemblers
• For larger genomes– More RAM (10-100’s Gbytes) for OLC assemblers– De Bruijin Graph assemblers– Read Mapping step to generate contig read alignments
![Page 28: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/28.jpg)
Automated Assemblies Are One Thing, Good Assemblies Are Another
• Automated assemblies need to be checked for mis-assemblies– Need paired-end/matepair reads– Need viewers to visualise paired-end data– Need editors to break/join/reassemble parts of the
assembly deemed to be inconsistent with read pair info– Need enough computer hardware to allow all this data to
be loaded – especially with large volumes of Illumina paired-end data
![Page 29: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/29.jpg)
Automated Assemblies Are One Thing, Good Assemblies Are Another
• Very time consuming and laborious to check/edit– Small assemblies (< ~15Mbases)
• Several weeks/few months to move 1 scaffold/contig at a time
– Large assemblies need a team to do the same thing• Need enough RAM to load all the paired-end data• Need ways to identify regions requiring closer inspection• identify possible mis-assemblies
• Major hurdles– Software inadequacies– Time– File formats! Grrrr!
![Page 30: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/30.jpg)
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
![Page 31: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/31.jpg)
SeqMan Pro – Strategy View
![Page 32: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/32.jpg)
SeqMan Pro
![Page 33: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/33.jpg)
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
![Page 34: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/34.jpg)
Gap5 – Template View
![Page 35: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/35.jpg)
Gap5 – Contig Comparator
![Page 36: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/36.jpg)
Gap5 – Join Editor
![Page 37: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/37.jpg)
Gap5 – Contig Editor
![Page 38: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/38.jpg)
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
![Page 39: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/39.jpg)
Consed – Assembly View
![Page 40: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/40.jpg)
Consed – Contig Viewer/Editor
![Page 41: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/41.jpg)
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
![Page 42: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/42.jpg)
![Page 43: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/43.jpg)
![Page 44: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/44.jpg)
![Page 45: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/45.jpg)
![Page 46: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/46.jpg)
![Page 47: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/47.jpg)
![Page 48: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/48.jpg)
Scaffold/Contig Length Distribution
![Page 49: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/49.jpg)
Library Stats
![Page 50: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/50.jpg)
• A measure of the deviation of local distribution of insert sizes to the global distribution of insert sizes– 0 indicates no deviation– ≤ 3 indicates much
compression– ≥3 indicates much
expansion
Compression-Expansion (CE) Statistic
![Page 51: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/51.jpg)
Insert Coverage Read Coverage
![Page 52: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/52.jpg)
500bp inserts 3kb inserts
20kb inserts
![Page 53: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/53.jpg)
AMOSvalidate
• An assembly analysis pipeline to identify possible mis-assemblies– Paired-end data
• CE stats• Incorrect orientation• Missing mate
– Coverage– SNP density– Singletons
![Page 54: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/54.jpg)
![Page 55: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/55.jpg)
Hawkeye Cons
• Poor support for correcting mis-assemblies once detected
![Page 56: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/56.jpg)
Software Inadequacies
Software Contig View
Scaffold View
Editing Reassemble Clipping Info
Other
SeqMan Pro
9 9 6 6 6 $$, buggy, not for large assemblies (32bit), 1 template size
Gap5 6 NA 9 NA 8 Free, join editor, contig comparator, poor visual support for many contigs, shuffle pads, ACE, multiple template sizes
Consed 6 6 5 9 6 Free/US$2500/US$10k, poor visual support for many contigs, multiple templates sizes
Hawkeye 9 9 NA NA 7 Leverages AMOS, automated detection of mis-assemblies, large assemblies, modular
![Page 57: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/57.jpg)
Closing Remarks
• Software exist to allow manual editing of assemblies– Time consuming– Different tools have different features– Most fall over with assemblies > ~15Mbases or with
many contigs/scaffolds (10k-100k)
![Page 58: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/58.jpg)
Closing Remarks
• Ideal Tool– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
![Page 59: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/59.jpg)
![Page 60: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/60.jpg)
![Page 61: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/61.jpg)
Closing Remarks
• Ideal Tool– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
– Contig join editor for manual alignment and editing of contigs (like Gap5)
![Page 62: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/62.jpg)
Gap5 – Join Editor
![Page 63: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/63.jpg)
Closing Remarks
• Ideal Tool– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
– Contig join editor for manual alignment and editing of contigs (like Gap5)
– Visualise clipped regions with consensus mismatches (like Gap5)
![Page 64: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/64.jpg)
Gap5 – Contig Editor
![Page 65: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/65.jpg)
Closing Remarks
• Ideal Tool– Contig/scaffold viewer capable of displaying
compressed/expanded mates, which contigs mates map to when they are off contig/scaffold (like SeqMan Pro and Hawkeye)
– Contig join editor for manual alignment and editing of contigs (like Gap5)
– Visualise clipped regions with consensus mismatches (like Gap5)
– Automated analysis of assembly to identify regions requiring attention (like AMOSvalidate) and a way to navigate to those regions for editing
– Minimise mouse-clicks and keyboard presses!!
![Page 66: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/66.jpg)
![Page 67: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/67.jpg)
Newbler Plant Genome Assemblies
• Pretty conservative in contig construction• Seems to split out repetitive regions into their
own contigs pretty well• Heterozygsity issues
– SNP alignment issues– Indels break contigs– Hidden in clipped regions– Manual joining of neighbouring contigs can reduce
scaffolded contig numbers by 60-70%– Many unscaffolded contigs have high sequence
similarity to scaffolded contigs – could collapse these and reduce the number of unscaffolded contigs by 50%
![Page 68: Genome Assembly Forensics](https://reader036.vdocuments.mx/reader036/viewer/2022062308/558e43981a28ab88668b4661/html5/thumbnails/68.jpg)
Gap5 – Contig Editor