Lectures11,12:DNASequencingHistoryandMethods
Fall2019October22,24,2019
IntroductionandHistory
SamplePreparation
SamplePreparation
Fragments
SamplePreparation
Sequencing
ACGTAGAATCGACCATG
GGGACGTAGAATACGAC
ACGTAGAATACGTAGAA
Reads
Fragments
NextGenerationSequencing(NGS)
SamplePreparation
Sequencing
Assembly
ACGTAGAATACGTAGAAACAGATTAGAGAG…
Contigs
Fragments
Reads
ACGTAGAATCGACCATG GGGACGTAGAATACGAC
ACGTAGAATACGTAGAA
SamplePreparation
Sequencing
Assembly
Analysis
Fragments
Reads
Contigs
ReferenceGenome
9
Denovovs.Re-sequencing• Denovoassembly(“fromthebeginning”)impliesthatyouhavenopriorknowledgeofthegenome.
• Re-sequencingassemblyassumesyouhaveacopyofthereferencegenome(thathasbeenverifiedtoacertaindegree).
• Theprogramsthatworkforre-sequencingwillnotworkfordenovo.
Denovovs.Re-sequencing
SamplePreparation
Fragments
Re-sequencing (LOCAS, Shrimp) requires 15x to 30x coverage. Anything less and re-sequencing programs will not produce results or produce questionable results.
SamplePreparation
Fragments
De-novo assembly requires higher coverage. At least 30x but upwards to 100x’s coverage. Most de novo assemblers require paired-end data.
SamplePreparation
Sequencing
Assembly
Analysis
Fragments
Reads
Contigs
Ourfocusfortoday’slecture:1. Comparisonofsequencing
platforms2. Detailsofsamplepreparation3. Definitionsandterminologies
concerningdataandsequencingplatforms
HistoryandBackground
LandmarksinSequencingEfficiency(bp/person/year)
Year Event
1870 Miescher:DiscoversDNA1940 Avery:ProposesDNAas“GeneticMaterial”1953 Watson&Crick:DoubleHelixStructureofDNA
1 1965 Holley:transferRNAfromYeast1,500 1977 Maxam&Gilbert:"DNAsequencingbychemical
degradation”Sanger:“DNAsequencingwithchain-terminatinginhibitors”
15,000 1981 Messingandhiscolleaguesdeveloped“shotgunsequencing”method
25,000 1987 ABImarketsthefirstsequencingplatform,ABI370
LandmarksinSequencingEfficiency(bp/person/year)
Year Event
50,000 1990 NIHbeginslarge-scalesequencingbacteriagenomes.
200,000 1995 CraigVentureandHamiltonSmithattheInstituteforGenomicResearch(TIGR)publishedthefirstcompletegenomeofafree-livingorganisminScience.Thismarksthefirstuseofwhole-genomeshotgunsequencing,eliminatingtheneedforinitialmappingefforts.
2001 AdraftofthehumangenomewaspublishedinScience.
2001 AdraftofthehumangenomewaspublishedinNature.
50,000,000 2002 454LifeSciencescomesoutwithapyrosequencingmachine.
100,000,000 2008 Nextgenerationsequencingmachinesarrive.
Huge 2015+ OxfordNanopore:600Millionbasepairsperhour.
RobertHolleyandteamin1965
WatsonandCrick
Messing:World’smost-citedscientist
FrancisCollins:PrivateHumanGenomeproject.
Next-GenSequencingPlatforms
454/RocheGS-20/FLX(2005)
PacBioRS(2009-2010)3rdgeneration?
IlluminaHiSeq(2007)
23
ComparisonofPlatforms
Technology Readsperrun AverageReadLength
bpperrun Typesoferrors
454(Roche) 400,000 250-1000bp 70Million Indels
SoLiD(ABI) 88-132Million 35bp 1Billion
IlluminaHiSeq 2.5Billion 100–250bp 600Billion Substitution
PacBio 45,000 2000-10,000bp 45Million Insertionsanddeletions
\
SequencingMethodsandTerminology
Sangermethod(1977):labeledddNTPsterminateDNAcopyingatrandompoints.
Bothmethodsgeneratelabeledfragmentsofvaryinglengthsthatarefurtherelectrophoresed.
Gilbertmethod(1977):chemicalmethodtocleaveDNAatspecificpoints(G,G+A,T+C,C).
SangerSequencing
SangerSequencingVideo
SangerSequencingSHEAR DNA target sample
SangerSequencingSHEAR DNA target sample
A A A A
C GT
C GT
C GT
C GT
SangerSequencing
30
SHEAR DNA target sample
A A A A
C GT
C GT
C GT
C GTA C
G
T
SangerSequencing
AC GT A
DNApolymerase Primer
SangerSequencing
AC GT A
DNApolymerase Primer
Primer
DNApolymerase
AC GT A
SangerSequencing
AC GT A
Primer
DNApolymerase
A
A
A
G
G
C
C
CT
CTA
CT
G
SangerSequencing
AC GT A
Primer
DNApolymerase
A
A
A
G
G
C
C
CT
CTA
CT
G
G
SangerSequencing
AC GT A
PrimerA
A
A
G
G
C
C
CT
CTA
CT
CG
G
SangerSequencing
AC GT A
PrimerA
A
A
G
G
C
C
CT
CTA
CT
GA
CG
G
SangerSequencing
AC GT A
PrimerA
A
A
G
G
C
C
CT
CTA
CT
T
SangerSequencing
PrimerA
A
A
G
G
C
C
CT
CTA
CT
SangerSequencing
PrimerA
A
A
G
G
C
C
CT
CTA
CT
ContinueuntilallstrandsofDNAhaveundergonethisreaction.IfyouchoosethereagentscorrectlythenyoushouldhaveallpossibleA-terminatedstrands;resultinginsequencesofvaryinglengths.
SangerSequencing
SangerSequencing
Inthegel,thelongerDNAfragmentsmovefastertothebottomandtheshorteronesmoveslowerandremainatthetop.Thesequencecanbereadoffbygoingfromtoptobottom.
Challenges
• Requiresalotofspaceandtime:youneedaplacetorunthereaction,andthenyouneedageltodeterminethelengthoftheDNA– Youcouldonlyrunperhapsahundredofthesereactionsatanyonetime.
– Thereare3billionbasepairsofDNAinthehumangenome,meaningabout6million500-basepairfragmentsofDNA.
• Nonethelessitwasstillusedtocomeupwiththefirstcopyofthehumangenome
42
CeleraSequencing(2001)
• 300ABIDNAsequencingplatforms• 50productionstaff• 20,000squarefeetofwetlabspace• 1milliondollars/yearforelectricalservice• 10milliondollarsinreagents
Totalcostofhumangenome:2.7Billiondollars
CeleraSequencing(2001)
• 300ABIDNAsequencingplatforms• 50productionstaff• 20,000squarefeetofwetlabspace• 1milliondollars/yearforelectricalservice• 10milliondollarsinreagents
Currentcostofhumangenome:<10,000$
• SecondgenerationsequencingtechniquesovercometherestrictionsbyfindingwaystosequencetheDNAwithouthavingtomoveitaround.
• YoustickthebitofDNAyouwanttosequenceinalittledot,calledacluster,andyoudothesequencingthere;asaresult,youcanpackmanymillionsofclustersintoonemachine.
Second/NextGenerationSequencing
SequencingastrandofDNAwhilekeepingitheldinplaceistricky,andrequiresalotofcleverness.
IlluminaSequencing:Video
StepsinIlluminasequencing
• Turnonthesequencingmachineandwait(1week)…
48
StepsinIlluminasequencing
• Sampleprep:sizeselectfragments,addadapterstoensurethefragmentsligatetotheflowcell(1to5days)
49
ligateadapters
StepsinIlluminasequencing• Clustergenerationonflowcell
Whydoweneedclusters?
50
Aflowcellcontains8lanes
Eachlanecontainsthreecolumnsoftiles
Eachcolumncontains100tiles
20Kto30KclustersEachtileisimagedfourtimespercycle,whichisoneimageperbase
Wemultiplyupthetemplatestand,i.e.thebitofDNAthatwearesequencing,andstickonafewbasesof‘adaptorsequence’;thissequencesticksontocomplementarybitsofDNAstucktoasurface,whichholdstheDNAinplacewhilewesequenceit:
WethenfloodtheDNAwithReversibleTerminator(RT)-bases.Wealsoaddapolymeraseenzyme,whichincorporatestheRT-baseintothenewstrandthatiscomplementarytothetemplatestrand:
WethenwashawayalltheRT-bases,leavingjustthosethatwereincorporatedintothenewstrand;wecanreadoffwhatbasethisisbylookingatthecolorofthedye:
There exists a cleavage enzyme thatchops all the extramolecules off, andturns the RT-base into a normallyfunctioningnucleotide.
56
57
• IlluminausesthemodifiedversionofSangersequencingcalledreversibleterminatormethod.
• Thedyeiswashedafterimagingandthelastnucleotideisextendedinthenextround.
• InasingleIlluminamachinewehavehundredsofmillionsoftheseclusters;cameraslookatallofthesedotsandrecordhowtheychangecolorovertime,allowingyoutodeterminethesequenceofbasesofmillionsofbitsofDNAatonce.
IlluminaCharacteristics
• Sequencingmethodisactuallyprettyinefficient,however,themachineiscapableofsequencingmillionsoffragmentsofDNAatonce.
• Duetocontrolledsequenceoftermination,washing,andchemicaldeactivation/activationevents,Illuminareadshave(almost)onlysubstitutionerrors.
• Pairedreadswithsmallinsertsize(<800bp)canbereliablygenerated.Largeinsertmatepairscanbemadeusingunreliable,difficult,time-consuming,andexpensivechemicalhacks.
IlluminaCharacteristics
InsidetheIlluminaMachine
60
Pyrosequencing:Video454RocheSystem
https://www.youtube.com/watch?v=nFfgWGFe0aA
• PyrosequencingdiffersfromSangersequencing,inthatitreliesonthedetectionofpyrophosphatereleaseonnucleotideincorporation,ratherthanchainterminationwithdideoxynucleotides.
• Sincethereisnochainterminationinpyrosequencingotherthanbydesignedunavailabilityoftheother3nucleotides,pyrosequencingreadshaveinsertion/deletionerrorsparticularlyinornexttorunsofhomopolymers:hardtodistinguishbetweenAAAAAandAAAAAA
PyrosequencingCharacteristics
• Relativelylongreads:800-1000bp.
• Reliablepairedreadprotocolwithlargeinsertsizes:3kbp,8kbp,20kbp.
• Forinstance,apairof1000bpreadsbacktoback(insertsize=2kbp)essentiallygivesa2000bpread.
• Dealingwith2%-3%indelsin454readsisthemainchallengebesidehighersequencingcostsincomparisonwithIllumina.
PyrosequencingCharacteristics
SingleMoleculeSequencing:VideoPacificBiosciencesSystem
https://www.youtube.com/watch?v=v8p4ph2MAvIhttps://www.youtube.com/watch?v=NHCJ8PtYCFc
• PacBioreadsarelong,e.g.onaverageafewkilobases.
• SincePacBioreliesonthesignalfromasinglemolecule,thesignaltonoiseratioissmall,andPacBioreadshavelotsofuniformlyrandomerrors,upto15%.
• PacBioerrorsareprimarilyindels,whichmakesefficaciouscomputationalerrorcorrectioncurrentlyintractable.
• PacBioreadsarecurrentlyusedforlimitedvalidationofcontiguityinformationorhelpingdatasetsgeneratedwithothertechnologies.
PacBioCharacteristics
NanoporeSequencing:VideoOxfordNanoporeSystemhttps://www.youtube.com/watch?v=3UHw22hBpAk
• Nanoporereadsareprettylong,upto100+kbp.
• Theyhavelotsoferrors,10%-40%.
• ErrorsareprimarilyindelslikePacBio’sbuttheNanoporeerrormodelisnotclearyet[PacBioerrorsareprettymuchuniformlyrandom].
• Nanoporehasjuststartedaworld-widebenchmarkingprojectwhichisstillgoingon.
OxfordNanoporeCharacteristics