introduction to second- generation sequencing · second generation sequencing •aka:...

47
Introduction to second- generation sequencing

Upload: others

Post on 05-Jun-2020

48 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Introductiontosecond-generationsequencing

Page 2: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Review:DNAsequencing

• TechnologiestodeterminethenucleotidesequencesofaDNAmolecule.

• Motivation:decipherthegeneticcodeshiddeninDNAsequencesfordifferentbiologicalprocesses.

• Genomeprojects:determineDNAsequencesfordifferentspecies,e.g.,humangenomeproject.

• Genomicresearch(inanutshell):studythefunctionsofDNAsequencesandrelatedcomponents.

Page 3: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Sequencingtechnologies

• Traditionaltechnology:Sangersequencing.– Slow(lowthroughput)andexpensive:ittookHumanGenomeProject(HGP)13yearsand$3billiontosequencetheentirehumangenome.

– Relativelyaccurate.

• Newtechnology:differenttypesofhigh-throughputsequencing.

Page 4: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Secondgenerationsequencing

• Aka:high-throughputsequencing,nextgenerationsequencing(NGS).

• Abletosequencelargeamountofshortsequencesegmentsinashortperiod:– highthroughput:billionsofsequencesinarun.– Cheap:sequenceentirehumangenomecostsbelowonethousanddollarsnow.

– shortreadlength:uptoseveralhundredbps.

Page 5: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount
Page 6: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount
Page 7: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Availableplatforms

• Majorplayer:– Illumina:HiSeq,MiSeq.– LifeTech (nowThermoFisher):SOLiD,IonTorrent.

• Others(third-generationsequencing):– PacificBioscience(SMRT)– OxfordNanopore

Page 8: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

High-throughputsequencers

Page 9: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Second-generationsequencingtechnologies

Page 10: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Second-generationsequencingtechnologies

• Complicatedandinvolvesalotofbiochemicalreactions.– Sequencingbysynthesis:

https://www.youtube.com/watch?v=fCd6B5HRaZ8.– Sequencingbyligation.– Pyrosequencing.

• Inanutshell:– CutthelongDNAintosmallersegments(severalhundredstoseveral

thousandbases).– Sequenceeachsegment:startfromoneendandsequencealongthe

chain,basebybase.– Theprocessstopsafterawhilebecausethenoiselevelistoohigh.– Resultsfromsequencingaremanysequencepieces.Thelengthsvary,

usuallyafewthousandsfromSanger,andseveralhundredsfromNGS.– Thesequencepiecesarecalled“reads”forNGSdata.

Page 11: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Single-endvs.paired-endsequencing

• SequenceoneorbothendsoftheDNAsegments.• Single-endsequencing:sequenceoneendoftheDNA

segment.• Paired-endsequencing:sequencebothendsofaDNA

segments.– Resultreadsare“paired”,separatedbycertainlength(thelengthof

theDNAsegments,usuallyafewhundredbps).– Paired-enddatacanbeusedassingle-end,butcontainextra

informationwhichisusefulinsomecases,e.g.,detectingstructuralvariationsinthegenome.

– Modelingtechniqueismorecomplicated.

Page 12: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

ApplicationsofSecond-generationsequencing

Page 13: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Applications

• NGShasawiderangeofapplications.– DNA-seq:sequencegenomicDNA.– RNA-seq:sequenceRNAproducts.– ChIP-seq:detectprotein-DNAinteractionsites.– Bisulfitesequencing(BS-seq):measureDNAmethylationstrengths.

– Alotofothers.

• Basicallyreplacedmicroarrayswithbetterdata:greaterdynamicrangeandhighersignal-to-noiseratios.

Page 14: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount
Page 15: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

DNA-seq

• SequencetheuntreatedgenomicDNA.– ObtainDNAfromcells,cutintosmallpiecesthensequencethesegments.

• Goals:– Comparewiththereferencegenomeandlookforgeneticvariants:

• Singlenucleotidepolymorphisms(SNPs)• Insertions/deletions(indels),• Copynumbervariations(CNVs)• Otherstructuralvariations(genefusion,etc.).

– Denovoassemblyofanewgenome.

Page 16: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

VariationsofDNA-seq

• Targetedsequencing,e.g.,exomesequencing.– SequencethegenomicDNAattargetedgenomicregions.– CheaperthanwholegenomeDNA-seq,sothatmoneycanbespentto

getbiggersamplesize(moreindividuals).– Thetargetedgenomicregionsneedtobe“captured”firstusing

technologieslikemicroarrays.

• Metagenomicsequencing.– SequencetheDNAofamixtureofspecies,mostlymicrobes,inorder

tounderstandthemicrobialenvironments.– Thegoalistodeterminenumberofspecies,theirgenomeand

proportionsinthepopulation.– Denovoassemblyisrequired.Butthenumberandproportionsof

speciesareunknown,soitposeschallengetoassembly.

Page 17: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

RNA-seq

• Sequencethe“transcriptome”:thesetofRNAmolecules.

• Goals:– CatalogueRNAproducts.– Determinetranscriptionalstructures:alternativesplicing,genefusion,etc.

– Quantifygeneexpression:thesequencingversionofgeneexpressionmicroarray.

Page 18: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

ChIP-seq

• Chromatin-Immunoprecipitation(ChIP)followedbysequencing(seq):sequencingversionofChIP-chip.

• Usedtodetectlocationsofcertain“events”onthegenome:– Transcriptionfactorbinding.– DNAmethylationsandhistonemodifications.

• Atypeof“captured”sequencing.ChIPstepistocapturegenomicregionsofinterest.

• Similartechnologies:MeDIP-seq,hMe-seal,etc.

Page 19: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Second-generationsequencingdataanalyses

Page 20: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

WorkflowofsecondgenerationsequencingdataanalysisSec-gen sequencing data

analysis workflow

Raw images

Fluorescent intensities

sequence reads

aligned reads contigs

imaging analysis

base calling

alignment de novo assembly

variant calling(DNA seq)

DE/splicing (RNA seq)

peak/DMR detection(ChIP/MeDIP- seq)

...

Page 21: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Imaginganalysis

• Extractintensityvaluesfromimages.– OnIlluminaandSOLiDsystems,therearefourimagespercycle,oneforanucleotide/color.

• Similartothatinmicroarrays.• Involvesmanystatisticalmethodstoextractsignalsfromnoisydata.

• Resultsoftheimaginganalysis:a3-dimensionalmatrix:nreadsx4xnbases.

Page 22: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Basecalling

• Foreachread,ateachposition,convertfourfluorescentintensities(continuous)intoabaseorcolor(categorical).

• It’saclassificationproblem.

Base1 Base2 Base3 Base4 Base5 A 0.290 0.046 0.014 0.026 0.010 C 0.014 0.654 0.132 0.803 0.006 G 0.062 0.009 0.001 0.016 0.712 T 0.016 0.010 0.455 0.046 0.768

ACTCT…

Page 23: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Exampleofbasecallingmethod:Alta-CyclicforIlluminadata

The training process (green arrows) starts with creation of the training set, beginning with sequences generated by the standardIllumina pipeline, by linking intensity reads and a corresponding genome sequence (the 'correct' sequence). Then, two grid searches are used to optimize the parameters to call the bases. After optimization, a final SVM array is created, each of which corresponds to a cycle. In the base-calling stage (blue arrows), the intensity files of the desired library undergo deconvolution to correct for phasing noise using the optimized values and are sent for classification with the SVM array. The output is processed, and sequences and quality scores are reported.

Erlichetal.NatureMethods5:679-682(2008)

Page 24: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Rawsequencereadsfromsecondgenerationsequencingafterbasecalling

• Largetextfile(millionsoflines)withsimpleformat.– Mostfrequentlyused:fastq format

Secgen sequencing data: fastq format

@HWI-EAS165:1:1:50:908:1CTGCGGTCTCTAAAGTGCCATCTCATTGTGCTTTGTATCAGTCAGTGCTGGA+BCCBCB8ABBBBBBB:BC=8@BBA:@BB@BBBCBB<9BBAC;A<C?BAAB<#@HWI-EAS165:1:1:50:0:1NCAACCCCCACAGTAATATGTAAAACAAAAACTAAAACCAGGAGCTGAAGGG+#BABABBBBBB@08<@?A@7:A@CCBCCCCBBBCCBB=?BBBB@7@B=A>:2@HWI-EAS165:1:1:50:708:1GGTCAGCATGTCTTCTGTTAAGTGCTTGCACAAGCTAGCCTCTGCCTATGGG+BB@A;B>@A@@=BB=BB?A>@@>B?ABBA=A?@@>@@A:=?>?A@=B8@@AB@HWI-EAS165:1:1:50:1494:1CTGGTGTCACACAAGCAGGTCTCCTGTGTTGACTTCACCAGACACTGTCATT+BCBB@AB@1ABBBBBBAAB?BBBBAB<A?AA>BB@?1ABBA@BBBA@;B>>:

readnamereadsequenceseparatorqualityscores

Page 25: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Sequencealignmentandassembly

• Sequenceaknowngenome?--- Alignment– Usetheknowngenome(called“referencegenome”)asablueprint.– Determinewhereeachreadislocatedinthereferencegenome.

• Sequenceawholenewgenome?--- Assembly– Newgenome:aspecieswithunknowngenome,orthegenomeis

believedtobeverydifferentfromreference(e.g.,cancer).– Basicallytheshortreadsare“stitched”togethertoformlong

sequencescalled“contigs”.– Overlapsamongsequencereadsarerequired,soitneedsalotof

reads(deepcoverage).– Morecomputationallyintensive.

Page 26: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Alignment• Need:sequencereadsfileandareferencegenome.• Itisbasicallyastringsearchproblem:whereistheshort(50-

letter)stringlocatedwithinthereferencestringof3billionletters.

• Brute-forcesearchingisokayforasingleread,butcomputationallyinfeasibletoalignmentmillionsofreads.

• Cleveralgorithmsareneededtopreprocessthereferencegenome(indexing),whichisbeyondthescopeofthisclass.

Page 27: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Populargeneralalignmentsoftware

• Bowtie:fast,butlessaccurate.• BWA(Burrows-WheelerAligner):samealgorithmasbowtie,but

allowgapsinalignments.– about5-10timesslowerthanbowtie,butprovidebetterresults

especiallyforpairedenddata.• Maq (MappingandAssemblywithQualities):withSNPcalling

capabilities.• ELAND:Illumina’scommercialsoftware.• Alotofothers.See

http://en.wikipedia.org/wiki/List_of_sequence_alignment_software formoredetails.

Page 28: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Othertechnology-specificalignmentsoftware

• RNA-seq:– Tophat,STAR– Pseudoaligner:Salmon,kallisto.Veryfast.

• Bisulfitesequencing:– Bismark– BSMAP– Merman

Page 29: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

bowtieBowtie

Bowtie is an ultrafast, memory-efficient short read aligner. It aligns short DNA sequences (reads) to the human genome at a rate of over 25 million 35-bp reads per hour. Bowtie indexes the genome with a Burrows-Wheeler index to keep its memory footprint small: typically about 2.2 GB for the human genome (2.9 GB for paired-end).

Page 30: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Usebowtie:buildalignmentindex

• Alignmentindexfilesarebuiltbasedonreferencegenome(canbedownloadastextfilesfromUCSC).

• Notethatpre-builtindexesformanygenomesareavailablefrombowtiepage,checkthatbeforebuildingyourownindex.

• CommandexampleforHumanhg18genome.Assumewehavethehg18sequencefilereadycalledhg18.fa:bowtie-build hg18.fa hg18

• Results:severalebwtfiles.• Tips:theindexfilescanbestoredinacommonplaceand

sharedamongcolleagues.

Page 31: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Usebowtie:alignment

• Testwhetheritworks:bowtie -c hg18 GGTATATGCACAAAATGAGATGCTTGCTTA

• Alignareadfilesbowtie -v 3 -f hg18 reads.fa reads.map

Page 32: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

bowtie:commonlyusedparameters• Inputfileformat:

– -q:queryinputfilesareFASTQ.fq/.fastq(default)– -f:queryinputfilesare(multi-)FASTA.fa/.mfa– -r:queryinputfilesarerawone-sequence-per-line

• Aligment:– -v:allowingvmismatches.– -5:ignoringsomebasedfrom5’ end.– -3:ignoringsomebasedfrom3’ end.

• Outputformat:– -S:outputinSAM(sequencealignmentmap)format.

• Example:inputisasinglefafile,allowing3mismatches,ignore5basesfrom3’ end,outputinSAMformat:bowtie -v 3 -3 5 -S hg18 reads.fa reads.sam

Page 33: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Outputfrombowtie

• SAMformat

• Bowtieformat

Output from bowtie

• SAM output:@HD VN:1.0 SO:unsorted@SQ SN:phage LN:5386@PG ID:Bowtie VN:0.12.5 CL:"/Users/hwu/bin/bowtie-0.12.5/bowtie -v 3 -f -S phage reads.fa reads.sam"5_143_428_832 4 * 0 0 * * 0 0 GATATTGTAGCATAACGCAACTTGGGAGGTGAGCTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XM:i:05_143_984_487 0 phage 3948 255 36M * 0 0 GTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_963_690 0 phage 3503 255 36M * 0 0 GGTATATGCACAAAATGAGATGCTTGCTTATCAACA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_957_461 16 phage 3903 255 36M * 0 0 TTGTCTAGGAAATAACCGTCAGGATTGACACCCTCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_808_403 0 phage 4122 255 36M * 0 0 GATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_986_385 4 * 0 0 * * 0 0 GATGCTGAAGGAACTTGGTAAAATTTATCTGGAGAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XM:i:05_143_981_626 16 phage 1522 255 36M * 0 0 TCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_470_717 16 phage 2061 255 36M * 0 0 ATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:0

5_143_961_681 + phage 4397 GCTGCTGAACGCCCTCTTAAGGATATTCGCGATGAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 5_143_996_500 + phage 2537 GGTTAATGCTGGTAATGGTGGTTTTCTTCATTTCAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 32:G>T5_143_468_916 + phage 339 GGATTACTATCTGAGTCCGATGCTGTTCAACCACTA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 5_143_972_467 + phage 3021 GTGGCATTCAAGGTGATGTGCTTGCTACCGATAACA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 5_143_953_471 - phage 5017 ATACGTTAACAAAAAGTCAGATATGGACCTTGCTGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 5_143_687_97 + phage 1287 GACTGTTAACACTACTGGTTATATTGACCATGCCAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 34:G>A5_143_620_93 + phage 4463 GATGAGTGTTCAAGATTGCTGGAGGCCTCCACTATG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 5_143_766_307 + phage 3024 GCATTCTAGGCGATGTGCTTGCTACCGTTAACAATA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII 0 6:A>T,10:T>C,27:A>T

• Bowtie output:

:@HD VN:1.0 SO:unsorted@SQ SN:phage LN:5386@PG ID:Bowtie VN:0.12.7 CL:"bowtie -v 3 -f -S phage reads.fa reads.sam"5_143_428_832 4 * 0 0 * * 0 0 GATATTGTAGCATAACGCAACTTGGGAGGTGAGCTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XM:i:05_143_984_487 0 phage 3948 255 36M * 0 0 GTTTTCATGCCTCCAAATCTTGGAGGCTTTTTTATG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_963_690 0 phage 3503 255 36M * 0 0 GGTATATGCACAAAATGAGATGCTTGCTTATCAACA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_957_461 16 phage 3903 255 36M * 0 0 TTGTCTAGGAAATAACCGTCAGGATTGACACCCTCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_808_403 0 phage 4122 255 36M * 0 0 GATAACCGCATCAAGCTCTTGGAAGAGATTCTGTCT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_986_385 4 * 0 0 * * 0 0 GATGCTGAAGGAACTTGGTAAAATTTATCTGGAGAA IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XM:i:05_143_981_626 16 phage 1522 255 36M * 0 0 TCCTCCTGAGACTGAGCTTTCTCGCCAAATGACGAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_470_717 16 phage 2061 255 36M * 0 0 ATGCGCCTTCGTATGTTTCTCCTGCTTATCACCTTC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_992_626 4 * 0 0 * * 0 0 GCCCAGAAGGGGCGGTTAAATGGTTTTTGGAGAAAG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XM:i:05_143_400_771 0 phage 3816 255 36M * 0 0 GATATTTTTCATGGAATTGATAAAGCTGTTGCCGAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:1 MD:Z:14T21 NM:i:15_143_962_110 16 phage 5074 255 36M * 0 0 AATGGAACAACTCACTAAAAACCAAGCTGTCGCTAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:05_143_774_100 0 phage 3810 255 36M * 0 0 GTGGTTGATATTTTTCATGGTATTGATAAAGCTGTT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII XA:i:0 MD:Z:36 NM:i:0

Page 34: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Bioconductoralignmentpackages

• Rsubread,Rbowtie/Rbowtie2,QuasR,etc.• ExamplecodeforRsubread:

library('Rsubread')

## build referencebuildindex(basename="reference_index",

reference=”ref.fa")## align align.stat <- align(index="reference_index",

readfile1="reads.fa", output_file="alignResults.BAM",type="dna")

Page 35: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Oncethereadsarealigned

• Downstreamanalysesdependonpurpose.– WewillcovertheanalysesforRNA-seq,ChIP-seq,andBS-seqinnextseverallectures.

• Oftenonewantstomanipulatingandvisualizingthealignmentresults.Thereareseveralusefultools:– filemanipulating(formatconversion,counting,etc.):samtools/Rsamtools,BEDTools,bamtools,IGVtools.

– Visualizing:samtools(textversion),IGV(JavaGUI).

Page 36: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

samtools• samtoolsprovidevariousutilitiesformanipulatingalignments

intheSAMformat,includingsorting,merging,indexingandgeneratingalignmentsinaper-positionformat.

• Commandlinedriven,meaningoneneedstotypecommandinaterminalwindow.– Installationcouldbetricky.NeedstoinstallextratoolsonWindowsor

Mac,suchasCygwinandperlonWindowsandXcodeonMac.

• Mainfunctionalities:– view:SAM<->BAMconversion– sort:sortalignmentfile– mpileup:multi-waypileup– depth:computethecoveragedepth– tview:textalignmentviewer– index:indexalignment

Page 37: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

samtools:generatesorted,indexedbamfiles

• BAMfile:binarySAM.Smallerfilesizesandfasteroperations.

• Toconvertfromsamtobam:samtools view -bS reads.sam > reads.bam

• Sortandindexbamfile.Thissortsthereadsbychromosomeandpositionandmakessubsequenceanalysiseasier.samtools sort reads.bam reads.sorted samtools index reads.sorted.bam

Page 38: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Anotherusefulsoftware:BEDTools

• AsetofcommandstomanipulateBED/GFF/VCFfiles.• Conversiontools:pairToBed(BAM), bamToBed, bedToBam,etc.

• Countingtools:coverageBed(BAM), windowBed(BAM)

• Others: sortBed, overlap,etc.

Page 39: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Bioconductorpackage:Rsamtools

• ProvidefunctionstoimportBAMfilestoR.• Therearemanytools(samtools,BEDTools,bamtools)availabletoconvertdifferentformats(BED,SAM,fasta,fastq,etc.)toBAM.

• ReadalignmentresultsshouldalwaysbesavedinBAMformatbecausetheyaresmallerandfaster.

Page 40: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

ReadinaBAM file> bamFile="reads.sorted.bam"> bam <- scanBam(bamFile)> names(bam[[1]])[1] "qname" "flag" "rname" "strand" "pos" "qwidth" "mapq" "cigar" [9] "mrnm" "mpos" "isize" "seq" "qual”

ThisgivestheavailableinformationintheBAMfile.Onecanspecifywhattoreadin(tosavetimeandmemory):

> what <-c("rname", "strand", "pos”, ”qwidth”) ## fields to read in

> param <- ScanBamParam(what = what)> bam <- scanBam(bamFile, param=param)[[1]]> names(bam)[1] "rname" "strand" "pos” "qwidth”> bam$pos[1:10][1] 1 2 3 3 4 4 4 4 4 5> bam$strand[1:10][1] + + + + - - - - - +Levels: + - *

Page 41: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Summarizethereadcounts

• Remembereachalignedreadcanbetreatedasagenomicinterval.SotheresultsfromscanBamcanbeusedtoconstructanGRangesobject(ofmillionsofintervals):

> GRange.reads=GRanges(seqnames=Rle(bam$rname), ranges=IRanges(bam$pos, width=bam$qwidth))

• Thenitbecomesveryhandy,forexample,wecan:– computegenomecoverage:

> cc=coverage(IRange.reads)

– countnumberofreadsinintervals(suchasgenes):> countOverlaps(genes, GRange.reads)

Page 42: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Anexample:obtainingRNA-seqreadsmappedtoexonsandintrons

library(GenomicRanges)library(GenomicFeatures)library(Rsamtools)## get gene annotation, and extract exons/intronsrefGene.hg18=makeTranscriptDbFromUCSC(genom="hg18",tablename="refGene")ex=exonsBy(refGene.hg18, "tx")intr=intronsByTranscript(refGene.hg18)

## read in RNA-seq BAM filewhat=c("rname", "strand", "pos", "qwidth")TSS.counts=NULLparam=ScanBamParam(what = what)bam=scanBam(”RNA-seq.bam", param=param)[[1]]IRange.reads=GRanges(seqnames=Rle(bam$rname),

ranges=IRanges(bam$pos, width=bam$qwidth))

## obtain countscounts.exon=countOverlaps(ex, IRange.reads)counts.intron=countOverlaps(intr, IRange.reads)

Page 43: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Visualizationofsequencingdata–IntegratedGenomeViewer(IGV)

• “TheIntegrativeGenomicsViewer(IGV)isahigh-performancevisualizationtoolforinteractiveexplorationoflarge,integrateddatasets.Itsupportsawidevarietyofdatatypesincludingsequencealignments,microarrays,andgenomicannotations.”

• WritteninJavaandrunsonallOS.• Veryversatileandfast.• Abilitytoconnecttodataserveranddisplaysomepublicdata(fromENCODE,broad,etc.)

Page 44: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

AlignedreadsonIGV

Page 45: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

ChIP-seqdataonIGV

Page 46: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

RNA-seqjunctionreadsonIGV

Page 47: Introduction to second- generation sequencing · Second generation sequencing •Aka: high-throughput sequencing, next generation sequencing (NGS). •Able to sequence large amount

Review• We’vecovered

– Basicsofsecondgenerationsequencing:technology,data,application

– Alignmentusingbowtie– Manipulationofalignmentresultsusingsamtools– ImportalignmentintoRusingRsamtools.– VisualizationusingIGV.