indel-based realignment - ucla · 2017. 2. 24. · cca tg ca context g ref del ins • mappers...

15
talks Indel-based Realignment Improving the original alignments of the reads based on mul8ple sequence (re-)alignment

Upload: others

Post on 18-Nov-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

talks

Indel-basedRealignment

Improvingtheoriginalalignmentsofthereadsbasedonmul8plesequence

(re-)alignment

Page 2: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

YouarehereintheGATKBestPrac8cesworkflowforgermlinevariantdiscovery

Page 3: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

InDels=inser8on/dele8on

AGCTAGGGTC AGCTAGGGTC

AGCTAGGGTC

TTC

AGCGGTC

Refseq

Sampleseq

Inser&on Dele&on

Page 4: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

Theproblemwewanttofix

Severalconsecu3ve“SNPs”onlyfoundonreadsendingonthe

rightofthehomopolymer

Severalconsecu3ve“SNPs”onlyfoundonreadsendingonthe

le;ofthehomopolymer 7bp“T”

homopolymerrun

Addinga1-bpinser3onbringssanityto

theen3realignment

AlignmentbyBWA

A;errealignment

Page 5: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

Whydoesthishappen?

þ  Localrealignmentaroundindels->mostparsimoniousalignment

þ  Improvesaccuracyofseveraldownstreamprocessingsteps

Ref T A C C C A T T T T T T T C T A A A A G C T BWA C C A T T T T T T C T A A A A A C T IR C C A – T T T T T T C T A A A A A C T

CATGCA CCA TGCA G

ref

del

ins

•  Mapperscannot“see”indelsnearendsofreads•  Becausemismatchesare“cheaper”thanagapinthis

context

Missmatch=-1Opengap=-3

Page 6: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

Howdoweiden8fywhererealignmentisneeded?

•  Knownsites(e.g.dbSNP,1000Genomes)

•  Indelsseeninoriginalalignments(inCIGARs)

•  Siteswhereevidencesuggestsahiddenindel

-Entropycalcula8oniden8fies“messyareas”

Page 7: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

1.Findthebestalternateconsensussequencethat,togetherwiththereference,bestfitsthereadsinapile(maximumof1indel)

3.Ifbestalternateconsensusissufficientlybe`erthantheoriginalalignments(usingLODscorethreshold)->acceptproposedrealignment

2.Scoreforalternateconsensus=totalsumofqualityscoresofmismatchingbases

Howdoestherealignmentalgorithmwork?

AAGAGTAGRef:

AAG---AGTAG

AAGAGTAG

Readpileconsistentwitha3bpinser8on

ReadpileconsistentwiththereferencesequenceRealigning

determineswhichisbe`er

ThreeadjacentSNPs

Page 8: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

IndelRealignmentsteps/tools

•  Iden8fywhatregionsneedtoberealigned➔ RealignerTargetCreator

•  Performtheactualrealignment

➔ IndelRealigner

Page 9: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

RealignerTargetCreator

•  Pre-processingsteptofindintervalsthatmayneedrealignment

•  InputBAMfilenotnecessaryifprocessingonlyatknownindels

•  Usingalistofknownindelswillbothspeedupprocessingandimproveaccuracy,butisnotrequired

Input BAM Target Intervals

Realigned BAM

RealignerTargetCreator

IndelRealigner

Known Sites

java –jar GenomeAnalysisTK.jar \ –T RealignerTargetCreator \ –R human.fasta \ –I original.bam \ –known indels.vcf \ –o realigner.intervals

Page 10: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

IndelRealigner

•  A`emptsrealignmentatRealignerTargetCreatortargetintervals

•  Mustusesameinputfile(s)usedinRealignerTargetCreatorstep

•  Processingop8ons-  Onlyatknownindels:muchfaster,

accuratefor~90-95%ofindels-  AtindelsseenintheoriginalBAM

alignments:therecommendedmode

-  UsingfullSmith-Watermanrealignment:mostaccurate,butheavycomputa8onalcostandnotreallynecessarywiththenewtechs

Input BAM Target Intervals

Realigned BAM

IndelRealigner

Known Sites

java –jar GenomeAnalysisTK.jar \ –T IndelRealigner \ –R human.fasta \ –I original.bam \ –known indels.vcf \ –targetIntervals realigner.intervals \ –o realigned.bam

Page 11: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

DePristo, M., Banks, E., Poplin, R. et. al, A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat Gen.

ThisiswhatarealignedBAMlookslike

Before AierOlddata

(lowerquality)

Newdata(higherquality)

Page 12: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

CanIseetheeffectsofrealignment?

•  IndelRealignerchangestheCIGARstringofrealignedreadsbutmaintainstheoriginalCIGAR(withOCtag)

->Cangrepforrealignedregionsandviewingenomebrowser(IGV)

20GAVAAXX100126:1:67:10041:180738 99 20 10011431 70 87M1D14M= 10011720 390

TTAAATGTGTTTATCTATTGTTCTACTATTCAGTTACCTGATTATAAAATCAAAGATTATTTCATGAAACTCAGTACCCCTTCAGGGAAAAAAAAAAAAAT

HHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHHGGGGGGGG X0:i:1 X1:i:0 MC:Z:101M OC:Z:101M PG:Z:MarkDuplicates RG:Z:20GAV.1XG:i:0 AM:i:37

NM:i:1SM:i:37 XM:i:1 XO:i:0

BQ:Z:@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@cccddc``a`^\[Y MQ:i:60 XT:A:

Page 13: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

Isrealignments8llnecessarywithlatestsoiware?

•  Variantcallerswithreassemblystep(HaplotypeCaller,MuTect2,Platypus)donotrequireindelrealignment

•  BUTpoten8alimprovementforBaseQualityScoreRecalibra8onwhenrunonrealignedBAMfiles(ar8factualSNPsarereplacedwithrealindels).

•  Alsos8llusefulforlegacytools–  UnifiedGenotyper–  MuTect1

Page 14: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

Analysis-Ready Variants

111Raw Reads

Raw Variants IndelsSNPs

Analysis-ReadyReads

Indel Realignment

Base Recalibration

SNPs & Indels

Variants

IndelsSNPs

VariantAnnotation

Variant Evaluation

look good?

use in projecttroubleshoot

111Analysis-ReadyReads

Genotype Likelihoods

Joint Genotyping

Analysis-Ready

No

n-G

AT

K

Mark Duplicates& Sort (Picard)

Var. Calling HC in ERC mode

separately per variant type

Variant Recalibration

Map to Reference

BWA mem GenotypeRefinement

Data Pre-processing Variant Discovery>> >> Callset Refinement

YouarehereintheGATKBestPrac8cesworkflowforgermlinevariantdiscovery

Page 15: Indel-based Realignment - UCLA · 2017. 2. 24. · CCA TG CA context G ref del ins • Mappers cannot “see” indels near ends of reads • Because mismatches are “cheaper”

talks

Furtherreading

h`p://www.broadins8tute.org/gatk/guide/best-prac8ces

h`p://www.broadins8tute.org/gatk/guide/ar8cle?id=38

h`ps://www.broadins8tute.org/gatk/gatkdocs/org_broadins8tute_gatk_tools_walkers_indels_IndelRealigner.php

h`ps://www.broadins8tute.org/gatk/gatkdocs/

org_broadins8tute_gatk_tools_walkers_indels_RealignerTargetCreator.php