indel calling pipeline in the gatk - broad institute calling pipeline in the gatk guillermo del...
TRANSCRIPT
Indel Calling Pipeline in the GATK
Guillermo del Angel, Ph.D.
Genome Sequencing and Analysis Group Medical and Population Genetics Program
Feb 17, 2011
What are the GATK’s indel processing abilities?
GATK Tool Function IndelRealigner Runs multiple sequence alignment on
reads and forms consensus indels suitable for variant genotyping.
UnifiedGenotyper Determines consensus alternate alleles, optimal allele frequency distribution, determines whether sites should be called, assigns genotypes and annotations.
VariantFiltration Filters calls based on given expressions.
VariantEval Indel metrics and stratifications for analysis
Step 1: BAM data processing
Input BAMs Indel Realignment
BAMs used for calling
• Indel realignment is a critical step in preparing BAM’s for indel calling.
• We recommend full indel realigning (Smith Waterman) at all sites, realignment using only known sites is not enough!
Note: Exome BAM’s coming out of Picard have already been fully indel-realigned!
Step 2: Indel discovery
Genotype Likelihoo
ds Calculatio
n
BAMs used for calling
• The genotype likelihoods calculation is inspired by Dindel (with kind permission from C Albers and R Durbin).
• Typical command line: java -jar GenomeAnalysisTK.jar –R ref.fasta -T UnifiedGenotyper –L mytargets.list –I myreads.bam –o mycalls.vcf -B:dbsnp,VCF dbsnp.vcf -glm DINDEL!
Allele Frequenc
y Calculatio
n
Hard-filters
Unified Genotyper
Only difference with SNP calling!
Beagle
Same computation as
with SNP’s
Some details and caveats… • All standard parameters used in UG for SNP calling are also valid
for indels!
– E.G. –stand_call_conf for a calling threshold.
• Heuristic for controlling sensitivity:
– We’ll only consider indels for genotyping if they are present in N reads, controlled by –minIndelCnt parameter. Default value: 5, may want lower value for higher sensitivity in lowpass samples.
• Limitations:
– Only bi-allelic sites considered. If more than 2 alt. alleles detected at a site, the one with most supporting reads taken.
• NOTE: Application of BAQ will severely degrade indel caller performance. Make sure argument –baq is either not included or set to OFF!
Step 3: variant filtration (indels)
Genotype Likelihoo
ds Calculatio
n
BAMs used for calling
• Hard filters are needed for eliminating calls coming from read artifacts. • This is an ongoing area of improvement, stay tuned on the GATK Wiki for best
practice recommendations! • Example command line with current best practice:
Allele Frequenc
y Calculatio
n
Hard-filters Beagle
Unified Genotyper
java –jar ./dist/GenomeAnalysisTK.jar -T VariantFiltration ref.fasta –o out.vcf -B:variant,VCF input.vcf \!--filterExpression "QUAL<30.0" --filterName "LowQual” \!--filterExpression "SB>=-1.0" --filterName "StrandBias” \!--filterExpression "QD<1.0" --filterName "QualByDepth” \!--filterExpression "(MQ0 >= 4 && ((MQ0 / (1.0 * DP)) > 0.1))" --filterName "HARD_TO_VALIDATE” \!--filterExpression "HRun>=15" --filterName "HomopolymerRun"!
Step 4 (Optional): Genotype refinement
Genotype Likelihoo
ds Calculatio
n
BAMs used for calling
• Beagle can be used to refine genotypes of indel calls. Current recommended best practice is to merge Indel and SNP calls and running Beagle on combined set. More details our Wiki page.
Allele Frequenc
y Calculatio
n
Hard-filters Beagle
Unified Genotyper
Assessing indel callsets
• How do we know if the callset that we have is of high sensitivity and high specificity?
• How many variants should we typically get?
• How should indels be distributed in size, allele frequency and types of indels?
VariantEval’s support for Indels java –jar GenomeAnalysisTK.jar -B:eval,V!CF mycalls.vcf -T VariantEval –R reffile.fasta -EV IndelMetricsByAC -EV IndelStatistics -B:dbsnp,VCF dbsnp.vcf -o output.txt!
This produces a GATK report file with aggregated statistics.
##:GATKReport.v0.1 CountVariants : Counts different classes of variants in the sample!CountVariants CompRod CpG EvalRod JexlExpression Novelty nProcessedLoci nCalledLoci nRefLoci nVariantLoci vari!antRate variantRatePerBp nSNPs nInsertions nDeletions nComplex nNoCalls nHets nHomRef nHomVar nSingleton!s heterozygosity heterozygosityPerBp hetHomRatio indelRate indelRatePerBp deletionInsertionR!atio!CountVariants dbsnp CpG eval none all 63025520 1215 0 1215 0.00!001928 51872.00000000 0 611 604 0 0 724 0 491 0 ! 0.00001149 87051.00000000 1.47454175 0.00001928 51872.00000000 0.98854337 !
CountVariants dbsnp CpG eval none known 63025520 1000 0 1000 0.00!001587 63025.00000000 0 491 509 0 0 567 0 433 0 ! 0.00000900 111156.00000000 1.30946882 0.00001587 63025.00000000 1.03665988 !
CountVariants dbsnp CpG eval none novel 63025520 215 0 215 0.00!000341 293141.00000000 0 120 95 0 0 157 0 58 0 ! 0.00000249 401436.00000000 2.70689655 0.00000341 293141.00000000 0.79166667 !
CountVariants dbsnp all eval none all 63025520 13580 0 13580 0.00!021547 4641.00000000 0 6649 6931 0 0 8852 0 4728 0 ! 0.00014045 7119.00000000 1.87225042 0.00021547 4641.00000000 1.04241239 !
Key module! Produces indel size distributions as well as classification tables
How many indels should I get?
Lowpass example AC distribution (Chr20 only):
Exome example AC distribution:
High heterozygosity in lowpass call set is leading us to focus on improving specificity of our calls.
1 2 5 10 20 50 100 200
15
10
50
500
5000
Total indels by Allele Count, target captured exomes, N=96, 1/Het=65916.9
Allele count
Num
ber
of vari
ants
GATK UnifiedGenotyperNeutral expectation ( 0.9*0.000015*32950014.0*(1/c(1:192)) )
High number of singletons: maybe also a lot of false positives?
1 2 5 10 20 50 100 200 500
11
00
10
00
0
Total indels by Allele Count, pop= ASN, N=266, 1/Het=8000.0
Allele count
Nu
mb
er
of
va
ria
nts
GATK UnifiedGenotyperNeutral expectation ( 0.9*0.000125*63025520.0*(1/c(1:532)) )
Published estimates In Whole Genome ~ 1 indel/8000 bp
Empirical exome estimate: ~500 indels/exome (33 Mbp)
A typical plot of indel size distribution in whole genome sets
!!
!!
!!!
!!
!!
!
!!
!
!
!!!
!!!!!!
!
!
!!
!
!!!!!
!
!
!!
!
!
!!!!!!
!
!
!!!!
!
!
!
!!!
!
!
!!
!
!!
!
!
!
!
!
!
!
!
!!
!
!
!!!!!!!!
!
!!!!
!
!!!!!
!
!!
!!!
!
!
!
!!!!
!
!!!!!!
!!!!!!!
!!! !!
!!
!!!!
!!!!!
!
!
!!!!!!!!!
!!!
!!!!!!!!
!!!
!
!
!!!!!!!!
!!
!!!!
!
!!!!!!
!!!!!!!
!
!!
!
!!!!
!!!
!!!
!
!!
!!!!!!!!!
!!
!
!
!!!!
!!!
!!!!!!!!!!!!!
!!
!
!!!
!
!
!!
!
!!
!!!!!!!
!!!
!!!
!!!!
!!
!!!!
!!
!!!!!!
!!!
!
!
!!!!!!
!!!
!!
!
!!
!
!!
!
!!
!!!!!!!!!
!
!!!!!!
!!
!!
!!
!
!
!
!!
!!!!
!
!!!
!
!
!
!
!!
!
!
! !!!!!!!!!!!!
!
!
!
!
!!
!
!!!
!!!!
!!
!!!!
!
!!!!!!!!
!
!!!
!
!
!!!!!!
!
!
!
!
!!
!
!!!!
!!!!!!
!!!
!!!!!!!
!!
!
!!
!
!!
!
!!!
!!!
!
!
!!!!
!
! !
!!!
!
!!!!!!!!
!!
!!!!!
!!!
!
!
!!!!!!!!!!
!
!
!
!!
!
!!
!
! !!
!!
!
!!
!
!
!
!!
!
!!!
!
!
!
!!! !
!!!!!!!!!!!!!!!!!!!!
!!!
!!!
110
100
1000
10000
Indel Size Distribution for low!pass 1000G samples, GATK, pop = ASN
Event Size (!:deletion, +:insertion)
Event C
ount per
sam
ple
!3
0
!2
8
!2
6
!2
4
!2
2
!2
0
!1
8
!1
6
!1
4
!1
2
!1
0
!8
!6
!4
!2 0 2 4 6 8
10
12
14
16
18
20
22
24
26
28
30
Notice that: • Counts are very consistent among samples • Counts are mostly symmetric between insertions and deletions • Even counts higher than odd counts
!!!!!!!!!!!!!!!!!!!!
!!
!! !!!!!!!!!!!!!!!!
!
!
!
!!!!!!!!!!!!!!!!!!!!
!
!
!
!!!!!!!!!!!!!!!!!!!!
!!!!
!!!!
!!
!
!
!
!!!!!!!
!
!
!
!!!!!
!!
!!!!!!!
!
!!!!
!
!!!!!!!!!!!!!
!
!!!!
!
!!!!!!
!!
!!!!!!!!!!!!
!!
!
!!!!!!!!!!!!!!!!
!
!!!!!!!!!!!!!!!! !
!!!
!
!
!
!
!!!
!!!
!
!
!
!
!!
!!!!!!!!!!!!!!!!
!
!
!
!!
!!
!
!!!!!!!!
!
!
!!
!!
!
!!!!!!! !!
!
!
!
!
!!
!!
!
!!
!
!!
!! !
!
!
!
!
!
!
!
!!
!!
!!!!!! !!!!
!!!!!!
!!!!!!!!
!!
!
!
!
!!
!
!
!
!!!!!!
!!!!!!!!!!!!!!!!!!!!!!!!!!
12
510
20
50
100
Indel Size Distribution for 96 exome capture 1000G samples, GATK
Event Size (!:deletion, +:insertion)
Event C
ount per
sam
ple
!30
!28
!26
!24
!22
!20
!18
!16
!14
!12
!10
!8
!6
!4
!2 0 2 4 6 8
10
12
14
16
18
20
22
24
26
28
30
• Strong bias against frameshift indels! • Count per sample much lower than in whole genome (~500 indels per sample) • More deletions than insertions • We believe many of the 1-bp events should be artifacts
Indel size distribution in exomes
Different indel types come at different rates 1
10
10
01
00
01
00
00
Total indels by Indel type, pop= ASN, N=266#
of
Ind
els
!!!!!!!!!!!!!!!!!!
!
!!
!
!!
!!!!!!!!
!!
!
!
!!!
! !!!!!!!!!!!!!!!
!
!
!
!!
!!!!!!
!!!
!!
!!!!
!
!!
!
!!
!
!!!!!!!!!!!!!
!
!
!!!
!
!
!!!!
!
!
!!!!!!!!!!!!
!
!
!
!!!
!
!
!!!!!
!
!
!
!
!!
!
!!
!
! !
!
!
!!
!
!
!!!!
!
!
!!!!
!
!!
!!
!
!!!!
!!!!
!!
!!
!
!!
!
!!
!
!
!
!
!
!!!!!
!!
!
!
!
!!!
!
!
!
!!
!
!!!!!
!
!
!!
!
!
!
!
!!
!!!
!! !
!
!!
!!
!
!!!
!!!
!
!
!!
!!
!
!
!
!
!
!!
!
!
!
!!
!!!!
!!!
!
!
!
!
!
!!!
!
!
!!
!
!
!!
!
!!
!
!
!
!!!!!
!
!
!!!!!!!
!
!
!!
!
!!!!
!!
!
!!!
!
!!
!!
!!!
!!!!!!!!
!!
!
!!!
!
!!! !
!!!!!!!!!!!!
!
!!
!
!!!!
!!
!!!!!
!
!
!!
!
!!
!!
!!!
!!!
!!!!!!!!
!!!
! !
!
!!!
!!!!!!!
!
!!
!
!
!
!
!
!
!!
!
!
!!!!!!
!!
!
!
!!!!!!!!
!!!!!!!!
!
!!!
!!!
!!
!!!!!
!
!!!!
!!!
!
!
!!!!!!!
!
!!!
!!
!!!
!
!!!!
!
!!
!!!!
!!
!!
!
!!
!!
!!
!
!
!!!!!!!
!
!
!!
!
!!!
!
!!
!!!!
!!!!!!
!!!!!!!
!!!!!!!!!!!!!
!!
!
!!!!!!!!!!!!!!
!
!!!!
!!!!!!!
!
!!
!!!!!!!!!!!!!! !!!!
!
!!
!
!! !!
!!
!!
!
!
!
!!!!!!
!!!!
!
!!
!!
!!
!!
!
11
01
00
10
00
10
00
0
Novel (A
)
Novel (C
)
Novel (G
)
Novel (T
)
Novel L=
1
Novel L=
2
Novel L=
3
Novel L=
4
Novel L=
5
Novel L=
6
Novel L=
7
Novel L=
8
Novel L=
9
Novel L=
10+
RepE
xp (
A)
RepE
xp (
C)
RepE
xp (
G)
RepE
xp (
T)
RepE
xp (
AC
)
RepE
xp (
AG
)
RepE
xp (
AT
)
RepE
xp (
CA
)
RepE
xp (
CG
)
RepE
xp (
CT
)
RepE
xp (
GA
)
RepE
xp (
GC
)
RepE
xp (
GT
)
RepE
xp (
TA
)
RepE
xp (
TC
)
RepE
xp (
TG
)
RepE
xp L
=1
RepE
xp L
=2
RepE
xp L
=3
RepE
xp L
=4
RepE
xp L
=5
RepE
xp L
=6
RepE
xp L
=7
RepE
xp L
=8
RepE
xp L
=9
RepE
xp L
=10+
Oth
er
Consistent difference in the number of A,T vs. C,G repeat expansions: Ratio of (A+T)/(C+G) expansions can be used as a specificity metric!
These counts for the Phase 1 1000 Genome samples taken for illustration only
Other callers
– Aside from the GATK, SAMTools and DINDEL can be alternatively used for indel calling.
– Example command line using SAMTools’ mpileup caller: samtools mpileup -ugf ref.fasta reads.bam | ../samtools/bcftools/bcftools view -vc - > myout.vcf!
– More info at: • http://samtools.sourceforge.net/mpileup.shtml • http://www.sanger.ac.uk/resources/software/dindel/