eccb10 talk - nextgen sequencing and snps
TRANSCRIPT
Aim
To identify the SNP that causes disease,phenotype– Find them all, so you don’t miss it (false
negatives)– Not find too many, so it’s useful (false
positives)
General principle
Map reads to reference sequenceConvert from read-based to base-based
(i.e. pileup)Look at differences
This presentation
Factors in finding real SNPs– Sequencing technology– Mapping algorithms and initial calling– Post-mapping tweaking– Calling– Filtering
Based on experiences in exome resequencing;“experiment 5” on last slide Thomas
1. Sequencing
• Provides raw data• Different technologies
Different accuracy (critical!)Different types of errors
Accuracy
Base quality dropsalong readSanger> SOLiD> Illumina> 454> Helicos
Base calling errors
Main source of error for Illumina, less inSOLiD & 454
Homopolymer runs
• Especially 45439% of errors are homopolymers
• A5 motifs: 3.3% error rate• A8 motifs: 50% error rate!Reason: use signal intensity as a measure for
homopolymer length
Is it 4? Is it 5? Is it 4?
Consensus accuracy
Increase accuracy for SNP calling byincreasing coverage– Illumina: 20X– SOLiD: 12X– 454: 7.4X– Sanger: 3X
Factors: raw accuracy + read length
2. Mapping: fastq => bam
• Maq and bwa: only 1 mappingIf multiple: mapQ = 0<=> mosaik & mrFAST: alternatives
• Maq and bwa: use paired-endinformation => might prefer correctdistance over correct alignment
3. Post-mapping tweaking
Improve quality of mapped data:– duplicate removal– baseQ recalibration– read clipping– local realignment around indels
Genome Analysis Toolkit (GATK)http://bit.ly/9zIn4b
Duplicate removal
PCR amplification biasmultiple reads with same start/stop =>keep only one (with highest mapping Q)
java -Xmx2048m \ -jar /path_to_picardtools/MarkDuplicates.jar \ INPUT=input.bam \ OUTPUT=output.bam \ METRICS_FILE=output.metrics \ VALIDATION_STRINGENCY=LENIENT
samtools rmdup input.bam output.bam
Picard
samtools
baseQ recalibration
• Why?– correct for variation in quality with machine
cycle, sequence context, lane, baseQ…• Steps:
– Identify what to correct for (create plots)– Calculate covariates– Apply covariates– Check (create plots)
java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ --DBSNP resources/dbsnp_129_hg18.rod \ -I my_reads.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov DinucCovariate \ -recalFile my_reads.recal_data.csv
java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ -I my_reads.bam \ -T TableRecalibration \ -outputBam my_reads.recal.bam \ -recalFile my_reads.recal_data.csv
Read clipping
Remove:– low quality strings of bases– sections of reads– reads containing user-provided sequences
Local realignment near indels
Local realignment near indels
java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -o /path/to/output.intervals
java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \ -jar /path/to/GenomeAnalysisTK.jar \ -I input.bam \ -R ref.fasta \ -T IndelRealigner \ -targetIntervals /path/to/output.intervals \ -o realignedBam.bam
4. SNP calling
• Different callers:– samtools– GATK UnifiedGenotyper– SOAPsnp– …
• Read-based => base-based
1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
pileup
1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
pileup
java \ -Xmx6g \ -jar /path_to/GenomeAnalysisTK.jar \ -l INFO \ -R human_b36_plus.fasta \ -I input.bam \ -T UnifiedGenotyper \ --heterozygosity 0.001 \ -pl Solexa \ -varout output.vcf \ -vf VCF \ -mbq 20 \ -mmq 10 \ -stand_call_conf 30.0 \ --DBSNP dbsnp_129_b36_plus.rod
GATK
samtools pileup \ -vcs \ -r 0.001 \ -l CCDS.txt \ -f human_b36_plus.fasta \ input.bam \ output.pileup
samtools
VCF file##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00. . .
VCF file##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00. . .
header
datacolumn header
VCF file
INFODB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSEDB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE
FORMAT a_a:bwa057_b:picard.bamGT:DP:GQ 1/1:3:36.00GT:DP:GQ 1/1:6:45.00
Pileup => VCF
Custom scripts, then annotatejava -Xmx10g \ -jar GenomeAnalysisTK.jar \ -T VariantAnnotator \ --assume_single_sample_reads sample \ -R human_b36_plus.fasta \ -D dbsnp_129_b36_plus.rod \ -I input.bam \ -B variant,VCF,unannotated.vcf \ -o annotated.vcf \ -A AlleleBalance \ -A MappingQualityZero \ -A LowMQ \ -A RMSMappingQuality \ -A HaplotypeScore \ -A QualByDepth \ -A DepthOfCoverage \ -A HomopolymerRun
5. Filtering
• Aim: to reduce number of false positives• Options:
– Depth of coverage– Mapping quality– SNP clusters– Allelic balance– Number of reads with mq0
java \ -Xmx4g \ -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R human_b36_plus.fasta \ -o output.vcf \ -B variant,VCF,input.vcf \ --clusterWindowSize 10 \ --filterExpression 'DP < 3 || DP > 1200' \ --filterName 'DP' \ --filterExpression 'QUAL < #{qual_cutoff}' \ --filterName 'QUAL' \ --filterExpression 'AB > 0.75 && DP > 40' \ --filterName 'AB'
Filtering - QC metrics (1)
Transition/transversion ratioRandom: Ti/Tv = 0.5
Whole genome: 2.0-2.1Exome: 3-3.5
Filtering - QC metrics (2)
Number of novel SNPsExome:total 20k - 25k;novel 1-3k
Combining discovery pipelines
• Mapper: MAQ/bwa/stampy/…• BaseQ recalibration? Local
realignment?• SNP caller: GATK/samtools/SOAPsnp• Priors for SNP calling: heterozygosity
(whole genome, exome, dbSNP)• Filtering
Combining discovery pipelines
ROC
false positives
true
posi
tives
Combining discovery pipelines
better
single
combinations
Indels
Still more tricky than SNPs– samtools/dindel/GATK– Sample of 10 individuals: on average per
individual:• 2 novel functional high-quality SNPs• 18 novel functional high-quality indels
“I trust manual interpretation of the reads morethan the basic quality parameters we use”
4 snp_1 STOP_GAINED1 snp_2 STOP_LOST1 snp_3 STOP_GAINED1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC1 snp_5 ESSENTIAL_SPLICE_SITE,INTRONIC2 snp_6 STOP_GAINED2 snp_7 STOP_GAINED1 snp_8 STOP_GAINED1 snp_9 STOP_GAINED1 snp_10 STOP_GAINED1 snp_11 STOP_GAINED1 snp_12 STOP_GAINED1 snp_13 STOP_LOST1 snp_14 STOP_GAINED
4 snp_1 STOP_GAINED1 snp_2 STOP_LOST1 snp_3 STOP_GAINED1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC1 snp_5 ESSENTIAL_SPLICE_SITE,INTRONIC2 snp_6 STOP_GAINED2 snp_7 STOP_GAINED1 snp_8 STOP_GAINED1 snp_9 STOP_GAINED1 snp_10 STOP_GAINED1 snp_11 STOP_GAINED1 snp_12 STOP_GAINED1 snp_13 STOP_LOST1 snp_14 STOP_GAINED
178 indels FRAMESHIFT_CODING
Conclusions
Different tools exist and are createdBest to combine (intersect) the results from
different pipelinesGenome Analysis ToolKit (GATK) provides
useful bam-file processing tools:– Realignment around indels– Base quality recalibration
Use in resequencing
• Identify SNPs/indels• Consequences (loss-of-function?)• Prevalence in cases/controls• Model:
– Dominant: any het– Recessive: homnonref or compound het
References
• Chan E. In: Single Nucleotide Polymorphisms,Methods in Molecular Biology 578 (2009)
• McKenna et al. Genome Res 20:1297-1303 (2010)• Li H & Durbin R. Bioinformatics 25:1754-1760 (2009)• Li H et al. Bioinformatics 25:2078-2079 (2009)• Li H et al. Genome Res 18:1851-1858 (2008)
Questions?