eccb10 talk - nextgen sequencing and snps

Post on 17-Jul-2015

638 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Next-generation sequencingand SNPs

Jan AertsWellcome Trust Sanger Institute

jan.aerts@gmail.com

Aim

To identify the SNP that causes disease,phenotype– Find them all, so you don’t miss it (false

negatives)– Not find too many, so it’s useful (false

positives)

General principle

Map reads to reference sequenceConvert from read-based to base-based

(i.e. pileup)Look at differences

This presentation

Factors in finding real SNPs– Sequencing technology– Mapping algorithms and initial calling– Post-mapping tweaking– Calling– Filtering

Based on experiences in exome resequencing;“experiment 5” on last slide Thomas

1. Sequencing

• Provides raw data• Different technologies

Different accuracy (critical!)Different types of errors

Accuracy

Base quality dropsalong readSanger> SOLiD> Illumina> 454> Helicos

Base calling errors

Main source of error for Illumina, less inSOLiD & 454

Homopolymer runs

• Especially 45439% of errors are homopolymers

• A5 motifs: 3.3% error rate• A8 motifs: 50% error rate!Reason: use signal intensity as a measure for

homopolymer length

Is it 4? Is it 5? Is it 4?

Consensus accuracy

Increase accuracy for SNP calling byincreasing coverage– Illumina: 20X– SOLiD: 12X– 454: 7.4X– Sanger: 3X

Factors: raw accuracy + read length

2. Mapping: fastq => bam

• Maq and bwa: only 1 mappingIf multiple: mapQ = 0<=> mosaik & mrFAST: alternatives

• Maq and bwa: use paired-endinformation => might prefer correctdistance over correct alignment

3. Post-mapping tweaking

Improve quality of mapped data:– duplicate removal– baseQ recalibration– read clipping– local realignment around indels

Genome Analysis Toolkit (GATK)http://bit.ly/9zIn4b

Duplicate removal

PCR amplification biasmultiple reads with same start/stop =>keep only one (with highest mapping Q)

java -Xmx2048m \ -jar /path_to_picardtools/MarkDuplicates.jar \ INPUT=input.bam \ OUTPUT=output.bam \ METRICS_FILE=output.metrics \ VALIDATION_STRINGENCY=LENIENT

samtools rmdup input.bam output.bam

Picard

samtools

baseQ recalibration

• Why?– correct for variation in quality with machine

cycle, sequence context, lane, baseQ…• Steps:

– Identify what to correct for (create plots)– Calculate covariates– Apply covariates– Check (create plots)

java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ --DBSNP resources/dbsnp_129_hg18.rod \ -I my_reads.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov DinucCovariate \ -recalFile my_reads.recal_data.csv

java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ -I my_reads.bam \ -T TableRecalibration \ -outputBam my_reads.recal.bam \ -recalFile my_reads.recal_data.csv

Read clipping

Remove:– low quality strings of bases– sections of reads– reads containing user-provided sequences

Local realignment near indels

Local realignment near indels

java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -o /path/to/output.intervals

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \ -jar /path/to/GenomeAnalysisTK.jar \ -I input.bam \ -R ref.fasta \ -T IndelRealigner \ -targetIntervals /path/to/output.intervals \ -o realignedBam.bam

4. SNP calling

• Different callers:– samtools– GATK UnifiedGenotyper– SOAPsnp– …

• Read-based => base-based

1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

pileup

1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

pileup

java \ -Xmx6g \ -jar /path_to/GenomeAnalysisTK.jar \ -l INFO \ -R human_b36_plus.fasta \ -I input.bam \ -T UnifiedGenotyper \ --heterozygosity 0.001 \ -pl Solexa \ -varout output.vcf \ -vf VCF \ -mbq 20 \ -mmq 10 \ -stand_call_conf 30.0 \ --DBSNP dbsnp_129_b36_plus.rod

GATK

samtools pileup \ -vcs \ -r 0.001 \ -l CCDS.txt \ -f human_b36_plus.fasta \ input.bam \ output.pileup

samtools

VCF file##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00. . .

VCF file##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00. . .

header

datacolumn header

VCF file

INFODB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSEDB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE

FORMAT a_a:bwa057_b:picard.bamGT:DP:GQ 1/1:3:36.00GT:DP:GQ 1/1:6:45.00

Pileup => VCF

Custom scripts, then annotatejava -Xmx10g \ -jar GenomeAnalysisTK.jar \ -T VariantAnnotator \ --assume_single_sample_reads sample \ -R human_b36_plus.fasta \ -D dbsnp_129_b36_plus.rod \ -I input.bam \ -B variant,VCF,unannotated.vcf \ -o annotated.vcf \ -A AlleleBalance \ -A MappingQualityZero \ -A LowMQ \ -A RMSMappingQuality \ -A HaplotypeScore \ -A QualByDepth \ -A DepthOfCoverage \ -A HomopolymerRun

5. Filtering

• Aim: to reduce number of false positives• Options:

– Depth of coverage– Mapping quality– SNP clusters– Allelic balance– Number of reads with mq0

java \ -Xmx4g \ -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R human_b36_plus.fasta \ -o output.vcf \ -B variant,VCF,input.vcf \ --clusterWindowSize 10 \ --filterExpression 'DP < 3 || DP > 1200' \ --filterName 'DP' \ --filterExpression 'QUAL < #{qual_cutoff}' \ --filterName 'QUAL' \ --filterExpression 'AB > 0.75 && DP > 40' \ --filterName 'AB'

Filtering - QC metrics (1)

Transition/transversion ratioRandom: Ti/Tv = 0.5

Whole genome: 2.0-2.1Exome: 3-3.5

Filtering - QC metrics (2)

Number of novel SNPsExome:total 20k - 25k;novel 1-3k

Combining discovery pipelines

• Mapper: MAQ/bwa/stampy/…• BaseQ recalibration? Local

realignment?• SNP caller: GATK/samtools/SOAPsnp• Priors for SNP calling: heterozygosity

(whole genome, exome, dbSNP)• Filtering

Combining discovery pipelines

ROC

false positives

true

posi

tives

Combining discovery pipelines

better

single

combinations

Indels

Still more tricky than SNPs– samtools/dindel/GATK– Sample of 10 individuals: on average per

individual:• 2 novel functional high-quality SNPs• 18 novel functional high-quality indels

“I trust manual interpretation of the reads morethan the basic quality parameters we use”

4 snp_1 STOP_GAINED1 snp_2 STOP_LOST1 snp_3 STOP_GAINED1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC1 snp_5 ESSENTIAL_SPLICE_SITE,INTRONIC2 snp_6 STOP_GAINED2 snp_7 STOP_GAINED1 snp_8 STOP_GAINED1 snp_9 STOP_GAINED1 snp_10 STOP_GAINED1 snp_11 STOP_GAINED1 snp_12 STOP_GAINED1 snp_13 STOP_LOST1 snp_14 STOP_GAINED

4 snp_1 STOP_GAINED1 snp_2 STOP_LOST1 snp_3 STOP_GAINED1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC1 snp_5 ESSENTIAL_SPLICE_SITE,INTRONIC2 snp_6 STOP_GAINED2 snp_7 STOP_GAINED1 snp_8 STOP_GAINED1 snp_9 STOP_GAINED1 snp_10 STOP_GAINED1 snp_11 STOP_GAINED1 snp_12 STOP_GAINED1 snp_13 STOP_LOST1 snp_14 STOP_GAINED

178 indels FRAMESHIFT_CODING

Conclusions

Different tools exist and are createdBest to combine (intersect) the results from

different pipelinesGenome Analysis ToolKit (GATK) provides

useful bam-file processing tools:– Realignment around indels– Base quality recalibration

Use in resequencing

• Identify SNPs/indels• Consequences (loss-of-function?)• Prevalence in cases/controls• Model:

– Dominant: any het– Recessive: homnonref or compound het

References

• Chan E. In: Single Nucleotide Polymorphisms,Methods in Molecular Biology 578 (2009)

• McKenna et al. Genome Res 20:1297-1303 (2010)• Li H & Durbin R. Bioinformatics 25:1754-1760 (2009)• Li H et al. Bioinformatics 25:2078-2079 (2009)• Li H et al. Genome Res 18:1851-1858 (2008)

Questions?

top related