eccb10 talk - nextgen sequencing and snps

Next-generation sequencingand SNPs

Jan AertsWellcome Trust Sanger Institute

jan.aerts@gmail.com

To identify the SNP that causes disease,phenotype– Find them all, so you don’t miss it (false

negatives)– Not find too many, so it’s useful (false

positives)

General principle

Map reads to reference sequenceConvert from read-based to base-based

(i.e. pileup)Look at differences

This presentation

Factors in finding real SNPs– Sequencing technology– Mapping algorithms and initial calling– Post-mapping tweaking– Calling– Filtering

Based on experiences in exome resequencing;“experiment 5” on last slide Thomas

1. Sequencing

• Provides raw data• Different technologies

Different accuracy (critical!)Different types of errors

Accuracy

Base quality dropsalong readSanger> SOLiD> Illumina> 454> Helicos

Base calling errors

Main source of error for Illumina, less inSOLiD & 454

Homopolymer runs

• Especially 45439% of errors are homopolymers

• A5 motifs: 3.3% error rate• A8 motifs: 50% error rate!Reason: use signal intensity as a measure for

homopolymer length

Is it 4? Is it 5? Is it 4?

Consensus accuracy

Increase accuracy for SNP calling byincreasing coverage– Illumina: 20X– SOLiD: 12X– 454: 7.4X– Sanger: 3X

Factors: raw accuracy + read length

2. Mapping: fastq => bam

• Maq and bwa: only 1 mappingIf multiple: mapQ = 0<=> mosaik & mrFAST: alternatives

• Maq and bwa: use paired-endinformation => might prefer correctdistance over correct alignment

3. Post-mapping tweaking

Improve quality of mapped data:– duplicate removal– baseQ recalibration– read clipping– local realignment around indels

Genome Analysis Toolkit (GATK)http://bit.ly/9zIn4b

Duplicate removal

PCR amplification biasmultiple reads with same start/stop =>keep only one (with highest mapping Q)

java -Xmx2048m \ -jar /path_to_picardtools/MarkDuplicates.jar \ INPUT=input.bam \ OUTPUT=output.bam \ METRICS_FILE=output.metrics \ VALIDATION_STRINGENCY=LENIENT

samtools rmdup input.bam output.bam

Picard

samtools

baseQ recalibration

• Why?– correct for variation in quality with machine

cycle, sequence context, lane, baseQ…• Steps:

– Identify what to correct for (create plots)– Calculate covariates– Apply covariates– Check (create plots)

java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ --DBSNP resources/dbsnp_129_hg18.rod \ -I my_reads.bam \ -T CountCovariates \ -cov ReadGroupCovariate \ -cov QualityScoreCovariate \ -cov DinucCovariate \ -recalFile my_reads.recal_data.csv

java -Xmx4g -jar GenomeAnalysisTK.jar \ -l INFO \ -R resources/Homo_sapiens_assembly18.fasta \ -I my_reads.bam \ -T TableRecalibration \ -outputBam my_reads.recal.bam \ -recalFile my_reads.recal_data.csv

Read clipping

Remove:– low quality strings of bases– sections of reads– reads containing user-provided sequences

Local realignment near indels

java -Xmx1g -jar /path/to/GenomeAnalysisTK.jar \ -T RealignerTargetCreator \ -R /path/to/reference.fasta \ -o /path/to/output.intervals

java -Xmx4g -Djava.io.tmpdir=/path/to/tmpdir \ -jar /path/to/GenomeAnalysisTK.jar \ -I input.bam \ -R ref.fasta \ -T IndelRealigner \ -targetIntervals /path/to/output.intervals \ -o realignedBam.bam

4. SNP calling

• Different callers:– samtools– GATK UnifiedGenotyper– SOAPsnp– …

• Read-based => base-based

1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

pileup

1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<61 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<1 277 T 22 ..CCggC,C,.C.,,CC,..g. +7<;<<<<<<<&<=<<:;<<&<1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<

pileup

java \ -Xmx6g \ -jar /path_to/GenomeAnalysisTK.jar \ -l INFO \ -R human_b36_plus.fasta \ -I input.bam \ -T UnifiedGenotyper \ --heterozygosity 0.001 \ -pl Solexa \ -varout output.vcf \ -vf VCF \ -mbq 20 \ -mmq 10 \ -stand_call_conf 30.0 \ --DBSNP dbsnp_129_b36_plus.rod

samtools pileup \ -vcs \ -r 0.001 \ -l CCDS.txt \ -f human_b36_plus.fasta \ input.bam \ output.pileup

samtools

VCF file##fileformat=VCFv3.3##FILTER=DP,"DP < 3 || DP > 1200"##FILTER=QUAL,"QUAL < 25.0"##FILTER=SnpCluster,"SNPs found in clusters"##FORMAT=DP,1,Integer,"Read Depth"##FORMAT=GQ,1,Integer,"Genotype Quality"##FORMAT=GT,1,String,"Genotype"##INFO=AB,1,Float,"Allele Balance for hets (ref/(ref+alt))"##INFO=DB,0,Flag,"dbSNP Membership"##INFO=DP,1,Integer,"Total Depth"##INFO=HRun,1,Integer,"Largest Contiguous Homopolymer Run of Variant Allele In Either Direction"##INFO=HaplotypeScore,1,Float,"Consistency of the site with two (and only two) segregating haplotypes"##INFO=LowMQ,3,Integer,"3-tuple: <fraction of reads with MQ=0>,<fraction of reads with MQ<=10>,<total nubmer of reads>"##INFO=MQ,1,Float,"RMS Mapping Quality"##INFO=MQ0,1,Integer,"Total Mapping Quality Zero Reads"##INFO=QD,1,Float,"Variant Confidence/Quality by Depth"##annotatorReference=human_b36_plus.fasta##reference=human_b36_plus.fasta##source=VariantAnnotator##source=VariantFiltration#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT a_a:bwa057_b:picard.bam1 856182 rs9988021 G A 36.00 0;TARGET DB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSE GT:DP:GQ 1/1:3:36.001 866362 rs4372192 A G 45.00 0;TARGET DB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE GT:DP:GQ 1/1:6:45.00. . .

header

datacolumn header

VCF file

INFODB;DP=3;HRun=0;MQ=60.00;MQ0=0;QD=12.00;OnTarget=FALSEDB;DP=6;HRun=6;MQ=60.00;MQ0=0;QD=7.50;OnTarget=FALSE

FORMAT a_a:bwa057_b:picard.bamGT:DP:GQ 1/1:3:36.00GT:DP:GQ 1/1:6:45.00

Pileup => VCF

Custom scripts, then annotatejava -Xmx10g \ -jar GenomeAnalysisTK.jar \ -T VariantAnnotator \ --assume_single_sample_reads sample \ -R human_b36_plus.fasta \ -D dbsnp_129_b36_plus.rod \ -I input.bam \ -B variant,VCF,unannotated.vcf \ -o annotated.vcf \ -A AlleleBalance \ -A MappingQualityZero \ -A LowMQ \ -A RMSMappingQuality \ -A HaplotypeScore \ -A QualByDepth \ -A DepthOfCoverage \ -A HomopolymerRun

5. Filtering

• Aim: to reduce number of false positives• Options:

– Depth of coverage– Mapping quality– SNP clusters– Allelic balance– Number of reads with mq0

java \ -Xmx4g \ -jar GenomeAnalysisTK.jar \ -T VariantFiltration \ -R human_b36_plus.fasta \ -o output.vcf \ -B variant,VCF,input.vcf \ --clusterWindowSize 10 \ --filterExpression 'DP < 3 || DP > 1200' \ --filterName 'DP' \ --filterExpression 'QUAL < #{qual_cutoff}' \ --filterName 'QUAL' \ --filterExpression 'AB > 0.75 && DP > 40' \ --filterName 'AB'

Filtering - QC metrics (1)

Transition/transversion ratioRandom: Ti/Tv = 0.5

Whole genome: 2.0-2.1Exome: 3-3.5

Filtering - QC metrics (2)

Number of novel SNPsExome:total 20k - 25k;novel 1-3k

Combining discovery pipelines

• Mapper: MAQ/bwa/stampy/…• BaseQ recalibration? Local

realignment?• SNP caller: GATK/samtools/SOAPsnp• Priors for SNP calling: heterozygosity

(whole genome, exome, dbSNP)• Filtering

false positives

better

single

combinations

Indels

Still more tricky than SNPs– samtools/dindel/GATK– Sample of 10 individuals: on average per

individual:• 2 novel functional high-quality SNPs• 18 novel functional high-quality indels

“I trust manual interpretation of the reads morethan the basic quality parameters we use”

4 snp_1 STOP_GAINED1 snp_2 STOP_LOST1 snp_3 STOP_GAINED1 snp_4 ESSENTIAL_SPLICE_SITE,INTRONIC1 snp_5 ESSENTIAL_SPLICE_SITE,INTRONIC2 snp_6 STOP_GAINED2 snp_7 STOP_GAINED1 snp_8 STOP_GAINED1 snp_9 STOP_GAINED1 snp_10 STOP_GAINED1 snp_11 STOP_GAINED1 snp_12 STOP_GAINED1 snp_13 STOP_LOST1 snp_14 STOP_GAINED

178 indels FRAMESHIFT_CODING

Conclusions

Different tools exist and are createdBest to combine (intersect) the results from

different pipelinesGenome Analysis ToolKit (GATK) provides

useful bam-file processing tools:– Realignment around indels– Base quality recalibration

Use in resequencing

• Identify SNPs/indels• Consequences (loss-of-function?)• Prevalence in cases/controls• Model:

– Dominant: any het– Recessive: homnonref or compound het

References

• Chan E. In: Single Nucleotide Polymorphisms,Methods in Molecular Biology 578 (2009)

• McKenna et al. Genome Res 20:1297-1303 (2010)• Li H & Durbin R. Bioinformatics 25:1754-1760 (2009)• Li H et al. Bioinformatics 25:2078-2079 (2009)• Li H et al. Genome Res 18:1851-1858 (2008)

Questions?

eccb10 talk - nextgen sequencing and snps

jar path

pathtotmpdir jar

fasta o pathtooutput

info r resourceshomo

raw accuracy

different callers

indelslocal realignment

sequenceslocal realignment

Documents

bnfo 602 lecture 1 usman roshan. bio background dna...

mutaciones y snps

trushar shah & rajneesh paliwal -...

class vi - snps

2 nd generation (“nextgen”) sequencing technologies...

variant calling (using high-‐throughput sequencing...

application of nextgen sequencing (ion torrent) for second...

snps e polimorfismo

proceedings of soil remediation workshop_27-28 …...

genetic mates snpedia write-ups final next-gen sequencing...

j lichtenberg - discovery of motif-based regulatory...

nextgen sequencing method validation and clia...

el proyecto genoma humano...more complex genome structures...

김동환 2009암학회워크샵.ppt [호환 모드]•basic...

nextgen net-centric operations supporting nextgen weather...

nextgen sequencing: experimental planning and data...

snps - gdc-docs.ethz.ch

nextgen program update wade lester nextgen research db...

exploring the use of nanopore cdna sequencing for...

beyond single genes or proteins - ritchie lab · 5 million...