aligning sgs reads and snp finding -...

96
Aligning SGS reads and SNP finding Aligning SGS reads and SNP finding Philippe Bardou & Olivier Rué

Upload: others

Post on 22-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

Aligning SGS reads and SNP findingAligning SGS reads and SNP finding

Philippe Bardou & Olivier Rué

Page 2: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

2

Organisation

Morning (9h00 -12h00) :

- Sequence quality– Theory + exercises

- Read mapping– Theory + exercises

Afternoon (13h30-17h) :

- SAM format– Theory + exercises

- Visualisation– Theory + exercises

- SNP calling– Theory + exercises

Page 3: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

3

Where are we?

Sequencing

De NovoAssembly

Alignment

Genome

Transcriptome

Genome

Transcriptome

Genome

Transcriptome

SNP Calling

Chip Seq

Genome

Transcriptome

Transcriptomic

Methylation

Page 4: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

4

What are you going to learn?

● To extract reads and reference genome from the NCBI

● To verify the read quality

● To format reference sequence

● To align the reads on the reference genome

● To index the reference sequence and the aligned reads

● To visualise the alignments and variations

● To improve alignment and to recalibrate SGS data

● To call SNPs

Page 5: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

5

What you should already know?

● How to connect to a remote unix server (putty)?

● What a unix command looks like?

● How to move around the unix environment?

● How to edit a file?

Page 6: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

6

The pieces of software

● Fastqc : quality control

● BWA : alignment

● Samtools & Picard-tools : manipulation of BAM files

● IGV : visualisation

● GATK : analysis of SGS data

Page 7: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

7

The 1000 genomes project

● Joint project NCBI / EBI

● Common data formats :

– fastq

– SAM (Sequence Alignment/Map)

Page 8: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

8BAM

SAM

VCF

SRASRAENAENA[…][…]

fastq

BWABWA(aln / bwsw /(aln / bwsw /

mem)mem)

BAM

FastQCFastQC

fastqfastq

SAMSAM

samtoolssamtools(view/merge/(view/merge/

sort/...)sort/...)

BAM

IGVIGV

rmduprmdup

BAM BAM

Overview

realignmentrealignmentrecalibrationrecalibrationVariantVariantcallingcalling

GATK Picard tools

Page 9: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

9

SGS platforms

● Two platforms :

– Illumina Solexa

– Roche 454

Page 10: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

10

SGS reads

2013

HiSeq 2000/2500600Gb / run2 x 100pb / lecture

Titanium XL+700Mb / run700pb / lecture

2013

Page 11: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

11

Sequencing bias bibliography

Page 12: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

12

Sequencing bias

● Platform related

● Roche 454 (data from Jean-Marc Aury CNS)

– 99,9% mapped reads

– Mean error rate : 0,55%

– 37% deletions, 53% insertions, 10% substitutions.

– homopolymers errors

– emPCR duplications

● Solexa (data from Jean-Marc Aury CNS)

– 98,5% mapped reads

– Mean error rate : 0,38%

– 3% deletions, 2% insertions, 95% substitutions

– Low A/T rich coverage

Page 13: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

13

What data will we use?

● The needed data :

– A reference sequence :● Genome● Parts of the genome● Transcriptome

– Short/Long reads

Page 14: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

14

Where to get a reference genome?

● Assemble your own

● Use a public assembly :

– NCBI : Genbank

– EMBL

Page 15: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

15

Where to get short reads?

● Produce your own sequences :

– CNS

– Local platform

– Private company● Use public data :

– SRA : NCBI Sequence Read Archive

– ENA : EMBL/EBI European Nucleotide Archive

Page 16: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

16

NCBI SRA?

Page 17: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

17

EBI ENA

Page 18: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

18

Meta data

● Meta data structure :

– Experiment– Sample– Study– Run– Data file

Page 19: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

19

What is a fastq file

Page 20: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

20

fastq file formats

Page 21: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

21

Sequence quality

● Phred : base calling

What is Phred Quality?

Traditionally, Phred quality is defined on base calls. Each base call is an estimate of the true nucleotide. It is a random variable and can be wrong. The probability that a base call is wrong is called error probability.

Explanation about the quality values :source http://maq.sourceforge.net/qual.shtml

Page 22: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

22

Which reads should I keep?

● All

● Some : what criteria and threshold should I use

– Composition (number of Ns, complexity,...),

– Quality,

– Alignment based criteria,● Should I trim the reads using :

– Composition

– Quality

Page 23: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

23

Basic reads statistics

● Number of reads

● Length histogram

● Number of Ns in the reads

● Reads quality

● Reads redundancy

● Reads complexity

Page 24: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

24

Sequence quality analysis

● FastQC :

Page 25: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

25

Sequence quality analysis

● FastQC :

Page 26: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

26

NG6 - Home

Page 27: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

27

NG6 - Runs

Page 28: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

28

NG6 - Run

Page 29: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

29

NG6 - Stats

Page 30: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

30

NG6 - Stats

K-mers

Page 31: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

31

NG6 - Stats

Sequence content across all bases

Page 32: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

32

NG6 - Stats

N content across all bases

Page 33: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

33

Exercises / set 1

● genotoul server connexion

● Short read retrieval (wget)

● Read statistics (fastqc)

● Data sets :

– SRX002048

– ERR003037

– ERR000017

Login Passwordanemone f1o2r3!aster f1o2r3!bleuet f1o2r3!iris ...lotusmuguetnarcissepenseepervencherosetulipeviolette

Page 34: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

34

Read alignment

● The different software generations :

– Smith-Waterman / Needleman-Wunch (1970)

– BLAST (1990)

– MAQ (2008)

– BWA (2009)

Page 35: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

35

BWA● Fast and moderate memory footprint (<4GB)● SAM output by default● Gapped alignment for both SE and PE reads● Effective pairing to achieve high alignment accuracy; suboptimal hits

considered in pairing.● Non-unique read is placed randomly with a mapping quality 0● Limited number of errors (2 for 32bp, 4 for 100 bp, ...)● The default conguration works for most typical input.

– Automatically adjust parameters based on read lengths and error rates.

– Estimate the insert size distribution on the fly

http://bio-bwa.sourceforge.net/

Page 36: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

36

Burrows-Wheeler transform

● Original text = “googol”

● Append ‘$’ to mark the end = X = “googol$”

● Sort all rotations of the text in lexicographic order

● Take the last column.

Page 37: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

37

BWA prefix trie

● Word 'googol'

● ^ = start character

● --- Search of 'lol' with one error

The prefix trie is compressed to fit in memory in most cases ( 1Go for the human genome).

Page 38: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

38

http://bio-bwa.sourceforge.net/bwa.shtml

Page 39: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

46

BWA – 3 algorithms

mem aln bwasw

Read length >70bp => few Mb Short reads (~ <100bp) Long reads (>100bp)

Genome length More than 4Gb < 4Gb < 4Gb

Time processing +++ ++ -

Paired Yes (+++) Yes (++) Yes (-)

Algorithm Auto: end-to-end/SW End-to-end SW

Commands

index bwa index ref.fasta

alignmentbwa mem ref.fa r1.fq [r2.fq] > o.sam

bwa aln ref.fa r1.fq > o1.saibwa aln ref.fa r2.fq > o2.sai

bwa bwasw ref.fa r1.fq [r2.fq] > o.sam

postprocess ­bwa samse ref.fa r1.fq o.sai > o.sam

bwa sampe ref.fa r1.fq r2.fq o1.sai o2.sai > o.sam ­

Page 40: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

48

Exercises / set 2

● Retrieving the reference sequence in fasta format :

– gi|224581838|ref|NC_012125.1| Salmonella enterica subsp. enterica serovar Paratyphi C strain RKS4594, complete genome

● Indexing the reference sequence

● Aligning the reads (fastq format)

● Formatting the alignment in SAM

Page 41: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

49

Sequence Alignment/Map (SAM) format

➢ Data sharing was a major issue with the 1000 genomes➢ Capture all of the critical information about NGS data in a single

indexed and compressed file➢ Sharing : data across and tools➢ Generic alignment format➢ Supports short and long reads (454 – Solexa – Solid)➢ Flexible in style, compact in size, efficient in random access

Website : http://samtools.sourceforge.net

Paper :Li H.*, Handsaker B.*, Wysoker A., Fennell T., Ruan J., Homer N., Marth G., Abecasis G., Durbin R. and 1000 Genome Project Data Processing Subgroup (2009) The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics, 25, 2078-9. [PMID: 19505943]

Page 42: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

50

Sequence Alignment/Map (SAM) format

Page 43: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

51

SAM formatHeader section

➢ Header lines start with @ followed by a two-letter TAG

➢ Header fields are TYPE:VALUE pairs

Page 44: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

52

SAM formatAlignment section

➢ 11 mandatory fields

➢ Variable number of optional fields

➢ Fields are tab delimited

Page 45: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

53

SAM formatFull example

<QNAME> <FLAG> <RNAME> <POS> <MAPQ> <CIGAR> <MRNM> <MPOS> <ISIZE> <SEQ> <QUAL>

[<TAG>:<VTYPE>:<VALUE> [...]]

Header

Alignement

X? : Reserved for end usersNM : Number of nuc. DifferenceMD : String for mismatching positionsRG : Read group[...]

A : Printable characteri : Signed 32­bit integerf : Single­precision float numberZ : Printable stringH : Hex string (high nybble first)

Page 46: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

54

SAM formatFlag field

http://picard.sourceforge.net/explain-flags.html

Page 47: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

55

SAM formatExtended CIGAR format

Ref: GCATTCAGATGCAGTACGC

Read:  ccTCAG­­GCATTAgtg

POS  CIGAR

5    2S4M2D6M3S

Page 48: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

56

SAM formatExtended CIGAR format

Page 49: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

57

BAM format

➢ Binary representation of SAM

➢ Compressed by BGZF library

➢ Greatly reduces storage space requirements to about 27% of original SAM

Page 50: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

58

SAMtools

➢ Library and software package

➢ Creating sorted and indexed BAM files from SAM files

➢ Removing PCR duplicates

➢ Merging alignments

➢ Visualization of alignments from BAM files

➢ SNP calling

➢ Short indel detection

http://samtools.sourceforge.net/samtools.shtml

Page 51: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

59

SAMtoolsExample usage

Page 52: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

60

SAMtoolsExample usage

➢ Create BAM from SAM

samtools view ­bS aln.sam ­o aln.bam

➢ Sort BAM file

samtools sort example.bam sortedExample

➢ Merge sorted BAM files

samtools merge sortedMerge.bam sorted1.bam sorted2.bam

➢ Index BAM file

samtools index sortedExample.bam

➢ Visualize BAM file

samtools tview sortedExample.bam reference.fa

Page 53: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

61

Picard

➢ A SAMtools complementary package

➢ More format conversion than SAMtools

➢ Visualization of alignments not available

➢ SNP calling & short indel detection not available

http://picard.sourceforge.net/

Page 54: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

62

PicardExample usage

➢ ValidateSamFile➢ SortSam➢ MarkDuplicates➢ EstimateLibraryComplexity➢ MergeSamFiles➢ ViewSam➢ ReplaceSamHeader

java ­Xmx2g ­jar PicardCmd.jar OPTION1=value1 OPTION2=value2...

➢ SamToFastq➢ FastqToSam➢ SamFormatConverter➢ CreateSequenceDictionary➢ CleanSam➢ CompareSAMs

Page 55: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

63

Exercise / set 3

Page 56: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

64

Visualizing the alignmentSAMtools : tview

samtools tview aln.sorted.bam ref.fasta 

Page 57: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

65

Visualizing the alignmentIGV

➢ IGV : Integrative Genomics Viewer

➢ Website : http://www.broadinstitute.org/igv

Page 58: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

66

Visualizing the alignmentIGV

➢High-performance visualization tool

➢Interactive exploration of large, integrated datasets

➢Supports a wide variety of data types

➢Documentations

➢Developed at the Broad Institute of MIT and Harvard

Page 59: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

67

Visualizing the alignmentIGV

Page 60: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

68

Visualizing the alignmentIGV - Loading the reference

Page 61: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

69

Visualizing the alignmentIGV - Loading the reference

Page 62: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

70

Visualizing the alignmentIGV - Loading the bam file

Page 63: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

71

Visualizing the alignmentIGV - Loading the bam file

Page 64: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

72

Visualizing the alignmentIGV - Zoom

Page 65: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

73

Visualizing the alignmentIGV - Zoom

Page 66: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

74

Visualizing the alignmentIGV - Loading an annotation file

Page 67: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

75

Visualizing the alignmentIGV - Coverage

Page 68: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

76

Exercise / set 4

Page 69: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

77

Variant calling

➢ A new file format :✔ VCF (variant calling format)

➢ Data pre-processing

➢ Variant calling

➢ IGV visualisation

Page 70: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

78

The VCF format

✗ Tab-delimited text file

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">...##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">...##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=­1.905;QD=10.77;ReadPosRankSum=­0.495;SB=­85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=­0.519;SB=­486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

http://vcftools.sourceforge.net/specs.html

http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

Page 71: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

79

The VCF format

✗ Tab-delimited text file

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">...##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">...##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=­1.905;QD=10.77;ReadPosRankSum=­0.495;SB=­85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=­0.519;SB=­486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

VCF version

http://vcftools.sourceforge.net/specs.html

http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

Page 72: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

80

The VCF format

✗ Tab-delimited text file

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred­scaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phred­scaled p­value using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per­sample when compared against the Hardy­Weinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z­score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=­1.905;QD=10.77;ReadPosRankSum=­0.495;SB=­85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=­0.519;SB=­486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

Fields description

http://vcftools.sourceforge.net/specs.html

http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

Page 73: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

81

The VCF format

✗ Tab-delimited text file

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred­scaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phred­scaled p­value using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per­sample when compared against the Hardy­Weinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z­score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=­1.905;QD=10.77;ReadPosRankSum=­0.495;SB=­85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=­0.519;SB=­486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

Tools & options used

http://vcftools.sourceforge.net/specs.html

http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

Page 74: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

82

The VCF format

✗ Tab-delimited text file

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred­scaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phred­scaled p­value using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per­sample when compared against the Hardy­Weinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z­score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=­1.905;QD=10.77;ReadPosRankSum=­0.495;SB=­85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=­0.519;SB=­486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

Genome informations

http://vcftools.sourceforge.net/specs.html

http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

Page 75: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

83

The VCF format

✗ Tab-delimited text file

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred­scaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phred­scaled p­value using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per­sample when compared against the Hardy­Weinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z­score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=­1.905;QD=10.77;ReadPosRankSum=­0.495;SB=­85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=­0.519;SB=­486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

Header line

http://vcftools.sourceforge.net/specs.html

http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

Page 76: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

84

The VCF format

✗ Header line

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred­scaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phred­scaled p­value using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per­sample when compared against the Hardy­Weinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z­score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=­1.905;QD=10.77;ReadPosRankSum=­0.495;SB=­85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=­0.519;SB=­486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

CHROM : chromosome IDPOS : position (sorted)ID : variant IDREF : reference base(s)ALT : alternative allele(s)QUAL : phred-scaled variant qualityFILTER : [ . | PASS | type of filter ]INFO : List of additional informationsFORMAT : Genotype fields formatL1 : 1st genotypeL2 : 2nd genotype...

Page 77: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

85

The VCF format

✗ Tab-delimited text file

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred­scaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phred­scaled p­value using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per­sample when compared against the Hardy­Weinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z­score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=­1.905;QD=10.77;ReadPosRankSum=­0.495;SB=­85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=­0.519;SB=­486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

Variants lines

http://vcftools.sourceforge.net/specs.html

http://gatkforums.broadinstitute.org/discussion/1268/how-should-i-interpret-vcf-files-produced-by-the-gatk

Page 78: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

86

The VCF format

✗ How to interpret a variant line ?

##fileformat=VCFv4.1##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred­scaled likelihoods for genotypes as defined in the VCF specification">##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">##INFO=<ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">##INFO=<ID=BaseQRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt Vs. Ref base qualities">##INFO=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth; some reads may have been filtered">##INFO=<ID=DS,Number=0,Type=Flag,Description="Were any of the samples downsampled?">##INFO=<ID=Dels,Number=1,Type=Float,Description="Fraction of Reads Containing Spanning Deletions">##INFO=<ID=FS,Number=1,Type=Float,Description="Phred­scaled p­value using Fisher's exact test to detect strand bias">##INFO=<ID=HRun,Number=1,Type=Integer,Description="Largest Contiguous Homopolymer Run of Variant Allele In Either Direction">##INFO=<ID=HaplotypeScore,Number=1,Type=Float,Description="Consistency of the site with at most two segregating haplotypes">##INFO=<ID=InbreedingCoeff,Number=1,Type=Float,Description="Inbreeding coefficient as estimated from the genotype likelihoods per­sample when compared against the Hardy­Weinberg expectation">##INFO=<ID=MQ,Number=1,Type=Float,Description="RMS Mapping Quality">##INFO=<ID=MQ0,Number=1,Type=Integer,Description="Total Mapping Quality Zero Reads">##INFO=<ID=MQRankSum,Number=1,Type=Float,Description="Z­score From Wilcoxon rank sum test of Alt vs. Ref read mapping qualities">##INFO=<ID=QD,Number=1,Type=Float,Description="Variant Confidence/Quality by Depth">##INFO=<ID=ReadPosRankSum,Number=1,Type=Float,Description="Z­score from Wilcoxon rank sum test of Alt vs. Ref read position bias">##INFO=<ID=SB,Number=1,Type=Float,Description="Strand Bias">##UnifiedGenotyper="analysis_type=UnifiedGenotyper input_file=[L1_RG_s_realign_recal_q30.bam, L2_RG_s_realign_recal_q30.bam] read_buffer_size=null phone_home=STANDARD gatk_key=null read_filter=[] [...] ##contig=<ID=scaffold376,length=1000>##reference=file:///work/banks/genome.fasta#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT l1 l2scaffold376 101 . G A 247.82 .AC=5;AF=0.417;AN=12;BaseQRankSum=4.588;DP=13;Dels=0.00;FS=0.000;HRun=1;HaplotypeScore=0.2974;MQ=45.38;MQ0=0;MQRankSum=­1.905;QD=10.77;ReadPosRankSum=­0.495;SB=­85.94 GT:AD:DP:GQ:PL ./. 1/1:0,13:13:9.01:87,9,0scaffold376 121 . G A 1443.70 .AC=6;AF=0.429;AN=14;BaseQRankSum=7.299;DP=57;Dels=0.00;FS=2.071;HRun=1;HaplotypeScore=0.1413;MQ=57.21;MQ0=0;MQRankSum=2.384;QD=10.94;ReadPosRankSum=­0.519;SB=­486.44 GT:AD:DP:GQ:PL 1/1:0,29:29:63.13:721,63,00/1:17,11:28:99:228,0,371

Variant at position 101 on scaffold376Not annotatedBase G in reference genome / A in the sequencesSNP quality : 247,82No filters appliedOther informations

GT : ./. → not defined l10/0 → homozygous reference0/1 → heterozygous1/1 → homozygous alternative l2

AD : allelic depth for REF → 0allelic depth for ALT → 13

DP : Read depth → 13

GQ : genotype quality → 9,01

PL : genotype 0/0 likelihood → 87genotype 0/1 likelihood → 9Genotype 1/1 likelihood → 0

Page 79: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

87

The VCF format

✗ Small INDELs

✗ Multi-allelic variants

scaffold376 500 . CTT C 424.60 ....

scaffold376 500 . C CT 434.60 ....

DeletionInsertion

scaffold376 577 . C A,T 2303.19 ....GT:AD:DP:GQ:PL 1/2:0,10,6:16:99:394,145,118,249,0,234 1/2:0,20,6:26:99:658,160,106,498,0,480

Page 80: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

88

THE GATK best practices

Page 81: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

89

THE GATK best practices

Page 82: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

90

Why GATK ?

➢ New technologies have difficulties to provide an accurate phred-scaled base quality

➢ The impact of a wrong base quality on SNP calling is very important

➢ Illustration with 5 variant calling methods :

Page 83: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

91

GATK pipeline

SAMSAMBAM

GATKGATKIndelRealignerIndelRealigner

GATKGATKRealignerTargetCreatorRealignerTargetCreator

SAMSAMBAM

VCF

Picard toolsPicard toolsAddOrReplaceReadGroupsAddOrReplaceReadGroups

SamtoolsSamtoolsindexindex Picard toolsPicard tools

MarkDuplicatesMarkDuplicates

SAMSAMBAM

SAMSAMBAM

GATKGATKBaseRecalibratorBaseRecalibrator

GATKGATKPrintReadsPrintReads

SAMSAMBAM GATKGATKUnifiedGenotyperUnifiedGenotyper

Page 84: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

92

Warning ! Huge temporary data

SAMSAMBAM

GATKGATKIndelRealignerIndelRealigner

GATKGATKRealignerTargetCreatorRealignerTargetCreator

SAMSAMBAM

VCF

Picard toolsPicard toolsAddOrReplaceReadGroupsAddOrReplaceReadGroups

SamtoolsSamtoolsindexindex Picard toolsPicard tools

MarkDuplicatesMarkDuplicates

SAMSAMBAM

SAMSAMBAM

GATKGATKBaseRecalibratorBaseRecalibrator

GATKGATKPrintReadsPrintReads

SAMSAMBAM GATKGATKUnifiedGenotyperUnifiedGenotyper

Page 85: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

93

Warning ! Huge temporary data

SAMSAMBAM

GATKGATKIndelRealignerIndelRealigner

GATKGATKRealignerTargetCreatorRealignerTargetCreator

SAMSAMBAM

VCF

Picard toolsPicard toolsAddOrReplaceReadGroupsAddOrReplaceReadGroups

SamtoolsSamtoolsindexindex Picard toolsPicard tools

MarkDuplicatesMarkDuplicates

SAMSAMBAM

SAMSAMBAM

GATKGATKBaseRecalibratorBaseRecalibrator

GATKGATKPrintReadsPrintReads

SAMSAMBAM GATKGATKUnifiedGenotyperUnifiedGenotyper

* HIGH CPU-MEMORY REQUIREMENTS* HIGH TIME-PROCESSING

Page 86: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

94

Add Read Groups

➢ GATK needs @RG field in the BAM header

Aim : keep meta-informations (« Where are sequences coming from? »)Usefull to recalibrate base qualities (co-variates) and to call variants

java ­jar ­Xmx4G AddOrReplaceReadGroups.jar I=file.bam O=out.bam ID=[val] RGLB=[val] PL=[val] PU=[val] SM=[val]

Page 87: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

95

Add Read Groups

➢ GATK needs @RG field in the BAM file to run

Page 88: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

96

Add Read Groups

➢ GATK needs @RG field in the BAM file to run

Possible to add the @RG tag directly with « bwa samse»

bwa samse ref.fasta file.sai file.fastq ­r '@RG\tID:[val]\tPL:[val]\tPU:[val]\tLB:[val]\tSM:[val] > file.sam

Page 89: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

97

Duplicate Marking

➢ Duplicates can propagate sequencing errors

Credits : GATK website

java ­jar ­Xmx4G MarkDuplicates.jar I=file.bam O=file_rmdup.bam REMOVE_DUPLICATES=boolean M=file.metrics

Page 90: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

98

Local Realignment

Read concerned

Page 91: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

99

Local Realignment

Raw BAM101M

2 mismatchs

BAM realigned98M25D3M1 mismatch

Read concerned

Page 92: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

100

Base recalibration

➢ The base quality provided by the sequencers is too inaccurate to be kept. They are re-computed.

➢ Needs knowledge of real SNP for recalibrating

Page 93: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

101

Variant Calling informations

Page 94: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

102

IGV : visualization of variants

Page 95: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

103BAM

SAM

VCF

SRASRAENAENA[…][…]

fastq

BWABWA(aln / bwsw /(aln / bwsw /

mem)mem)

BAM

FastQCFastQC

fastqfastq

SAMSAM

samtoolssamtools(view/merge/(view/merge/

sort/...)sort/...)

BAM

IGVIGV

rmduprmdup

BAM BAM

Synthesis

realignmentrealignmentrecalibrationrecalibrationVariantVariantcallingcalling

GATK Picard tools

Page 96: Aligning SGS reads and SNP finding - genotoul-bioinfobioinfo.genotoul.fr/wp-content/uploads/SGS_SNP_slides_Oct2013.pdf · 2 Organisation Morning (9h00 -12h00) : - Sequence quality

104

Exercise / set 5