phylogeny - based on whole genome data johanne ahrenfeldt research assistant
TRANSCRIPT
Overview
• What is Phylogeny and what can it be used for
• Single Nucleotide Polymorphism (SNP) methods- snpTree and CSI Phylogeny
• Nucleotide Differences- NDtree
• Controlled Evolution study• What services for which data
What is phylogenetic trees
• Trees are traditionally made using aligned sequences of single genes or proteins
• Whole genome data may be used to create trees based on – SNP calling– K-mer overlap
What is a SNP
• A Single Nucleotide Polymorphism (SNP) is a DNA sequence variation occurring commonly* within a population (e.g. 1%) in which a Single Nucleotide — A, T, C or G — in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes.
What is phylogeny used for
• Classify taxonomy – the classic use• Outbreak detection – becoming more
increasing with WGS data
Construct distance matrix
Strain A ATTCAGTAStrain B ATGCAGTCStrain C ATGCAATCStrain D ATTCAGTC
A B C DA 0 2 3 1B 2 0 1 1 C 3 1 0 2D 1 1 2 0
Make Tree
Strain AATTCAGTAStrain B ATGCAGTCStrain C ATGCAATCStrain D ATTCAGTC A B C DA 0 2 3 1B 2 0 1 1 C 3 1 0 2D 1 1 2 0
A
D
BC
snpTree
• First online webserver for constructing phylogenetic trees based on whole genome sequencing
snpTree--a web-server to identify and construct SNP trees from whole genome sequence data. Leekitcharoenphon P, Kaas RS, Thomsen MC, Friis C, Rasmussen S, Aarestrup FM. BMC Genomics. 2012;13 Suppl 7:S6.
CSIPhylogeny
• Finds SNPs in the same manner as snpTree• More strict sorting of SNPs• Requires all SNPs to be significant
– Z-score higher than 1.96 for all SNPs
• X is the number of reads, of the most common nucleotide at that position, and Y the number of reads with any other nucleotide.
NDtree
• A different approach where the main distinction is not between if a SNP should be called or not, but between whether there is solid evidence for what nucleotide should be called or not.
NDtree
• When all reads had been mapped the significance of the base call at each position was evaluates by calculating the number of reads X having the most common nucleotide at that position, and the number of reads Y supporting other nucleotides.
• A Z-score threshold was calculated as:
> 1.96 (or 3.29)
• >90% of reads supporting the same base
NDtree
• Count nucleotide differences– Method 1: Each pair of sequences was
compared and the number of nucleotide differences in positions called in all sequences was counted.
• More accurate (Z=1.96 is used as threshold)
– Method 2: Each pair of sequences was compared and the number of nucleotide differences in positions called in both sequences was counted.
• More robust (Z=3.29 is used as threshold)
NDtree
• A matrix with these numbers was given as input to a UPGMA algorithm implemented in the neighbor program.
• Simplest method:– First cluster closest strains
• Merge those strains to one point– Distance of cluster to another strain is average of
distances for each member in cluster to that strain
– Then merge second closest– Repeat until all strains are clustered
UPGMA vs. Neighbor Joining
• UPGMA works well when samples have been taken the same time
• Neighbor joining is better when samples have been taken at different times