dotplots for bioinformatics

14
Dot plots Dr Avril Coghlan [email protected] this talk contains animations which can only be se oading and using ‘View Slide show’ in Powerpoint

Upload: avrilcoghlan

Post on 22-Jun-2015

4.913 views

Category:

Education


8 download

TRANSCRIPT

Page 1: Dotplots for Bioinformatics

Dot plots

Dr Avril [email protected]

Note: this talk contains animations which can only be seen by downloading and using ‘View Slide show’ in Powerpoint

Page 2: Dotplots for Bioinformatics

Dot plots

• How can we compare the human & Drosophila melanogaster Eyeless protein sequences?One method is a dotplot

• A dotplot is a graphical method for assessing similarityMake a matrix (table) with one row for each letter in sequence 1, & one column for each letter in sequence 2Colour in each cell with an identical letter in the 2 sequencesRegions of local similarity between the 2 sequences appear as diagonal lines of coloured cells (‘dots’)

Page 3: Dotplots for Bioinformatics

eg. for sequences ‘RQQEPVRSTC’ and ‘QQESGPVRST’:

Regions of local similarity between the 2 sequences appear as diagonal lines Some off-diagonal dots may be due to chance similarities

Sequence 2

Sequence 1

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Page 4: Dotplots for Bioinformatics

Problem• Make a dot-plot for DNA sequences “GCATCGGC” &

“CCATCGCCATCG”. Are there regions of similarity?

Page 5: Dotplots for Bioinformatics

Answer• Make a dot-plot for DNA sequences “GCATCGGC” &

“CCATCGCCATCG”. Are there regions of similarity?

CATCG in sequence 1 appears twice in sequence 2

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Page 6: Dotplots for Bioinformatics

• If you colour in all cells with an identical letter, some dots may be due to chance similarities

• Therefore, it is common to use a threshold to decide whether to plot a ‘dot’ in a cellA window of a certain size (eg. window size = 3) is moved up all possible

diagonals, one-by-oneA score is calculated for each position of the window on a diagonal : the number of identical letters in the windowIf the score is equal to or above the threshold (eg. threshold = score of

2), all the cells in the window are coloured inThe choice of values for the window size and threshold for the dot plot

are chosen by trial-and-error

Dot plots with thresholds

Page 7: Dotplots for Bioinformatics

Score = 1, < thresholdScore = 0, < thresholdScore = 0, < thresholdScore = 1, < thresholdScore = 2, ≥ thresholdScore = 2, ≥ threshold → colour inScore = 2, ≥ threshold → colour inScore = 2, ≥ threshold → colour inScore = 2, ≥ thresholdScore = 3, ≥ threshold → colour inScore = 3, ≥ threshold

eg. for sequences “GCATCGGC” and “CCATCGCCATCG” , using a window size of 3, and a threshold of ≥2:

and so on....

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 0, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 0, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 1, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 1, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 1, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 1, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 0, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 0, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 0, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 0, < threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Score = 2, ≥ threshold

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

= the sliding window

C C A T C G C C A T C G

G

C

A

T

C

G

G

C

Page 8: Dotplots for Bioinformatics

• A dot plot of fruitfly & human Eyeless proteins:

Do you think we chose a good value for the window-size and threshold?

Real data: fruitfly & human Eyeless

Human Eyeless

Fru

itfl

y E

yele

ss

Window-size = 10,Threshold = 3

Page 9: Dotplots for Bioinformatics

Real data: fruitfly & human Eyeless• Here is a dot plot of fruitfly and human Eyeless proteins, made

using windowsize=10, threshold=5:

Are there any regions of similarity?

Human Eyeless

Fru

itfl

y E

yele

ss

Window-size = 10,Threshold = 5

Page 10: Dotplots for Bioinformatics

• AdvantagesA dot plot can be used to identify long regions of strong similarity between two sequences It produces a plot, which is easy to make and to interpretIt can be used to compare very short or long sequences (even whole chromosomes – millions of bases)

• DisadvantagesIt is necessary to find the best window size and threshold by trial-and-errorA dot plot can only be used to compare 2 sequences, not >2 sequencesIt doesn’t tell you what mutations occurred in the region of similarity (if there is one) since the two sequences shared a common ancestor

Pros and cons of dot plots

Page 11: Dotplots for Bioinformatics

• dotPlot() function in the SeqinR R libraryAllows you to specify a windowsize and threshold

If the score in a window is ≥ than the threshold, colours in the 1st cell in the window (not all cells)

• EMBOSS dottupAllows you to specify a windowsize but not a thresholdIf all cells in a window are identities, it colours in all cells in the window

• EMBOSS dotmatcherAllows you to specify a windowsize and thresholdInstead of using the number of identities in a window as the window score, it calculates a more complex score based on the similarities of the bases/amino acids

Software for making dotplots

Page 12: Dotplots for Bioinformatics

Problem• Make a dot-plot for amino acid sequences

“RQQEPVRSTC” and “QQESGPVRST”, using a window size of 3, and a threshold of ≥3

Page 13: Dotplots for Bioinformatics

Answer• Make a dot-plot for sequences “RQQEPVRSTC” and “QQESGPVRST”,

using window size: 3, threshold: ≥3

Q Q E S G P V R S T

R

Q

Q

E

P

V

R

S

T

C

Page 14: Dotplots for Bioinformatics

Further reading• Chapter 3 in Introduction to Computational Genomics Cristianini & Hahn• Practical on dotplots in R in the Little Book of R for Bioinformatics:

https://a-little-book-of-r-for-bioinformatics.readthedocs.org/en/latest/src/chapter4.html