microsoft · web viewexplain what the word bioinformatics refers to and give some examples of...

10
Practical sessions Bioinformatics Workshop Sessions 2021 Carmelina Charalambous Session 2 Session 2 (advanced): Lancaster virus. This is the genome of a novel virus isolated in a GP surgery in Lancaster by an MSc student in early 2015 There is also the MEGA manual: http://www.megasoftware.net/MEGA-v1.01.pdf of which Chapters 3, 4 and 7 may be useful. Although it is a software manual rather than a textbook, it explains some of the basic concepts rather well. It is recommended that you download MEGA for your own personal PCs and laptops. It is available for free at http://www.megasoftware.net . For those of you who prefer not to download things onto your own computers, or who run incompatible architectures, MEGA is part of Applications Jukebox and is therefore available on any LU teaching lab PC. MEGA is required for the coursework. The learning objectives are: 1. Explain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application. 2. Understand the concept of format as used in bioinformatics, and why it is so important. 1

Upload: others

Post on 03-Aug-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Microsoft · Web viewExplain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application. Understand the concept of format as

Practical sessions

BioinformaticsWorkshop Sessions 2021Carmelina Charalambous

Session 2

Session 2 (advanced): Lancaster virus. This is the genome of a novel virus isolated in a GP surgery in Lancaster by an MSc student in early 2015

There is also the MEGA manual: http://www.megasoftware.net/MEGA-v1.01.pdf of which Chapters 3, 4 and 7 may be useful. Although it is a software manual rather than a textbook, it explains some of the basic concepts rather well.

It is recommended that you download MEGA for your own personal PCs and laptops. It is available for free at http://www.megasoftware.net. For those of you who prefer not to download things onto your own computers, or who run incompatible architectures, MEGA is part of Applications Jukebox and is therefore available on any LU teaching lab PC. MEGA is required for the coursework.

The learning objectives are:

1. Explain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application.

2. Understand the concept of format as used in bioinformatics, and why it is so important.

3. Be able to recognise the following commonly used data formats: FASTA, GenBank, GCG, Nexus, Newick, Clustal and describe what they are used for.

4. Understand the meanings of the terms search and alignment as they are used in bioinformatics, and explain their importance.

5. Be able to perform a simple sequence search using BLAST and a simple alignment using ClustalW.

6. Be able to build, and interpret, a phylogenetic tree in MEGA. 7. Be able to derive several important sequence statistics in MEGA.

Software downloads:

1

Page 2: Microsoft · Web viewExplain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application. Understand the concept of format as

MEGA

Carry out the following instructions –The steps are all things that you’ll need to know how to do for the coursework assessment. The demonstrators are on hand for any questions.

1. Go to NCBI and search for KY342346.1 This is the genome of a novel virus isolated in a GP surgery in Lancaster by an MSc student in early 2015.2. Go to the NCBI webpage (http://www.ncbi.nlm.nih.gov) and click “BLAST”

3. Choose “Nucleotide BLAST”4. Paste the accession number in the query box5. Type “viruses” in the “Organism” box. If offered a precomputed taxid, take it. The

DEFLINE will appear automatically in “Job Title”.6. Click the “BLAST” button.

7. You will see the following output:

2

Page 3: Microsoft · Web viewExplain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application. Understand the concept of format as

8. Scroll downwards to the list of hits. Select around a dozen or so that are ATCC (American Type Culture Collection)

3

Page 4: Microsoft · Web viewExplain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application. Understand the concept of format as

9. Click “Download” and choose “FASTA complete sequence”, then click “Continue”.10. Save the file somewhere you can find it. Give it a name you will recognise and make

sure it ends in *.fasta, e.g. HRV-A.fasta. Make sure you save it as “All Files” and not “Text Document”.

11. Before you leave the output, click on the top hit. This will take you to the alignment of the query sequence with the top hit. What is the percentage divergence of the query from the reference genome?

12. Click on the sequence ID link. This shows you the GenBank format rendition of the human rhinovirus A-22 reference genome. Learning Objective alert! Recognising and understanding GenBank format is one of your learning objectives!

13. Launch MEGA.14. Click “Align”, then “Edit/build alignment”, then “Retrieve sequences from a file”.

Choose *.FASTA and open the file you just saved.15. Add the Lancaster virus sequence by cutting and pasting it.16. Click “Alignment”, then “Align by ClustalW”. Learning Objective alert!

Understanding what an alignment is and how to do it is one of your learning objectives!

4

Page 5: Microsoft · Web viewExplain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application. Understand the concept of format as

17. While this is running, go to the EBI alignment website: (http://www.ebi.ac.uk/Tools/msa/clustalo/). Browse for the file you just saved, chose “DNA” in “Enter or paste” and hit submit.

18. Look at your EBI Clustal O output. Learning Objective alert! Recognising and understanding Clustal format is one of your learning objectives!

19. Click “Download Alignment file”. Save it as something you can recognise.20. Go to the EMBOSS site: http://emboss.bioinformatics.nl/ . From the program list on

the left side, choose seqret.

5

Page 6: Microsoft · Web viewExplain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application. Understand the concept of format as

21. Browse for your Clustal O output file. In “Output Sequence format” choose “Nexus/PAUP”. Click “Run seqret”.

22. Look at your EBI Clustal O output in Nexus format. Learning Objective alert! Recognising and understanding Nexus format is one of your learning objectives!

6

Page 7: Microsoft · Web viewExplain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application. Understand the concept of format as

23. Repeat the above steps, now choosing “GCG/MSF” in “Output Sequence format” Learning Objective alert! Recognising and understanding GCG format is one of your learning objectives!

24. Repeat again, now choosing “GenBank” in “Output Sequence format” What is different about this GenBank format to the one you saw before?

25. Return to MEGA. The alignment should have completed by now.26. Click “Data”, then “Save session”. Save it a something you can recognise, ending in

*.mas.27. Click “Export Alignment”, then “MEGA format”. Save it a something you can

recognise, ending in *.meg. For “protein coding nucleotide sequence data” click “No” (this is only for alignment of individual coding segments, not whole genomes).

28. Return to the main MEGA window. Click “Data”, then “Open file Session”, and open the *.meg file you just saved.

29. Click “Data”, then “Explore Active Data”.30. Click the “V” and then “S” buttons to find the proportion of variable and singleton

sites. What are the percentages of each? What is the length of the alignment?

31. Click “Models”, then “Find Best DNA/protein model”, then “Compute”.32. Make a note of the best model. 33. Click “Models”, then “Estimate Transition/Transversion Bias”, enter your best model

in the “Model/Method” selection, then “Compute”. 34. Click “Distance”, then “Compute Pairwise distances”, leave “Model/Method”

selection as “p-distance”, then “Compute”. 35. In pop-up result, click “->XL” and choose “Export Type” as “Matrix”, then “Print/Save

Matrix”. Save Excel spreadsheet. What is the p-distance between the Lancaster virus and its nearest relative.

36. Click “Phylogeny”, then “Construct/Test Maximum Likelihood Tree”, enter your best model in the “Model/Method” selection, then “Compute”.

37. When the tree appears, click “File”, then “Export Current Tree (Newick)”. Save this as something you will recognise. Learning Objective alert! Recognising and

7

Page 8: Microsoft · Web viewExplain what the word bioinformatics refers to and give some examples of bioinformatics techniques and their application. Understand the concept of format as

understanding Newick format is one of your learning objectives! Repeat and save as *.mts.

38. What is the nearest relative of the Lancaster virus sequence on the tree?

As to the question of whether or not the Lancaster virus is a new species or just a variant of rhinovirus 22, please read the paper by McIntyre et al (2013): https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3749525/ See your own answer to question 36 above and compare it with the figure given in the paper.

8