introduction to bioinformatics - shandong...

37
1 Introduction to Bioinformatics Dr. rer. nat. Gong Jing Cancer Research Center Medicine School of Shandong University 2012.11.09 Introduction to Introduction to Bioinformatics Bioinformatics

Upload: others

Post on 11-Jul-2020

6 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

1

Introduction to Bioinformatics

Dr. rer. nat. Gong Jing

Cancer Research Center

Medicine School of Shandong University

2012.11.09

Introduction to Introduction to BioinformaticsBioinformatics

Page 2: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

2

Chapter 4 Phylogenetic

Tree

Introduction to Introduction to BioinformaticsBioinformatics

Page 3: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

3

Introduction to Introduction to BioinformaticsBioinformatics

PhylogenyEvidence from morphological (形态学的), biochemical, and gene sequence data suggests that all organisms on earth are genetically related, and the genealogical (谱系的) relationship of living things can be represented by a vast evolutionary tree, the tree of Life. The tree of life then represents the phylogeny of organisms.

A phylogeny is a tree representation for the evolutionary history relating the species we are interested in.

Page 4: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

4

Introduction to Introduction to BioinformaticsBioinformatics

The most authentic evidences are fossils! But fossils are scattered, not complete, not systematic.

How to Study the Evolutionary History

Page 5: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

5

Introduction to Introduction to BioinformaticsBioinformatics

We can use comparative morphology and comparative anatomy (解剖学) to determine general framework of evolution. But many details are controversial.

How to Study the Evolutionary History

Page 6: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

6

Introduction to Introduction to BioinformaticsBioinformatics

Basic assumptions:

1) Nucleic acid sequences and protein sequences contain all information of evolutionary history of species;

2) Molecular clock: the rate of evolutionary change (the number of amino acid differences) of a certain protein was approximately constant over time and over different lineages.

=> The more similar two homologous proteins are, the closer they are to their common ancestor.

How to Study the Evolutionary HistoryComputational molecular evolution: phylogenetic tree. Evolution process happened on the level of molecules: DNA, RNA and protein.

Page 7: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

7

Homologous gene are genes that derive from a common ancestor.

They have 3 types of relationships:

Orthologs (直系同源): They’re separated by speciation — is the phenomenon during which a common ancestor gives birth to two subgroups that slowly drift away from their common genetic makeup to become distinct species. Orthologsusually have similar functions and structure.

Paralogs (间接同源): Paralogs are homologues separated by a duplication event, meaning that within a genome, a gene was duplicated. One of the duplicates may have kept the original function while the other duplicate could have acquired a new function.

Xenologs (异同源): Xeno is a Greek word that means “foreigner.” Xenologsresult from a lateral transfer between two organisms — a direct DNA transfer between two species. This means that one of the species contains a gene that does not have the same history as the genome in which it is inserted. This is often seen between pathogenetic bacteria and humans.

Introduction to Introduction to BioinformaticsBioinformatics

How to Study the Evolutionary History

Page 8: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

8

Introduction to Introduction to BioinformaticsBioinformatics

How to Study the Evolutionary History

Page 9: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

9

Phylogenetic TreeWhat is a phylogenetic tree used for?

For a certain protein/gene, determining the closest relatives of the organism that you’re interested in.

Discovering the function of a new protein/gene.

Retracing the origin of a gene.

Introduction to Introduction to BioinformaticsBioinformatics

Page 10: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

10

Conceptions:

leaf / outer node

branch / lineage

inner node

root

Phylogenetic Tree

Introduction to Introduction to BioinformaticsBioinformatics

Page 11: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

11

All these trees represent the same evolutionary relationships.

Cladogram Change-based phylogram Time-based phylogram

Branch lengths do Branch lengths indicate Inner nodes indicatenot mean anything. numbers of evolutionary branching time points.

changes

Phylogenetic Tree

Introduction to Introduction to BioinformaticsBioinformatics

With different branches, the phylogenetic trees have different names.

Page 12: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

12

Phylogenetic Tree

Introduction to Introduction to BioinformaticsBioinformatics

There are many different ways to represent the information found in a phylogenetic tree.

Page 13: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

13

Phylogenetic Tree

Introduction to Introduction to BioinformaticsBioinformatics

Branches can be rotated at a node, without changing the relationships among the out nodes.

Page 14: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

14

Should you do this on the protein or on the DNA sequence?

If DNA sequences > 70% identical: DNA multiple sequence alignment.

If DNA sequences ˂ 70% identical: If your sequences code for proteins: translate them into proteins and build the protein multiple sequence alignment.

If your sequences are too similar at the protein level, you can thread the DNA sequences back onto the protein alignment using pal2nal: http://www.bork.embl.de/pal2nal/.

In practice, unless your sequences are almost identical, it is easier to keep working at the protein level.

Choosing Right Sequences for the Right Tree

Introduction to Introduction to BioinformaticsBioinformatics

choose right sequences

do multiple sequence alignment

build a phylogenetic

tree

Page 15: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

15

Paralogs of a large human gene family: story of this gene family.

Orthologs from different species: much like a species tree.

Choosing Right Sequences for the Right Tree

Introduction to Introduction to BioinformaticsBioinformatics

Page 16: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

16

Algorithms of Tree Reconstruction

Maximum Parsimony (MP) 最大简约法:

Closely related sequences, accurate, sequence number <= 12.

Distance (Neighbor Joining, NJ) 邻接法:

Distantly/closely related sequences, not very accurate.

Maximum Likelihood (ML) 最大似然法:

Distantly related sequences, very accurate.

Speed:

Distance > Maximum Parsimony > Maximum Likelihood

Introduction to Introduction to BioinformaticsBioinformatics

Page 17: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

17

Algorithms of Tree Reconstruction

Introduction to Introduction to BioinformaticsBioinformatics

Page 18: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

18

Preparing Your Multiple Sequence Alignment

Computing your multiple sequence alignment:ClustalW: http://www.ebi.ac.uk/Tools/msa/clustalw2/MUSCLE: http://www.ebi.ac.uk/Tools/msa/muscle/T-coffee: http://tcoffee.crg.cat/

Removing bad columns that affect the tree quality:1. Make sure there are as many gap-free columns as possible. 2. Remove the extremities of your multiple alignment.3. Remove the gap-rich regions of your alignment.4. Be sure to keep the most informative blocks.

Before using your MSA for building a tree, you must make sure that it is as accurate as possible.

Page 19: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

19

1. Make sure there are as many gap-free columns as possible.

Preparing Your Multiple Sequence Alignment

columns to remove

Page 20: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

20

2. Remove the “bad” terminals of your multiple alignment.

columns to remove

Preparing Your Multiple Sequence Alignment

Page 21: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

21

3. Remove the gap-rich regions of your alignment.

columns to remove

Preparing Your Multiple Sequence Alignment

Page 22: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

22

4. Be sure to keep the most informative blocks.

columns to keep

Preparing Your Multiple Sequence Alignment

Page 23: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

23

How to Delete Columns with WORDWhile pressing the Alt key on your

keyboard, use the mouse to select entire columns in your alignment.

When you’ve selected everything you want to remove, press the Delete key to remove the selected block.

+

Page 24: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

24

Computing Your Tree

Guide Tree is NOT a phylogenetic tree.!

Page 25: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

25

EMBL ClustalW http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny

Computing Your Tree

English Courses English Courses for for

Graduate StudentsGraduate Students

Page 26: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

26

EMBL ClustalW http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny

Computing Your Tree

English Courses English Courses for for

Graduate StudentsGraduate Students

Page 27: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

27

EMBL ClustalW http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny

Computing Your Tree

English Courses English Courses for for

Graduate StudentsGraduate Students

Page 28: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

28

clustalw.aln

sequences.fasta

EMBL ClustalW http://www.ebi.ac.uk/Tools/phylogeny/clustalw2_phylogeny

Computing Your Tree

English Courses English Courses for for

Graduate StudentsGraduate Students

Page 29: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

29

This tree is much more accurate than a guide tree!

Computing Your Tree

English Courses English Courses for for

Graduate StudentsGraduate Students

Page 30: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

30

A phylogram is a phylogenetic tree that has branch lengths proportional to the amount of character change.In cladogram tree, the branch lengths do not represent any change.

Computing Your Tree

English Courses English Courses for for

Graduate StudentsGraduate Students

Phylogram Tree

Page 31: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

31

A phylogram is a phylogenetic tree that has branch lengths proportional to the amount of character change.In cladogram tree, the branch lengths do not represent any change.

Cladogram Tree

Computing Your Tree

English Courses English Courses for for

Graduate StudentsGraduate Students

Page 32: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

32

Different tree representation by choosing display options.

English Courses English Courses for for

Graduate StudentsGraduate Students

Introduction to Introduction to BioinformaticsBioinformatics

Computing Your Tree

Page 33: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

33

The easiest way to save your tree is to make a screen capture with theprint-screen (PrntScr) key on your keyboard. You can then cut and pastethis image into your favorite application (PowerPoint, Paint. etc.).

English Courses English Courses for for

Graduate StudentsGraduate Students

Introduction to Introduction to BioinformaticsBioinformatics

Displaying Your Tree

Paste (Ctrl + V) into Windows-Paint

Page 34: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

34

English Courses English Courses for for

Graduate StudentsGraduate Students

Introduction to Introduction to BioinformaticsBioinformatics

MyTree.ph

Displaying Your Tree

Page 35: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

35

English Courses English Courses for for

Graduate StudentsGraduate Students

Introduction to Introduction to BioinformaticsBioinformatics

Phylodendron http://iubio.bio.indiana.edu/treeapp/treeprint-form.html

MyTree.ph

Displaying Your Tree

Page 36: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

36

English Courses English Courses for for

Graduate StudentsGraduate Students

Introduction to Introduction to BioinformaticsBioinformatics

Phylodendron http://iubio.bio.indiana.edu/treeapp/treeprint-form.html

right click MyTree.png

Displaying Your Tree

Page 37: Introduction to Bioinformatics - Shandong Universitycourse.sdu.edu.cn/Download2/20130613162107500.pdf · 2015-09-07 · Introduction to Bioinformatics Basic assumptions: 1) Nucleic

37

English Courses English Courses for for

Graduate StudentsGraduate Students

Introduction to Introduction to BioinformaticsBioinformatics

sequences.fasta

clustalw.aln

MyTree.ph

MyTree.png