bio info 5

Post on 11-May-2015

731 Views

Category:

Education

4 Downloads

Preview:

Click to see full reader

DESCRIPTION

bioinformatics

TRANSCRIPT

Bioinformatics Lecture# 5Dr. Naeem Ud Din Khattak

ProfessorDepartment of Zoology

Islamia College Peshawar (Chartered University)

Phylogenetic Tree Construction

3

• The mutation distance : The minimal number of nucleotides that would need to be altered in order for the gene for one Protein to code for the other.

• ACTGAT A C T G A T - T C T - A T C TCTATC

The construction of the tree

4

• Assume proteins, A, B and C, and their mutation distances.

• There are two Qs:

1. Which pair does one join together first?

2. What are the lengths of edges a, b, and c?

B CA 24 28B 32

Which pair does one join together first ?

5

• It is simply by choosing the pair with the smallest mutation distance.

B CA 24 28B 32 A B C

What are the lengths of legs a, b, and c?

6

B CA 24 28B 32

a+b=24 a+c=28b+c=32

a =10b =14c =18

A B C

a b

c

a =?b =?c =?

• i. a+b=24 ii. a+c=28 iii. b+c=32

• a+b=24 : a=24-b put the value of a in ii : • 24-b+c=28 ; c-b=28-24; c-b=4 : c=4+b

• put value of c in iii. b+4+b=32 : 2b+4=32: 2b=32-4;

• b=28/2=14• Now put the value of b in 1

• Note that this analysis assumes that there are no multiple substitutions|||||||||||||||when a single site undergoes two or more changes e.g. the ancestral sequence … ATGT … gives … AGGT …

• and … ACGT …).

Based on lectures by C-B Stewart, and by Tal Pupko

Ancestral Node or ROOT of

the TreeInternal Nodes or

Divergence Points (represent hypothetical ancestors of the

taxa)

Branches or Lineages

Terminal Nodes

A

B

C

D

E

Represent theTAXA (genes,populations,species, etc.)used to inferthe phylogeny

Phylogenetic Tree Terminology

Based on lectures by C-B Stewart, and by Tal Pupko

Phylogenetic trees diagram the evolutionary relationships between the taxa

((A,(B,C)),(D,E)) = The above phylogeny as nested parentheses

Taxon A

Taxon BTaxon C

Taxon ETaxon D

Based on lectures by C-B Stewart, and by Tal Pupko

((A,(B,C)),(D,E))

Taxon A

Taxon B

Taxon C

Taxon E

Taxon D

__ B and C are more closely related to each other than either is to A, ___ and A, B, and C form a clade that is a sister group to the clade composed of D and E. ____If the tree has a time scale, then D and E are the most closely related.

clade

Clade

Sequence Comparisons

• Nature acts conservatively, i.e., it does not develop a new kind of biology for every life form but continuously changes and adapts a proven general concept.

• Novel functionalities do not appear because a new gene has suddenly arisen but are developed and modified during evolution.

• Thus, Alleles of a gene found in a population arise from a common ancestor gene_____________ HOMOLOGOUS

Homology is not a measure of similarity, but rather that sequences have a shared evolutionary history and, therefore, possess a common ancestral sequence

(Tatusovet al. 1997).• An all or none phenomenon

Orthologs• Homologous proteins from different

species that possess the same function (e.g., corresponding kinases in a signal transduction pathway in humans and mice) are called orthologs.

Paralogs • Homologous proteins that have

different functions in the same species (e.g., two kinases in different signal transduction pathways of humans) are termed paralogs.

lotus
SEE ANIMATION FOR KINASES

• A visual representation of orthologs (and some other commonly confused terms, paralogs and homologs)

Orthologs: "genes that have diverged after a speciation event... [that] tend to have similar function" (Fulton et al. 2006). Thus, orthologs are genes whose encoded proteins fulfill similar roles in different species.

• Homology is not quantifiable –

• The similarity and Identity of two sequences, however

IS

Identity • ratio of the

number of identical amino acids or nucleotides relative to the total number of amino acids or nucleotides.

4/20 = 0.2.

similarity• Unlike identity, similarity is not as simple to

calculate. Before similarity can be determined, it must first be defined how similar the building blocks of sequences are to each other.

• This is done with the help of similarity matrices _____ specify the probability at which a sequence transforms into another sequence over time.

• dependent on the time and the mutational rate of nucleotides.

• For nucleotide sequences the simplest solution is an identity matrix ( Fig. 4.2a).

• For protein sequences, an identity matrix is not sufficient to describe biological and evolutionary

processes.• Amino acids are not exchanged with the same

probability as might be conceived theoretically.

• YOU CAN RECALL THE SYNONYMOUS AND NON-SYNONYMOUS MUTATIONS

• For example, • an exchange of

aspartic acid for glutamic acid is frequently observed;

• aspartic acid to tryptophan is seen rarely.

T in DNA

DNA T

• A second reason for the mutation of aspartic acid- to- glutamic acid

to occur more often is that both have similar properties.

• In contrast aspartic acid and tryptophan are chemically different – the hydrophobic tryptophan is frequently found in the center of proteins, whereas the hydrophilic aspartic acid occurs more often at the surface.

• Amino acid substitution matrices, therefore, describe the probability at which amino acids are exchanged in the course of evolution.

• The most commonly used amino acid scoring matrices are the

PAM (Position Accepted Mutation; Dayhoff et al.

1978) and BLOSUM groups• (Blocks Substitution Matrix; Henikoff and

Henikoff 1992)

Tryptophan Trp W Hydrophobic

aspartic acid Asp D

Glutamic acid GluHydrophilic

E

Electrically Charged (negative)

NUCLEOTIDE AND AMINO ACID SEQUENCES ARE EVOLUTIONARILY DIFFERENT

SO,WE NEED DIFFERENT CRITERIA AND MATRICES TO ANALYZE THEM

• ( Fig. 4.2 a)

• For nucleotide sequences the simplest solution is an identity matrix

Score: 65 Score: 19

( Fig. 4.2 b) For Amino Acid Seqs We need Similarity Matrices

Calculation of a global alignment of two similar protein sequences.

Calculation of a global alignment of two similar protein Sequences

Identity • ratio of the

number of identical amino acids or nucleotides relative to the total number of amino acids or nucleotides.

4/20 = 0.2.

Identity • ratio of the

number of identical amino acids or nucleotides relative to the total number of amino acids or nucleotides.

4/20 = 0.2.

• Using MEGA to Calculate Mutation Distance

Outgroup to root a phylogenetic tree

• The tree of human, chimpanzee, gorilla and orangutan genes is rooted with a baboon gene because

• we know from the fossil record that the common ancestor of the four species split away from baboon earlier in geological time

• Let’s See Members of this Group

Outgroup

Chimp

Human

Gorilla

Orangutan

Baboon0.11487

0.02920

0.03257

0.03604

0.06993

0.04494

0.00993

0.00997

0.02

Chimp

Human

Gorilla

Orangutan

0.02893

0.03163

0.03631

0.06338

0.01087

0.01621

0.01

Kiwi Ostrich Swan Ring Necked Phaesant Silver phaesant song sparrow Parrot Lizzard

0.03410

0.05269

0.08525

0.03263

0.21419

0.02508

0.02360

0.08454

0.065700.06932

0.01663

0.03150

0.01503

0.00881

Outgroup

Kiwi

Struthio camelus

Swan

song sparrow

Ring nicked Phaesant

Silver pheasant

Parrot

0.02885

0.04909

0.10274

0.02885

0.02885

0.02885

0.07554

0.02023

0.05367

0.02645

0.00699

0.02021

The Design of the phylogenetic TREE does not change the evolutionary distance among the various taxa represented.

Kiwi

Struthio camelus

Swan

song sparrow

Ring nicked Phaesant

Silver pheasant

Parrot

0.02885

0.04909

0.10274

0.028850.02885

0.02885

0.07554

0.02023

0.05367

0.02645

0.00699

0.02021

The Design of the phylogenetic TREE does not change the evolutionary distance among the various taxa represented.

51

Types of Trees

rooted treesCommonAncestor

52

Types of treesUnrooted tree represents the same phylogeny without the

root node

This Tree is Rooted ?

Fig. 4.6. Phylogenetic tree of dopamine receptor sequences.

Gene trees are not the same as species trees

Examples of what can be inferred from phylogenetic trees

(DNA, protein) 1. Which species are the closest living

relatives of modern humans?

2. Did the infamous Florida Dentist infect his patients with HIV?

3. What is the relation between HIV and SIV

Relatives of modern humans?

Mitochondrial DNA, most nuclear DNA-encoded genes, and DNA/DNA hybridization

The pre-molecular view

MYA

Chimpanzees

Orangutans Humans

Bonobos

GorillasHumans

Bonobos

Gorillas Orangutans

Chimpanzees

MYA015-30014

Based on lectures by C-B Stewart, and by Tal Pupko

2. Did the Florida Dentist infect his patients with HIV?

DENTIST

DENTIST

Patient D

Patient F

Patient C

Patient APatient G

Patient BPatient EPatient A

Local control 2Local control 3

Local control 9

Local control 35

Local control 3

Yes:The HIV sequences fromthese patients fall withinthe clade of HIV sequences found in the dentist.

No

No

From Ou et al. (1992) and Page & Holmes (1998)

Phylogenetic treeof HIV sequencesfrom the DENTIST,his Patients, & LocalHIV-infected People:

3. Relating Human HIV to Simian SIV retroviruses

human immunodeficiency virus 1 (HIV-1), pathogenic

SIVs are not pathogenic in their normal hosts

IMAGE FROM: Medical Art Service, Munich / Wellcome Images.

The structure of HIV

CD4 proteins on surface

Phospholipid membrane

Matrix

Viral RNA

Viral enzymes: - Reverse transcriptase - Integrase - Protease

Capsid

HIV attaches to CD4 receptors on T-Cell

Viral core of enzymes and RNA injected into cell

HIV’s replication cycle

DNA transcribed from viral RNA

Double-stranded DNA produced

DNA integrates with host chromosome

Viral RNA

Viral proteins

New virus assembled

Viral protease cuts up proteins

Transcription

New virus leaves cell

Viral integrase

Retrovirus genomes accumulate mutations relatively quickly • lacks an efficient proofreading, so make errors when it carries out RNA-dependent DNA synthesis.• the molecular clock runs rapidly in retroviruses,

•genomes that diverged quite recently display sufficient nucleotide dissimilarity for a phylogenetic analysis to be carried out.

•In less than 100 years, HIV and SIV genomes contain sufficient data.

• The starting point for this phylogenetic analysis is RNA extracted from virus particles.RT-PCR

RT-PCRReverse transcription polymerase chain reaction (RT-PCR) is a variant of polymerase chain reaction (PCR). It is a laboratory technique commonly used in molecular biology where a RNA strand is reverse transcribed into its DNA complement (complementary DNA, or cDNA) using the enzyme reverse transcriptase, and the resulting cDNA is amplified using PCR.

• This tree has a number of interesting features. First it shows that different samples ofHIV-1 have slightly different sequences, the samples as a whole forming a tight cluster, almost a star-like pattern, that radiates from one end of the unrooted tree.

•*This star-like topology implies

that the global AIDS epidemic began with a very small number of viruses, perhaps just one, which have spread and

diversified since entering the human population. • The closest relative to HIV-1 among primates is

the SIV of chimpanzees, the implication being that

• this virus jumped across the species barrier between chimps and humans and initiated the AIDS epidemic.

• However, this epidemic did not begin immediately: a relatively long uninterrupted branch links the center of the HIV-1 radiation with the internal node leading to the relevant SIV sequence, suggesting that after transmission to humans, HIV-1 underwent a latent period when it remained restricted to a small part of the global human population, presumably in Africa, before beginning its rapid spread to other parts of the world.

• Other primate SIVs are less closely related to HIV-1, but one, the SIV from sooty mangabey, clusters in the tree with the second human immunodeficiency virus, HIV-2.

• It appears that HIV-2 was transferred to the human population independently of HIV-1, and from a different simian host. HIV-2 is also able to cause AIDS, but has not, as yet, become globally epidemic.

top related