biostatistics and bioinformatics biochemistry sequence

13
1 Biochemistry Biostatistics and Bioinformatics Sequence Alignment Meaning and Significance

Upload: others

Post on 16-Oct-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Biostatistics and Bioinformatics Biochemistry Sequence

1

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

Page 2: Biostatistics and Bioinformatics Biochemistry Sequence

2

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

Description of Module

Subject Name Biochemistry

Paper Name 13 Biostatistics and Bioinformatics

Module Name/Title 06 Sequence Alignment – Meaning and Significance

Dr. Vijaya Khader Dr. MC Varadaraj

Page 3: Biostatistics and Bioinformatics Biochemistry Sequence

3

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

1. Objectives: This module aims students - 1. To understand the biochemical meaning and concept of amino acid (protein) and nucleotide

(DNA/RNA) sequence alignment to delineate residue to residue conservation during evolution 2. To learn methods for creating sequence alignments 3. To measure pairwise sequence similarity for scoring sequence alignments 4. To evaluate different sequence alignments for a pair of sequences to infer homology. 5. To represent homologous sequence alignment for easy interpretation 6. To distinguish global vs. Local and pairwise vs multiple sequence alignments

2. Concept Map

3. Sequence alignment - Meaning and significance

Genes and proteins are linear sequences of nucleotides and amino acids, respectively. The genes are passed on to offsprings in next generations, after replication. However, during replication, the gene sequence may undergo mutations. Two types of mutations may occur: Site mutations and Gene duplication. However, next generation offsprings will survive only if they have functional copy of the essential genes. If an essential gene is non-functional then the harbouring organism will be wiped out of

Meaning of Sequence alignment

Interpretation of alignment

Types of Sequence alignment

Representing Sequence alignment

Creating sequence alignment

Scoring sequence alignment

Page 4: Biostatistics and Bioinformatics Biochemistry Sequence

4

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

the population. With the result, the mutated non-functional copy of the gene is also wiped out. Consequently all the surviving individuals have at least one copy of the each of the functional genes. Site Mutations in genes include three types of change:

1. substitution of one residue with another or 2. insertion of a residue or 3. even deletion of a residue.

These mutations occur at random positions. Therefore, one offspring may inherit a gene or encoding protein sequence with mutations at the beginning of the sequence and the other offspring, may inherit a gene or encoding protein sequence with mutations at the end of the sequence. But still, the two sequences in two offsprings are identical over several positions. This sequence identity is conservation of residues at significant positions maintains biochemical function. To understand site mutations, let us say that we have a protein sequence in one organism as ‘THISISAPRRTEINSEQVENCEFEWFALSERESIDVES’. This organism acts as a parent and passes this sequence to its offspring without any mutation. Then mutations take place during replication to result in

changed sequence i.e. ‘ITISANNTHERSEQVENCEHAVINGMVTATED’ and then this mutated sequence is passed on to another offspring. Since both the offsprings (individuals) have a common parent (called ancestor), these sequences will be identical at positions where no mutation took place. These two sequences in different individuals are called orthologous sequences. Orthologous genes encoding protein, rRNA or tRNA are called orthologues and are present in different individuals/species. When these proteins are purified from these surviving/functional individuals and their amino acid sequences are determined using experimental methods, these two sequences will have identical residues on functionally relevant/significant positions and the sequences inherited by each different organism from a common ancestor may continue to perform same or similar function. These similar sequences in different species is exemplified by presence of enzyme hexokinase in E. Coli, yeast and several other organisms. These genes in different species are known as orthologues. A second type of mutation may also occur in an individual. This is called gene duplicati on and the duplicated copies of the same gene in the same individual are called paralogous genes, proteins or RNAs. These are present in same individual because they arose within a single organism by gene duplication. These paralogues in same individual evolve independent of each other and may continue to perform same or similar function. These similar sequences in the same organism is exemplified by presence of two genes of lactate dehydrogenase which is a tetrameric enzyme. One gene is exclusively expressed in heart (as H4) and other in muscle (as M4). In other tissues, H3M, H2M2 and HM3 forms are expressed. These are different forms of the same enzyme within a single organism that catalyze the same reaction but differ slightly in sequence and consequently in structure. This results in differences in K M and V max values, as well

Page 5: Biostatistics and Bioinformatics Biochemistry Sequence

5

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

as regulatory properties. These different forms of the paralogous enzymes catalysing same reaction are called isozymes or isoenzymes.

The orthologous and paralogous sequences are collectively known as homologous sequences. It is the protein sequence which has information for adopting a particular three dimensional structure. The homologous sequences have several identical residues; therefore, both might be having similar structure and consequently have similar function. The sequences in different organisms can be compared for sequence similarity to infer homology and consequently to infer structure and consequently the function. The sequences are compared through sequence alignment of similar residues in two sequences. If the structural and functional knowledge is known for a protein in one of the offsprings/ species, then the same knowledge may be applied in understanding the unknown structure and unknown function of same protein in other offspring/ species. The homologous sequences may have even altered or different functions, particularly when sequence divergence is substantial, i.e. with less number of conserved/ identical residues. By understanding the evolution of these sequences through sequence alignment, the structure and function as well as role of these homologous sequences at cellular level can be inferred. Therefore, with sequence alignment, we hope to contribute towards better nutrition, health and environment. The sequence alignment was the first practical application in Bioinformatics to predict function based on sequence similarity. Back to Concept map

3.1. Creating sequence alignment

To delineate the identical and mutated positions in two homologous sequences, we need to create a sequence alignment between two sequences. To create sequence alignment, each sequence is written in successive rows and each residue along each sequence (amino acids in proteins and bases in DNA/RNA), in each row is written in successive columns. Then the identical residues from left to right are matched to determine the conserved position along sequences in each column. To achieve this for two sequences, the first residue in first sequence is placed in first column of first row and the first residue in second sequence is placed in first column of second row, only if two residues are identical at first position in each sequence. If the first residue in both the sequences is not identical, then the second residue in second sequence is compared with first residue in first sequence. If they are identical then first residue in first sequence is placed in column number 2 in first row and second residue in second sequence is placed in column number 2 in the second row. This is called alignment of identical residues in columns. In addition, a dash or hyphen character i.e. ‘-‘ is placed at position number one in the first row. The insertion of dash in a sequence is called introduction/placement of a gap. The gap is introduced in one sequence whenever the residue in other sequence at corresponding position is not identical. In this way, the introduction of gaps allows the alignment of only identical residues along the sequence in successive columns.

Page 6: Biostatistics and Bioinformatics Biochemistry Sequence

6

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

To create an alignment of identical residues, let us have the above mentioned two sequences, with the

first sequence having residues ‘THISISAPRRTEINSEQVENCEFEWFALSERESIDVES’ with sequence length of 38 residues and the second sequence having residues

‘ITISANNTHERSEQVENCEHAVINGMVTATED’ with sequence length of 32 residues. Now, compare, first residue in the first sequence with first residue in the second sequence. The se residues are not identical as ‘T’ in first sequence is different from ‘I’ in second sequence. Now match first residue ‘T’ in first sequence with next residue in second sequence. They are identical as both are ‘T’ in each sequence. Therefore, place

both ‘T’ residues in second column in each row. In addition, place a hyphen i.e. ‘-‘, at position 1 in first row and ‘I’ at position 1 in second row, as shown below.

-T

IT

The situation of aligning a residue with a hyphen / gap arises when ever there is an insertion of a residue in one of the two sequences. The situation of aligning a residue with a gap may also arise when there is a deletion of a residue in one of the two sequences. Now, to continue alignment of next residue in both the sequences, compare second residue ‘H’ in first sequence with third residue ‘I’ in second sequence. We find that they are not identical, so co ntinue matching next residue of first sequence i.e. third residue of first sequence with third residue ‘I’ in second sequence. We find that third residue ‘I’ in first sequence is identical to third residue ‘I’ in second sequence. Therefore align these residues in column number 4 and insert a gap at column number 3, i.e. before ‘I’, in second row, which gets aligned with residue ‘H’ at column number 3, as shown below:

-THI

IT-I

Now, to continue alignment of next residue in both the sequences, we find that fourth residue ‘S’ in first sequence is identical to fourth residue ‘S’ in second sequence. Therefore align these two residues at next column in both rows, as shown below:

-THIS

IT-IS

Continue with each successive position in both the sequences in the manner explained above. In this process, alignment will end, as shown below:

-THISISAPRR--T-E-INSEQVENCEFEWF-ALSERE-SID---V---E-S

IT-IS--A---NNTHER—-SEQVENCE----HA-----V-I-NGMVTATED-

Page 7: Biostatistics and Bioinformatics Biochemistry Sequence

7

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

This process of lining up two sequences to achieve maximal levels of identity (and conservation, in case of amino acid sequences) in successive columns is called sequence alignment. The alignment of identical residues allows calculating the number of identical positions, called identities. Consequently, we can describe the extent of identity between two sequences. In reaching at the above alignment, we have not allowed any mismatched residues in the same column. Therefore, substitution of one residue by another residue during evolution was not allowed. In this way, the alignment allowed only matching of identical residues along the sequences in successive columns. However, in actual sequence alignment, substitutions by similar residues (mismatched residues) are also allowed to be aligned, because a residue may replace for improved functioning. The presence of a non-identical residue at a given position in an alignment is called a substitution of a residue during evolution. However, the placement of non-identical residues in the same column need to allowed for only allowed substitutions during evolution. Therefore, in addition to alignment of a residue with a residue (representing conserved positions) and residue with a gap (representing insertion/deletion), mismatching of residues need to allowed to represent substitutions. During evolution, some substitutions are more likely, say arbitrarily, according to their similar physico-chemical properties, as shown in the table 1. These can be allowed because they do not disrupt the three dimensional structure of mutated protein to a significant extent.

Table 1:

Physico-chemical group Amino acids involved

Small amino acids P, G, A

Small polar residues S, T

Large hydrophobic side chains V, L, I, M

Aromatic side chain F, Y, W

Basic side chain or acid amide side chain H, K, R, N, Q

Acidic side chain D, E

Disulphide bond forming amino acid C

The substitution of a residue by a such a similar residue at a position in alignment is called a conservative substitution. This type of conserved substitution is not expected to affect the function of the evolving protein. Therefore, in addition to alignment of identical residues, we can allow alignment of similar residues belonging to, say arbitrarily, the same physico-chemical group. The alignment of identical and similar residues allows calculating the number of identical and similar positions, called similarities. Consequently, we can describe the extent of similarity between two sequences. Similarity is a descriptive

Page 8: Biostatistics and Bioinformatics Biochemistry Sequence

8

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

term which implies that two sequences, by some criterion (such as physico-chemical or other, as we will see later), resemble each other and can be used to infer homology. The alignment allowing matching of physico-chemically similar amino acids (such as ‘R’ with ‘N’ and ‘L’ with ‘V’, highlighted in the table 1) is shown below:

-THISISAPRRT-EINSEQVENCEFEWF-ALSERESID---V---E-S

IT-IS--A-NNTHE-RSEQVENCE----HAV-----I-NGMVTATED-

In addition, However, this simplified method of aligning two sequences from left to right is not used, because it does not result in optimal alignment of two sequences. On the other hand, a method which guarantee optimal alignment of two sequences, i.e. the dynamic programming method, is used for creating optimal sequence alignment. In the next module 07 - Sequence Alignment - creation process, we will understand the dynamic programming method for creating sequence alignments. Back to Concept map

3.2. Scoring sequence alignments

To compare two different sequence alignments for a pair of sequences achieved through different methods, say a simplified method given above and dynamic programming method discussed in next module, we need to assign scores to these two different sequence alignments, i.e., to compare different alignments, we need to assign scores for each alignment. Each identical residue aligned in each column is assigned an alignment score. However some substitutions may be allowed in the given position and some substitutions are disallowed during evolution. Initially, arbitrary scoring schemes were used. All identities, mismatches position and insertion of gaps were assigned fixed score. The arbitrary scores were highest for identities (conserved positions), lower for mismatches (substitutions) of similar residues and lowest for gaps (insertions/deletions). To understand comparison of different alignments, let us assign some arbitrary scores for scoring alignments. In the arbitrary way, say, +2 can be assigned to a column where we have identical residues. A lower score, say +1, is assigned for alignment of allowed substitutions, which are usually for similar physico-chemical amino acids (table 1). On the other hand, wherever we have aligned a gap with a residue, we assign no score for that column, i.e. a zero. The total score of an alignment is defined as the sum total of the all scores of individual columns. The total alignment score is sum of scores for individual identities, similarities and gaps. The first alignment with alignment of identical residues and gaps, shown below :

-THISISAPRR--T-E-INSEQVENCEFEW-FALSERE-SID---V---E-S

IT-IS--A---NNTHER—-SEQVENCE---H-A-----V-I-NGMVTATED-

Page 9: Biostatistics and Bioinformatics Biochemistry Sequence

9

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

has 16 identical residues aligned in each sequence and 34 residues in both sequences aligned with a gap.

Therefore, a total sum of 32 for identical residues and 0 for gaps gives total alignment score of +32. This method of alignment did not allow placement of non-identical residues in the same column. However, in practice, we can have three alignment type scores, say, one score for identical residues (+2), one score for similar residues (+1) and one score for gap (0). For the following alignment, allowing placing similar residues in alignment

-THISISAPRRT-EINSEQVENCEFEWF-ALSERESID---V---E-S

IT-IS--A-NNTHE-RSEQVENCE----HAV-----I-NGMVTATED-

the total score will be +36, because there are 16 identical residues (+32), four conservative substitutions by

similar residues (+4) and twenty six gaps (0) This score of +36 is more than the assigned score of +32 for

the first alignment. Therefore, the second alignment method allowing substitution by similar amino acids gives a better alignment and therefore, must be preferred. The assignment of scores based on identities and similarities allows the evaluation of the alignment for its overall similarity. Arbitrary scoring matrix was sometime appropriate for DNA and RNA comparisons, where transitions equal transversions and the sequences are moderately diverged. However, arbitrary scoring systems are not useful for protein alignments. Therefore, many alternatives to the arbitrary scoring matrices have been suggested. One of the earliest suggestions was a scoring matrix based on the minimum number of bases that must be mutated to convert a codon for one amino acid into a codon for a second amino acid. This matrix, known as minimum mutation distance matrix, has succeeded in identifying more distant relationships among protein sequences. The minimal mutation distance matrix is an improvement over the unitary matrix because it incorporates knowledge about the process of mutating one amino acid into another. However, its use is limited because the minimum mutation distance matrix does not include codon usage or selection process during evolution. Therefore, in practice, different scoring systems/ schemes have been developed for alignment of a residue with a residue. In addition, we also have different schemes for scoring alignment involving gaps. Further, There are different schemes for scoring alignment of residues in proteins and nucleotides. Next, we also have different schemes for scoring alignments of sequences involving different evolutionary divergences. In the next module 07 - Sequence Alignment - creation process, we will understand the empirical scoring systems for scoring aligned residues and a residue with a gap. Back to Concept map

3.3. Interpretation of pairwise alignment score for inferring homology

Page 10: Biostatistics and Bioinformatics Biochemistry Sequence

10

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

The identity between two sequences due to presence of identical residues at corresponding positions in sequence alignment is quantitative and can be used to infer homology from percentage of identical residues present over the length of pairwise sequence alignment. The figure given below is an x – y graph showing three regions. The green region shows that the sequence alignments with corresponding sequence identity aligned over involved sequence length are safe to infer homology. In the yellow zone, we need to be careful before inferring homology and we may follow statistical assessment in this zone. In the red zone it not safe at all.

In the present case with number of identities 18 (37.5% ) over 48 residue alignment length, the alignment falls in twilight zone, therefore, we may follow the powerful approach to infer homology, which is based on the evaluation of the statistical significance of alignment similarity score with Z-score and P-value. P-value : The statistical significance of an alignment score is frequently assessed by calculating P-value to determine the statistically significance for probability of the alignment with this score, to infer homology.

The P–value is given by Ke-s. The parameters K and lambda () can be thought of simply as natural scales

for the alignment (search) space size and the scoring system respectively and whereas ‘S’ is the score for alignment. The following table gives ranges of P values to draw an inference of homology between two sequences.

P-value range Inference

< 10-100

Two sequences are identical Sequences

10-100 - 10

-50

Two sequences are nearly identical

Page 11: Biostatistics and Bioinformatics Biochemistry Sequence

11

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

10-50 - 10

-5

Two sequences are having clear homology

10-5 - 10

-1

Two sequences are having possible distant homology

> 10-1

Two sequences are having are randomly related,, therefore, no homology

The P–value is given by Ke-s. However, only ‘S’ is reported by sequence alignment programs and K and

are not reported by sequence alignment programs. Therefore, we need to calculate K and , which can be

calculated using online program PRSS http://www.ch.embnet.org/software/PRSS_form.html, using a

specific empirically derived scoring system. Since we not yet discussed the empirically derived scoring

systems, we will take up this calculation of P-Value, when we actually go for a practical sequence

alignment example.

Back to Concept map

3.4. Representations of homologous sequence alignment for easy interpretation Once we infer that the two sequences are homologous sequences, then we need to represent sequences highlighting the conserved positions. Two ways, in which the sequence alignments are presented are shown below: A. Using symbols in between aligned residues:

THISISAPRRT-EINSEQVENCE

|:||| ..| | .||||||||

ITISA-NNTHE-RSEQVENCE

Some sequence alignment programs, such as EMBOSS Needle and EMBOSS Stretcher, present identities using a vertical bar ‘|’, conservative substitutions using colons ‘:’ and semi-conservative substitutions using dots ‘.’, as shown above. B. Using conserved sequence for identical residues written underneath alignment

THISISAPRRT-EINSEQVENCE

ITISA-NNTHE-RSEQVENCE

i-isa---t-e--seqvence

Page 12: Biostatistics and Bioinformatics Biochemistry Sequence

12

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

Back to Concept map

3.5. Types of Sequence alignments In the alignment shown below,

-THISISAPRRT-EINSEQVENCEFEWF-ALSERESID---V---E-S

IT-IS--A-NNTHE-RSEQVENCE----HAV-----I-NGMVTATED-

two sequences are forced to align along their complete lengths. This is called global alignment. In the alignment shown above, the middle region residues are aligned to each other involving single insertions/deletions and ends region alignments involve insertion of large gaps involving insertions/deletions of more than one amino acid. Therefore, this complete length global alignment (involving large gaps) may not be biologically relevant. On the other hand, the middle portion involving insertion/deletion of single amino acids, shown below, may be biologically relevant evolution.

-------APRRT-EINSEQVENCE------------------------

-------A-NNTHE-RSEQVENCE------------------------

Therefore, if only these residues with single gaps are considered, then this is called local alignment, i.e. aligning short regions within the complete sequence lengths. Therefore, local alignment means aligning short regions within the complete sequence lengths. These local regions represent domains, motifs, fingerprints of amino acids in actual protein sequences. The alignment above, between a pair sequences is known as pairwise sequence alignment. Therefore, we have seen a pairwise global and local alignment, above. Sometimes more than two sequences are aligned in this fashion. The alignment of more than two sequences is called multiple sequence alignment (MSA). The MSA may be either global or local. We will see the importance of local alignments in sequence similarity search and pattern matching.

4. Summary In this lecture we learnt :

1. the biochemical meaning and concept of amino acid (protein) and nucleotide (DNA/RNA) sequence

alignment to delineate residue to residue conservation during evolution 2. a method for creating sequence alignment 3. scoring pairwise sequence alignment 4. to infer homology from sequence alignment 5. to represent homologous sequence alignment for easy interpretation

Page 13: Biostatistics and Bioinformatics Biochemistry Sequence

13

Biochemistry

Biostatistics and Bioinformatics

Sequence Alignment – Meaning and Significance

6. to distinguish global vs. Local and pairwise vs multiple sequence alignments