C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 2
1.1 Introduction
he knowledge of the complete human genome sequence has unfolded the
mysteries of the human genome variation which in!turn has allowed a
mechanism!based approach to the understanding of the relationship of
genotype with disease. This understanding is considered as the essential precursor to
the development of the personalized medicine. With rapid advances in high!throughput
genotyping and next!generation sequencing technologies, a large amount of genetic
variation has been discovered which has assumed many forms. The simplest type of
variant results from a single base mutation which substitutes one nucleotide for other
and that accounts for the most common form of variation referred to as single
nucleotide polymorphisms (SNPs). Many other forms of variation result from the
insertion or deletion of one or more nucleotides, so!called insertion/deletion (INDEL)
polymorphisms. The most common insertion/deletion events occur in repetitive
sequence elements, consisting of variable length sequence motifs that are repeated in
tandem in a variable copy number, so!called variable number tandem repeat
polymorphisms (VNTRs). VNTRs can further be divided on the basis of the size of the
tandem repeat unit: microsatellites (or simple sequence repeats (SSRs)) and
minisatellites. Microsatellites (or SSRs) consists of one to six bases repeat motifs. Direct
tandem repeat sequences of motif 10!30 base pairs are called minisatellites (Jeffreys et
al., 1985). The rarest insertion/deletion events involve deletion or duplications of
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 3
regions that can range from a few kilobases to several megabases. Few other types of
repeats were also observed in the genomes. These include palindrome sequences,
inverted repeats, and mirror repeats (Cox and Mirkin, 1997).
This huge quantity of various forms of genetic variations in the human genome led many
to question the origin and maintenance of such a human population’s genetic load.
Kimura (1983) formulated a theory called neutral theory of evolution that proposed that
most of the sequence variations does not make a significant impact on the phenotypic
consequences and so, will not be subjected to the natural selection, thus, rendering the
majority of mutations likely to be phenotypically neutral. However, there are a certain
number of undefined alleles that can cause directly (referred to as mutations) or
increase the susceptibility to disease (polymorphisms). Bioinformatics analysis of human
sequence provides an opportunity to identify the most common form of genetic
variation, SNPs, by comparison of two sequences viz., coding DNAs, expressed sequence
tags (ESTs) or genomic sequences. Discovery of SNPs that affect biological function have
become increasingly important and availability of the databases for the SNPs to some
large extent have been discussed in this chapter.
1.2 Single Nucleotide Polymorphisms (SNPs)
SNPs are the nucleotide changes that occur in DNA which account for approximately
90% of the genetic variation among individuals in a population (Collins et al., 1998). SNP
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 4
is a nucleotide change that is prevalent in at least 1% of the population (Figure 1.1).
There are two categories of SNPs as given below:
1. Linked SNPs: SNPs which do not reside within genes and do not affect protein
functions. These are also referred as Indicative SNPs which originate to response
to the drugs or to the risk of getting a certain disease.
2. Causative SNPs: SNPs which affect protein structures and/or functions and
cause diseases. These can be divided into two categories:
(a) Coding SNPs: SNPs found in the coding regions of the genes and affect
protein function. Again coding SNPs can be divided into two parts:
(i) Non!synonymous SNPs (nsSNPs): In these cases changed nucleotide
leads to change in amino acid
(ii) Synonymous SNPs (sSNPs): In these cases changed nucleotide does
not lead to change in amino acid.
(b) Non!coding SNPs: the nucleotide change is located within the regulatory
parts of genes and is correlated to the changes in the corresponding mRNA
expressions.
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 5
Figure 1.1: The single nucleotide polymorphism (SNP) where a single nucleotide (A, C, T
or G) in the DNA sequence is altered. Here, C changed to T, hence, change in nucleotide.
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 6
The non!synonymous SNPs (nsSNPs) that lead to amino acid changes in protein products
are likely to affect their structure and the function depending on the type of the amino
acid change as well as the site of change (Cargill et al., 1999; Stenson et al., 2003;
Thorisson and Stein, 2003; Ng and Henikoff, 2006). Some amino acid changes are
tolerated by proteins with no concomitant phenotypic effect and the corresponding
nsSNPs are referred to as benign or neutral nsSNPs. Those leading to amino acid
changes that are not tolerated by protein structure and function which further lead to
disease phenotypes are referred to as pathogenic or disease mutants (Saunders and
Baker, 2002; Bao and Cui, 2005; Yue and Moult, 2006).
Although comparative genetic analyses of healthy and disease individuals have led to
the discovery of a number of mis!sense mutations/nsSNPs associated with diseases, the
list may be far from complete as the list of uncharacterized mutations/nsSNPs
discovered from the human genome project outweighs the list of characterized
mutations/nsSNPs. In this post!genomic era, classification of nsSNPs into disease or
neutral has, therefore, been perceived as the first step before any study is attempted
such as pharmacogenomics and a variety of computational methods have been devised
for this purpose (Mooney, 2005; Ng and Henikoff, 2006; Thusberg and Vihinen, 2009).
But before that I give details of the databases hosting information on mutations/SNPs.
Needless to mention these databases, in addition to serving as information resources,
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 7
have also been providing datasets for benchmark studies of the computational methods
developed for prediction of pathogenic mutations.
1.3 Databases
The databases include dbSNP (Sherry et al., 1999), the Human Genome Variation
Database (HGVbase) (Fredman et al., 2004), On!line Mendelian Inheritance in Man
(OMIM) (Hamosh et al., 2005) and Human Gene mutation database (HGMD) (Stenson et
al., 2003) etc. The details of these databases are given below:
1.3.1 The Single Nucleotide Polymorphism Database (dbSNP)
The dbSNP is a free public domain for broad collection of simple genetic polymorphisms
across the different organisms. This database has been developed and maintained by
the National Center for Biotechnology Information (NCBI) in collaboration with the
National Human Genome Research Institute (NHGRI) and is available at
http://www.ncbi.nlm.nih.gov/SNP/. This database was created in 1998 (Sherry et al.,
1999) for providing additional information to Genbank, NCBI’s public collection of
protein and nucleotide sequences. In addition to SNPs, the dbSNP contains a range of
other molecular variation: (1) deletion and insertion polymorphisms (DIPs/indels) and
(2) microsatellite repeat variations or short tandem repeats (STRs). Each dbSNP entry
includes the sequence context of the polymorphism (i.e., the surrounding sequence),
the occurrence frequency of the polymorphism (by population or individual), and the
experimental method(s), protocols, and conditions used to assay the variation. The
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 8
dbSNP can be searched using the Entrez SNP tool with queries viz., a refSNP number ID,
a gene name, an allele, a build number etc with the summarized information of that
searched SNP. It has been reported that this database contains some false positive
entries due to genotyping and base!calling errors (Reich et al., 2003; Mitchell et al.,
2004; Musumeci et al., 2010).
1.3.2 Human Genome Variation Database (HGVbase)
The Human Genome Variation database, HGVbase, previously known as HGbase
(http://hgvbase.cgb.ki.se/; Fredman et al., 2004) is a highly curated and non!redundant
database of available genomic variation data of all types but mostly comprising of single
nucleotide polymorphisms (SNPs). The HGVbase is supported by the establishment of a
European consortium comprising teams at the Karolinska Institute, Sweden, the
European Bioinformatics Institute, United Kingdom (UK) and at the European Molecular
Biology Laboratory, Germany.
This database can also be called as extension of manually curated dbSNP where the
HGVbase curators provide a more!extensively validated SNP data set by filtering out
SNPs in repeat and low complexity regions and by identifying SNPs for which a
genotyping assay can successfully be designed. The HGVbase include polymorphisms as
well as variations with rare or single occurrence alleles as well as disease!related and
disease!causing clinical mutations.
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 9
1.3.3 The Human Gene Mutation Database (HGMD)
The Human Gene Mutation Database (HGMD) constitutes a comprehensive core
collection of data on germ!line mutations responsible for human inherited disease
(http://www.hgmd.org/; Stenson et al., 2003). The HGMD was first made publicly
available in April 1996 and is now available as commercial to users after the
collaboration between HGMD and BIOBASE GmbH in 2006. The scope of HGMD is
particularly limited to mutations which include single base!pair substitutions in coding,
regulatory and splicing!relevant regions, insertions/deletions (indels), duplications and
triplet repeat expansions.
1.3.4 On!line Mendelian Inheritance in Man (OMIM)
OMIM is an on!line database (http://www.omim.org; Hamosh et al., 2000) that
catalogues all the human genes and their associated mutations based on the long
running catalogue Mendelian Inheritance in Man (MIM), started in 1967 by Victor A.
McKusick at Johns Hopkins. This database was available on the NCBI web site in 1995.
OMIM is an excellent resource for providing background information about biology of
genes and their related diseases.
1.3.5 The UniProt/SwissProt Database
UniProtKB/Swiss!Prot (http://expasy.org/; Bairoch and Apweiler, 1996) is a highly
curated and manually annotated, non!redundant protein sequence database. This
database was created in 1986 by Amos Bairoch at Swiss Institute of Bioinformatics and
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 10
maintained collaboratively by the Department of Medical Biochemistry of the University
of Geneva and the EMBL Data Library. The objective of UniProtKB/Swiss!Prot is to
provide all known relevant information about a particular protein. The information
about variants has been listed as disease/polymorphisms for each protein sequence
entry. The additional bonus of Uniprot/SwissProt is that it is well integrated with the
OMIM, dbSNP and NCBI database family and whenever new variants are updated these
in those databases also become available on the UniProt/SwissProt database.
1.4 Computational analysis of effects of nsSNPs
As mentioned earlier, there are a large number of mis!sense mutations whose
phenotypic effects have not been discovered. Hence, methods to accurately predict the
effect of mis!sense mutations have always been in demand. Several methods have been
developed and have been briefly discussed (Mooney, 2005; Ng and Henikoff, 2006;
Thusberg and Vihinen, 2009).The basic approach adopted by all these methods involves
use of either sequence or structural information or both, of proteins harboring the
nsSNPs with an underlying idea that mis!sense mutations that alter protein structure
and function are likely to be pathogenic and those do not alter are likely to be
neutral/benign (Figure 1.2). In other words, the phenotypic effect of a mis!sense
mutation is judged by its effect at the protein level. In order to predict whether a given
mis!sense mutation is pathogenic or neutral, various features at the mutation site are
considered which include evolutionary conservation (Miller et al., 2001), solvent
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 11
accessibility, secondary structure (Sunyaev et al., 2000) etc. In addition, the effect of
mutation on protein stability is also considered by some studies (Wang and Moult,
2001).
There have been studies to map mis!sense mutations on to their respective proteins and
study of their protein sequence and structural contexts (Sunyaev et al., 2000; Burke et
al., 2007, Yue et al., 2006, Adzhubei et al., 2010). Wang and Moult (2001) showed that
83% of the disease!causing mutations affected protein stability. Using both structure as
well as sequence information, Sunyaev et al. (2000) showed that 70% of the disease
causing mutations affect the structurally and functionally important sites such as those
buried sites, active sites or sites involved in disulphide bonds. Gong and Blundell (2010)
showed the distribution of amino acid variants by mapping onto the 3D structures, if
available and reported the occurrence of disease!related variants much more frequently
at solvent inaccessible regions as well as at amino acid residues involved in hydrogen
bond formation as compared to polymorphic variants.
However, the coverage for prediction methods using protein structure is only 14% (Yue
and Moult, 2005) as compared to coverage using sequence!based methods (81%)
(Ramensky et al., 2002). For sequence!based prediction methods, first step is to select
the homologous sequences, manually or automatically. Since the amino acids occurring
in the alignments form the fundamental basics of sequence!based prediction method,
the alignments and the number of sequences used are the central part in the prediction
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 12
PATHOGENIC BENIGN
Figure 1.2: The basic approach for simple amino acid prediction using either sequence
or structure based method
Protein sequence or Structure and
amino acid as an input
Structure Sequence
Structural features such as
crystallographic B factor, solvent
accessibility, ligands binding site,
3D structure environment etc.
Sequence based features include conservation
score, position!specific evolutionary score
derived from MSA, the physiochemical
properties, amino acid substitution matrix
Apply scoring rules for prediction
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 13
Table 1.1: Available Amino acid substitution prediction methods
Methods Algorithm Used Conservation analysis Structural features
SIFT Scores calculated using
Dirichlet mixtures
Sequence Homology !!!!!
PolyPhen Empirical rules Position Specific Independent
Counts (PSIC)
Predicted Features/
Homology modeling
PANTHER Alignment Scores PANTHER Library , HMMs !!!!
SNAP Neural Network PSIC profiles, Pfam, PSI!BLAST Predicted features
SNPs3D Support Vector Machine Shannon !entropy based Predicted Features
PhD SNP Support Vector Machine Sequence profiles, Sequence
environment
!!!!
MutPred Random Forest SIFT, Pfam, PSI!BLAST Predicted Features
nsSNPAnalyzer Random Forest Normalized probability Homologous
structures
PMUT Neural Network Physicochemical Features Predicted Features
PAREPRO Support Vector Machine psap score, residue difference !!!!
MAPP MSAs Physicochemical Features !!!!
SAAP Known PDB !!! Structural analysis
SNPs&GO Support Vector Machine Sequence profiles, ontology !!!!!!
TopoSNP MSAs Relative entropy , Pfam 3D structural locations
PolyPhen 2 Bayesian Classification PSIC Profiles Predicted Features/
Homology modeling
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 14
of pathogenic mutations. Prediction methods (SNPs&GO, Calabrese et al., 2009) also
incorporate annotations or gene ontology to increase the prediction accuracy. In the
following sections I discuss some of the widely used methods (Table 1.1).
1.4.1 SIFT
SIFT (Sorting Intolerant From Tolerant) constructs multiple sequence alignment (MSA) as
a query and creates 13!component Dirichlet mixture based score matrix for each
position in the alignment ((http://sift.jcvi.org/) and (http://sift-dna.org); Ng and Henikoff,
2001). Based on the amino acids appearing at each position in the MSA, SIFT calculates
the score for each amino acid substitution which will be converted into a normalized
probability that the substitution would be evolutionary tolerated. Substitutions at a
position showing normalized probabilities less than a chosen cutoff value (0.05) are
predicted to be pathogenic, and those greater than or equal to the cutoff value are
predicted to be tolerated. SIFT is available both in the online server as well as a
standalone software which can be downloaded to a local system and run.
1.4.2 PolyPhen
PolyPhen (Polymorphism Phenotyping) like SIFT also takes an evolutionary approach in
distinguishing pathogenic nsSNPs from functionally neutral ones and is available as an
online server (http://genetics.bwh.harvard.edu/pph; Ramensky et al., 2002). PolyPhen
uses a rule!based cutoff system based on sequence, phylogenetic, and structural
information to classify variants. The sequence based characterization includes SWALL
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 15
database annotation for sequence features (Johnson and Todd, 2001), SignalP program
to predict signal peptide regions (Nielsen et al., 1997), the Coils2 program for prediction
of coiled coil regions (Lupas et al., 1991), TMHMM to predict transmembrane region
(Krogh et al., 2001), PHAT (Predicted Hydrophobic And Transmembrane region) matrix
substitution score (Ng et al., 2000). The phylogenetic prediction is based on the
position!specific independent counts (PSIC) score (Sunyaev et al., 1999) derived from
multiple sequence alignments (MSAs) of observations. It utilizes protein structure
databases, such as PDB (Protein Data Bank) or PQS (Protein Quaternary Structure), and
three!dimensional structure databases and the use of DSSP (Dictionary of Secondary
Structure in Proteins) software (Kabsch and Sander, 1983) to determine if a variant may
have an effect on the protein's secondary structure, solvent!accessible surfaces and phi!
psi dihedral angles. In addition, PolyPhen calculates normalized B!factor (temperature
factor), change in residue chain volume, region of the phi!psi map (Ramachandran map),
change in residue side chain volume, normalized accessible surface area and change in
accessible surface propensity resulting from the amino acid substitution. In addition,
PolyPhen also checks whether the amino acid substitution site is in spatial contact with
ligands or protein subunits or interchain contacts, functional sites, and binding sites.
After characterizing the variant, PolyPhen uses empirically derived rules to predict that
variant as “probably damaging” to protein function, “possibly damaging”, “benign” and
“unknown”.
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 16
1.4.3 PolyPhen 2
PolyPhen!2 is an improved version of PolyPhen with selective combination of 11
sequence and structure!based features for the characterization of an amino acid
substitution available as both online server and batch form
(http://genetics.bwh.harvard.edu/pph2; Adzhubei et al., 2010). The sequence!based
attributes include PSIC scores of wild type and their difference, MSA properties include
number of residues observed in the MSA, sequence identity (SI) with the closest
homologue and position of mutation in relation to domain boundaries as defined by
Pfam (Finn et al., 2010). The structure based attributes are change in accessible surface
area propensity for buried residues, crystallographic B!factor and solvent accessibility.
PolyPhen!2 use naïve Bayesian classifier to predicts the characterization of variants as
“probably damaging” or “possibly damaging” or “benign” or “unknown”.
1.4.4 SNAP
SNAP (Screening for Nonacceptable Polymorphisms) is an online neural!network based
method to make prediction of the effect of a mis!sense mutation
(http://rostlab.org/services/snap/; Bromberg and Rost, 2007). The method utilizes local
sequence environment of a residue, biochemical properties including the substitution by
charged amino acid in the buried position, introduction of proline as structure disruptor
in alpha!helices, replacement of hydrophilic by hydrophobic side chain or vice!versa,
over packing of cavity/core with the replacement by larger size residue, transition
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 17
frequencies of the mutations, evolutionary information encoded from combinations of
weighted amino acid frequency and position!specific scoring matrix vectors from PSI!
BLAST (Altschul et al., 1997), profiles generated by PSIC (Sunyaev et al., 1999), structure!
based sequence features includes secondary structure information predicted by
PROFsec (Rost and Sander,1993; Rost,1996), PROFacc to predict the solvent accessibility
(Rost and Sander, 1994; Rost, 2005), predictions of chain flexibility by PROFbval
(Schlessinger et al., 2006), protein family/domain related evolutionary information from
Pfam (Finn et al., 2010) and SWISS!PROT annotations (Bairoch and Apweiler, 2000) to
predict score for a particular variant. The score can be seen as the signature for the
prediction effects as well as reliability index (RI).
1.4.5 PANTHER
PANTHER (Protein ANalysis THrough Evolutionary Relationships), an online server,
estimates the likelihood of a mis!sense variant to cause a functional impact on the
protein (http://www.pantherdb.org/tools/csnpScoreForm.jsp; Thomas et al., 2003). It
calculates substitution position!specific evolutionary conservation (subPSEC) scores
based on an alignment of evolutionarily related proteins. The alignments are derived
from PANTHER library of protein families/subfamilies based on Hidden Markov Models
(HMMs). First, the likelihood of a particular amino acid substitution at a particular
position, aaPSEC score, is calculated and the subPSEC score can be the differences of the
aaPSEC scores of the amino acid residues. The subPSEC score can range from 0 (neutral)
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 18
to about !10 (more pathogenic). The authors suggested !3 as best user cut!off values for
the prediction of pathogenic mutations.
1.4.6 PhD SNP
PhD!SNP (Predictor of human Pathogenic Single Nucleotide Polymorphisms), a web!
based support vector machine classifier, uses a combination of single!sequence (SVM!
Sequence) and sequence profile (SVM!Profile) to classify a mis!sense variant
(http://gpcr.biocomp.unibo.it/~emidio/PhD!SNP/PhD!SNP.htm; Capriotti et al., 2006).
The SVM!Sequence classifies the mutation to be disease or benign based on the nature
of the specific mutation and their neighboring/local sequence environment. The SVM!
Profile is created from MSAs through the sequence profile information and classifies the
variant according to the ratio between the frequencies of the wild!type and mutated
residue. Prediction of mutation at a particular position is based on the decision!tree
algorithm where user can chooses either sequence!based or sequence and profile based
information (Hybrid Meth).
1.4.7 MutPred
MutPred is a web application tool developed to classify an amino acid substitution as
disease!associated or neutral (http://mutpred.mutdb.org/; Li et al., 2009). It is a
Random Forest based classifier that integrates protein structure, sequence as well as
evolutionary information. MutPred utilizes SIFT method (Ng and Henikoff, 2003) along
with PSI!BLAST, transition frequencies (Bromberg and Rost, 2007) and Pfam profiles
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 19
(Finn et al., 2010). Structural attributes include secondary structure and solvent
accessibility prediction by PHD method (Rost, 1996), coiled!coil structure prediction by
MARCOIL (Delorenzi and Speed, 2002), disorder prediction by DisProt (Peng et al.,
2006), stability prediction by MuPro (Cheng et al., 2006), transmembrane helix
prediction by TMHMM (Krogh et al., 2001) and B!factor prediction (Radivojac et al.,
2004). Functional sites predictions include DNA!binding residues (Ahmad et al., 2004),
catalytic residues, methylation (Daily et al., 2005), calmodulin!binding targets (Radivojac
et al., 2006), ubiquitination (Radivojac et al., 2009) and glycosylation. The MutPred
measures the likelihood of observing a given mutation depending upon the probability
score whether mis!sense variant is pathogenic or neutral as well as estimating the
functional effects by using significance p!values for the particular phentotypic effect.
1.4.8 nsSNPAnalyzer
nsSNPAnalyzer is an online random forest based classifier using the structural and
evolutionary information for the classification of mis!sense variants
(http://snpanalyzer.utmem.edu; Bao et al., 2005). After submitting the protein
sequence as a query, nsSNPAnalyzer searches against the ASTRAL database (Chandonia
et al., 2004) for homologous protein structures for extracting structural features
including solvent accessibility, environmental polarity and secondary structure. The
evolutionary attributes include the normalized probability of the amino acid substitution
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 20
from MSA as well as similarity and dissimilarity between the wild type and mutated
amino acid.
1.4.9 SNPs&GO
SNPs&GO is a web!server based on support vector classifier to predict disease related
mutations from the protein sequence (http://snps!and!go.biocomp.unibo.it/snps!and!
go/; Calabrese et al., 2009). It is based on the type of mutation and sequence
environment information, sequence profiles generated from MSAs and PANTHER
predictions (Thomas et al., 2003). A novel feature, log!odd score derived from Gene
Ontology (GO) terms (Ashburner et al., 2000) is a crucial feature in increasing the
performance of SNPs&GO for predicting the pathogenic mutations. The SNPs&GO
output consists of a table listing the number of the mutated position in the protein
sequence, the wild!type residue, the new residue and if the related mutation is
predicted as disease or as neutral.
1.4.10 SNPs3D
SNPs3D is a SVM!based classifier which assigns molecular functional effects of nsSNPs
based on structure and sequence analysis and is available as an online server
(http://www.snps3d.org/; Yue et al., 2006). The sequence!based features include
probability of the substitution at a particular position, Shannon entropy at each position,
mean entropy and standard deviation of the entropy over all positions. Structural
attributes include a set of 15 stability factors which are used to access the impact of
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 21
each mutant on protein stability. The classes of electrostatic interaction viz., reduction
of polar!polar, polar!charge, charge!charge, solvation effects viz., burying of charge or
polar groups, disulphide bond breakage and reduction in non!polar area buried on
folding, structural rigidity viz., crystallographic B!factor, Z score and standard deviation,
steric strain representing backbone strain and overpacking. The sequence features
contribute to svm!profile score and structural features to svm!structure score. A
positive svm!profile as well as positive svm!structure score indicates a variant classified
as neutral, and a negative score indicates a pathogenic case. For variants that act by
affecting protein function rather than stability, the stability model is expected to return
a positive (non!pathogenic) svm!structure score and the profile model a negative
(pathogenic) score.
1.4.11 PMUT
PMUT is a neural network (NNs) based method available on!line as a webserver
(http://mmb2.pcb.ub.es:8080/PMut/; Ferrer!Costa et al., 2005). The features include
sequence as well as structural features. The structural parameters include predicted
solvent accessibility and secondary structure prediction by the PHD software (Rost and
Sander, 1993), observed secondary structure prediction by SSTRUC implementation of
the Kabsch and Sander method (Rost and Sander, 1993), observed solvent accessibility
prediction by NACCESS (Hubbard and Thornton, 1993). Substitution matrices such as
BLOSUM62 and PAM40, physiochemical properties including hydrophobic indices from
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 22
water/ocatnol free energy measurement, volume changes derived from Van der Waals
volumes as well as volumes of buried residues. Their differences in physiochemical
properties between wild!type and mutant type are also incorporated. Secondary
structure propensities are obtained from standard Chou and Fasman analysis (Chou and
Fasman, 1974) as well as from the Swindells et al., analysis (Swindells et al., 1994). After
submitting to PMUT, the output shown as a pathogenicity index ranging from 0 to 1
(indexes >0.5 signal pathological mutations) and a confidence index ranging from 0 (low)
to 9 (high).
1.4.12 MAPP
MAPP (Multivariate Analysis of Protein Polymorphism), a java based web tool that
considers the physicochemical variation in each column of a MSA and, on the basis of
this variation, calculates the deviation of mutated amino acid from the variation and
thus can predict the impact of all possible amino acid substitutions on the function of
the protein (http://mendel.stanford.edu/SidowLab/downloads/MAPP/index.html; Stone
and Sidow, 2005). MAPP considers 6 physicochemical properties for the evaluation of
mis!sense variants: hydropathy (Kyte and Doolittle, 1982), polarity (Stryer, 1995), charge
(Stryer, 1995), side!chain volume (Zamyatin, 1972), free energy in alpha helical
conformation (Muñoz and Serrano, 1994) and free energy in beta!sheet conformation
(Muñoz and Serrano, 1994). MAPP calculates the impact score by scoring the deviation
from the MSA column for each possible amino acid variant by calculating each property
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 23
difference from the mean and dividing by the square root of the variance. With the
highest impact score, MAPP identifies a potentially pathogenic one whereas low impact
score shows the negligible effect or neutral.
1.4.13 PAREPRO
PAREPRO (Prediction of amino acid replacement probability) is yet another SVM!based
method on the basis of evolutionary information as well as 50 selected properties from
the AAindex (Kawashima et al., 1999; Kawashima and Kanehisa, 2000) for the prediction
of pathogenic mutations (http://www.mobioinfor.cn/parepro/; Tian et al., 2007). This
method is available both online as well as standalone server. First, the position!specific
amino acid probability (psap) score is calculated from MSA, and then residue differences
(RD), mutation position information (MI) and information about surrounding around the
mutation position (IE), thus these combinations of features were selected as input into
SVM to make a model. PAREPRO thus appears to use more specific evolutionary
information.
1.4.14 TopoSNP
TopoSNP is an an online server for analyzing the non!synonymous SNPs (nsSNPs) that
can be mapped onto known 3D structures of proteins
(http://gila.bioengr.uic.edu/snp/toposnp/; Stitziel et al., 2004). This tool provides an
interactive structural visualization of nsSNPs as well as classification of nsSNP sites into
three categories based on their geometric location in the protein structure: surface
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 24
pocket or an interior void, a convex region or a shallow depressed region, a completely
buried inside the protein structure. The conservation feature viz., relative entropy of
SNPs calculated from MSA as obtained from the Pfam database is also incorporated into
TopSNP. Thus, TopoSNP, by selecting an nsSNP site, can be used to visualize the specific
assignment of geometric class and relative entropy score.
1.4.15 SNPeffect
SNPeffect is an online web!resource for nsSNPs mapping the phenotypic effects of allelic
variations in human (http://snpeffect.vib.be/; Reumers et al., 2005). SNPeffect have
been designed by using different functional properties to predict the effect of nsSNPs
describing the molecular phenotype of proteins. The functional properties in the
SNPeffect can be divided into three parts as: properties affecting protein folding and
stability; affecting functional and binding sites, affecting cellular processing of a protein.
Change in free energy upon mutation as calculated by FoldX (Guerois et al., 2002),
change in protein aggregation and amyloidosis predicted by TANGO (Fernandez!
Escamilla et al., 2004) and AmyScan (Lopez and Serrano, 2004; Lopez et al., 2002) were
evaluated in the SNPeffect for the functional properties affecting stability and folding.
For active and catalytic sites, SNPeffect uses Catalytic Site Atlas (CSA) (Porter et al.,
2004) database visualizing and documenting enzyme active sites and catalytic residues
in enzymes for which three!dimensional structures are available. PA!SUB (Lu et al.,
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 25
2004) and PSORT II (Horton and Nakai, 1997) are evaluated in SNPeffect for predicting
cellular localization whenever nsSNPs is localized.
1.4.16 SAAP
SAAP (Single Amino Acid Polymorphisms) (http://www.bioinf.org.uk/saap/db/; Cavallo
and Martin, 2005) first collects relevant data from dbSNP and HGVbase and then maps
the data onto the translated regions of the gene to determine whether the mutation is
in a part of the gene translated to protein. If, mutation is in the coding part (exon), then
check for whether it is non!synonymous or not. Data from Online Mendelian Inheritance
in Man (OMIM) as well as Locus specific mutation databases (LMSDs) is also
incorporated into the SAAP. Once the mapping of a mutation to a protein sequence is
established and if a pdb structure of the corresponding mutation is known, then the
mutant is mapped onto the protein structure and its structural analysis is performed.
Structural analysis includes whether mutation involve hydrogen bonding residues, steric
clashes, mutation location on the interface or directly involved in the binding
interactions with ligand or partner protein. In this way, SAAP represents as a completely
automatic and reliable implementation of nsSNPs mapping on to the known structure of
the protein. SAAP is also available for download locally.
1.4.17 Align GVGD!
Align GVGD! is a web!based program that combines the MSA and the biophysical
characteristics of amino acids to predict the pathogenic mutations
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 26
(http://agvgd.iarc.fr/agvgd_input.php; Mathe et al., 2006; Tavtigian et al., 2006). This
software calculates two types of conservation score based on MSA of a query protein
for each substitution viz., Grantham variation (GV) and Grantham deviation (GD). GV
measures the degree of biochemical variation among amino acids found at each
position in the MSA and the biochemical distance of mis!sense variations from the
observed amino acid at a particular position reflects GD. Align!GVGD is an extension of
the original Grantham differences like composition (C), polarity (P) and volume (V) to
MSAs, by combining GV and GD scores to predict the disease causing activity of each
mis!sense substitution. Each amino acid can be plotted on a 3D graph, having C, P and V
as the three axes, with different weights applied to each axis. All amino acids will then
form a cloud of points at a given position in the MSA and be enclosed within a box, the
coordinates of which are defined by the minimum and maximum values of C, P, V, for
the observed amino acids. If the substitution lies within the box, then GD = 0 and vice
versa. Thus, Align!GVGD can measure the biochemical difference between the mis!
sense and the observed amino acid variation at that position in the MSA.
As discussed already a mis!sense mutation can be classified as a pathogenic/disease or
neutral/benign based on the distribution pattern of several sequence and structure
based features concerning the mutation. It is clear that the task of prediction of mis!
sense mutation into pathogenic or neutral is essentially a binary classification problem
in a multi!dimensional feature space and hence it is not surprising to find several well
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 27
known methods adopting machine learning classifiers. In fact there has been a growing
trend amongst the computational biologists to adopt one of the machine learning
classifiers in an attempt to improve accuracies of their prediction methods. There are
excellent reviews and books available which give their theories (McCulloch and Pitts,
1943; Haykin, 1994; Michalski and Tecuci, 1994; Mitchell, 1997; Vapnik, 1998; Strasser
and Weber, 1999; Breiman, 2001; Cruz and Wishart, 2006). However, for the purpose of
continuity I give in this chapter some essential aspects on machine learning classifiers
with a special emphasis on support vector machine (SVM) used in the present work.
1.5 What are Machine Learning Classifiers?
Machine learning classifiers form a branch of artificial intelligence and incorporate a
variety of probabilistic, statistical and optimization techniques that allows the computer
to first “train” from past examples and use that prior training to classify new data,
identify new patterns from large, noisy or complex data sets. There are several
classifiers available which can be used for solving classification problems and they are:
(a) Support Vector machine, (b) Random Forest, (c) Neural Network, (d) Decision Tree
using recursive partitioning, (e) Conditional Inference Trees, (f) Naïve Bayesian Classifier,
(g) Bootstrap Aggregating (bagging) and (h) Ensemble of Random Forest & Bagging.
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 28
1.5.1 Support Vector Machine (SVM)
Support Vector Machine (SVM) is a supervised machine!learning method whose
mathematical framework was first developed by Vapnik (1995). It is based on the
concept of decision (hyperplanes) that define decision boundaries whose preliminary
task is to classify objects into two classes and hence it is extensively used for
classification and regression problems (Larranaga et al., 2006; Vapnik, 1995; Yang,
2004).
To classify objects, SVM plots the given values as points (known as vectors) in space and
differentiates between members and non!members of a defined class by drawing a
maximum margin hyperplane between them. The vectors near the hyperplane are the
support vectors (Figure 1.3). The objective of the SVM modeling is to find the optimal
hyperplane in a multidimensional space that separates clusters of vectors into two class
labels. The points corresponding to the query object are plotted in the same space and
depending on the position relative to the hyperplane; it is assigned one of the two class
labels. So, having the points with the largest margin (positive distance from the
hyperplane), better will be the generalization of SVM classifier.
Additionally, SVM have ability to deal with errors in the training dataset by adding a
“soft margin” (Cortes & Vapnik, 1995) in order to avoid misclassification of unknown
datasets by SVM. This soft margin can be introduced by allowing some data points to
C h a p t e r
Figure 1.3:
given objec
support vec
1
Simple repre
cts: stars fro
ctors.
esentation o
om crosses.
of SVM whe
The black
I
re decision
one in both
n t r o d u c
hyperplane
h crosses an
t o r y R e v
Pa
separates th
nd stars rep
v i e w
age 29
he two
present
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 30
push their way through the margin of the separating hyperplane without affecting the
final result.
For a sample of x vectors, prediction is based on a formula:
!"#$ % &'() "* +i i,"#i- #$ . /$0123 (1.1)
where, f(x) is the decision function, K (.,.) is the kernel function, i is the weight
assigned to the training feature vector xi and yi is the corresponding label (+1: member,
!1: non!member). In equation 1.1, ‘b’ is chosen so that yi f (xi) =1 for any i with 0< i<C
where ‘C’ is a cost parameter.
Various popular kernel functions are available and they are as follows:
i) Linear Kernel: This kernel can be represented as
4"#- +$ % #5 + (1.2)
ii) Polynomial:
4"#- +$ % "#- +$6 (1.3)
Where ‘d’ is the degree of the polynomial, ‘k’ is the kernel function.
iii) Sigmoid: This kernel is expressed as
4"#- +$ % 789: ";"#- +$ . < (1.4)
Where, = and " are parameters respectively called gain and threshold
and ; > 0 and < < 0.
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 31
iv) Inhomogeneous Polynomial:
4"#- +$ % ""#- +$ . >$6 (1.5)
Where,‘d’ is the degree of the polynomial.
v) Radial Basis Function (RBF):
4"#- +$ % ?@A BCCDEFCCG
HIG J (1.6)
Where K is the threshold parameter and K L M.
Among the different kernels, RBF kernel is the most commonly used in the biological
problems (Hsu et al., 2009). This is because of the reasons:
i) RBF kernel nonlinearly maps data points into a higher dimensional space,
and therefore, unlike the linear kernel, can handle the cases where there
is a non!linear relationship between class labels and attributes.
Therefore, a Gaussian kernel defined on a domain of infinite cardinality,
will produce a feature space of infinite dimension and will ensure
possibility of smooth and simple estimates. Further, it has been shown
that the linear kernel is a special case of RBF kernel (Keerthi and Lin,
2003).
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 32
ii) Simple linear transformation in the output units generated can be fully
optimized using linear modeling in RBF which allow the data to be trained
quickly than other kernels.
iii) RBF has a gamma parameter which makes optimization in SVM easier
than other kernels.
iv) The number of hyper!parameters, which influences the complexity of
model selection, is more in case of polynomial kernel than RBF kernel.
1.5.2 SVM Applications
Support vector machines have been successfully applied to a number of biological
applications. SVMs are evolved from the sound theory to the implementation and
experiments as compared to heuristic approach of the other machine learning
classifiers. The SVMs is global, unique and does not depend on the dimensionality of the
input space. SVMs does not attempt to control model complexity keeping the number
of small features and most importantly less prone to overfitting than other machine
learning algorithms. The effectiveness of SVM in overcoming these shortcomings and
the superior generalization performance makes the method very promising (Vert, 2002;
Noble, 2004; Yang, 2004). SVM has been shown to work well for many biological
analyses, including, prediction of pathogenic mutations (Krishnan and Westhead, 2003;
Bao and Cui, 2005; Yue et al., 2005, Yue and Moult, 2006; Tian et al., 2007), prediction
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 33
of protein subcellular locations (Park and Kanehisa, 2003), classification of proteins and
their functions (Cai et al., 2003), prediction of membrane protein types (Cai et al.,
2004), classification of G!protein!coupled receptors (GPCRs) (Karchin et al., 2002),
classification of nuclear receptors (Bhasin and Raghava, 2004), protein fold!recognition
(Ding and Dubchak, 2001; Shamim et al., 2007), prediction of RNA binding proteins (Han
et al., 2004), prediction of rRNA!RNA!and!DNA binding proteins from sequence (Cai et
al., 2003), prediction of phosphorylation sites (Kim et al., 2004), prediction of T!cell
epitopes (Bhasin and Raghava, 2004; Zhao et al., 2003), prediction of regulatory
networks (Qian et al., 2003), analysis of microarray gene expression data (Brown et al.,
2000), protein!protein interaction prediction (Bock and Gough, 2001; Koike and Takagi,
2004; Yellaboina et al., 2007), etc.
1.5.3 SVM Softwares
Many of the SVM software packages viz., SVMlight
(Joachims, 1999), SVMstruct
(Joachims,
1999), Sequential Minimal Optimization (SMO) (Platt, 1999), LIBSVM (Library for
Support Vector Machines) (Chang and Lin, 2001) etc. are available. However, some of
the SVM software used are either quite complicated or are not suitable for large
problems. Depending upon the size n of the working sample test, SVMlight
allows the
users ranging from 2 to 100, therefore allows optimizing this software in each iteration
step, thereby making a complicated task. Sequential Minimal Optimization (SMO) by
Platt (1998) proposed a two!loop heuristic method for the working set and restricts the
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
Page 34
sample size n to 2. Therefore, optimization is not needed in the SMO. However, there is
still limitation of proposed SMO that might not work in solving very large problems
(Keerthi et al., 2001). It has been proposed through LIBSVM which implements both
SMO and SVMlight
by restricting the working size test to 2 (SMO) as well as strategy
followed in SVMlight
. Chang and Lin (2001) reported the better performance and stability
of LIBSVM on the benchmark datasets as compared to two other SVM softwares.
LIBSVM is available at http://www.csie.ntu.tw/~cjlin/libsvm and can be downloaded
into the local system as libsvm!2.88 packages and run. LIBSVM contains many features
where users can easily incorporate their functions into their own programs. Different
formulations can be easily implemented in LIBSVM viz., C!support vector classification
(C!VSC), "!support vector classification ("!SVC), one!class SVM, N!support vector
regression (N!SVR) and "!support vector regression ("!SVR). In addition to class label,
LIBSVM also provides a decision/probability values to the users. LIBSVM have additional
advantages in providing a cross!validation model selection as well as weighted SVM for
unbalanced data. For this reason, I have used LIBSVM as one of the software packages in
this thesis.
1.6 R package!
As a part of GNU project, freely available R package is widely used for statistical
computing and graphics. The R package mechanism is highly flexible where the
developers can submit packages for specific functions/interests that can be easily
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
! Page!35!
available to users. R provides excellent quality graphs to the interested users. R is
available at URL: www.r!project.org and can be downloaded into the system and run. R
provides a number of tools for installing and using the packages with R CMD INSTALL in
the Linux directory to install user!interested R!package. I have used different machine
learning classifiers available in the R package. For SVM prediction of pathogenic
mutations, I have also used SVM in R which is implemented as e1071 package.
1.7!The!Scope!and!aim!of!present!study!!
The review presented in this chapter highlighted the importance of nsSNPs involved in
altering the protein function and their disease causing status. Although there are
methods available to classify human nsSNPs/mutations into benign or pathogenic and
these methods use various attributes varying from sequence!based, evolutionary based,
combination of structural and evolutionary information to a variety of machine learning
techniques including linear logistic regression, decision trees, random forest, neural
networks, neuro!fuzzy classifier, Bayesian classifier, ridge partial least square and
support vector machines. Despite their availability, it is still conceived as highly
important to develop new methods with higher prediction accuracies. The thesis
presents details of our investigations in order to develop a new method with higher
accuracy as compared to the existing methods. As will be found in the following
chapters I have identified new discriminatory features and used them in an SVM!based
C h a p t e r 1 I n t r o d u c t o r y R e v i e w
! Page!36!
discrimination of mutations as disease or benign. The newly developed SVM!based
method outperforms the other methods in its prediction accuracies.
1.8! Summary!
In this chapter, a brief review on the nsSNPs has been presented including a
comprehensive account of the organization of database, different methods used for
discrimination of nsSNPs. Further, an overview of the different existing methods used in
predicting pathogenic effect of min!sense mutations was also presented. This chapter
discuss about briefings of one of the machine learning classifiers extensively used (SVM)
to be used in the prediction of pathogenic mutations. The next chapter tells the
investigations and their distributions of sequence!structure features to be used in the
machine learning classifier viz., SVM for the prediction of pathogenic mutations.