comparison of variants of blast

COMPARISON OF VARIANTS OF BLAST (Basic Local Alignment Search Tool)

A Thesis

Submitted in partial fulfillment of the requirement for the award of degree of

Master of Engineering In

Software Engineering

Under the Supervision of Ms. Inderveer Chana

Senior Lecturer Computer Science and Engineering Department

Batch 2003-2005

Submitted By Harpreet Kaur

(8033107)

Computer Science & Engineering Depar tment Thapar Institute of Engineering & Technology

(Deemed University), Patiala-147004 (India).

May 2005

i

ABSTRACT

Now a days, large quantities of gene sequences of related species of plants, animals

and microorganisms show complex patterns of similarity to one another and many

molecular biologists are convinced that an understanding of sequence evolution is the

first step towards understanding the evolution itself. In fact this is one of the most

fascinating aspects of the study of evolution. Thus the comparison of gene sequences

or biological sequence analysis is one of the processes used to understand sequence

evolution. Just as the ancient Greeks used comparative anatomy to understand the

human body and linguists used the Rosetta stone to decipher Egyptian hieroglyphs,

today we can use comparative sequence analysis to understand genomes. There is

variety of different tools available to perform sequence analysis. Various DNA

sequences alignment tools have been developed. Various software packages of

automated tools have been developed that had improved the eff iciency of much

biological research. Fast, economical, flexible, and extensible computing power is

making it increasingly attractive to scientists in many areas of research, including

biology.

More generally, the open source movement has greatly benefited biological research.

The combination of data availabili ty and free software is revolutionizing this field.

BLAST is the eff icient tool used for biological searches. There exists variants of Blast

which are developed to overcome the limitations of Main BLAST Tool. I studied

variants of BLAST (BLASTN, BLASTP, BLASTX, TBLASTN, TBLASTX,PSI-

BLAST). Each variant has advantages and disadvantages over one another. Different

tools work according to the different parameters. These parameters add to the

performance of the algorithm. I did analysis of these variants and compared these

tools on the basis of their algorithms, parameters, and performance. Situation is

depicted that in which condition, which variant is more advantageous and under

which circumstances different versions should use. How they can be improved by

eliminating their deficiencies and by adding new features.

ii

DECLARATION

I hereby certify that the work which is being presented in the thesis entitled,

“ Comparison of Variants of Blast (Basic Local Ali gnment Search

Tool)” in partial fulf i llment of the requirements for the award degree of Master

of Engineering in Software Engineering at Computer Science and Engineering

Department of Thapar Institute of Engineering and Technology (Deemed

University), Patiala, is an authentic record of my own work carried out under the

supervision of Ms. Inderveer Chana.

�The matter presented in this thesis has not been submitted by me for the award of any

other degree of this or any other University.

Harpreet Kaur

This is to certify that the above statement made by the candidate is correct and true to

the best of my knowledge.

Ms. Inderveer Chana

Senior Lecturer

Computer Science and Engineering Department

Thapar Institute of Engineering and Technology

PATIALA- 147004

Countersigned by

Mr. R.S Salaria

Head

Computer Science and Engineering Department

Thapar Institute of Engineering and

Technology

PATIALA- 147004

Dr. D. S. Bawa

Dean Of Academic Affairs

Thapar Institute of Engineering and

Technology

PATIALA- 147004

ii i

��ACKNOWLEDGEMENT

I wish to express my deep gratitude to Ms. Inderveer Chana, Senior Lecturer,

Computer Science and Engineering Department for providing her uncanny guidance

and support throughout the Thesis work.

�I am also thankful to Mr.R.S.Salaria, Head, Computer Science and Engineering

Department and Mr. Rajesh Bhatia, P.G Coordinator, for their excellent guidance and

encouragement right from the beginning of this course

I would also like to thank all the staff members and my Co-students who were always

there at the need of the hour and provided with all the help and faciliti es, which I

required for the completion of the Thesis.

I wish to express my indebtedness to my parents who have been a constant source of

love and encouragement.

Finally I would like to thank God for not letting me down at the time of crisis and

showing me the silver lining in the dark clouds.

Harpreet Kaur ��

iv

TABLE OF CONTENTS

Abstract ..........................................................................................................................i

Declaration……………………………………………………………………………ii

Acknowledgement…………………………………………………………………...ii i

List of Figures……………………………………………………………………….vii

List of Tables…………………………………………………………………………ix

Organization of Thesis……………………………………………………………….x

CHAPTER 1 DATA MINING.............................................................................. 1-10

1.1 DATA MINING.......................................................................................................1

1.2 WHY DATA MINING............................................................................................1

1.3 STEPS OF KDD PROCESS....................................................................................2

1.4 WHAT KIND OF DATA CAN BE MINED?.........................................................4

1.4.1 Relational Databases.....................................................................................4

1.4.2 Data Warehouses...........................................................................................4

1.4.3 Transactional Databases................................................................................4

1.4.4 Multimedia Databases...................................................................................5

1.4.5 Spatial Databases...........................................................................................5

1.4.6 World Wide Web ..........................................................................................5

1.4.7 Advanced DB and Information Repositories................................................5

1.5 ARCHITECTURE FOR DATA MINING SYSTEM ..............................................6

1.5.1 Database, Data Warehouse, or Other Information Repository......................6

1.5.2 Database or Data Warehouse Server.............................................................6

1.5.3 Knowledge Base............................................................................................7

1.5.4 Data Mining Engine......................................................................................7

1.5.5 Pattern Evaluation Module............................................................................8

1.5.6 Graphical User Interface...............................................................................8

1.6 DATA MINING APPLICATIONS.........................................................................8

1.7 THE SCOPE OF DATA MINING ..........................................................................9

CHAPTER 2 BIOINFORMATICS.................................................................... 11-24

2.1 WHY BIOINFORMATICS...................................................................................11

v

2.2 BIOINFORMATICS..............................................................................................11

2.3 AIMS OF BIOINFORMATICS.............................................................................12

2.4 STEPS OF KDD FOR BIOINFORMATICS.........................................................13

2.5 WHAT KIND OF DATA CAN BE MINED?.......................................................13

2.5.1 DNA ............................................................................................................13

2.5.2 RNA ............................................................................................................15

2.5.3 PROTEIN....................................................................................................16

2.6 DATA MINING TECHNIQUES IN BIOINFORMATICS..................................17

2.6.1 Clustering....................................................................................................17

2.6.2 Classification...............................................................................................19

2.6.3 Association..................................................................................................19

2.7 THE CENTRAL DOGMA.....................................................................................19

2.7.1 Transcription ...............................................................................................19

2.7.2 The Genetic Code........................................................................................20

2.8 NEED OF DATA MINING IN BIOINFORMATICS...........................................21

2.9 BIOINFORMATICS AND ITS SCOPE................................................................22

2.10 APPLICATIONS OF BIOINFORMATICS........................................................23

CHAPTER 3 INTRODUCTION TO BLAST ................................................... 25-42

3.1 INTRODUCTION..................................................................................................25

3.2 DATABASES AVAILABLE FOR BLAST SEARCH INCLUDE.......................26

3.2.1 Protein Sequence Databases........................................................................26

3.2.2 Nucleotide Sequence Databases..................................................................27

3.3 BLAST ALGORITHM ..........................................................................................29

3.4 BLAST PARAMETERS........................................................................................32

3.5 FEATURES OF BLAST........................................................................................39

3.5.1 Heuristic ......................................................................................................39

3.5.2 Substitution Matrix......................................................................................40

3.5.3 Local Alignments........................................................................................40

3.5.4 Ungapped Alignments.................................................................................40

3.5.5 Explicit Statistical Theory...........................................................................40

3.5.6 Rapid ...........................................................................................................41

3.5.7 Sequence Input ............................................................................................41

3.5.8 Results Format.............................................................................................41

vi

3.5.9 BLAST Output ............................................................................................41

CHAPTER 4 VARIANTS OF BLAST............................................................... 43-61

4.1 BLAST VARIANTS..............................................................................................43

4.2 PSI-BLAST............................................................................................................45

4.3 BLASTN ................................................................................................................53

4.4 BLASTX ................................................................................................................55

4.5 BLASTP.................................................................................................................58

4.5.1 BLASTP PARAMETERS ..........................................................................59

4.6 TBLASTN..............................................................................................................60

4.7 TBLASTX ..............................................................................................................61

4.7.1 Limitations of TBlastX................................................................................61

CHAPTER 5 COMPARISON OF VARIANTS OF BLAST ........................... 62-74

5.1 INTRODUCTION..................................................................................................62

5.1.1 Comparison On The Basis Of Parameters...................................................62

5.2 COMPARISON ON THE BASIS OF ALGORITHM...........................................66

5.2.1 The Two-Hit Algorithm Isn't Used In BLASTN, Because Word Hits

Are Generally Rare With Large Identical Words........................................66

5.2.2 Extension in BlastN is different from BlastP and other protein based

programs......................................................................................................68

5.3 COMPARISON ON THE BASIS OF PERFORMANCE.....................................68

5.3.1 Comparison On The Basis of Varying Expect Values................................68

5.3.2 Comparison On The Basis of Word Size....................................................70

5.3.3 Comparison on the Basis of Execution Time..............................................73

CHAPTER 6 CONCLUSION AND FUTURE SCOPE................................... 75-76

6.1 CONCLUSION......................................................................................................75

6.2 FUTURE SCOPE...................................................................................................76

REFERENCES...........................................................................................................77

L IST OF PUBLICATIONS.......................................................................................80

GLOSSARY................................................................................................................81

vii

LIST OF FIGURES

Number Page

Figure 1.1 The Process of Knowledge Discovery 03

Figure 1.2 Architecture Of Typical Data Mining 07

Figure 2.1 The KDD Process For Bioinformatics 14

Figure 2.2 DNA Molecule 14

Figure 2.3 Protien Moleceule 17

Figure 3.1 Protein Database 28

Figure 3.2 Nucleotide Database 28

Figure 3.3 List of Words From Query Sequence 30

Figure 3.4 Exact Matches of Words From Word List 31

Figure 3.5 Maximal Segment Pairs 31

Figure 3.6 Figure Shows The Word Size Option 33

Figure 4.1 Blast Variants 43

Figure 4.2 Blast Variants 45

Figure 4.3 PSI-Blast 46

Figure 4.4 PSI-Blast Step1 47

Figure 4.5 PSI-Blast Step2 48

Figure 4.6 PSI-Blast Output 50

Figure 4.7 PSI-Blast Output 51

Figure 4.8 PSI Blast 51

Figure 4.9 BlastN 53

Figure 4.10 Using BlastN for Comparison 54

Figure 4.11 BlastN Results 54

Figure 4.12 BlastX 56

Figure 4.13 Using BlastX for Comparison 57

Figure 4.14 BlastX Results 58

Figure 4.15 BlastP 59

Figure 4.16 Using BlastP for Comparison 59

Figure 5.1 Conserved Domain Search For Blastn And Blastp 63

Figure 5.2 Different Word size for BlastN and BlastP 64

vii i

Figure 5.3 Empirically Estimated Probability That An HSP Is Missed By This

Method, As a Function of Its Normalized Score 67

Figure 5.4 Speeds of The One-Hit And Two-Hit Methods 67

Figure 5.5 Comparison - Varying Expect Values 69

Figure 5.6 Comparison - Varying Expect Values 70

Figure 5.7 Varying Expect Values For Blastn 71

Figure 5.8 Varying Expect Values Blastn 71

Figure 5.9 Varying Expect Values For Variants 72

Figure 5.10 Varying Expect Values For Variants 72

Figure 5.11 Compares The Performance of BLAST Compiled With 32-Bit And 64-Bit

Processor 73

ix

LIST OF TABLES

Number Page Table 2.1 The 20-Amino Acids and their off icial codes 16

Table 4.1 Programs Available For Blast 44

Table 5.1 No of hits for varying expect values 69

Table 5.2 No of hits for varying expect values BlastN 70

Table 5.3 No of Hits For Varying Word Size 72

Table 5.4 Varying Execution Time 73

x

ORGANIZATION OF THESIS

The Thesis entitled “Compar ison of Var iants of BLAST (Basic Local Alignment

Search Tool)” is concerned with comparison of variants of BLAST. All tools are

compared according to some defined criteria.

The First chapter briefly introduces Data Mining technology and the techniques which

are used in data mining. Process of knowledge discovery for databases for is also

discussed.

The Second chapter is related to Field of Bioinformatics, Need of Bioinformatics, kind

of data on which bioinformatics is applied.

The Third chapter explains Biological tool BLAST which is used for sequence

similarity, algorithm of BLAST, features of BLAST is explained.

Fourth chapter explores variants of BLAST (BlastN, BlastX, BlastP, TBlastN, TBlastX,

PSI-Blast) the algorithm of all variants, parameters, and the performance criteria for

each tool is explored.

In Fifth chapter comparison of variants of BLAST is performed on the basis of

parameters, algorithms and performance. Deficiency of any parameters and

improvement to that is also enlightened.

1

CHAPTER 1 DATA MINING

1.1 DATA MINING

Data Mining is extraction of interesting knowledge (rules, regularities, patterns,

constraints) from data in large databases. It is the process of selecting, explor ing,

and modeling large amounts of data to uncover previously unknown patterns or

information for a business advantage [4]. Data Mining can be viewed as an analytical

process designed to explore data (usually large amounts of - typically business or

market related - data) in search for consistent patterns and/or systematic relationships

between variables, and then to validate the findings by applying the detected patterns

to new subsets of data. There are many terms carrying a similar or slightly different

meaning to data mining, such as knowledge mining from databases, knowledge

extraction, data/ pattern analysis, data archaeology, and data dredging. It is a young

interdisciplinary field, drawing from areas such as database systems, data

warehousing, statistics, machine learning, data visualization, information retrieval,

and high-performance computing. Other contributing areas include neural networks,

pattern recognition, spatial data analysis, image databases, signal processing, and

many application fields, such as business, economics, and bioinformatics.

1.2 WHY DATA MINING “W e are drowning in data, but starving for knowledge!”

Necessity is the Mother of Invention - Automated data collection tools and mature

database technology led to tremendous amounts of data stored in databases, data

warehouses and other information repositories. Every day the world creates 52,000

terabytes of data. Only 4% of the data is used for any purpose. So a thought came that

if we could do something useful with this data, and with this thought the field of

DATA MINING was born. Database technology began with the development of data

collection and database creation mechanisms that, led to the development of effective

mechanisms for data management including data storage and retrieval, and query and

transaction processing. The large number of database systems offering query and

transaction processing eventually and naturally led to the need for data analysis and

understanding. Hence, data mining began its development out of this necessity.

2

1.3 STEPS OF KDD PROCESS

Knowledge discovery is defined as ̀ `the non-trivial extraction of implicit, unknown,

and potentially useful information from data''. The knowledge discovery process takes

the raw results from data mining (the process of extracting trends or patterns from

data) and carefully and accurately transforms them into useful and understandable

information [6].

The overall process of finding and interpreting patterns from data involves the

repeated application of the following steps:

1. Developing an understanding of

o the application domain

o the relevant prior knowledge

o the goals of the end-user

2. Creating a target data set: selecting a data set, or focusing on a subset of

variables, or data samples, on which discovery is to be performed.

3. Data cleaning and preprocessing.

o Removal of noise or outliers.

o Collecting necessary information to model or account for noise.

o Strategies for handling missing data fields.

o Accounting for time sequence information and known changes.

4. Data reduction and projection.

o Finding useful features to represent the data depending on the goal of

the task.

o Using dimensionali ty reduction or transformation methods to reduce

the effective number of variables under consideration or to find

invariant representations for the data.

3

Figure 1. 1 The process of Knowledge Discovery [22]

5. Choosing the data mining task.

o Deciding whether the goal of the KDD process is classification,

regression, clustering, etc.

6. Choosing the data mining algorithm(s).

o Selecting method(s) to be used for searching for patterns in the data.

o Deciding which models and parameters may be appropriate.

o Matching a particular data mining method with the overall criteria of

the KDD process.

7. Data mining.

o Searching for patterns of interest in a particular representational form

or a set of such representations as classification rules or trees,

regression, clustering, and so forth.

8. Interpreting mined patterns.

9. Consolidating discovered knowledge.

The terms knowledge discovery and data mining are distinct. KDD refers to the

overall process of discovering useful knowledge from data. It involves the evaluation

and possibly interpretation of the patterns to make the decision of what qualifies as

knowledge. It also includes the choice of encoding schemes, preprocessing, sampling,

and projections of the data prior to the data mining step.

Data mining refers to the application of algorithms for extracting patterns from data

without the additional steps of the KDD process.

4

1.4 WHAT KIND OF DATA CAN BE MINED?

Data mining is not specific to one type of media or data. Data mining should be

applicable to any kind of information repository. However, algorithms and approaches

may differ when applied to different types of data. Data mining is being put into use

and studied for databases, including relational databases, object-relational databases

and object-oriented databases, data warehouses, transactional databases, unstructured

and semi-structured repositories such as the World Wide Web, advanced databases

such as spatial databases, multimedia databases, time-series databases and textual

databases, and even flat files.

1.4.1 Relational Databases

A relational database consists of a set of tables containing either values of entity

attributes, or values of attributes from entity relationships. Tables have columns and

rows, where columns represent attributes and rows represent tuples. A tuple in a

relational table corresponds to either an object or a relationship between objects and is

identified by a set of attribute values representing a unique key. The most commonly

used query language for relational database is SQL, which allows retrieval and

manipulation of the data stored in the tables, as well as the calculation of aggregate

functions such as average, sum, min, max and count.

1.4.2 Data Warehouses

A data warehouse is a repository of information collected from multiple resources,

stored under a unified schema and which usually reside at a single site. Data

warehouse are constructed via a process of data cleaning, data transformation data

integration, data loading and process data refreshing.

1.4.3 Transactional Databases

A transactional database consists of a file where each record represents a transaction.

A transaction typically includes a unique transaction identity number and a list of

items making up the transaction.

5

1.4.4 Multimedia Databases

Multimedia databases include video, images, audio and text media. They can be

stored on extended object-relational or object-oriented databases, or simply on a file

system. Multimedia is characterized by its high dimensionality, which makes data

mining even more challenging. Data mining from multimedia repositories may require

computer vision, computer graphics, image interpretation, and natural language

processing methodologies.

1.4.5 Spatial Databases

Spatial databases are databases that, in addition to usual data, store geographical

information like maps, and global or regional positioning. Such spatial databases

present new challenges to data mining algorithms.

1.4.6 World Wide Web

The World Wide Web is the most heterogeneous and dynamic repository available. A

very large number of authors and publishers are continuously contributing to its

growth and metamorphosis, and a massive number of users are accessing its resources

daily. Data in the World Wide Web is organized in inter-connected documents. These

documents can be text, audio, video, raw data, and even applications. Conceptually,

the World Wide Web is comprised of three major components: The content of the

Web, which encompasses documents available; the structure of the Web, which

covers the hyperlinks and the relationships between documents; and the usage of the

web, describing how and when the resources are accessed.

1.4.7 Advanced DB and Information Repositor ies

• Object-or iented databases

Object Oriented databases are based on the object oriented programming

paradigm, where each entity is considered as an object. Each object has associated

with it a set of variables, a set of messages and set of methods.

Objects that share a common set of properties can be grouped into an object class.

Each object is an instance of its class. For example, employee can contain

variables like name, address and birth date.

6

• Object-relational databases

The object-relational model extends the basic relational data model by adding the

power to handle complex data types, class hierarchies and object inheritance.

These are becoming more popular in industry and applications.

• Spatial databases

Spatial databases include spatial related information. Such databases include

geographical databases, VLSI chip design databases, and medical and satell ite

image databases.

• Temporal databases and Time Ser ies databases

Temporal databases usually stores relational data that include time related

attributes. Time Series database stores sequences of values that change with time,

such as data collected regarding the stock exchange.

• Legacy databases

A legacy database is a group of heterogeneous databases that combines different

kinds of data systems, such as relational or object oriented databases, spreadsheets

or file systems.

1.5 ARCHITECTURE FOR DATA MINING SYSTEM

The architecture of typical data mining system has the following components [11]:

1.5.1 Database, Data Warehouse, or Other Information Repository

This is one or a set of database, data warehouse spreadsheet, or other kinds of

information repositories. Data cleaning and data integration techniques may be

performed on the data.

1.5.2 Database or Data Warehouse Server

The database or data warehouse server is responsible for fetching the relevant data,

based on the user data-mining request [14].

A data warehouse is a repository for long-term storage of data from multiple sources,

organized so as to facilitate management decision making. The data are stored under a

unified schema and are typically summarized. Data warehouse systems provide some

7

data analysis capabilities, collectively referred to as OLAP (On-Line Analytical

Processing).

1.5.3 Knowledge Base

This is the domain knowledge that is used to guide the search or evaluate the

interestingness of resulting patterns. Knowledge such as users beliefs, which can be

used to assess a pattern’s interestingness based on its unexpectedness, may be

included. Other examples of domain knowledge are additional interestingness

constraints or threshold and metadata.

Figure 1.2 Architecture of a typical data mining system 1.5.4 Data Mining Engine

This is essential to the data mining system and identically consists of set of functional

modules for task such as characterization, association, classification, cluster analysis

and evolution and deviation analysis.

8

1.5.5 Pattern Evaluation Module

This component typically employs interestingness measure and interacts with the data

mining modules so as to focus the search towards interesting patterns. It may use

interestingness thresholds to filter out discovered patterns.

1.5.6 Graphical User Interface

This modules communicates between users and data mining system, allowing the

users to interacts with the system by specifying a data mining query or task, providing

information to help focus on the search and performing exploratory data mining based

on intermediate data mining results.

1.6 DATA MINING APPLICATIONS The Google system uses a mathematical algorithm called PageRank to estimate the

relative importance of individual web pages based on link patterns [19].

• Financial institutions have reduced incidents of credit-card fraud through the

application of neural networks, which feature circuits arranged in a brain-like

configuration that can infer patterns from data.

• The medical sector is also taking advantage of data-mining: One application

involves a collaboration between IBM and the Mayo Clinic to detect patterns

in medical records, while another project uses natural-language processing to

map out the "grammar" of amino acid sequences and match them to specific

protein shapes and functions.

• Government organizations such as the Defense Department and the National

Security Agency are using AI technology for several efforts related to national

security, such as the Echelon telecom monitoring system. The Defense

Advanced Research Projects Agency (DARPA) is a leading AI research

investor, and the break throughs that come out of DARPA-funded projects are

more often than not put to civili an rather than mili tary use.

• Marketing: In marketing, the primary application is database marketing

systems, which analyze customer databases to identify different customer

groups and forecast their behavior. Business Week (Berry 1994) estimated that

9

over half of all retailers are using or planning to use database marketing, and

those who do use it have good results; for example,American Express reports a

10- to 15-percent increase in credit-card use. Another notable marketing

application is market-basket analysis

• Investment: Numerous companies use data mining for investment, but most

do not describe their systems. One exception is LBS Capital Management. Its

system uses expert systems, neural nets, and genetic algorithms to manage

portfolios totaling $600 million; since its start in 1993, the system has

outperformed the broad stock market (Hall, Mani, and Barr 1996).

• Fraud detection: HNC Falcon and Nestor PRISM systems are used for

monitoring credit card fraud, watching over milli ons of accounts. The FAIS

system (Senator et al. 1995),from the U.S. Treasury Financial Crimes

Enforcement Network, is used to identify financial transactions that might

indicate money laundering activity.

• Telecommunications: The telecommunications alarm-sequence analyzer

(TASA) was built in cooperation with a manufacturer of telecommunications

equipment and three telephone networks (Mannila, Toivonen, and Verkamo

1995). The system uses a novel framework for locating frequently occurring

alarm episodes from the alarm stream and presenting them as rules. Large sets

of discovered rules can be explored with flexible information-retrieval tools

supporting interactivity and iteration. In this way, TASA offers pruning,

grouping, and ordering tools to refine the results of a basic brute-force search

for rule.

1.7 THE SCOPE OF DATA MINING Data mining derives its name from the similarities between searching for valuable

business information in a large database — for example, finding linked products in

gigabytes of store scanner data — and mining a mountain for a vein of valuable ore.

Both processes require either sifting through an immense amount of material, or

intelligently probing it to find exactly where the value resides. Given databases of

10

suff icient size and quality, data mining technology can generate new business

opportunities by providing these capabili ties [21].

• Automated prediction of trends and behaviors. Data mining automates the

process of f inding predictive information in large databases. Questions that

traditionally required extensive hands-on analysis can now be answered

directly from the data — quickly. A typical example of a predictive problem is

targeted marketing. Data mining uses data on past promotional maili ngs to

identify the targets most likely to maximize return on investment in future

mailings. Other predictive problems include forecasting bankruptcy and other

forms of default, and identifying segments of a population likely to respond

similarly to given events.

• Automated discovery of previously unknown patterns. Data mining tools

sweep through databases and identify previously hidden patterns in one step.

An example of pattern discovery is the analysis of retail sales data to identify

seemingly unrelated products that are often purchased together. Other pattern

discovery problems include detecting fraudulent credit card transactions and

identifying anomalous data that could represent data entry keying errors.

Data mining techniques can yield the benefits of automation on existing software and

hardware platforms, and can be implemented on new systems as existing platforms

are upgraded and new products developed. When data mining tools are implemented

on high performance parallel processing systems, they can analyze massive databases

in minutes. Faster processing means that users can automatically experiment with

more models to understand complex data. High speed makes it practical for users to

analyze huge quantities of data. Larger databases, in turn, yield improved predictions.

11

CHAPTER 2 BIOINFORMATICS

2.1 WHY BIOINFORMATICS The information for the set-up of living organisms is stored in the sequences of

nucleotides in DNA. DNA serves two purposes: to provide the information during the

li fe cycle of a cell and to pass it on to offspring. The discovery of genes and the

genetic code triggered the hope to be able to read the information stored in our genes,

and today we are able to do so: massive progress in sequencing technology has

delivered entire genomes to the tips of our fingers. The era of genomics and

proteomics has opened up the opportunity to go beyond the analysis of single genes

and proteins, towards understanding the interactions between all components of

genomes and proteomes. From trying to comprehend li fe by cutting it into smaller and

smaller pieces, we are beginning to unveil i n the same way it has been functioning

since its beginning: as a whole.

Computer scientists are important alli es for biologists in the struggle to understand

the information in DNAs. On one hand the massive amount of sequence data requires

new tools -computers and programs- to generate, proof, store, and access these data.

On the other hand, the deciphering of genomes necessitates the development of new

hard- and software that allow to detect genes, determine relationships between them,

study their expression, to be able to understand the basis of development and disease.

Bioinformatics provides the tools to understand the information in biological data.

2.2 BIOINFORMATICS

Bioinformatics has evolved into a full -fledged multidisciplinary subject that

integrates developments in Information and Computer Technology as applied to

Biotechnology and Biological Sciences. Bioinformatics uses Computer software tools

for database creation, data management, data warehousing, data mining and global

communication networking. Bioinformatics is the recording, annotation, storage,

analysis, and searching/retrieval of nucleic acid sequence (genes and RNAs), protein

sequence and structural information [2]. This includes databases of the sequences and

structural information as well methods to access, search, visualize and retrieve the

information. Bioinformatics concern the creation and maintenance of databases of

12

biological information whereby researchers can both access existing information and

submit new entries. Bioinformatics includes Sequence analysis used by geneticists,

cell biologists, molecular biologists, Molecular modeling used by crystallographers,

cell biologists, biochemists, Molecular phylogeny/evolution, Ecology and population

studies ,Medical informatics .The most pressing tasks in bioinformatics involve the

analysis of sequence information.

Computational Biology is the name given to this process, and it involves the following:

• Finding the genes in the DNA sequences of various organisms

• Developing methods to predict the structure and/or function of newly

discovered proteins and structural RNA sequences.

• Clustering protein sequences into families of related sequences and the

development of protein models.

• Aligning similar proteins and generating phylogenetic trees to examine

evolutionary relationships.

2.3 AIMS OF BIOINFORMATICS

The aims of bioinformatics are basically three-fold. They are

� Organization of data in such a way that it allows researchers to access existing

information & to submit new entries as they are produced. While data-creation

is an essential task, the information stored in these databases is useless unless

analyzed. Thus the purpose of bioinformatics extends well beyond mere

volume control.

� To develop tools and resources that help in the analysis of data. For example,

having sequenced a particular protein, it is with previously characterized

sequences. This requires more than just a straightforward database search. As

such, programs such as FATA and PSI-BLAST much consider what

constitutes a biologically significant resemblance. Development of such

resources extensive knowledge of computational theory, as well as a thorough

understanding of biology.

13

� Use of these tools to analyze the individual systems in detail, and frequently

compared them with few that are related.

2.4 STEPS OF KDD FOR BIOINFORMATICS

The steps of KDD for bioinformatics involve the same steps as were performed during

the KDD in simple databases. The only difference is the data on which the data

mining is performed. Here the data is biomolecular data instead of simple databases

[2]. It may involve DNA sequences, RNA sequences. KDD for bioinformatics is

shown in figure 2.1.

2.5 WHAT KIND OF DATA CAN BE MINED? KDD for Bioinformatics can be applied on biomolecular data. Biomolecular Data consists of the following types � DNA ( deoxyribonucleic acid)

� RNA ( ribonucleic acid)

� Protein sequences ( 2D & 3D structures)

2.5.1 DNA

In most living organisms (except for viruses), genetic information is stored in the

molecule deoxyribonucleic acid, or DNA. DNA is made and resides in the nucleus of

living cells. DNA gets its name from the sugar molecule contained in its backbone

(deoxyribose), however it gets its significance from its unique structure There are

four different nucleotide bases that occur in DNA:

A - Adenine

T- thymine

C- cytosine

G- guanine

14

Figure 2.1 The KDD for Bioinformatics The versatility of DNA comes from the fact that the molecule is actually double-

stranded. The nucleotide bases of the DNA molecule form complementary pairs: the

nucleotides hydrogen bond to another nucleotide base in a strand of DNA opposite

to the original. This bonding is specific, and adenine always bonds to thymine (and

vice versa) and guanine always bonds to cytosine (and vice versa). This bonding

occurs across the molecule leading to a double-stranded system as shown in picture:

Figure 2.2 DNA Molecule

15

The fundamental chemical building block of deoxyribonucleic acid (DNA) is the

nucleotide. A nucleotide consists of three parts:

(1) a nitrogen-containing pyrimidine or purine base,

(2) a deoxyribose sugar, and

(3) a phosphate group that acts as a bridge between adjacent deoxyribose sugars.

The double-stranded DNA molecule has the unique abil ity that it can make exact

copies of itself, or self-replicate. When more DNA is required by an organism (such

as during reproduction or cell growth) the hydrogen bonds between the nucleotide

bases break and the two single strands of DNA separate. New complementary bases

are brought in by the cell and paired up with each of the two separate strands thus

forming two new identical, double-stranded DNA molecules.

2.5.2 RNA RNA stands for Ribonucleic Acid. It is a long molecule but usually Single stranded,

except when it folds back on itself. They differ chemically from DNA by containing

ribose instead of deoxyribose & containing Uracil ( U) instead of Thymine (T). So the

only important differences between RNA and DNA are that

� RNA differs from DNA by one nucleotide.

� RNA comes as a single stranded

The four bases of RNA are

A - adenine

U- uracil

C- cytosine

G- guanine

Some programs automatically handle the U-instead-of-T conversion and many do not

even distinguish between the two classes o nucleic acids. Don’ t be surprised if a

database entry displays RNA sequences with a T instead of U. In fact, RNA

sequences are encoded in the DNA.

16

Table 2.1: The 20- Amino Acids and their Off icial Codes

2.5.3 PROTEIN

Protein is a polymer constructed by Amino acids. The most popular representation

model for biologists to describe a protein is to use the sequence. A protein

sequence is made up of 20 amino acids, each represented by a letter. These amino

acids along with their codes are shown in Table 2.1.

# 1-Letter Code 3-Letter Code Name

1 A Ala Alanine

2 R Arg Arginine

3 N Asn Asparagine

4 D Asp Aspartic acid

5 C Cys Cysteine

6 Q Gln Glutamine

7 E Glu Glutamic acid

8 G Gly Glycine

9 H His Histidine

10 I Ile Isoleucine

11 L Leu Leucine

12 K Lys Lysine

13 M Met Methionine

14 F Phe Phenylalanine

15 P Pro Proline

16 S Ser Serine

17 T Thr Threonine

18 W Trp Tryptophan

19 Y Tyr Tyrosine

20 V Val Valine

17

Figure 2.3 Protein Molecule

2.6 DATA MINING TECHNIQUES IN BIOINFORMATICS There are many data mining techniques available which can be applied to

biomolecular data. Clustering, Classification and Association, which are very useful

in biomolecular data, are discussed below. These techniques are able to discover

previously hidden pattern in biomolecular data [2].

2.6.1 Clustering

The search for protein structure motif begins with the knowledge that some protein

with low sequence similarity folds into remarkably similar 3-D conformations. Even

globally different structure may share similar or identical substructures.

Protein motifs can be divided into four categories

i. Sequences Motif: Linear strings of amino acids residues with a topological

ordering.

ii . Sequence structure motifs.

ii i. Structure Motifs: 3-D objects that correspond to a protein backbone.

iv. Structure Sequence Motif: Structure motifs in which nodes of the graph are

annotated with sequence information.

Predictabili ty

It is the degree to which a motif is representing one level or facet of protein structure

or function may be predicted form knowledge of another. For the local structure

18

motifs designated as secondary structure, predictability is the ability to accurately

predict secondary structure classes from amino acid sequence.

Predictive utili ty

It is the flip side of the predictabili ty criterion for e.g. If one takes the view of

secondary structure as an intermediate level encoding between primary structure and

tertiary structure, then predictive utilit y ought to be some measure of the gain in

accuracy in predicting tertiary structure with a particular encoding, as compared with

prediction using other possible encoding. Another mode direct measure might be the

degree to which a particular set of proposed motifs, corresponding to secondary

structure classes, constrain the alpha and gamma angles of the included structure

fragments.

Intelligibility

refers to the ease with which researchers and practitioners of protein science can

understand a given structure motif and can incorporate its information into their own

work. Many factors affect intelligibility for e.g. A discovered structured class that

contains one-third traditional alpha helix, one-third traditional beta sheets and one-

third coil i s harder to explain than one that correlates almost perfectly with alpha

helix.

Naturalness

It means the degree to which a motif captures some essential bio chemicals or

evolutionary properties or some essential class structure in the space of protein

sequence or structure fragments under consideration. Some clustering methods are

infamous for finding ersatz clusters in uniformly distributed data. Other clustering

methods produce results very dependent on their starting point. To avoid such results

it is important to carefully choose appropriate representations and attributes for

classification.

Systematicity

It is the degree to which a motif discovery method is derived from explicitly stated

principles and the degree to which the methods can repeatedly be applied to diverge

data and produce consistent results.

19

Ease Of Discovery

It refers to the computational complexity and data complexity of the methods required

to discover the motif.

2.6.2 Classification

To find knowledge pattern discovery is a fundamental operation. A Pattern in Bio-

sequence can help scientist to analyze the property of a sequence or predict the

function of a new entity. The pattern may also help to classify an unknown sequence

or to assign the sequence to an existing family.

2.6.3 Association Some qualiti es or some traits in any species don't come alone; they come associated

with some other fundamentals differences. So sometime if one particular

characteristic (pattern) in the sequence, that wil l also depend upon the confidence of a

particular object (pattern) in that sequence for that particular association.

Types of association

i. Association can be for a pair or set of similarity in the same sequence.

ii . Association can be for a pair or set of similarity in the two sequences.

Association can be for a pair or set of similarity in the multiple sequences.

2.7 THE CENTRAL DOGMA

The expression of the genetic information stored in DNA involves the translation of a

linear sequence of nucleotides into a co-linear sequence of amino acids in proteins.

The flow is: DNA :�mRNA :�Protein [2].

2.7.1 Transcription

A segment of DNA is first copied into a complementary strand of RNA. This process

called transcription is catalyzed by the enzyme RNA polymerase. Near most of the

genes there is a special pattern in the DNA called promotor, located upstream of the

transcription start site, which informs the RNA polymerase where to begin the

transcription. This is achieved with the assistance of transcriptional factors that

recognize the promotor sequence and bind to it. Although ribonucleic acid (RNA) is a

long chain of nucleic acids (as is DNA), it has very different properties. First, RNA is

usually single stranded (denoted ssRNA). Second,

20

RNA has a ribose sugar, rather than deoxy-ribose. Third, RNA has the pyrimidine

based Uracil (abbreviated U) instead of Thymine. Fourth, unlike DNA, which is

located primarily in the nucleus, RNA can also be found in the cellular liquid outside

the nucleus, which is called the cytoplasm.

In Eukaryotic organisms, to produce a protein the entire length of the gene, including

both its introns and its exons, is first transcribed into a very large RNA molecule - the

primary transcript. At the end of the gene the transcription stops, and a few dozens of

Adenine (A) nucleotides are added to the end of the RNA molecule for protection

(poly-A tail ). 5’ CAP lays an important part in the initializing of protein synthesis by

the protecting the growing RNA transcript from degradation. Before this RNA

molecule leaves the nucleus, a complex of RNA processing enzymes removes all the

intron sequence, in a process called splicing, thereby producing a much shorter RNA

molecule. Typical eukaryotic exons are of average length of 200bp, while the average

length of introns is around 10000bp (these lengths can vary greatly between different

introns and exons). In many cases, the pattern of the splicing can vary depending on

the tissue in which the transcription occurs. For example, an intron that is cut from

mRNAs of a certain gene transcribed in the liver may not be cut from the same

mRNA when transcribed in the brain. This variation is called alternative splicing, and

it contributes to the overall protein diversity in the organism. After this RNA

processing step has been completed, the RNA molecule moves to the cytoplasm as a

messenger RNA molecule (mRNA), in order to undergo translation.

2.7.2 The Genetic Code

The rules by which the nucleotide sequence of a gene is translated into the amino acid

sequence of the corresponding protein, the so-called genetic code, were deciphered in

the early 1960s. The sequence of nucleotides in the mRNA molecule, that acts as an

intermediate was found to be read in serial order in groups of three. Each triplet of

nucleotides, called a codon, species one amino acid (the basic unit of a protein,

analogous to nucleotides in DNA). Since RNA is a linear polymer of four different

nucleotides, there are 43 = 64 possible codon triplets (However, only 20 different

amino acids are commonly found in proteins, so that most amino acids are specified

by several codons. In addition, 3 codons (of the 64) specify the end of translation, and

are called stop codons. The codon specifying beginning of translation is AUG, and is

also the codon for the amino acid Methionine. The code has been highly conserved

21

during evolution: with a few minor exceptions, it is the same in organisms as diverse

as bacteria, plants, and humans.

2.8 NEED OF DATA MINING IN BIOINFORMATICS

Data in biology are very diverse and abundant. They can be catalogued and classified,

but often cannot be easily summarized or abstracted using a formula.

With the increase in biological knowledge, computer-based databases have become

essential for this task. Bioinformatics databases includes following types of databases

• Sequence databases

• Structural databases

• Motif databases

• Genome databases

• Proteome databases

• RNA expression

• Literature

• Populations

• Mutations

• Organisms

Moreover the data of even a single microorganism is very large. Rickettsia conorii is

the smallest bacteria whose complete gene sequence is known. This bacteria is 1.3

million bp long and this size is stil l on the small side of bacteria. Human genome

sequences are several billion bp in length. So with the significant growth of the

amount of biomolecular data, it becomes increasingly important to develop new

techniques for extracting knowledge from the data. Data mining is a fundamental

operation in such a domain.

Every data in bioinformatics can be converted into DNA sequence. All the protein,

RNA sequence can be converted into DNA sequences. So the data mining need to be

applied on the DNA sequences and later the results can be converted for the other

molecular data.

22

2.9 BIOINFORMATICS AND ITS SCOPE

Bioinformatics has evolved into a full-fledged scientific discipline over the last

decade. The definition of Bioinformatics is not restricted to computational molecular

biology and computational structural biology. It now encompasses fields such as

comparative genomics, structural genomics, transcriptiomics, Proteomics,

cellunomics and metabolic pathway engineering. Developments in these fields have

direct implications to healthcare, medicine, discovery of next generation drugs,

development of agricultural products, renewable energy, environmental protection etc

[23]. Bioinformatics integrates the advances in the areas of Computer Science,

Information Science and Information Technology to solve complex problems in Life

Sciences. The core data comprises of the genomes and proteomes of human and other

organisms, 3-D structures and functions of proteins, microarray data, metabolic

pathways, cell lines & hybridoma, biodiversity etc. The sudden growth in the

quantitative data in Biology has rendered data capture, data warehousing and data

mining as major issues for biotechnologists and biologist. Availability of enormous

and other data has resulted in the realization of the inherent biocomplexity issues

which call for innovative tools for biotechnologists and biologist. Availability of

enormous and other data has resulted in the realization of the inherent biocomplexity

issues which call for innovative tools for synthesis of knowledge. Information

Technology, particularly the internet, is utilized to collect, distribute and access ever-

increasing data which are later analyzed with mathematics and statistics-based tools.

Bioinformatics has a key role to play in the cutting edge Research & Development

areas such as functional genomics, proteomics, protein engineering,

pharmacogenomics, discovery of new drugs and vaccines, molecular diagnostic kits,

agro-biotechnology etc. This has attracted attention of several companies and

entrepreneurs. As a result, a large number of Bioinformatics- based start-ups have

been launched and the trend is likely to continue. This has necessitated the availability

of a large number of formally trained individuals in Bioinformatics. A

Bioinformaticians must acquire/possess expertise in the essential multi-displinary

fields that comprise the core of this new science. Quality research and education in

Bioinformatics are vital not only to meet the existing challenges but also to set and

accomplish new goals in Life Science.

23

2.10 APPLICATIONS OF BIOINFORMATICS

Molecular medicine

The human genome will have profound effects on the fields of biomedical research

and clinical medicine. Every disease has a genetic component. This may be inherited

or a result of the body's response to an environmental stress which causes alterations

in the genome (eg. cancers, heart disease, diabetes.). The completion of the human

genome means that we can search for the genes directly associated with different

diseases and begin to understand the molecular basis of these diseases more clearly

[27]. This new knowledge of the molecular mechanisms of disease will enable better

treatments, cures and even preventative tests to be developed.

Personalized medicine

Clinical medicine will become more personalised with the development of the field of

pharmacogenomics. This is the study of how an individual's genetic inheritance

affects the body's response to drugs. At present, some drugs fail to make it to the

market because a small percentage of the clinical patient population show adverse

affects to a drug due to sequence variants in their DNA.

As a result, potentially lives saving drugs never make it to the marketplace. Today,

doctors have to use trial and error to find the best drug to treat a particular patient as

those with the same clinical symptoms can show a wide range of responses to the

same treatment. In the future, doctors wil l be able to analyze a patient's genetic profile

and prescribe the best available drug therapy and dosage from the beginning.

Preventative medicine

With the specific details of the genetic mechanisms of diseases being unraveled, the

development of diagnostic tests to measure a persons susceptibil ity to different

diseases may become a distinct reality. Preventative actions such as change of

li festyle or having treatment at the earliest possible stages when they are more likely

to be successful, could result in huge advances in our struggle to conquer disease.

Gene Therapy

In the not too distant future, the potential for using genes themselves to treat disease

may become a reality. Gene therapy is the approach used to treat, cure or even prevent

disease by changing the expression of a persons genes. Currently, this field is in its

24

infantile stage with clinical trials for many different types of cancer and other diseases

ongoing.

Drug development

At present all drugs on the market target only about 500 proteins. With an improved

understanding of disease mechanisms and using computational tools to identify and

validate new drug targets, more specific medicines that act on the cause, not merely

the symptoms, of the disease can be developed. These highly specific drugs promise

to have fewer side effects than many of today's medicines.

.

25

CHAPTER 3 INTRODUCTION TO BLAST

3.1 INTRODUCTION

The discovery of sequence homology to a known protein or family of proteins often

provides the first clues about the function of a newly sequenced gene. As the DNA

and amino acid sequence databases continue to grow in size they become increasingly

useful in the analysis of newly sequenced genes and proteins because of the greater

chance of f inding such homologies. There are a number of software tools for

searching sequence databases but all use some measure of similarity between

sequences. To distinguish biologically significant relationships from chance

similarities. Perhaps the best studied measures are those in conjunction with variations

of the dynamic programming algorithm These methods assign scores to insertions,

deletions and replacements, and compute an alignment of two sequences that

corresponds to the least costly set of such mutations. Such an alignment may be

thought of as minimizing the evolutionary distance or maximizing the similarity

between the two sequences compared. In either case, the cost of this alignment is a

measure of similarity; the algorithm guarantees it is optimal, based on the given

scores. Because of their computational requirements, dynamic programming

algorithms are impractical for searching large databases without the use of a

supercomputer. Rapid heuristic algorithms that attempt to approximate the above

methods have been developed, allowing large databases to be searched on commonly

available computers. In many heuristic methods -the measure of -similarity is not

explicitly defined as a minimal cost set of mutations, but instead is implicit in the

algorithm itself. For example, the FASTP program first finds locally similar regions

between two sequences based on identities but not gaps, and then rescores these

regions using a measure of similarity between residues, such as a PAM which allows

conservative replacements as well as identities to increment the similarity score.

Despite their rather indirect approximation of minimal evolution measures, heuristic

tools such as FASTP have been quite popular and have identified many distant but '

biologically significant relationships.

BLAST (Basic Local Alignment Search Tool), which employs a measure based on

well-defined mutation scores. It directly approximates the results that would be

26

obtained by a dynamic programming algorithm for optimizing this measure. The

method will detect weak but biologically significant sequence similarities, and is more

than an order of magnitude faster than existing heuristic algorithms.

BLAST Means:

B(Basic) - Despite the adjective “BASIC” in its name it is sophisticated software

package that has become the single most important piece of software in the field of

bioinformatics.

LA (Local Alignment) - local alignment is one from two kinds of alignment that

finds the best subsequence alignment. Necessity for this alignment is that functional

(catalytic sites) are localized or relatively short regions.

ST (Search Tool)- It has introduced a no of refinements to database searching that

improved overall search speed & put database searching on a firm statistical

foundation. It searches using some threshold value.

BLAST (Basic Local A lignment Search Tool) is a set of similarity search programs

designed to explore all of the available (DNA and protein) sequence databases

regardless of whether the query is protein or DNA. The BLAST programs have been

designed for speed, with a minimal sacrifice of sensitivity to distant sequence

relationships. BLAST uses the concept of a "segment pair" which is a pair of sub-

sequences of the same length that form an ungapped alignment. The algorithm first

looks for short words that are present in both sequences and then extends these at

either end to find the longest segments present in both. The statistical significance of

these High-scoring Segment Pairs is evaluated to determine whether the matches are

random or not. Thus, the scores assigned in a BLAST search have a well -defined

statistical interpretation, making real matches easier to distinguish from random

background.

3.2 DATABASES AVA ILABLE FOR BLAST SEARCH 3.2.1 Protein Sequence Databases We can choose a protein db for blastp or blastx. We can choose a nucleotide database

for blastn, tblastn or tblastx

27

• ��nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF

• ��month All new or revised GenBank CDS

translation+PDB+SwissProt+PIR+PRF released in the last 30 days.

• ��swissprot Last major release of the SWISS-PROT protein sequence database

(no updates)

• ��Drosophila genome Drosophila genome proteins provided by Celera and

Berkeley Drosophila

• Genome Project (BDGP). (www.fruitfly.org)

• ��yeast Yeast (Saccharomyces cerevisiae) genomic CDS translations

• ecoli Escherichia coli genomic CDS translations

• ��pdb Sequences derived from the 3-dimensional structure from Brookhaven

Protein Data Bank (www.pdb.org)

• ��kabat Kabat's database of sequences of immunological interest

(http://immuno.bme.nwu.edu)

• ��alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences

3.2.2 Nucleotide Sequence Databases We can choose a nucleotide database for blastn, tblastn or tblastx

• ��nr All GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or

phase 0, 1 or 2 HTGS sequences). No longer "non-redundant".

• ��month All new or revised GenBank+EMBL+DDBJ+PDB sequences released

in the last 30 days.

• ��Drosophila genome Drosophila genome provided by Celera and Berkeley

Drosophila Genome Project)

• dbest Database of GenBank+EMBL+DDBJ sequences from EST Divisions

• ��dbsts Database of GenBank+EMBL+DDBJ sequences from STS Divisions

• ��gss Genome Survey Sequence, includes single-pass genomic data, exon-

trapped sequences, and Alu PCR sequences.

• ��yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences

• E. coli Escherichia coli genomic nucleotide sequences

• pdb Sequences derived from the 3-dimensional structure from Brookhaven

Protein Data Bank

28

BLAST protein databases available at through blastp web inter face

Figure 3.1 Protein Databases

• ��kabat Kabat's database of sequences of immunological interest

• ��vector Vector subset of GenBank(R), NCBI, in

ftp://ncbi.nlm.nih.gov/blast/db/

• ��mito Database of mitochondrial sequences

• ��alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from

query sequences. It is available by anonymous FTP from ncbi.nlm.nih.gov

(under the /pub/jmc/alu directory).

• ��Epd Eukaryotic Promotor Database

BLAST nucleotide databases available at through blastn web interface

Figure 3.2 Nucleotide Databases

29

3.3 BLAST ALGORITHM

(1) In step 1, BLAST filters low complexity regions removes them from the query

sequence. Low compositional complexity or short-periodicity repeats can yield

extremely large numbers of statistically significant but biologically uninteresting

results. The filtering and removal of these can be controlled with the -F flag of the

stand-alone version of BLAST and with check boxes in the web version. Next,

BLAST generates a list of all of short sequences, or words, that make up the query

(Figure a). The default word lengths are 3 and 11, for amino-acid sequences and

nucleotide sequences, respectively, and are adjustable using the -W flag in the stand-

alone version. Then, BLAST uses a scoring matrix (BLOSUM62, by default, for

amino acids) to determine all high-scoring matching words for each word in the query

sequence. No gaps are allowed. The list of matches is reduced by taking only those

that will score above a given threshold, called the neighborhood word-score threshold.

There is a trade-off at this stage between speed and sensitivity: a higher threshold

gives greater speed but increases the chance of missing relevant pairs [8].

� For the query find the list of high scoring words of length w.�� For a given word length w (usually 3 for proteins) and a given score matrix:

Create a list of all words (w-mers) that can can score >T when compared to w-

mers from the query.

P Q N 12 etc.

Below Threshold (T=13)

L N K C K T P Q G Q R L V N Q

P Q G 18 P E G 15 P R G 14 P K G 14 P N G 13

Neighborhood Words

Word

P M G 13

30

Query Sequence of length L

Maximum of L-w+1 words (typically w = 3 for proteins)

For each word from the query sequence

find the li st of words that will score

at least T when scored using a pairscore

matrix (e.g. PAM 250). For typical parameters

there are around 50 words per residue of the query

Figure 3.3 List of Words From Query Sequence

(2) In the second step, BLAST searches through the target sequence database for

exact matches to the word list generated. Because BLAST has already pre-processed

and indexed the databases for the occurrence of all words in each sequence in the

database, this search is extremely fast. If a match is found, it is used to seed a possible

alignment between the query and the database sequences.

• Compare the word list to the database and identify exact matches.

• Each neighborhood word gives all positions in the database where it is found

(hit list).

P D G 13

P Q G 18

P E G 15 P R G 14

P K G 14 P N G 13

P M G 13 PMG Database

31

Database

Sequences

Figure 3.4 Exact matches of words form word list

(3) In the third step, the original BLAST method tried to extend the alignment from

the matching words in both directions as long as the score continued to increase.

For each word match, extend alignment in both directions to find alignments that

score greater than score threshold S. The program tries to extend matching segments

(seeds) out in both directions by adding pairs of residues. Residues will be added until

the incremental score drops below a threshold. The resulting alignment was called a

high-scoring pair, or HSP. Gapped BLAST uses a lower threshold for generating the

list of high-scoring matching words; the algorithm uses short matched regions with no

insertions or deletions between them and within a certain distance of each other as the

starting points for longer ungapped alignments. These joined regions are then

extended using the same method as in the original BLAST.

Figure 3.5 Maximal Segment Pairs (MSPs)

32

Next, BLAST determines whether each score found by one of the above methods is

greater in value than a given cutoff score S, determined empirically by examining the

range of scores given by comparing random sequences and then choosing a value that

is significantly greater. The maximal scoring pairs, or MSPs, from the entire database

are identified and listed. Finally, BLAST determines the statistical significance of

each score, initially by calculating the probability that two random sequences, one the

length of the query sequence and the other the length of the database (the sum of the

lengths of all of the database sequences) with the same composition (nucleotide or

amino acid) could produce the calculated score.

3.4 BLAST PARAMETERS

There are various parameters that play a vital role in the output produced by the

BLAST. The proper value of these parameters can improve the speed and sensitivity

of the BLAST. We have analyzed all the parameters to see which of them can be

improved to improve the results of the BLAST [8]. The parameters of BLAST

includes

¾ W, word size

Word size is roughly the minimal length of an identical match an alignment must

contain if it is to be found by the algorithm. It controls the number of word hits. The

query sequence and every database sequence is split up into every possible "word" of

a selected size. The default word size is 11 bp for DNA and 3 aa for Proteins (it must

be >=7 for DNA). The task of f inding HSPs begins with identifying short words of

length W in the query sequence that either match or satisfy some positive-valued

threshold score T when aligned with a word of the same length in a database

sequence. These initial neighborhood word hits act as seeds for initiating searches to

find longer HSPs containing them. The word hits are extended in both directions

along each sequence for as far as the cumulative alignment score can be increased.

Extension of the word hits in each direction are halted when: the cumulative

alignment score falls off by the quantity X from its maximum achieved value; the

cumulative score goes to zero or below, due to the accumulation of one or more

negative-scoring residue alignments; or the end of either sequence is reached.

33

If we are interested in longer regions of homology we should increase the word size.

Increasing the word size also speeds up the search, especially with larger query

sequences (>5kb) and large databases. But the high values of W in conjunction with

moderate values of T can lead to immense memory requirements. The probability of a

hit decreases with increase in word size [15]. The smaller word sizes increase

sensitivity and decreases speed. For protein searches the best word size is of four.

¾ T, the threshold parameter.

T is referred to as the neighborhood word score threshold (Altschul et al., 1990). It is

the minimum score that a word pair in the segment pair should have. Actually we can

adjust the value of T to control the size of the neighborhood and therefore the number

of word hits in the search space. The lower value of T increases the chance that a

segment pair with a score of at lest S will contain a word pair with a score of at least

T. Thus, a small value for T increases the number of hits. But this in turn increases the

execution time of the algorithm because there wil l be more words generated by the

query sequence and therefore more hits. On the other hand, higher values of T

progressively reduce word hits and reduce the search space. So the proper value of T

depends on the balance between speed and sensitivity. It also depends on the values in

the scoring matrix.

Figure 3.6 Figure shows the word size option

If the value for T is not chosen carefully, though -- i.e., if T is set just a little bit too

low -- a combinatorial explosion in neighborhood words will soon lead to the

34

depletion of all available memory. Even if the neighborhood word list does fit in

memory, however, its sheer size may produce an adverse effect on speed, due to the

consequent loss of processor cache eff iciency.

¾ X, drop off

This value provides a cutoff threshold for the extension algorithm tree exploration.

When the score of a given branch drops below the current best score minus the X

dropoff , the exploration of this branch stops. This variable represents the recent

alignment history [20]. Specifically, it represents how much the score is allowed to

drop off since the last maximum.

A very large value of X doesn’ t increase the score and requires more computation. It

is generally a good idea to use a large value, which reduces the risk of premature

termination and is better way to increase speed than with the seeding parameters.

However, W,T and 2-hit are better for controlling speed than X.

X not only depends on the substitution scores, but also gap initiation and extension

costs. We general need to adjust this parameter in following two situations:

� If we align sequences that are nearly identical and we want to prevent the

extensions from straying into nonidentical sequences, we can set the various X

values very low.

� If we try to align very distant sequences and have already adjusted W, T and the

scoring matrix to allow additional sensitivity, it makes sense to also increase the

various X values.

¾ λλ, lambda

λ, is a matrix specific constant required to convert a raw score to normalized score.

Raw score can be a misleading quantity because scaling factors are arbitrary. A

normalized score, corresponding to the original lod score, is therefore a more useful

measure. Lambda is approximately the inverse of the original scaling factor, but its

value may be slightly different due to integer rounding errors. When calculating target

frequencies from multiple alignments, the sum of all target

frequencies naturally sums to 1.

ΣΣ qij = 1 ………(1)

35

The score of two amino acids is the log-odds ratio of the observed and expected

frequencies. The same equation is presented in Equation, but the lod score is replaced

by the product of lambda and the raw score.

λSij = loge (qij / pi pj ) ………(2)

Equation (1)rearranges Equation (2) to solve for pair-wise frequency.

qij = pi pj eλ Sij ………(3)

From Equation 3,we can see that a pair-wise frequency (q ij) is implied from

individual amino acid frequencies (p i and p j )and a normalized score (λ �S ij ).The key

to solving for lambda is to provide the individual amino acid frequencies (pi and

pj)and find a value for lambda where the sum of the implied target frequencies equals

one. The formulation is given in Equation 4.

ΣΣ qij = ΣΣ pi pj eλ Sij = 1 ………(4)

Normally, once lambda is estimated, it is used to calculate the Expect of every HSP in

the BLAST report. Unfortunately, the residue frequencies of some proteins deviate

widely from the residue frequencies used to construct the original scoring matrix.

Recently, some versions of PSI-BLAST and BLASTP have therefore begun to use the

query and subject sequence amino acid compositions to calculate a composition based

lambda .These “hit-specific ”lambdas have been shown to improve BLAST

sensitivity, so this approach may see wider use in the near future. Lambda is also used

in calculating the Expect by using the equation E = kmne-λS . Here Lambda may be

thought of as the expected increase in reliabili ty of an alignment associated with a unit

increase in alignment score. Reliabili ty in this case is expressed in units of

information, such as bits or nats, with one nat being equivalent to 1/log(2) (roughly

1.44) bits.

¾ k, Adjustment

A small adjustment (k) takes into account the fact that optimal local alignment scores

for alignments that start at different places in the two sequences may be highly

correlated. For example, a high-scoring alignment starting at residues 1,1 implies a

pretty high alignment score for an alignment starting at residues 2,2 as well .

¾ m, length of query

It seems to be the length of the query that we enter to be matched to the different

databases. But actually in BLAST it is the effective length of the query. It may be

36

defined as the actual length minus the expected HSP length where expected HSP

length is the length of an HSP that hat has an Expect of 1. The size of the search space

is simply the product of the number of letters in the query (m) and the number of

letters in the database (n). The relationship between the expected number of

alignments (E) and the search space (mn)is linear. If the size of the search space is

doubled, the expected number of alignments with a particular score also doubles.

¾ n, length of the database

It seems to be the length of the database sequence with which the query is to be

matched. But actually its is the effective length of the database. It may be defined as

the sum of effective length of every sequence within it. The size of the search space is

simply the product of the number of letters in the query (m) and the number of letters

in the database (n). The relationship between the expected number of alignments (E)

and the search space (mn) is linear. If the size of the search space is doubled, the

expected number of alignments with a particular score also doubles.

No effective length of the query or database can ever be less than 1/k. Setting an

effective length to 1/k basically amounts to ignoring a short sequence for statistical

purposes; in case when both m and n are less than 1/k, BLAST searches are ill -

advised.

¾ H, Relative Entropy

The formal name for the average information per symbol is entropy. But what if all

symbols aren’ t equally probable? To compute the entropy, you need to weigh the

information of each symbol by its probabilit y of occurring. This formulation, known

as Shannon ’s Entropy (named after Claude Shannon),is shown in Equation.

H= - Σ pi log2pi

Entropy (H) is the negative sum over all the symbols (n )of the probabil ity of a

symbol (pi )multiplied by the log base 2 of the probabili ty of a symbol (log 2 pi ). The

relative entropy of a scoring matrix (H ) conveniently summarizes the general

behavior of a scoring matrix. Its formulation is similar to the expected score but is

calculated from normalized scores. It formulation is shown in following equation

H = - ΣΣ qij λSij

37

H is the average number of bits (or nats) per position in an alignment and is always

positive.

¾ E, Expect

Expect is the number of alignments expected by chance during a sequence database

search and can be represented using the Karlin-Altschul equation.

E = kmne-λS

From the above equation we can see that E is a function of the size of the search space

(m *n ),the normalized score (λS ),and a minor constant (k ). The relationship between

the expected number of alignments and the search space (mn) is linear. If the size of

the search space is doubled, the expected number of alignments with a particular score

also doubles. The relationship between the expected numbers of alignments and score

is exponential. This means that small changes in score can lead to large differences in

E. An E-value tells you how many alignments with a given score are expected by

chance, that is, the E value is the probabili ty that the associated match is due to

randomness. The lower the E value, the more specific/significant is the match. Its

relation with P value can represented as

E= - In(1-P)

E is the statistical significance threshold for reporting matches against database

sequences; the default value is 10, such that 10 matches are expected to be found

merely by chance, according to the stochastic model of Karlin and Altschul (1990). In

the BLAST output report the sequences are listed in order of increasing E (expect)

value. The alignments are listed in order of most to least significant. If the statistical

significance ascribed to a match is greater than the EXPECT threshold, the match will

not be reported. Lower EXPECT thresholds are more stringent, leading to fewer

chance matches being reported.

¾ S, Score

In the late ’60s and early ’70s, Margaret Dayhoff pioneered quantitative techniques

for measuring amino acid similarity. Using sequences that were available at the time,

she constructed multiple alignments of related proteins and compared the frequencies

of amino acid substitutions. As expected, there is quite a bit of variation in amino acid

substitution frequency, and the patterns are generally what you ’d expect from the

chemical properties.

38

Dayhoff represented the similarity between amino acids as a log 2 odds ratio, also

known as a lod score .To derive the lod score of an amino acid, take the log 2 of the

ratio of a pairing ’s observed frequency divided by the pairing ’s random expected

frequency. If the observed and expected frequencies are equal, the lod score is zero. A

positive score indicates that a pair of letters is common, while a negative score

indicates an unlikely pairing. The general formula for any pair of amino acids is

shown in following Equation.

Sij = log(qij/pipj )

The score of two amino acids i and j, is sij, their individual probabilities are pi and pj ,

and their frequency of pairing is qij. The relationship between the expected number of

alignments and score is exponential. This means that small changes in score can lead

to large differences in E.

¾ P-value

A P-value tells you how often you can expect to see such an alignment.

P = 1- e -E

For values of less than 0.001,the E-value and P-value are essentially identical.

The aggregate pair-wise P-value for a sum score can be approximated using above

stated equation. Thus, when sum statistics are being employed, BLAST not only uses

a different score, it also uses a different formula to convert that score into a

probability —the standard Karlin-Altschul equation E= kmne -λS isn’ t used to convert

a sum score to an Expect. In the limit of infinite E, P approaches 1; and in the limit as

E approaches 0, E and P approach equality.

Due to inaccuracy in the statistical methods as they are applied in the BLAST

programs, whenever E and P are less than about 0.05, the two values can be

practically treated as being equal.

¾ Number of sequences in database

The number of sequences in database also affects the speed and sensitivity of the

BLAST algorithm. If the number of sequences is very less then the speed of the

BLAST is enhanced as there are less word hits and less sequences to be compared

with the query.

39

¾ Percent identity

Percent identity is the percent of exact matches between your query sequence and the

database sequence. The positive value is more relevant to protein alignments. This is

the percent of exact + similar (based on properties) amino acid matches.

¾ Number of Alignments

Restricts database sequences to the number specified for which high-scoring segment

pairs (HSPs) are reported; the default limit is 100. If more database sequences than

this happen to satisfy the statistical significance threshold for reporting only the

matches ascribed the greatest statistical significance are reported.

¾ Fil ter

Low-complexity regions, such as proline- or glycine-rich regions or acidic or basic

regions, can yield tremendous numbers of spurious matches between sequences that

have no other similarity between them. The statistics break down when such

decidedly non-random sequences appear; furthermore, search times may be needlessly

increased. To avoid spurious matching and make the statistics more robust, low-

complexity regions can be filtered from the query sequence. Fil tering eliminates

statistically significant but biologically uninteresting reports from the blast output

(e.g., hits against common acidic-, basic- or proline-rich regions), leaving the more

biologically interesting regions of the query sequence available for specific matching

against database sequences. Filtering is only applied to the query sequence (or its

translation products), not to database sequences. Filtering should not be expected to

always yield an effect. Furthermore, in some cases, sequences are masked in their

entirety, indicating that the statistical significance of any matches reported against the

unfil tered query sequence should be suspect.

Sometimes we need to mask the human repeats (LINE's and SINE's). It is especiall y

useful for human sequences that may contain these repeats.

3.5 FEATURES OF BLAST

3.5.1 Heur istic

BLAST is not guaranteed to find the best alignment between your query and the database; it may miss matches. This is because it uses a strategy, which is expected to

find most matches, but sacrifices complete sensitivity in order to gain speed.

40

However, in practice few biologically significant matches are missed by BLAST that

can be found with other sequence search programs. BLAST searches the database in

two phases. First it looks for short subsequences, which are likely to produce

significant matches, and then it tries to extend these subsequences [8].

3.5.2 Substitution Matr ix

A substitution matrix is used during all phases of protein searches (BLASTP,

BLASTX, and TBLASTN). Both phases of the alignment process (scanning &

extension) use a substitution matrix to score matches. This is in contrast to FASTA

that uses a substitution matrix only for the extension phase. Substitution matrices

greatly improve sensitivity. There are two main types of matrices PAM and

BLOSUM; we can select the preferred matrix.

PAM (Percent Accepted Mutation) matrices: predicted matrices, most sensitive for

alignments of sequences with evolutionary related homologs. The greater the number

in the matrix name, the greater the expected evolutionary (mutational) distance, i.e.

PAM30 would be used for alignments expected to be more closely related in

evolution than an alignment performed using the PAM250 matrix

BLOSUM (Blocks Substitution Matrix): calculated matrices, most sensitive for local

alignment of related sequences, ideal when trying to identify an unknown nucleotide

sequence. BLOSUM62 is the default matrix set in the BLAST search tool.

3.5.3 Local Alignments

BLAST uses LOCAL ALIGNMENTS for matching sequnecs rather than GLOBAL

ALIGNMENTS. BLAST tries to find patches of regional similarity, rather than trying

to find the best alignment between your entire query and an entire database sequence.

3.5.4 Ungapped Alignments Alignments generated with BLAST do not contain gaps. BLAST's speed and

statistical model depend on this, but in theory it reduces sensitivity. However, BLAST

will report multiple local alignments between your query and a database sequence.

3.5.5 Explicit Statistical Theory BLAST is based on an explicit statistical theory developed by Samuel Karlin and

Steven Altschul. The original theory was later extended to cover multiple weak

matches between query and database entry: the repetitive nature of many biological

41

sequences (particularly naive translations of DNA/RNA) violates assumptions made

in the Karlin & Altschul theory. While the P values provided by BLAST are a good

rule-of-thumb for initial identification of promising matches, care should be taken to

ensure that matches are not due simply to biased amino acid composition.

The databases are contaminated with numerous artifacts. The intelligent use of f ilters

can reduce problems from these sources. Remember that the statistical theory only

covers the likelihood of f inding a match by chance under particular assumptions; it

does not guarantee biological importance.

3.5.6 Rapid

BLAST is extremely fast. It does not explore the entire search space between two

sequences as it uses the three layers (seeding, extension, and evaluation) of rules to

sequentially refine potential HSPs (high scoring pairs). This minimization of search

space is the key to its speed but at the cost of a loss in sensitivity. You can either run

the program locally or send queries to an E-mail server maintained by NCBI.

3.5.7 Sequence Input

The BLAST web pages accept input sequences in three formats; FASTA sequence

format, NCBI Accession numbers, or GIs. The preferred query sequence format for

the BLAST program is the FASTA format. Advanced BLAST tolerates both spaces

and numbers and is case insensitive.

3.5.8 Results Format

Results returned in either text format (default) or HTML format (must supply an e-

mail address and select the HTML results format option). A Request ID number is

given such that the results are obtained at a later time, if you want the results

immediately, we can click on the "Format Results" button.

Formatting items such as the results format option and the number of descriptions and

alignments in the results output are needed only for formatting, these items may be

specified from the BLAST query form or at the time you request your results. Most

results are held for up to 24 hours; very-large result files are deleted after 30 minutes.

3.5.9 BLAST Output

All BLAST programs produce a similar output. This consists of program introduction,

42

a schematic distribution of alignments of the query sequence to those in the databases,

a series of one line descriptions of the database sequences which have significantly

aligned to the query sequence, the actual sequence alignments, and a list of statistics

specific to the BLAST search method and version number is displayed at the top of

the output. The output consists of:

� A schematic distribution of the ordered alignments of the query sequence to those

in the databases. Colored bars are distributed in a way to reflect the region of

alignment onto the query sequence. The color legend represents the significance

of the alignment scores. Holding the mouse over a given bar wil l display a

description of that specific alignment sequence in the above window; clicking on a

specific bar wil l cause the browser to jump down to that particular alignment.

� Sequence alignments and their corresponding line descriptions are listed in order

of lowest to highest E value where E value is the expect value is the probability

that the associated match is due to randomness; the lower the E value, the more

specific/significant the match.

� Identifiers for the database sequences appear in the first column and are

hyperlinked to the associated GenBank entry

� The Score for each alignment. The score (bits) is a sum value calculated for

alignments using the scoring matrix; the higher the score value, the better the

alignment

� The percent identity (called "Identities" is given as a percent) is the percent of

exact matches between your query sequence and the database sequence, this value

also gives the number of nucleotide bases or amino acid residues that are matched

in the database sequence versus the query sequence

� Gap value is the percent of the alignment sequence that has been gapped in the

particular alignment. Alignments are gapped unless specified by the user at the

BLAST search submission page

� A list of statistics specific to the particular BLAST search are displayed at the

bottom of the output, they include the BLAST version number, the database and

matrices used for the search.

43

CHAPTER 4 VARIANTS OF BLAST

The best way to identify an unknown sequence is to see if that sequence already exists

in a public database. If the database sequence is a well -characterized sequence, then

you may have access to a wealth of biological information. BLAST (Basic Local

Alignment Search Tool) is a set of similarity search programs designed to explore all

of the available sequence databases (DNA or protein) regardless of whether the query

is protein or DNA. These programs have been tailored specifically for the purpose of

sequence similarity identification. Each BLAST program performs a different task.

Different flavors of BLAST are covered in the following sections [7].

Figure 4.1 Blast Variants

4.1 BLAST VARIANTS

Programs Available For The Blast Search Include:

44

Program

Query sequences

of type

Database Of Type Compar ison Application

BlastN DNA DNA

Compares a nucleotide query sequence against a

nucleotide sequence database

Find DNA sequences that match the query

BlastP Protein Protein Compares an amino acid query sequence against a protein sequence database

Compares an amino acid query sequence

against a protein sequence database

BlastX DNA Protein

Protein Compares a nucleotide query

sequence translated in all reading frames against a

protein sequence databases

Find what protein the query sequence

codes for

TBlastN Protein DNA

Compares protein query sequence against

nucleotide sequence database translated in

reading frames

Find genes in unknown DNA

sequences

TBlastX DNA

DNA

Compares the six-frame translations of a nucleotide query

sequence against the six frame translations of a nucleotide sequence

database

Discover gene structure (Find degree of homology between the coding region of the query sequence and known genes in

the database)

Table 4.1 Programs Available For The Blast

Types of BLAST Programs:

• blastp compares an amino acid query sequence against a protein sequence

database

• blastn compares a nucleotide query sequence against a nucleotide sequence

database

• blastx compares a nucleotide query sequence translated in all reading frames

against a protein sequence database

• tblastn compares a protein query sequence against a nucleotide sequence

database dynamically translated in all reading frames.

45

Figure 4.2 Blast Variants

• tblastx compares the six-frame translations of a nucleotide query sequence

against the six-frame translations of a nucleotide sequence database. Note that

tblastx program cannot be used with the nr database on the BLAST Web page.

4.2 PSI-BLAST

PSI-Blast is the preferred method for searching a protein database with a protein

sequence as the key. If used for only one round, it is identical to BlastP. Its algorithm

is designed to conduct further iterations of the search and to extend the search to

distantly related homologues.

PSI stands for Position Specific Iterated. This search method makes use of a profile,

which is a position-specific accounting of what amino acid residues are found in a

family of aligned homologous proteins. PSI-Blast accepts a protein sequence as input

and first conducts a normal BlastP search to identify homologues in the database. A

profile is constructed from the spectrum of sequences found in the initially identified

homologues. This profile is used as the search key to identify more distant relatives.

The process is then iterated, each time refining the profile based on inclusion of the

new members. Ideally, the process is expected to converge on a unique set of genes.

In practice, the search may at some point begin to include proteins that are related by

chance similarity. The user must use judgement to recognize when proteins of known

and unrelated functions begin to appear in the list of f inds [19].

46

It’s an acronym for "Position Specific Iterated" BLAST. It is an iterative form of

blastp in which a profile is created from the amino acid query and nth set of results

(meeting the Psi-Expectation) and resubmitted. PSI - BLAST is a program based on

the BLAST 2.0 algorithm that is designed to detect weak relationships between the

query and members of the database not necessarily detectable by standard BLAST

searches [19]. The added sensitivity of this program over regular BLAST comes from

the use of a profile that is constructed (automatically) from a multiple alignment of

the highest scoring hits in the initial BLAST search. The profile is generated by

calculating position-specific scores for every position in the alignment. A highly

conserved position will receive a high score and weakly conserved positions receive

scores near zero. The profile is then used to perform additional BLAST searches

(called iterations) and the results of each iteration used to refine the profile.

PSI-BLAST is designed for more sensitive protein protein similarity searches. PSI-

BLAST is the most sensitive BLAST program, making it useful for finding very

distantly related proteins. We should use PSI-BLAST when our standard protein-

Figure 4.3 PSI Blast

protein BLAST search either failed to find significant hits, or returned hits with

descriptions such as "hypothetical protein" or "similar to...". When we use PSI-

BLAST to search a database, it generates Position Specific Scoring Matrices, which

can then be built into a database of patterns. Then we can just search one of these

databases with a new sequence. One of the diff iculties in doing this is curating the

47

database. In a regular sequence database, we just keep throwing in new sequences,

whereas with one of these pattern databases, we have to periodically go back and redo

the patterns and try to consolidate them and so forth. It takes a lot of effort to keep up

to date Position-Specific Iterative PSI-BLAST analysis is useful both for identifying

the distant members of a protein family, whose relationship is not recognizable by

straight sequence comparison, and also for deducing the function of hypothetical

proteins that are unannotated in the database.

STEPS OF PSI -BLAST ALGORITHM:

STEP 1:

The data to be entered must be in one of the allowed formats for BLAST search. Once

the query sequence is entered, the database to be searched must be selected from the

appropriate pull down menus. Options include a number of different sequence

databases that can be searched using blastp.

Figure 4.4 PSI Blast-Step1 The default database is nr, which is the collection of all unique sequences.It contains

all non redundant Genbank CDS translations + PDB + SwissProt + PIR +PRF entries.

STEP2:

The E-value is the statistical significance threshold for reporting matches against

database sequences. The default expect value for the initial BLAST search is 10. This

48

EXPECT threshold is fairly lenient allowing all possible related sequences to be

reported. Thus, the initial (BLAST) E value is set at 1.0.

It is appropriate to filter most queries for low complexity sequences because they give

spuriously high scores that reflect compositional bias rather than significant position-

by-position alignment. Thus we have selected to filter lo complexity region.

The BLOSUM62 (gap existence cost = 11; per residue gap cost = 1; lambda ratio =

0.85) substitution matrix is used by default in BLAST 2.0. A variety of other matrices

are also supported which include: PAM30, PAM70, BLOSUM80, BLOSUM62 and

BLOSUM45. Adjustments to the matrix may be in order when a search for very

distant relatives of the query is being performed. The BLOSUM matrix assigns a

probability score for each position in an alignment that is based on the frequency with

which that substitution is known to occur among related proteins.

Figure 4.5 PSI Blast-Step2

Then the word size needs to be set which is by default 3. There are other advance

options possible, which can specify gap costs, word size, and other parameters not

otherwise selectable on the query form that can be set. Here, we have not set any

advanced option.

STEP 3:

Checking the NCBI-gi designation is facilitates the process of doing additional

searches to investigate the significance of a given alignment whereas checking

49

graphical overview gives the graphical overview of the database sequences aligned to

the query sequence. The score of each alignment is represented by bars of different

colors. Multiple alignments on the same database sequence are connected by a striped

line. Mousing over a hit sequence causes the definition and score to be shown in the

window at the top.

The default number of descriptions and alignments to be listed is 500. Although it

may seem useful to change the default to something smaller to control the magnitude

of the output, these variables affect the search in two important ways: First, if the total

number of hits in which E is less than the threshold exceeds the number (x) of

descriptions requested, only the top x most signficant would be listed; additional

possibly significant alignments would not be shown, though these may embody

important information. Second, the number of sequences used in generating the

multiple alignment and the position specific matrix is specified by the larger of the

two(descriptions, alignments) variables. If at any point in the iterative PSI-BLAST

process, significant sequences are omitted from the profile, all subsequent output will

be affected. By selecting a large number of descriptions (e.g. 250-500) it is possible to

ensure that the E value and not the description limit will be the determining factor in

generating the profile to be used for additional iterations. Reducing the output can

then be accomplished, if desired, by limiting the number of alignments to be reported.

A variety of different alignment formats are available. The choice of which to use is

based on personal preference. Pairwise alignment gives a good view of the quality of

an individual hit. However, a flat query-anchored alignment (with identities) is a

format in which identities shared by numerous sequences can be easily spotted.

There is second E value which is the threshold value for inclusion in the position

specific matrix used for PSI-BLAST iterations. Here the PSI-BLAST E value is left at

the default setting of 0.001. Both of the E values specified (one earlier) allow the user

to see (and selectively, based on prior knowledge, include) all of the BLAST hits up

to E=1; but to automatically include only those hits exceeding a relatively rigorous E

value threshold of 0.001.

There are some more options to set, which include layout, formatting options on page

with result and autoformat. All these affect the report format but not the results

produced. In the end we click on the search button to initiate the search. In seconds,

50

the query sequence has been compared to all of the entries in the specified database.

Each comparison is scored and the top scores are listed in rank order.

.PSI-BLAST Output

Output of PSI-BLAST is shown both in graphical format and in detailed format. In

detailed format the hits are divided into two categories. Those that are better than the

E value threshold are listed first. Those with E values worse than threshold, but

nonetheless have an E value better than 1 (selected on the query page) are listed

further down the page.

Figure 4.6 PSI Blast-Output

PSI-BLAST In summary:

Patterns of conservation such as PSSM (Position Specific Score Matrix) identified

from the alignment of related sequences can aid the recognition of distant similarities.

This power can be further enhanced through iteration of the search procedure.

Position-Specific Iterated BLAST (PSI-BLAST) was developed for this goal, and

furthermore, has advantages at speed, simplicity and automatic operation. PSI-

BLAST program runs as follows.

51

Figure 4.7 PSI Blast-Output

(1) A standard BLAST search is performed against a database using a substitution

matrix (e.g.BLOSUM62).

(2) A PSSM (checkpoint) is constructed automatically from a multiple alignment of

the hits of the initial BLAST search or last round iteration of homology searching.

High conserved positions receive high scores and weakly conserved positions receive

low scores.

(3) The new PSSM replaces the initial matrix (e.g. BLOSUM62) or last round PSSM

to perform a next “BLAST” search.

(4) Steps 2 and 3 can be repeated and the new found sequences are included to build a

new PSSM.

(5) PSI-BLAST has converged if no new sequences are included.

Figure 4.8 PSI Blast

52

PSI-Blast The blastpgp program can do an iterative search in which sequences found in one

round of searching are used to build a score model for the next round of searching. In

this usage,the program is called Position-Specific Iterated BLAST, or PSI-BLAST. As

explained in the accompanying paper, the BLAST algorithm is not tied to a specific

score matrix. Traditionally, it has been implemented using an AxA substitution matrix

where A is the alphabet size. PSI-BLAST instead uses a QxA matrix, where Q is the

length of the query sequence; at each position the cost of a letter depends on the

position w.r.t. the query and the letter in the subject sequence.The position-specific

matrix for round i+1 is built from a constrained multiple alignment among the query

and the sequences found with suff iciently low e-value in round i. The top part of the

output for each round distinguishes the sequences into: sequences found previously

and used in the score model, and sequences not used in the score model. The output

currently includes lots of diagnostics requested by users at NCBI. To skip quickly

from the output of one round to the next, search for the string "producing", which is

part of the header for each round and likely does not appear elsewhere in the output.

PSI-BLAST "converges" and stops if all sequences found at round i+1 below the e-

value threshold were already in the model at the beginning of the round [21].

There are several blastpgp parameters specifically for PSI-BLAST:

-j is the maximum number of rounds (default 1; i.e., regular BLAST)

-e is the e-value threshold for including sequences in the score matrix model (default

0.01)

-c is the "constant" used in the pseudocount formula specified in the paper (default

10)

The -C and -R flags provide a "checkpointing" facility whereby a score model can be

stored and later reused.

-C stores the query and frequency count ratio matrix in a file

-R restarts from a file stored previously.

When using -R, it is required that the query specified on the command line match

exactly the query in the restart file.Users who also develop their own sequence

analysis software may wish to develop their own scoring systems. For this purpose the

code in posit.c that writes out the checkpoint can be easily adapated to write out

53

scoring systems derived by other algorithms in such a way that PSI-BLAST can read

the files in later.

The checkpoint structure is general in the sense that it can handle any position-

specific matrix that fits in the Karlin-Altschul statistical framework for BLAST

scoring.

4.3 BLASTN

Standard nucleotide BLAST compares a nucleotide query sequence against a

nucleotide sequence database. It is better at finding sequences similar, but not

identical, to your query. The BLAST nucleotide algorithm finds similar sequences by

generating an indexed table or dictionary of short subsequences called words for both

the query and the database. The program can then rapidly find initial exact matches to

the query words by simply looking up a particular word in the database dictionary.

These initial matches serve as starting points for longer alignments that are generated

in several steps, ending with a final gapped alignment [8].

One of the important parameters governing the sensitivity of BLAST searches is the

length of the initial words (word size). The most important reason that blastn is more

sensitive than MEGABLAST is that it uses a shorter default word size. Because of

this, blastn is better than MEGABLAST at finding alignments to related nucleotide

sequences from other organisms since the initial exact match can be shorter. The word

Figure 4.9 BLASTN

54

size is adjustable in blastn and can be reduced from the default value of 11 to a

minimum of 7 to increase sensitivity. This word size can also be increased to increase

the search speed and limit the number of database hits. Nucleotide-nucleotide

searches are not the recommended way to find homologous protein coding regions in

other organisms. It is better to perform searches at the protein level, either with

translations of the nucleotide sequences or by direct protein-protein BLAST. This is

because of the degeneracy of the genetic code, the greater information available in

amino acid sequence, and the more sophisticated algorithm in protein-protein BLAST.

Figure 4.10 Using Blastn For Comparison

Figure 4.11 Blastn Results

55

4.4 BLASTX

Sequence similarity between a translated nucleotide sequence and a known biological

protein can provide strong evidence for the presence of a homologous coding region,

and such similarities can often be identified even between distantly related genes. The

computer program BLASTX performed conceptual translation of a nucleotide query

sequence followed by a protein database search in one programmatic step. The

BLAST search algorithm combined with Karlin-Altschul statistics yields a predictable

selectivity that has been parameterized. BLASTX is appropriate for use in moderate

and large scale sequencing projects at the earliest opportunity, when the data are most

prone to containing errors [9].

Most primary sequence data is obtained as nucleic acid, while much of the biological

interest lies in the encoded protein. Inference of likely protein coding regions is often

based on statistical features, such as codon usage and the locations of putative splice

site signals but significant false positive rates are common. In contrast, similarity

between a conceptually translated nucleotide sequence and a known protein sequence

may be highly significant statistically, which suggested a more discriminating

approach to inferring coding potential. BLASTX is used to probe a nucleotide

sequence directly for the presence of protein coding regions by identifying segments

that encode significant similarity to members of a protein sequence database.

The BLASTX program has been successfully employed to identify likely protein

coding sequences in thousands of partial cDNA sequences from human brain tissue.

BLASTX allowed protein-protein comparisons to be considered when only

uncharacterized nucleotide query sequence was available.

The program conceptually translated query sequences in all six reading frames (three

on each strand) and compared each of these full -length translation products with a

comprehensive protein sequence database in a single pass. The BLAST algorithm

approximates a well defined measure of local sequence similarity based on a matrix of

similarity or substitution scores for all possible pairs of residues. The algorithm

identifies ungapped, aligned pairs of sequence segments with locally maximum scores

which meet or exceed a parameterized cutoff score S, These segments are referred to

as “high-scoring segment pairs” (HSP’s), and the highest scoring segment pair

derivable from any two

56

Figure 4.12 Blastx

sequences is their maximal-scoring segment pair, or MSP. A program, BLASTX,

based on this rapid, probabilistic algorithm, was used to find statistically significant

HSPs between a translated nucleotide query sequence and a target protein sequence

database. When an HSP was found, the analysis of Karlin and Altschul was used to

estimate the significance of its score. No prior knowledge of the reading frame or

direction was assumed by BLASTX; all possible reading frames in both orientations

of the query sequence were translated into protein sequence using the standard genetic

code. The PAM (point accepted mutation) amino acid substitution model was

typically used for scoring similarity between peptide sequences. By default, BLASTX

used a PAM120 matrix.

The expected number of alignments scoring S or greater in a comparison between two

random sequences of lengths m and n is

E=mnKe-s

Where K and S �DUH�SDUDPHWHUV�GHSHQGHQW�RQ�WKH�DPLQR�DFLG�FRPSRVLWLRQV�RI�WKH

sequences. For values less than about 0.1,E is often an acceptable approximation to

P the probability of occurrence of one or more matches scoring S or greater. In a true

coding region, one reading frame may have a predicted amino acid composition

typical for biologically occurring proteins, while the other reading frames exhibit

anomalous

Compositions. For this reason, BLASTX calculated separate K and S �YDOXHV�IRU�HDFK

reading frame.

57

Figure 4.13 Using Blastx For Comparison The BLAST algor ithm operates in two successive stages, “ neighborhood” word

generation followed by the actual search, with an implicit trade-off in speed versus

sensitivity imparted in the first stage. A list of neighborhood words of length W is

generated from consecutive, overlapping words of length W in the query sequence,

using a specified scoring matrix. The neighborhood list contains all words which

satisfy a threshold scoring parameter, T, when aligned with words in the query

sequence. Raising T decreases the size of the neighborhood and, consequently,

increases the search speed in the algorithm’s second stage, but at the expense of

decreased sensitivity. In BLASTX, the neighborhood word list was built from the

conceptual translations of all six reading frames on both strands of the query sequence

[24].

Dur ing the second stage of the BLAST algor ithm, the neighborhood words from

the first stage are searched for in the database or “ target” sequence; the presence of a

neighborhood word match indicates the possible location of an HSP. Individual

neighborhood word matches (or word “hits” ) are extended in both directions along the

matrix diagonal until the ends are reached or the cumulative alignment score falls

from its maximum achieved value by a parameterized quantity X.

58

Figure 4.14 Blastx Results 4.5 BLASTP

The BLASTP program is a search tool for databases of protein sequences that is

widely used by biologists as a first step in investigating new genome sequences.

BLASTP finds high-scoring local alignments without gaps between a query sequence

q and sequence s in the database. The score of an alignment is the sum of the scores of

individual alignments between amino acids that make up the protein. These individual

scores come from a scoring matrix modeling the rate of evolutionary mutation.

BLASTP is the most widely used program for determining alignments of protein

sequences against databases such as Genbank. BLASTP is a three-step algorithm that

succeeds in only scanning the database for exact matches [14].

The BLASTP algorithm works in three steps:

1. Neighborhood Construction. A set of words of length W, called the neighborhood

N, is computed. Each word scores at least T with some word of equivalent length

in the query sequence Q.

2. Hit Detection. Each subject SB in the database DB is scanned for (exact) matches

to a word in N.

3. Hit Extension. The match, or hit H, is extended into a potentially higher scoring

alignment

59

Figure 4.15 Blastp .The first step is to create a neighbourhood for each (short) segment of length $ of

the query sequence. The neighbourhood consists of all sequences of $ amino acids

that match the query segment with a high-score. An automaton is built to recognize

the union of all neighbourhoods. The second step is to scan the database for exact

matches to any neighbour. These matches are called hits. The third step attempts to

extend a hit into a high-scoring pair of segment with approximate matches to the left

and right of the hit. As each pair of aligned residues is included into the alignment, the

score of the aligned pair is looked-up in a score matrix and added to a running sum.

Extension of a hit continues until the falloff value, X, is reached.

Figure 4.16 Using Blastp For Comparison 4.5.1 BlastP Parameters

1.[ DATABASE ] Valid database name

60

Default : nr

2. [EXPECT] The statistically significant expectation value. If the statistical

significance ascribed to a match is greater than the E value, the match will not be

reported. Lower E values are more stringent, leading to a fewer chance matches being

reported.

Default : 10.0

3. [ENTREZ_QUERY] Entrez query to limit Blast search

• Value : Entrez query format

• Default : Empty

4. [FILTER] Sequence filter identifier

• L for Low Complexity

• R for Human Repeats

• m for Mask for Lookup

5. [GAP_OPEN_COSTS] Gap open costs

• Value : integer values

• Default : 5 for nuc-nuc, 11 for proteins, non-aff ine for megablast

6. [GAP_EXTEND_COSTS] Gap extend costs

• Value : space separated float values

• Default : 2 for nuc-nuc, 1 for proteins, non-aff ine for megablast

7. [MATRIX_NAME] A key element in evaluating the quality of a pairwise sequence

alignment is the "substitution matrix", which assigns a score for aligning any possible

pair of residues.

• Value : Valid matrix name

• Default : BLOSUM62

4.6 TBLASTN

It compares a protein query sequence against a nucleotide sequence database

dynamically translated in all six reading frames (both strands). The "Protein query -

Translated db [tblastn]" search is useful for finding protein homologs in unannotated

nucleotide data. A tblastn search allows you to compare a protein sequence to the six-

frame translations of a nucleotide database. It can be a very productive way of f inding

homologous protein coding regions in unannotated nucleotide sequences such as

expressed sequence tags (ESTs) and draft genome records (HTG), located in BLAST

databases est and htgs, respectively. ESTs are short, single-read cDNA sequences.

61

These comprise the largest pool of sequence data for many organisms and contain

portions of transcripts from many uncharacterized genes. Since ESTs have no

annotated coding sequences, there are no corresponding protein translations in the

BLAST protein databases. Hence a tblastn search is the only way to search for these

potential coding regions at the protein level. The HTG sequences, draft sequences

from various genome projects or large genomic clones, are another large source of

unannotated coding regions [8].

Like all translating searches, the tblastn search is especially suited to working with

error prone data like ESTs and draft genomic sequences from HTG because it

combines BLAST statistics for hits to multiple reading frames and thus is robust to

frame shifts introduced by sequencing error.

4.7 TBLASTX

Tblastx compares the six-frame translations of a nucleotide query sequence against

the six-frame translations of a nucleotide sequence database. The tblastx program

cannot be used with the nr database on the BLAST Web page because it is

computationally intensive. The "Nucleotide query - Translated db [tblastx]" is useful

for identifying novel genes in error prone query sequence. Tblastx takes a nucleotide

query sequence, translates it in all six frames, and compares those translations to the

database sequences dynamically translated in all six frames. This effectively performs

a more sensitive blastp search without doing the manual translation. Tblastx gets

around the potential frame-shift and ambiguities that may prevent certain open

reading frames from being detected. This is very useful in identifying potential

proteins encoded by single pass read ESTs. In addition, it would be a good tool for

identifying novel genes [8].

4.7.1 Limitations Of Tblastx

1. TblastX is computationally insensitive.

2. Until recently there were not many completely sequenced genomes

3. When we got a match, rarely find a description for what was found.

62

CHAPTER 5 COMPARISON OF VARIANTS

OF BLAST

5.1 INTRODUCTION Blast is a successful tool to compare biological sequences. Now a days Large amount

of biological data is available, but Standalone Blast is not suff icient to handle all types

of queries related to sequence similarities, so different variants (BlastX, BlastP,

BlastN, TBlastN, TBlastX, PSI-Blast) have been developed. Each variant has

limitations and advantages. Every tool is made to handle with different purposes. So

the user should have knowledge in which situation to use which tool. Comparison is

needed between these variants different to know thoroughly about these tools [29].

Comparison Between The Var iants of Blast on The Basis of:

� Parameters

� Algorithm

� Performance.

5.1.1 Compar ison On The Basis Of Parameters

All variants of BLAST run on same algorithm followed by Main Blast Program.

There are some differences occur between these variants, due to which the

functionality differs. All the parameters are same for all variants, which are used for

MAIN BLAST program. But stil l there are some parameters which can be present in

some variants, or the absence of which can make other tools to advantageous one over

the other.

5.1.1.1 Conserved Domain Search Is Not Applied To Blastn, I t Is Applicable To

Blastp.

Proteins often contain several domains, each with a distinct function (membrane

binding, signal peptide, etc.) .As species evolve; the functional parts of important

proteins remain relatively constant over time, and may even be copied and adapted for

use by other proteins. Such domains have evolved as modules that are combined in

various arrangements to produce proteins of unique function. Conserved domains are

structural modules that have been reused frequently during the process of evolution.

63

NCBI’ s new Conserved Domain Search (CD-Search) service can be used to identify

conserved domains in a protein sequence.

Figure 5.1 Conserved Domain For BlastN and BlastP Influence of absence of CDD Search: Conserved Domain Search is applicable only

to proteins. Because it is based on PSSMs (Position Specific Score Matrices) which is

applied only on proteins. By applying PSSMs, specific functional areas with in

proteins can be searched. The searched functional domains are used in future for

further research. Because PSSM is not applied on nucleotides so if there are specific

functional areas exist in nucleotides, no search option is available for that.

Conserved domain will not work for nucleotide as -it is based on PSSM which does not apply to nucleotide. 5.1.1.2 The Default Word Size Is 11 Characters For Blastn. The Default Word

Size Is 3 For BLASTP, due To Which BLASTP Searches Run Slower

Than BLASTN.

Word size (seed) strongly affects the database searching. Speed of the algorithm is

inversely proportional to the word size. By decreasing the word size the sensitivity

increases but speed of the search program decreases. Word size for BlastP is very

small as compared to BlastN. Word size (seed) in case of BlastP is of 3-residues.It is

seen for BlastP, during the second step of algorithm, large no of hits are found in the

database. This is because of the small size of the seed. So more time is spent on the

search. But in case of BlastN, seed is of 11-nucleotides.It is difficult to find more

number of exact matches for such large seed size. Results are displayed in lesser time

as compared to BlastP and less number of hits are found. But sensitivity decreases in

BlastN.

64

Figure 5.2 Different Word Sizes For BlastN and BlastP

5.1.1.3 Blastn Is Very Different From Other Protein-Based Algor ithms. Blastn Seeds Are Identical Words. T Is Never Used In Blastn.

A word hit is simply two identical sequences. T is the threshold parameter for

sequences. T is only used where any match related to given sequence is not found.

This parameter is used to increase the length of the word seed. Neighborhood of a

given word seed is found. Neighborhood of a word contains the word itself and other

words whose score is at least as big as T when comparing with the scoring matrix. By

adjusting T, it is possible to control the size of neighborhood and therefore word hits

in the search space[30].But T is not used in BlastN, because BlastN always find

identical matches. Therefore no need of neighborhood is there.

Influence of absence of T: T is not used in BlastN. There is big limitation of this to

BlastN algorithm. If identical seeds are not found in BlastN, there will be no match.

Because when no match is found with respect to the given seed, the search is stopped

there. No extension of the seed will be performed and no match wil l be found.

Improvement: T should be used in BlastN. By using T more word hits can be found.

When the other words are aligned with the previously word seed, Neighborhood of

word is created and extension is applied on that. On applying the extension in both the

directions, the words are included in the extension whose score does not lies below

threshold value ‘T’ . And similar sequence is found whose value does not lie below the

drop-off score ‘X’ . Therefore no need to stop the search here. More sequence matches

can be found. There will be less chances of missing alignments.

65

5.1.1.4 Unlike Nucleotide BLAST, There Is No Comparable MEGABLAST For

Protein Searches.

MegaBlast is optimized for aligning sequences that differ slightly as a result of

sequencing or other similar "errors". MegaBlast is also able to eff iciently handle much

longer DNA sequences than the blastn program of traditional BLAST algorithm.

When larger word size is used (see explanation below), it is up to 10 times faster than

more common sequence similarity programs. Mega BLAST is also able to eff iciently

handle much longer DNA sequences than the blastn program of traditional BLAST

algorithm.

Influence of absence of Mega Blast:

MegaBlast is an improvement to existing BlastN algorithm, but for proteins there is

no such program exists. No batch queries can be run in case of protein sequence

searching. Longer sequence searches cant be applied so efficiently. To improve the

speed of the protein searches by speed, and to handle long sequence searches

MegaBlast like program should be developed for proteins, Which can run large

protein sequence and batch sequences at a time.

5.1.1.5 Genetic Code Option Is Only Used With Blastx, Genetic Code Option Is

Disabled With Tblastn

The genetic code is the relationship between the sequence of the bases in the DNA

and the sequence of amino acids in proteins. Both DNA and proteins are linear

polymers thus it seems logical to suppose that the sequence of bases in DNA codes for

the sequences of amino acids in proteins. However, there are 20 amino acids found in

proteins and only 4 different bases found in DNA so the coding ratio cannot be 1 to 1

nor can it be 2 bases to 1 amino acid, which would only give 16 different

combinations. At least 3 bases in combination as a triplet are required to code for each

amino acid and this would give 4 to power 3 = 64 possible combinations of triplet

bases or codons. We now know that the genetic code is based on these triplet codons.

Different species may use different genetic codes

to encode for the same amino acid. You have to

specify appropriate genetic codes (translation

table) for your query sequence based on the

organism and sources

66

BlastX mainly translate the given nucleotide sequence into protein and then compare

it with the protein database. These genetic codes are used to translate those nucleotide

into protein. Without these codes translation is not possible. Different codes are

available for different species. Mainly the Standard Genetic codes are used.

5.2 COMPARISON ON THE BASIS OF ALGORITHM

All variants of BLAST run on same algorithm followed by Main Blast Program. But

there exist some difference in the working of these due to which the performance of

all varies by the other. The different features in the algorithm make it possible to use

different tool for purpose. On the basis of different functionality different algorithms

can be optimized to improve the performance.

5.2.1 The Two-Hit Algor ithm Isn' t Used In BLASTN, Because Word

Hits Are Generally Rare With Large Identical Words.

The two-hit algorithm isn't used in original version. BLASTN the statistical

alignments which are found using main BLAST algorithm are based on threshold

value ‘T’ and drop-off score X. The central idea of the BLAST algorithm is that a

statistically significant alignment is likely to contain a high-scoring pair of aligned

words. BLAST first scans the database for words (typically of length three for

proteins) that score at least T when aligned with some word within the query

sequence. Any aligned word pair satisfying this condition is called a hit. The second

step of the algorithm checks whether each hit lies within an alignment with score

suff icient to be reported. This is done by extending a hit in both directions, until the

running alignment’s score has dropped more than X below the maximum score yet

attained. This extension step is computationally quite costly; with the T and X

parameters necessary to attain reasonable sensitivity to weak alignments, the

extension step typically accounts for >90% of Blast’s execution time. It is therefore

desirable to reduce the number of extensions performed. Refined algorithm is based

upon the observation that an HSP of interest is much longer than a single word pair,

and may therefore entail multiple hits on the same diagonal and within a relatively

short distance of one another. Specifically, we choose a window length A, and invoke

an extension only when two non-overlapping hits are found within distance A of one

another on the same diagonal. Any hit that overlaps the most recent one is ignored.

The two-hit method will detect an HSP if it contains two no overlapping length-W

67

words of score at least T. To analyze the relative speeds of the one-hit and two-hit

methods, using the parameters studied above. Two-hit method generates on average

~3.2 times as many hits, but only ~0.14 times as many hit extensions.

Influence of absence of two-hit algor ithm: Two-hit algorithm is not used for

BlastN, because the word size for BlastN is large (11 nucleotide). Word hits are the

identical words. It is rare and diff icult to find word hits with large word size. It is easy

to find identical matches for one or two nucleotide in a given database.

�

��

��

��

��

��

��

� �� 11RUPDRUPDOLOL]]HG�HG�++663�63�6FFRUHRUH

33

U

R

E

U

R

E

DD

EE

L

O

L

L

O

L

WW

\

�

\

�

RR

I

�

I

�

PP

LL

VV

VV

LL

Q

J

�

Q

J

�

DD

Q

�

Q

�

++

66

33

7 ��7 ��

Figure 5.3 shows the empirically estimated probability that an HSP is missed by this method, as a function of its normalized score

But it is very rare that we find exactly same nucleotide sequence with the seed of 11

bp. Therefore two-hit algorithm is not used.

Figure 5.4 Speeds of the one-hit and two-hit methods

68

Improvement: If two-hit algorithm will be applied to blastn, The sensitivity of

BlastN will i ncreased and more accurate sequence similarity will be obtained. This

can be done by decreasing the word size of BlastN. Because with large words size it is

diff icult to find the same matches regularly at two positions. But with short word size

it is easy to find the exact matches at more than one position.

5.2.2 Extension in BlastN is different from BlastP and other protein

based programs.

Extension for BlastN is different from Blastp. This is because of the Proteins and

Nucleotides. Different Scoring matrices are used for scoring of neighborhood during

extension. Different scoring matrices yields separate drop-off (X) score for BlastN and

BlastP. But in BlastN there are 11-nucleotides for which the whole score has to be

evaluated. It will t ake more time to calculate as compared to BlastP because the word

size for BlastP is small as compared to BlastN.

5.3 COMPARISON ON THE BASIS OF PERFORMANCE Every tool is eff icient in different conditions and to different input queries.

Performance of variants is measured on the basis of following criteria.

Performance of various variants of Blast is measured on the basis of:

• Expect Value

• Word Size

• Time

5.3.1 Compar ison On The Basis of Varying Expect Values

A BlastN was performed using the mRNA sequence of PRDX1 against the non-

redundant database. To observe the effect of the "expect value" parameter, values of

10, 0.1, and 1e-30 were used, keeping the wordsize (11) and the fil ter (low

complexity) constant. Table 5.1 shows the results:

The results from expect=10 returned 163 hits, expect=0.1 returned 157 hits, and

expect=1e-30 returned only 65 hits. The expect value is the measure of how many

times the sequence could hit another by chance. By decreasing this value, the blast

becomes more stringent and less results are returned.

69

Expect

value (e)

BlastN BlastP BlastX TBlastN TBlastX PSI-Blast

10 163 100 100 100 101 501

0.1 157 100 100 101 100 501

1e-30 65 80 58 75 98 480

Table 5.1 No of hits for varying expect values

In the same manner, the protein sequence of PRDX1 was blasted against the non-

redundant protein databases, BlastP, BlastX, TblastX, TBlastN and PSI -Blast.

Again, the expect value was varied while keeping the word size (3) constant. The

results from the expect values of 10 and 0.1 both returned almost 100 hits and in PSI-

BLAST it gives 501 hits, meaning that a decrease in stringency by 100x yields no

difference. However, when an expect value of 1e-30 was used, only 58 hits were

returned. The protein sequences in the database aligned so well with the PRDX1

protein sequence that only very low expect values altered the output.

Figure 5.5 Comparison - Varying Expect Values

BlastN BlastP Blastx TBlastN Tblastx PSI-

Blast

70

3H3HUUIIRRUUPPDDQQFFH�H�RRQ�Q�WWKKH�H�%%DDVVLLV�V�RRI�I�(([S[SHHFFW�W�9D9DOOXXHH

��

(([[SHSHFFW�W�99DDOOXHXH

11

RR

�

�

�

�

RR

I

�

I

�

++

LL

WW

VV

%ODVW1%ODVW3%ODVW;7EODVW17EODVW;3VL�%ODVW

Figure 5.6 Comparison - Varying Expect Values

By lowering the value by just 100th does not make much difference in number of hits

in BlastP, BlastX, TBlastX, BlastP. But variation comes when the expect value is

reduced by a large factor. But as it can be seen from the graph , irrespective of the

same input parameters given to all the variants, PSI-BLAST and BLASTN gives the

maximum output.

5.3.2 Compar ison On The Basis of Word Size

Similar to the above experiment, a BlastN was performed using PRDX1 mRNA. This

time, the expect value was held constant at 10 while the word size was changed (7, 11,

15). Also, other variables such as the nr database and the low complexity filter were

similarly used. The following results were observed.

Word Size (w) BlastN

7 163

11 163

15 139

. Table 5.2 No of hits for varying expect values BlastN

The results showed that both a wordsize of 7 and 11 returned 163 hits while a

wordsize of 15 returned only 139 hits. Wordsize is a measure of how many items,

71

3H3HUUIIRRUUPPDDQFQFH�H�RRI�I�%%DDOVWOVW1�1�RQ�RQ�WWKKH�H�%%DDLLV�RQ�V�RQ�

9D9DUU\\LLQQJ�J�ZZRRUUG�G�VL]VL]HH

��

� �� ::RUG�RUG�66LL]H]H

11

RR

�

�

�

�

RR

I

�

I

�

++

LL

WW

VV

6HULHV�

Figure 5.7 Varying Expect Values for BlastN

nucleotides in this case, are taken and compared to the database. In a wordsize of 11, a

group of 11 sequential nucleotides are compared with the database. The larger the

wordsize, the more stringent the analysis. That is why a wordsize of 15 returned less

results

33HHUUIIRURUPPDDQFQFH�H�RQ�RQ�WWKKH�H�EEDDVVLLV�V�RRI�I�::RUG�RUG�66L]L]HH

125

130

135

140

145

150

155

160

165

7 11 15

Word Size

No

. o

f Hit

s

BlastN

Figure 5.8 Varying Expect Values BlastN

Wordsize can also be varied in a BlastP, BlastX, TblastX, TBlastN and PSI -Blast.

In the next comparison, PRDX1 protein was blasted against the protein database using

a constant expect value (1e-70), database (nr), and fil ter (low complexity). Wordsize

was varied between 2 and 3.

72

Performance on the basis of varying word size

0

100

200

300

400

500

600

w ord size

no

of

hit

s

word size=2

word size=3

Word size ( w) BlastP BlastX TblastX TblastN PSI

2 58 100 100 115 501

3 58 100 57 115 501

. Table 5.3 No of Hits For Varying Word Size

Figure 5.9 Varying Expect Values for variants

33HHUIUIRRUUPPDDQQFFH�H�RQ�RQ�WWKKH�H�%%DDVVLLV�V�RRI�I�

::RURUG�G�66L]L]HH

��

� �::RRUUG�G�66LL]]HH

11

RR

�

�

�

�

RR

I

�

I

�

++

LL

WW

VV

%ODVW3%ODVW;7EODVW;7EDOVW136,

Figure 5.10 Varying Expect Values for variants

BlastP BlastX TbalstX TbalstN PSI

73

Varying word size does not affect the performance of BlastP, BlastN, TBlastN,

TBlastP and PSI-Blast. But it only affects the performance of TBlastX. Performance

of TBlastX declines with the increase of word size.

5.3.3 Compar ison on the Basis of Execution Time

All the variants were executed on 32-bit and 64-bit processors and their performance

was compared in terms of seconds and number of processors, which is shown below.

TEST NUMBER OF CPUs

32-BIT TIME (in seconds)

64-BIT TIME (in seconds)

blastX 1 1516 1085 blastX 2 751 550 blastN 1 297 252 blastN 2 153 132 tblastX 1 4999 3545 tblastX 2 2761 1940

. Table 5.4 Varying Execution Time

From the graph shown on next page, it is clear that TblastN takes less time to Execute

than the other variants. TblastX is slowest amongst all whether it is executed on 32-bit

processor or 64-bit processor. The performance of BlastX lies between both.

The observations are represented in the graph as shown below:

0

1000

2000

3000

4000

5000

6000Single CPU 32-bit

Dual CPU 32-bit

Single CPU 64-bit

Dual CPU 64-bit

Figure 5.11 Compares the performance of BLAST compiled with 32-bit and 64-bit processor

74

Summary: Variants of Blast (BlastN, BlastP, BlastX, TBlastN, and TblastX, PSI-Blast) run on

different parameters, different algorithms, and each tool have different performance

criteria. The performances differ on the basis of parameters like Word Size, Expect

Value, and Databases Available. By selecting different values, the eff iciency of each

tool can be improved. In this chapter the performance is being checked on the basis of

execution time, and varying parameters and algorithm comparison. From the

performance, we can make decision that in which situation, which tool is to be used.

75

CHAPTER 6 CONCLUSION AND FUTURE

SCOPE

6.1 CONCLUSION

In the plethora of tools available for data mining in bioinformatics, Blast was chosen

due to its unmatched speed, sensitivity and accuracy. Though, the performances of

BLAST was best, but still due to different conditions variants of BLAST are

available. There are various parameters that are having contextual relation with areas

other than the algorithm design and computer science: the analysis of parameters was

limited form the point of view of computer engineer. That is why the improvements in

some of the parameters are suggested.

Firstly to improve the speed of BlastP word size should increased as in BlastP. Word

size strongly affects the database searching. Speed of the algorithm is inversely

proportional to the word size. By decreasing the word size the sensitivity increases

but speed of the search program decreases. Word size for BlastP is very small as

compared to BlastN. In case of BlastP word size is of 3-residues. It is seen for BlastP,

during the second step of algorithm, large no of hits are found in the database. This is

because of the small size of the seed. So, more time is spent on the search. But in

case of BlastN, seed is of 11-nucleotides. It is diff icult to find more number of exact

matches for such large seed size. Results are displayed in lesser time as compared to

BlastP and less number of hits are found. But sensitivity decreases in BlastN.

Secondly improvement for BlastP, BlastX, and TBlastX, PSI-Blast is: For Nucleotide

BLAST, there is one MegaBlast available. There should also be Comparable

MEGABLAST for Protein Searches. MegaBlast is optimized for aligning sequences

that differ slightly as a result of sequencing or other similar "errors". MegaBlast

eff iciently handles much longer DNA sequences than the BlastN program of

traditional BLAST algorithm. When larger word size is used, it is up to 10 times faster

than more common sequence similarity programs.

Lastly there are some advantages and disadvantages in each of the variants. Regular

exploration and improvements are need for better eff iciency of these tools. Some

76

features are available only in nucleotide based tools which are absent in protein based

versions. By continuously evaluating the performances and exploring the features of

each tool, improvements are being done in this area.

6.2 FUTURE SCOPE

Over the past decade many biological tools have been developed, but still

improvements are needed in these tools, to improve the speed and accuracy. Research

for improvements of existing tools is carrying on.

¾ Examinations of the problems arising from the use of biological tools.

¾ How the execution of code affects the performance of the tool. What

modifications can be done in source code.

By doing modifications to the existing parameters and source code, speed will

increase and the field of bioinformatics will emerge with and more dynamic scope.

“ Measurement and Analysis is the key to Development and Improvement”

So with continuous evaluations of existing versions of biological tools, further

improvements will be possible.

77

REFERENCES

[1] By blast-help group, NCBI User Service, “BLAST Program Selection Guide” ,

NCBI, NLM, NIH, 8600 Rockville Pike, Bethesda, MD 20894

[2] Dan E. Krane, Michel L. Raymer “Fundamentals concepts of

Bioinformatics” ,Pearson Education, 2003.

[3] Dr. Joanne Fox, “Sequence Similarity Searching: Understanding and Using Web

Based BLAST” , Wednesday January 26th, 2005 Rm 220 FNS Building, UBC

[4] Discovery: An Overview” . In U.M. Fayad, et al. (eds.), Advances in Knowledge

Discovery and Data Mining, 1-35.AAAI/MIT Press, 1996.

[5] Gat and Tal Kohen , “Algorithms for Molecular Biology” , Lecture 4: January

1, 1999

[6] G.Piatetsky-Shapiro, U. Fayad, and P. Smith “Data mining to Knowledge

Discovery: An Overview” . In U.M. Fayad, et al. (eds.), Advances in Knowledge

Discovery and Data Mining, 1-35.AAAI/MIT Press, 1996

[7] Ian Korf, “Serial BLAST Searching” , The Wellcome Trust Sanger Institute [8] Ian Krof, Mark Yandell, and Joseph Bedell “BLAST” , Shroff Publishers & Distributors Pvt. Ltd. [9] Jason, Bruce, Dennis, “Pattern Discovery in Biomolecular Data” , Oxford

University Press. New York 1999

[10] Jean Michel Claverie and Cedric Notredame “Bioinformatics A Beginners

Guide” , Wiley Publishing, Inc. 2003.

[11] Jiawei Han, Micheline Kamber and Simon Fraser University “Data Mining

Concepts and Techniques” Morgan Kaufmann Publishers, USA 2001.

[12] Jaak Vilo, “Pattern Discovery from Biosequences” , University Of Helsinki Finland,2003

78

[13] Nick Camp, Haruna Cofer, and Roberto Gomperts, “High-Throughput BLAST” , September 1998 [14] Osmar R. Zaïane ,“Principles of Knowledge Discovery in Databases” , 1999 [15] Paracel Algorithms, “The Biologist’s Guide to Paracel’s Similarity Search Algorithms” , October 2, 2001 [16] Sandra Barth, “Sequence similarity searches” , Session 4 ,2002.Jason

[17] Shawn Delaney, Greg Butler, Clement Lam, Larry Thiel Department of

Computer Science, Concordia University, “Three Improvements to the BLASTP

Search of Genome Databases” , 1455 de Maisonneuve Blvd. West, Montreal,

Quebec, Canada, H3G 1M8

[18] Sir William Dunn, “ Introduction to Database Searching” , Oxford, July 12, 2001 [19] Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer1, Jinghui

Zhang, Webb Mill er2 and David J. Lipman “Gapped BLAST and PSI-

BLAST: a new generation of protein database search programs” , 3389–3402

Nucleic Acids Research, 1997, Vol. 25, No. 17

[20] Stephen F. Altschul', Warren Gish', Webb Miller2 Eugene W. Myers3 and David

J. Lipmanl “Basic Local Alignment Search Tool” J.Mol.Biol (1990) 215,403-

410

[21] Fengkai Zhang, “The Use of Vector Seeds to Improve PSI-BLAST

Sensitivity”, School of Computer Science, University of Waterloo,

Waterloo,

Ontario, Canada, 2004

[22] Usama Fayyad, Gregory Piatetsky-Shapiro, and Padhraic Smyth, “From Data

Mining to Knowledge Discovery in Databases” . Articles.

[23] Warren Gish and David J. States “ Identification of Protein Coding Regions by

Database Similarity Search” , Articles.

INTERNET RELATED LINKS

[24] http://www.eas.asu.edu/~mining03/chap1/lesson_2.html

79

[25] http://www.anderson.ucla.edu/faculty/jason.frand/teacher/technologies/

palace/datamining.htm

[26] http://e-comm.webopedia.com/TERM/D/data_mining.html [27] http://biotech.icmb.utexas.edu/pages/bioinfo.html [28] http://services.bioasp.nl/blast/cgi-bin/blast.cgi?program=blastx [29] http://www.ncbi.nlm.nih.gov/blast [30] www.biotech.ufl.edu/WorkshopsCourses/ bioinfoWorkshops/bioinfoTools/BLAST

80

LIST OF PUBLICATIONS

1. Ms. Inderveer Chana, Harpreet Kaur, Navjot Kaur, “ Issues Of Software

Engineering and Knowledge Engineering In Bioinformatics “ in National

Conference of Bioinformatics Computing, held at T.I.E.T, Patiala, on 18th

March – 19th March.

2. Mrs. Rinkle Aggarwal, Navjot Kaur, Harpreet Kaur, “ Algorithmic and Non-

Algorithmic Issues In Database Search Of Sequence databases ” in National

Conference of Bioinformatics Computing, held at T.I.E.T, Patiala, on 18th

March – 19th March.

81

GLOSSARY

Algor ithm: a fixed procedure embodied in a computer program. The Basic Local

Alignment Search Tool or BLAST is a sequence comparison algorithm that NCBI

uses to search sequence databases for optimal local alignments with a query sequence.

FASTA is another type of algorithm used for database similarity searching.

Alignment: The process of lining up two or more sequences to achieve maximal

levels of identity (and conservation, in the case of amino acid sequences) for the

purpose of assessing the degree of similarity and the possibility of homology.

Codon: The sequence of nucleotides, coded in triplets (codons) along the mRNA, that

determines the sequence of amino acids in protein synthesis. A gene's DNA sequence

can be used to predict the mRNA sequence, and the genetic code can in turn be used

to predict the amino acid sequence.

EST expressed sequence tag: A short strand of DNA that is a part of a cDNA

molecule and can act as identifier of a gene. Used in locating and mapping genes.

Exons: DNA segments of a gene that encode the amino acid sequence of a protein.

Gap: A space introduced into an alignment to compensate for insertions and deletions

in one sequence relative to another. To prevent the accumulation of too many gaps in

an alignment, introduction of a gap causes the deduction of a fixed amount (the gap

score) from the alignment score. Extension of the gap to encompass additional

nucleotides or amino acid is also penalized in the scoring of an alignment.

Global Alignment: The alignment of two nucleic acid or protein sequences over their

entire length

Homology: Similarity attributed to descent from a common ancestor.

82

HSP: High-scoring segment pair. Local alignments with no gaps that achieve one of

the top alignment scores in a given search.

Identity: The extent to which two (nucleotide or amino acid) sequences are invariant.

Introns: Noncoding DNA sequences that interrupt the sequences containing

instructions for making a protein (exons). Introns are not represented in messenger

RNA; only the exons are translated into protein. The function of introns is stil l being.

Local Alignment: The alignment of some portion of two nucleic acid or protein sequences

Sensitivity: It is the abili ty to detect ‘ true positives’ i.e. correct matches. The most

sensitive search finds all true matches, but might have lots of ‘ false positives’ i.e.

erroneous matches detected. Sensitivity can be defined as the probabili ty of f inding

the matches such that the query and the matched database sequences have at least x%

similarity.

Similar ity: The extent to which nucleotide or protein sequences are related. The

extent of similarity between two sequences can be based on percent sequence identity

and/or conservation. In BLAST similarity refers to a positive matrix score.

Specificity: Abili ty to reject ‘ false positives’ . The most specific search will return

only true matches, but might have lots of ‘ false negatives’ i.e. missed correct matches.

comparison of variants of blast

Documents