bioinformatics

1 | P a g e

BIOINFORMATICS

2 | P a g e

Contents Overview of Bioinformatics

Bioinformatics

Path to the Bioinformatics

History of Bioinformatics

Ways of modeling Bioinformatic approaches

Static

Dynamic

Structural Bioinformatics

Major Research Areas

Sequence Analysis

Sequence Alignment

Profile Comparison

Sequence Assembly

Gene Prediction

Protein Structure Prediction

Genome Annotation

3 | P a g e

Computational Evolutionary Biology

Literature Analysis

Analysis of Gene Expression

Analysis of Regulation

Analysis of Protein Expression

Analysis of Mutations in Cancer

Comparative Genomics

Modeling Biological Systems

High-throughput Image Analysis

Structural Bioinformatic Approaches

Application of Bioinformatics in Various Fields

Molecular medicine

More drug targets

Personalised medicine

Preventative medicine

Gene therapy

Microbial genome applications

Waste cleanup

Climate change

Alternative energy sources

Biotechnology

4 | P a g e

Antibiotic resistance

Forensic analysis of microbes

The reality of bioweapon creation

Evolutionary studies

Agriculture

Crops

Insect resistance

Improve nutritional quality

Grow crops in poorer soils and that are drought

resistant

Animals

Comparative studies

Biological Databases

Types of BioInformatics Tools

Homology and Similarity Tools

Protein Function Analysis

Structural Analysis

Sequence Analysis

Bioinformatics Tools

BLAST

FASTA

5 | P a g e

EMBOSS

Clustalw

RASMOL

CHEMSKETCH

WINCOOT

AUTODOCK

SWISS PDB VIEWER

Application Programs

JAVA in Bioinformatics

Perl in Bioinformatics

XML in Bioinformatics

C in Bioinformatics

C++ in Bioinformatics

Python in Bioinformatics

R in Bioinformatics

MySQL in Bioinformatics

SQL in Bioinformatics

CUDA in Bioinformatics

MATLAB in Bioinformatics

Microsoft Excel in Bioinformatics

6 | P a g e

Bioinformatics Projects

BioJava

BioPerl

BioXML

7 | P a g e

BIOINFORMATICS Bioinformatics is a combination of laptop or computer technologies and biology to

manage the immense biological data, primarily genetic data which has been produced

over the years. It entails developing, managing and updating the databases which are

warehouse of the biological data as well as the development of the tools utilized for

analyzing and utilizing this information. The details gathered are then used for the

discovery and development of gene based drugs.

8 | P a g e

Overview of Bioinformatics

Introduction

Biology is in the middle of a major paradigm shift driven by computing technology.

Although it is already an informational science in many respects, the field has been

rapidly becoming much more computational and analytical. Rapid progress in genetics

and biochemistry research combined with the tools provided by modern biotechnology

has generated massive volumes of genetic and protein sequence data.

Bioinformatics has been defined as a means for analysing, comparing, graphically

displaying, modeling, storing, systemising, searching, and ultimately distributing

biological information, which includes sequences, structures, function, and phylogeny.

Thus bioinformatics may be defined as a discipline that generates computational tools,

databases, and methods to support genomic and postgenomic research. It comprises the

study of DNA structure and function, gene and protein expression, protein production,

structure and function, genetic regulatory systems, and clinical applications.

Bioinformatics needs the expertise from Computer Science, Mathematics, Statistics,

Medicine, and Biology.

Knowledge Base in Biology

In the last 10 years or so, numerous innovations have seen light and the consequence is

the development of a new biological research paradigm, one that is information-heavy

and computer-driven. As the genetic information is being made as computerized

databases and their sizes are steadily growing, molecular biologists need effective and

efficient computational tools to store and retrieve the cognate information such as

bibliographic or biological information from the databases, to analyze the sequence

patterns they contain and to extract the biological knowledge the sequences have. On the

other hand, there is a strong need for mathematical methods and computational

9 | P a g e

techniques for challenging computational tasks such as predicting the three-dimensional

structure of the molecules the sequences represent, and to construct evolutionary trees

from the sequence data. These tools will also be used to learn basic facts about biology

such which sequences of DNA are used to code proteins , which other combinations of

DNA are not used for protein synthesis, for greater understanding of gens and how they

influence diseases.

Biology employs a digital language for represening its information using the four basic

alphabets (A, C, G, T). All the chromosomes in an organism' cell have been represented

and being identified using these alphabets. The demanding challenge here is to determine

how this digital language of the chromosomes is being converted into the three-

dimensional and sometimes four-dimensional languages of living and breathing

organisms.

Information Technology in Biology

As it was found that performing all these above-mentioned tasks manually is nearly

impossible due to the massive volumes of biological data and the preciseness of works, it

became mandatory to use computers for these purposes. Thus this subject of

bioinformatics deals with designing and deploying efficient software tools for

accomplishing the above quoted tasks in a fast and precise manner. So, bridging the gap

between the real world of biology and precise logical nature of computers requires an

interdisciplinary perspective.

10 | P a g e

Software and Hardware Advancements in

Biology

The tools of computer science, statistics, and mathematics are very critical for studying

biology as an informational science subject.

Some of the recent advances happened include improved DNA sequencing methods,

new approaches to identify protein structure, and revolutionary methods to monitor the

expression of many genes in parallel. The design of techniques able to deal with different

sources of incomplete and noisy data has become another crucial goal for the

bioinformatics community. In addition, there is the need to implement computational

solutions based on theoretical frameworks to allow scientists to perform complex

inferences about the phenomena under study.

Genomics in the recent past has triggered the development of high-throughput

instrumentation for DNA sequencing, DNA arrays, genotyping, proteomics, etc. These

instruments have catalyzed a new type of science for biology termed discovery science.

11 | P a g e

Human Genome Project - An Introduction

The Human Genome Project has encouraged a series of paradigm changes to the view

that biology is an informational science. The draft of the human genome has given us a

genetics parts list of what is necessary for building a human: approximately 35,000

genes, their regulatory regions, a lexicon of motifs that are the building block

components of proteins and genes, and access to the human variability that make us each

different from one user.

Genomes - Discovering Methodology and Study

Discovery science defines all of the elements in a biological system. For example,

sequence of the genome, identification and quantitation of all of the mRNAs or proteins

in a particular cell type - respectively, genome, transcriptome, and the proteome.

Discovery science creates databases of information, in contrast to the more classical

hypothesis-driven science that formulates hypotheses and attempts to test them. The high-

throughput tools both provide the means for discovery science and can assay how global

information sets, for example, transcriptomes or protemes change as systems are

perturbed.

The genomes of the model organisms yeast, worm, fly etc., have demonstrated the

fundamental conservation among all living organisms of the basic informational

pathways. Hence systems can be perturbed in model organisms to gain insight into their

functioning, and these data will provide fundamental insights into human biology. From

the genome, the information pathways and networks can be extracted to begin

understanding their logic of life. Furthermore, different genomes can be compared to

identify similarities and differences in the strategies for the logic of life and these provide

fundamental insights into development, physiology and evolution. The first eukaryotic

genome that has been fully sequenced and annotated is Saccharomyces cerevisiae. This

12 | P a g e

highly helps to develop biological and computational tools for genomic and postgenomic

research.

In the era of automated DNA sequencing and revolutionary advances in DNA

sequence analysis, the attention of many researchers is now shifting away from the study

of single genes or small gene clusters to whole genome analyses. Knowing the complete

sequence of a genome is only the first step in understanding how the myriad of

information contained within the genes is transcribed and ultimately translated into

functional proteins. In the post genomic era, functional genomics and proteomic studies

help to obtain an image of the dynamic cell.

System Biology

Biology is a highly informational science. There are mainly two types of biological

information.

1. The information of genes or proteins, which are the molecular machines of life.

2. The information of the regularity networks that coordinate and specify the

expression patterns of the genes and proteins.

All biological information is hierarchical. Initially DNA will change over to mRNA,

which in turn goes to protein. Proteins enact protein interactions, which creates some

informational pathways. These pathways form informational networks, which in turn

become cells. Now cells form networks of cells. Finally an individual is a collection of

cells. A host of individuals forms population and a variety of populations becomes

ecologies. This evolution brings a primary challenge for researchers and scientists to

create tools and mechanisms to capture and integrate these different levels of biological

information and integrate it towards gaining insight of their curious functionings.

All of these paradigm shift lead to the view that the major challenges for biology and

medicine in this new century will be the study of complex systems and the approach

necessary for studying these biological complexities. Here comes a viable approach.

13 | P a g e

1. Identify all elements, such as sequence of genomes in the system with currently

available discovery tools

2. Use current knowledge of the sytem to formulate a model predicting its behavior

3. Perturb the system in a model organism using biological, genetic or environmental

perturbations, capture information at all relevant levels, such as DNA, mRNA,

protein, protein interactions, etc. and integrate the collected information

4. Compare theoretical predictions and experimental data, carry out additional

perturbations to bring theory and experiment into closer apposition, integrate new

data into model,

Iterate steps iii) and iv) till the mathematical model can predict the structure of the system

and its systems or emergent properties given particular perturbations.

System Biology - Challenges Ahead

The Integration of technology, biology, and computation.

The integration of the various levels of biological information and the modeling .

The proper annotation of biological information and its its storage and integration

in databases.

The inclusion of other molecules, large and small, in the systems approach.

The integration imperatives of systems biology presents many challenges to

industry and academia.

Conclusion

With the confluence of biology and computer science, the computer applications of

molecular biology are drawing a greater attention among the life science researchers and

scientists these days. As it becomes imperative for biologists to seek the help of

information technology professionals to accomplish the ever growing computational

requirements of a host of exciting and needy biological problems, the synergy between

modern biology and computer science is to blossom in the days to come. Thus the

14 | P a g e

research scope for all the mathematical techniques and algorithms coupled with software

programming languages, software development and deployment tools are to get a real

boost. In addition, information technologies such as databases, middleware, graphical

user interface (GUI) design, distributed object computing, storage area networks (SAN),

data compression, network and communication and remote management are all set to

play a very critical role in taking forward the goals for which the bioinformatics field

came into existence.

15 | P a g e

Bioinformatics Bioinformatics is the application of computer science and information technology to

the field of biology and medicine. Bioinformatics deals with algorithms, databases and

information systems, web technologies, artificial intelligence and soft computing,

information and computation theory, structural biology, software engineering, data

mining, image processing, modeling and simulation, signal processing, discrete

mathematics, control and system theory, circuit theory, and statistics. Bioinformatics

generates new knowledge as well as the computational tools to create that knowledge.

Bioinformatics:

The use and development of mathematical algorithms and computer programs to obtain insight into biological and medical systems.

16 | P a g e

The National Center for Biotechnology Information

(NCBI 2001) defines bioinformatics as:

"Bioinformatics is the field of science in which biology, computer science, and

information technology merges into a single discipline. There are three important sub-

disciplines within bioinformatics: the development of new algorithms and statistics with

which to assess relationships among members of large data sets; the analysis and

interpretation of various types of data including nucleotide and amino acid sequences,

protein domains, and protein structures; and the development and implementation of tools

that enable efficient access and management of different types of information."

From Webopedia:

The application of computer technology to the management of biological information.

Specifically, it is the science of developing computer databases and algorithms to

facilitate and expedite biological research. Bioinformatics is being used largely in the

field of human genome research by the Human Genome Project that has been

determining the sequence of the entire human genome (about 3 billion base pairs) and is

essential in using genomic information to understand diseases. It is also used largely for

the identification of new molecular targets for drug discovery.

17 | P a g e

The three terms bioinformatics, computational biology and bioinformation

infrastructure are often times used interchangeably. These three may be defined as

follows:

Bioinformatics refers to database-like activities, involving persistent sets of

data that are maintained in a consistent state over essentially indefinite periods of

time;

Computational biology encompasses the use of algorithmic tools to

facilitate biological analyses; while

Bioinformation infrastructure comprises the entire collective of

information management systems, analysis tools and communication networks

supporting biology. Thus, the latter may be viewed as a computational scaffold of

the former two.

Path to the Bioinformatics

First Learn Biology.

Decide and pick a problem that interests you for experiment.

Find and learn about the Bioinformatics tools.

Learn the Computer Programming Languages.

Experiment on your computer and learn different programming techniques.

The computer has become an essential tool for the biologist just like the microscope.

Eventually the Bioinformatics will become an integral part of the biology.

18 | P a g e

History of Bioinformatics

The Modern bioinformatics is can be classified into two broad categories, Biological

Science and computational Science. Here is the data of historical events for both biology

and computer science.

The history of biology in general, B.C. and before the discovery of genetic inheritance

by G. Mendel in 1865, is extremely sketch and inaccurate. This was the start of

Bioinformatics history. Gregor Mendel is known as the "Father of Genetics". He did

experiment on the cross-fertilization of different colors of the same species. He carefully

recorded the data and analyzed the data. Mendel illustrated that the inheritance of traits

could be more easily explained if it was controlled by factors passed down from

generation to generation.

The understanding of genetics has advanced remarkably in the last thirty years. In

1972, Paul berg made the first recombinant DNA molecule using ligase. In that same

year, Stanley Cohen, Annie Chang and Herbert Boyer produced the first recombinant

DNA organism. In 1973, two important things happened in the field of genomics. The

advancement of computing in 1960-70s resulted in the basic methodology of

bioinformatics. However, it is the 1990s when the INTERNET arrived when the full

fledged bioinformatics field was born.

Here are some of the major events in bioinformatics over the last several decades. The

events listed in the list occurred long before the term, "bioinformatics", was coined.

19 | P a g e

Bioinformatics Events

1665

Robert Hooke published Micrographia, described the cellular

structure of cork. He also described microscopic examinations

of fossilized plants and animals, comparing their microscopic

structure to that of the living organisms they resembled. He

argued for an organic origin of fossils, and suggested a

plausible mechanism for their formation.

1683 Antoni van Leeuwenhoek discovered bacteria.

1686

John Ray, John Ray's in his book "Historia Plantarum"

catalogued and described 18,600 kinds of plants. His book

gave the first definition of species based upon common

descent.

1843 Richard Owen elaborated the distinction of homology and

analogy.

1864 Ernst Haeckel (Häckel) outlined the essential elements of

modern zoological classification.

1865 Gregory Mendel (1823-1884), Austria, established the theory

of genetic inheritance.

1902 The chromosome theory of heredity is proposed by Sutton

and Boveri, working independently.

1905 The word "genetics" is coined by William Bateson.

1913 First ever linkage map created by Columbia undergraduate

Alfred Sturtevant (working with T.H. Morgan).

1930

Tiselius, Uppsala University, Sweden, A new technique,

electrophoresis, is introduced by Tiselius for separating

proteins in solution. "The moving-boundary method of

studying the electrophoresis of proteins" (published in Nova

20 | P a g e

Acta Regiae Societatis Scientiarum Upsaliensis, Ser. IV, Vol. 7,

No. 4)

1946 Genetic material can be transferred laterally between

bacterial cells, as shown by Lederberg and Tatum.

1952

Alfred Day Hershey and Martha Chase proved that the DNA

alone carries genetic information. This was proved on the

basis of their bacteriophage research.

1961 Sidney Brenner, François Jacob, Matthew Meselson, identify

messenger RNA,

1962 Pauling's theory of molecular evolution

1965 Margaret Dayhoff's Atlas of Protein Sequences

1970 Needleman-Wunsch algorithm

1977 DNA sequencing and software to analyze it (Staden)

1981 Smith-Waterman algorithm developed

1981 The concept of a sequence motif (Doolittle)

1982 GenBank Release 3 made public

1982 Phage lambda genome sequenced

1983 Sequence database searching algorithm (Wilbur-Lipman)

1985 FASTP/FASTN: fast sequence similarity searching

1988 National Center for Biotechnology Information (NCBI) created

at NIH/NLM

1988 EMBnet network for database distribution

21 | P a g e

1990 BLAST: fast sequence similarity searching

1991 EST: expressed sequence tag sequencing

1993 Sanger Centre, Hinxton, UK

1994 EMBL European Bioinformatics Institute, Hinxton, UK

1995 First bacterial genomes completely sequenced

1996 Yeast genome completely sequenced

1997 PSI-BLAST

1998 Worm (multicellular) genome completely sequenced

1999 Fly genome completely sequenced

2000

Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL. The large-

scale organization of metabolic networks. Nature 2000 Oct

5;407(6804):651-4, PubMed

2000 The genome for Pseudomonas aeruginosa (6.3 Mbp) is

published.

2000 The A. thaliana genome (100 Mb) is secquenced.

2001 The human genome (3 Giga base pairs) is published.

Description

Bioinformatics was applied in the creation and maintenance of a database to store

biological information at the beginning of the "genomic revolution", such as nucleotide

sequences and amino acid sequences. Development of this type of database involved not

only design issues but the development of complex interfaces whereby researchers could

access existing data as well as submit new or revised data.

22 | P a g e

In order to study how normal cellular activities are altered in different disease states,

the biological data must be combined to form a comprehensive picture of these activities.

Therefore, the field of bioinformatics has evolved such that the most pressing task now

involves the analysis and interpretation of various types of data. This includes nucleotide

and amino acid sequences, protein domains, and protein structures. The actual process of

analyzing and interpreting data is referred to as computational biology. Important sub-

disciplines within bioinformatics and computational biology include:

The development and implementation of tools that enable efficient access to, and

use and management of, various types of information.

The development of new algorithms (mathematical formulas) and statistics with

which to assess relationships among members of large data sets. For example,

methods to locate a gene within a sequence, predict protein structure and/or

function, and cluster protein sequences into families of related sequences.

The primary goal of bioinformatics is to increase the understanding of biological

processes. What sets it apart from other approaches, however, is its focus on developing

and applying computationally intensive techniques to achieve this goal. Examples

include:

Pattern recognition

Data mining

Machine learning algorithms

Visualization

Major research efforts in the field include

Sequence alignment

Gene finding

Genome assembly

Drug design

Drug discovery

23 | P a g e

Protein structure alignment

Protein structure prediction

Prediction of gene expression

Protein–protein interactions

Genome-wide association studies

Modeling of evolution

Interestingly, the term bioinformatics was coined before the "genomic revolution".

Paulien Hogeweg and Ben Hesper introduced the term in 1978 to refer to "the study of

information processes in biotic systems". This definition placed bioinformatics as a field

parallel to biophysics or biochemistry (biochemistry is the study of chemical processes in

biological systems). However, its primary use since at least the late 1980s has been to

describe the application of computer science and information sciences to the analysis of

biological data, particularly in those areas of genomics involving large-scale DNA

sequencing.

Bioinformatics now entails the creation and advancement of databases, algorithms,

computational and statistical techniques and theory to solve formal and practical

problems arising from the management and analysis of biological data.

Over the past few decades rapid developments in genomic and other molecular

research technologies and developments in information technologies have combined to

produce a tremendous amount of information related to molecular biology.

Bioinformatics is the name given to these mathematical and computing approaches used

to glean understanding of biological processes.

Common activities in bioinformatics include mapping and analyzing DNA and protein

sequences, aligning different DNA and protein sequences to compare them, and creating

and viewing 3-D models of protein structures.

24 | P a g e

Ways of modeling Bioinformatic

approaches

There are two fundamental ways of modelling a Biological system (e.g., living cell) both

coming under Bioinformatic approaches.

1. Static

2. Dynamic

1. Static

Sequences – Proteins, Nucleic acids and Peptides

Structures – Proteins, Nucleic acids, Ligands (including metabolites and drugs)

and Peptides

Interaction data among the above entities including microarray data and Networks

of proteins, metabolites

2. Dynamic

Systems Biology comes under this category including reaction fluxes and variable

concentrations of metabolites

Multi-Agent Based modelling approaches capturing cellular events such as

signalling, transcription and reaction dynamics

A broad sub-category under bioinformatics is structural

bioinformatics.

Structural Bioinformatics

Structural bioinformatics is the branch of bioinformatics which is related to the

analysis and prediction of the three-dimensional structure of biological macromolecules

25 | P a g e

such as proteins, RNA, and DNA. It deals with generalizations about macromolecular 3D

structure such as comparisons of overall folds and local motifs, principles of molecular

folding, evolution, and binding interactions, and structure/function relationships, working

both from experimentally solved structures and from computational models. The term

structural has the same meaning as in structural biology, and structural bioinformatics can

be seen as a part of computational structural biology.

Major Research Areas

1. Sequence Analysis

In bioinformatics, the term sequence analysis refers to the process of subjecting a

DNA, RNA or peptide sequence to any of a wide range of analytical methods to

understand its features, function, structure, or evolution. Methodologies used include

sequence alignment, searches against biological databases, and others. Since the

development of methods of high-throughput production of gene and protein sequences,

the rate of addition of new sequences to the databases increased exponentially. Such a

collection of sequences does not, by itself, increase the scientist's understanding of the

biology of organisms. However, comparing these new sequences to those with known

26 | P a g e

functions is a key way of understanding the biology of an organism from which the new

sequence comes. Thus, sequence analysis can be used to assign function to genes and

proteins by the study of the similarities between the compared sequences. Nowadays,

there are many tools and techniques that provide the sequence comparisons (sequence

alignment) and analyze the alignment product to understand its biology.

Sequence analysis in molecular biology includes a very wide range of relevant topics:

1. The comparison of sequences in order to find similarity often to infer if they

are related (homologous).

2. Identification of intrinsic features of the sequence such as active sites, post

translational modification sites, gene-structures, reading frames, distributions

of introns and exons and regulatory elements.

3. Identification of sequence differences and variations such as point mutations

and single nucleotide polymorphism (SNP) in order to get the genetic marker.

4. Revealing the evolution and genetic diversity of sequences and organisms.

5. Identification of molecular structure from sequence alone.

In chemistry, sequence analysis comprises techniques used to do determine the

sequence of a polymer formed of several monomers. In molecular biology and genetics,

the same process is called simply "sequencing".

In marketing, sequence analysis is often used in analytical customer relationship

management applications, such as NPTB models.

History

Since the very first sequences of the insulin protein was characterised by Fred Sanger

in 1951 biologists have been trying to use this knowledge to understand the function of

molecules.

27 | P a g e

I. Sequence Alignment

In bioinformatics, a sequence alignment is a way of arranging the sequences of DNA,

RNA, or protein to identify regions of similarity that may be a consequence of functional,

structural, or evolutionary relationships between the sequences. Aligned sequences of

nucleotide or amino acid residues are typically represented as rows within a matrix. Gaps

are inserted between the residues so that identical or similar characters are aligned in

successive columns.

A sequence alignment, produced by ClustalW, of two human zinc finger proteins, identified on the left by GenBank accession number.

Key: Single letters: amino acids. Red: small, hydrophobic, aromatic, not Y. Blue: acidic. Magenta: basic. Green: hydroxyl, amine, amide, basic. Gray: others. "*": identical. ":": conserved substitutions (same colour group). ".": semi-conserved substitution (similar shapes).

Sequence alignments are also used for non-biological sequences, such as those present in

natural language or in financial data.

Interpretation

If two sequences in an alignment share a common ancestor, mismatches can be

interpreted as point mutations and gaps as indels (that is, insertion or deletion mutations)

introduced in one or both lineages in the time since they diverged from one another. In

sequence alignments of proteins, the degree of similarity between amino acids occupying

a particular position in the sequence can be interpreted as a rough measure of how

28 | P a g e

conserved a particular region or sequence motif is among lineages. The absence of

substitutions, or the presence of only very conservative substitutions (that is, the

substitution of amino acids whose side chains have similar biochemical properties) in a

particular region of the sequence, suggest that this region has structural or functional

importance. Although DNA and RNA nucleotide bases are more similar to each other

than are amino acids, the conservation of base pairs can indicate a similar functional or

structural role.

Alignment methods

Very short or very similar sequences can be aligned by hand. However, most

interesting problems require the alignment of lengthy, highly variable or extremely

numerous sequences that cannot be aligned solely by human effort. Instead, human

knowledge is applied in constructing algorithms to produce high-quality sequence

alignments, and occasionally in adjusting the final results to reflect patterns that are

difficult to represent algorithmically (especially in the case of nucleotide sequences).

Computational approaches to sequence alignment generally fall into two categories:

global alignments and local alignments. Calculating a global alignment is a form of

global optimization that "forces" the alignment to span the entire length of all query

sequences. By contrast, local alignments identify regions of similarity within long

sequences that are often widely divergent overall. Local alignments are often preferable,

but can be more difficult to calculate because of the additional challenge of identifying

the regions of similarity. A variety of computational algorithms have been applied to the

sequence alignment problem, including slow but formally correct methods like dynamic

programming, and efficient, heuristic algorithms or probabilistic methods that do not

guarantee to find best matches designed for large-scale database search.

29 | P a g e

Representations

Alignments are commonly represented both graphically and in text format. In almost

all sequence alignment representations, sequences are written in rows arranged so that

aligned residues appear in successive columns. In text formats, aligned columns

containing identical or similar characters are indicated with a system of conservation

symbols. As in the image above, an asterisk or pipe symbol is used to show identity

between two columns; other less common symbols include a colon for conservative

substitutions and a period for semiconservative substitutions. Many sequence

visualization programs also use color to display information about the properties of the

individual sequence elements; in DNA and RNA sequences, this equates to assigning

each nucleotide its own color. In protein alignments, such as the one in the image above,

color is often used to indicate amino acid properties to aid in judging the conservation of

a given amino acid substitution. For multiple sequences the last row in each column is

often the consensus sequence determined by the alignment; the consensus sequence is

also often represented in graphical format with a sequence logo in which the size of each

nucleotide or amino acid letter corresponds to its degree of conservation.

Sequence alignments can be stored in a wide variety of text-based file formats, many

of which were originally developed in conjunction with a specific alignment program or

implementation. Most web-based tools allow a limited number of input and output

formats, such as FASTA format and GenBank format and the output is not easily

editable. Several conversion programs that provide graphical and/or command line

interfaces are available, such as READSEQ and EMBOSS. There are also several

programming packages which provide this conversion functionality, such as BioPerl and

BioRuby.

Global and local alignments

Global alignments, which attempt to align every residue in every sequence, are most

useful when the sequences in the query set are similar and of roughly equal size. (This

30 | P a g e

does not mean global alignments cannot end in gaps.) A general global alignment

technique is the Needleman–Wunsch algorithm, which is based on dynamic

programming. Local alignments are more useful for dissimilar sequences that are

suspected to contain regions of similarity or similar sequence motifs within their larger

sequence context. The Smith–Waterman algorithm is a general local alignment method

also based on dynamic programming. With sufficiently similar sequences, there is no

difference between local and global alignments.

Hybrid methods, known as semiglobal or "glocal" (short for global-local) methods,

attempt to find the best possible alignment that includes the start and end of one or the

other sequence. This can be especially useful when the downstream part of one sequence

overlaps with the upstream part of the other sequence. In this case, neither global nor

local alignment is entirely appropriate: a global alignment would attempt to force the

alignment to extend beyond the region of overlap, while a local alignment might not fully

cover the region of overlap.

Fig: Illustration of global and local alignments demonstrating the 'gappy' quality of global alignments that can occur if sequences are insufficiently similar

Pairwise Alignment

Pairwise sequence alignment methods are used to find the best-matching piecewise

(local) or global alignments of two query sequences. Pairwise alignments can only be

used between two sequences at a time, but they are efficient to calculate and are often

used for methods that do not require extreme precision (such as searching a database for

sequences with high similarity to a query). The three primary methods of producing

31 | P a g e

pairwise alignments are dot-matrix methods, dynamic programming, and word methods;

however, multiple sequence alignment techniques can also align pairs of sequences.

Although each method has its individual strengths and weaknesses, all three pairwise

methods have difficulty with highly repetitive sequences of low information content -

especially where the number of repetitions differ in the two sequences to be aligned. One

way of quantifying the utility of a given pairwise alignment is the 'maximum unique

match' (MUM), or the longest subsequence that occurs in both query sequence. Longer

MUM sequences typically reflect closer relatedness.

Dot-matrix methods

The dot-matrix approach, which implicitly produces a family of alignments for

individual sequence regions, is qualitative and conceptually simple, though time-

consuming to analyze on a large scale. In the absence of noise, it can be easy to visually

identify certain sequence features—such as insertions, deletions, repeats, or inverted

repeats—from a dot-matrix plot. To construct a dot-matrix plot, the two sequences are

written along the top row and leftmost column of a two-dimensional matrix and a dot is

placed at any point where the characters in the appropriate columns match—this is a

typical recurrence plot. Some implementations vary the size or intensity of the dot

depending on the degree of similarity of the two characters, to accommodate conservative

substitutions. The dot plots of very closely related sequences will appear as a single line

along the matrix's main diagonal.

Problems with dot plots as an information display technique include: noise, lack of

clarity, non-intuitiveness, difficulty extracting match summary statistics and match

positions on the two sequences. There is also much wasted space where the match data is

inherently duplicated across the diagonal and most of the actual area of the plot is taken

up by either empty space or noise, and, finally, dot-plots are limited to two sequences.

None of these limitations apply to Miropeats alignment diagrams but they have their own

particular flaws.

32 | P a g e

Dot plots can also be used to assess repetitiveness in a single sequence. A sequence can

be plotted against itself and regions that share significant similarities will appear as lines

off the main diagonal. This effect can occur when a protein consists of multiple similar

structural domains.

Fig: A DNA dot plot of a human zinc finger transcription factor (GenBank ID NM_002383), showing regional self-similarity. The main diagonal represents the sequence's alignment with itself; lines off the main diagonal represent similar or repetitive patterns within the sequence.

This is a typical example of a recurrence plot.

Dynamic programming

The technique of dynamic programming can be applied to produce global alignments

via the Needleman-Wunsch algorithm, and local alignments via the Smith-Waterman

algorithm. In typical usage, protein alignments use a substitution matrix to assign scores

to amino-acid matches or mismatches, and a gap penalty for matching an amino acid in

one sequence to a gap in the other. DNA and RNA alignments may use a scoring matrix,

33 | P a g e

but in practice often simply assign a positive match score, a negative mismatch score, and

a negative gap penalty. (In standard dynamic programming, the score of each amino acid

position is independent of the identity of its neighbors, and therefore base stacking effects

are not taken into account. However, it is possible to account for such effects by

modifying the algorithm.) A common extension to standard linear gap costs, is the usage

of two different gap penalties for opening a gap and for extending a gap. Typically the

former is much larger than the latter, e.g. -10 for gap open and -2 for gap extension. Thus,

the number of gaps in an alignment is usually reduced and residues and gaps are kept

together, which typically makes more biological sense. The Gotoh algorithm implements

affine gap costs by using three matrices.

Dynamic programming can be useful in aligning nucleotide to protein sequences, a

task complicated by the need to take into account frameshift mutations (usually insertions

or deletions). The framesearch method produces a series of global or local pairwise

alignments between a query nucleotide sequence and a search set of protein sequences, or

vice versa. Its ability to evaluate frameshifts offset by an arbitrary number of nucleotides

makes the method useful for sequences containing large numbers of indels, which can be

very difficult to align with more efficient heuristic methods. In practice, the method

requires large amounts of computing power or a system whose architecture is specialized

for dynamic programming. The BLAST and EMBOSS suites provide basic tools for

creating translated alignments (though some of these approaches take advantage of side-

effects of sequence searching capabilities of the tools). More general methods are

available from both commercial sources, such as FrameSearch, distributed as part of the

Accelrys GCG package, and Open Source software such as Genewise.

The dynamic programming method is guaranteed to find an optimal alignment given a

particular scoring function; however, identifying a good scoring function is often an

empirical rather than a theoretical matter. Although dynamic programming is extensible

to more than two sequences, it is prohibitively slow for large numbers of or extremely

long sequences.

34 | P a g e

Word Methods

Word methods, also known as k-tuple methods, are heuristic methods that are not

guaranteed to find an optimal alignment solution, but are significantly more efficient than

dynamic programming. These methods are especially useful in large-scale database

searches where it is understood that a large proportion of the candidate sequences will

have essentially no significant match with the query sequence. Word methods are best

known for their implementation in the database search tools FASTA and the BLAST

family. Word methods identify a series of short, nonoverlapping subsequences ("words")

in the query sequence that are then matched to candidate database sequences. The relative

positions of the word in the two sequences being compared are subtracted to obtain an

offset; this will indicate a region of alignment if multiple distinct words produce the same

offset. Only if this region is detected do these methods apply more sensitive alignment

criteria; thus, many unnecessary comparisons with sequences of no appreciable similarity

are eliminated.

In the FASTA method, the user defines a value k to use as the word length with which

to search the database. The method is slower but more sensitive at lower values of k,

which are also preferred for searches involving a very short query sequence. The BLAST

family of search methods provides a number of algorithms optimized for particular types

of queries, such as searching for distantly related sequence matches. BLAST was

developed to provide a faster alternative to FASTA without sacrificing much accuracy;

like FASTA, BLAST uses a word search of length k, but evaluates only the most

significant word matches, rather than every word match as does FASTA. Most BLAST

implementations use a fixed default word length that is optimized for the query and

database type, and that is changed only under special circumstances, such as when

searching with repetitive or very short query sequences. Implementations can be found

via a number of web portals, such as EMBL FASTA and NCBI BLAST.

35 | P a g e

Multiple sequence alignment

Multiple sequence alignment is an extension of pairwise alignment to incorporate more

than two sequences at a time. Multiple alignment methods try to align all of the

sequences in a given query set. Multiple alignments are often used in identifying

conserved sequence regions across a group of sequences hypothesized to be

evolutionarily related. Such conserved sequence motifs can be used in conjunction with

structural and mechanistic information to locate the catalytic active sites of enzymes.

Alignments are also used to aid in establishing evolutionary relationships by constructing

phylogenetic trees. Multiple sequence alignments are computationally difficult to produce

and most formulations of the problem lead to NP-complete combinatorial optimization

problems. Nevertheless, the utility of these alignments in bioinformatics has led to the

development of a variety of methods suitable for aligning three or more sequences.

Fig: Alignment of 27 avian influenza hemagglutinin protein sequences colored by residue conservation (top) and residue properties (bottom)

36 | P a g e

Dynamic Programming

The technique of dynamic programming is theoretically applicable to any number of

sequences; however, because it is computationally expensive in both time and memory, it

is rarely used for more than three or four sequences in its most basic form. This method

requires constructing the n-dimensional equivalent of the sequence matrix formed from

two sequences, where n is the number of sequences in the query. Standard dynamic

programming is first used on all pairs of query sequences and then the "alignment space"

is filled in by considering possible matches or gaps at intermediate positions, eventually

constructing an alignment essentially between each two-sequence alignment. Although

this technique is computationally expensive, its guarantee of a global optimum solution is

useful in cases where only a few sequences need to be aligned accurately. One method

for reducing the computational demands of dynamic programming, which relies on the

"sum of pairs" objective function, has been implemented in the MSA software package.

Progressive Methods

Progressive, hierarchical, or tree methods generate a multiple sequence alignment by

first aligning the most similar sequences and then adding successively less related

sequences or groups to the alignment until the entire query set has been incorporated into

the solution. The initial tree describing the sequence relatedness is based on pairwise

comparisons that may include heuristic pairwise alignment methods similar to FASTA.

Progressive alignment results are dependent on the choice of "most related" sequences

and thus can be sensitive to inaccuracies in the initial pairwise alignments. Most

progressive multiple sequence alignment methods additionally weight the sequences in

the query set according to their relatedness, which reduces the likelihood of making a

poor choice of initial sequences and thus improves alignment accuracy.

Many variations of the Clustal progressive implementation are used for multiple

sequence alignment, phylogenetic tree construction, and as input for protein structure

37 | P a g e

prediction. A slower but more accurate variant of the progressive method is known as T-

Coffee.

Iterative Methods

Iterative methods attempt to improve on the heavy dependence on the accuracy of the

initial pairwise alignments, which is the weak point of the progressive methods. Iterative

methods optimize an objective function based on a selected alignment scoring method by

assigning an initial global alignment and then realigning sequence subsets. The realigned

subsets are then themselves aligned to produce the next iteration's multiple sequence

alignment. Various ways of selecting the sequence subgroups and objective function are

reviewed in.

Motif Finding

Motif finding, also known as profile analysis, constructs global multiple sequence

alignments that attempt to align short conserved sequence motifs among the sequences in

the query set. This is usually done by first constructing a general global multiple

sequence alignment, after which the highly conserved regions are isolated and used to

construct a set of profile matrices. The profile matrix for each conserved region is

arranged like a scoring matrix but its frequency counts for each amino acid or nucleotide

at each position are derived from the conserved region's character distribution rather than

from a more general empirical distribution. The profile matrices are then used to search

other sequences for occurrences of the motif they characterize. In cases where the

original data set contained a small number of sequences, or only highly related

sequences, pseudocounts are added to normalize the character distributions represented in

the motif.

Techniques Inspired by Computer Science

A variety of general optimization algorithms commonly used in computer science have

also been applied to the multiple sequence alignment problem. Hidden Markov models

38 | P a g e

have been used to produce probability scores for a family of possible multiple sequence

alignments for a given query set; although early HMM-based methods produced

underwhelming performance, later applications have found them especially effective in

detecting remotely related sequences because they are less susceptible to noise created by

conservative or semi conservative substitutions. Genetic algorithms and simulated

annealing have also been used in optimizing multiple sequence alignment scores as

judged by a scoring function like the sum-of-pairs method. More complete details and

software packages can be found in the main article multiple sequence alignment.

II. Profile Comparison

In 1987 Michael Gribskov, Andrew McLachlan and David Eisenberg introduced the

method of profile comparison for identifying distant similarities between proteins. Rather

than using a single sequence, profile methods use a multiple sequence alignment to

encode a profile which contains information about the conservation level of each residue.

These profiles can then be used to search collections of sequences to find sequences that

are related. Profiles are also known as Position Specific Scoring Matrices (PSSMs). In

1993 a probabilistic interpretation of profiles was introduced by David Haussler and

colleagues using hidden Markov models. These models have become known as profile-

HMMs.

In recent years methods have been developed that allow the comparison of profiles

directly to each other. These are known as profile-profile comparison methods.

III. Sequence Assembly

In bioinformatics, sequence assembly refers to aligning and merging fragments of a

much longer DNA sequence in order to reconstruct the original sequence. This is needed

as DNA sequencing technology cannot read whole genomes in one go, but rather reads

small pieces of between 20 and 1000 bases, depending on the technology used. Typically

39 | P a g e

the short fragments, called reads, result from shotgun sequencing genomic DNA, or gene

transcript (ESTs).

The problem of sequence assembly can be compared to taking many copies of a book,

passing them all through a shredder, and piecing the text of the book back together just

by looking at the shredded pieces. Besides the obvious difficulty of this task, there are

some extra practical issues: the original may have many repeated paragraphs, and some

shreds may be modified during shredding to have typos. Excerpts from another book may

also be added in, and some shreds may be completely unrecognizable.

Genome Assemblers

The first sequence assemblers began to appear in the late 1980s and early 1990s as

variants of simpler sequence alignment programs to piece together vast quantities of

fragments generated by automated sequencing instruments called DNA sequencers. As

the sequenced organisms grew in size and complexity (from small viruses over plasmids

to bacteria and finally eukaryotes), the assembly programs used in these genome projects

needed to increasingly employ more and more sophisticated strategies to handle:

Terabytes of sequencing data which need processing on computing clusters.

Identical and nearly identical sequences (known as repeats) which can, in the

worst case, increase the time and space complexity of algorithms exponentially.

Errors in the fragments from the sequencing instruments, which can confound

assembly.

Faced with the challenge of assembling the first larger eukaryotic genomes, the fruit

fly Drosophila melanogaster, in 2000 and the human genome just a year later, scientists

developed assemblers like Celera Assembler and Arachne able to handle genomes of

100-300 million base pairs. Subsequent to these efforts, several other groups, mostly at

the major genome sequencing centers, built large-scale assemblers, and an open source

40 | P a g e

effort known as AMOS was launched to bring together all the innovations in genome

assembly technology under the open source framework.

EST Assemblers

Expressed Sequence Tag or EST assembly differs from genome assembly in several

ways. The sequences for EST assembly are the transcribed mRNA of a cell and represent

only a subset of the whole genome. At a first glance, underlying algorithmical problems

differ between genome and EST assembly. For instance, genomes often have large

amounts of repetitive sequences, mainly in the inter-genic parts. Since ESTs represent

gene transcripts, they will not contain these repeats. On the other hand, cells tend to have

a certain number of genes that are constantly expressed in very high amounts

(housekeeping genes), which again leads to the problem of similar sequences present in

high amounts in the data set to be assembled.

Furthermore, genes sometimes overlap in the genome (sense-antisense transcription),

and should ideally still be assembled separately. EST assembly is also complicated by

features like (cis-) alternative splicing, trans-splicing, single-nucleotide polymorphism,

recoding, and post-transcriptional modification.

De-novo vs. Mapping Assembly

In sequence assembly, two different types can be distinguished:

De-novo: assembling short reads to create full-length (sometimes novel)

sequences (see de novo transcriptome assembly)

Mapping: assembling reads against an existing backbone sequence, building a

sequence that is similar but not necessarily identical to the backbone sequence

In terms of complexity and time requirements, de-novo assemblies are orders of

magnitude slower and more memory intensive than mapping assemblies. This is mostly

due to the fact that the assembly algorithm needs to compare every read with every other

41 | P a g e

read (an operation that has a complexity of O (n²) but can be reduced to O (n log(n)).

Referring to the comparison drawn to shredded books in the introduction: while for

mapping assemblies one would have a very similar book as template (perhaps with the

names of the main characters and a few locations changed), the de-novo assemblies are

more hardcore in a sense as one would not know beforehand whether this would become

a science book, or a novel, or a catalogue etc.

IV. Gene Prediction

Gene prediction or gene finding refers to the process of identifying the regions of

genomic DNA that encode genes. This includes protein-coding genes as well as RNA

genes, but may also include prediction of other functional elements such as regulatory

regions. Gene finding is one of the first and most important steps in understanding the

genome of a species once it has been sequenced. In general the prediction of bacterial

genes is significantly simpler and more accurate than the prediction of genes in

eukaryotic species that usually have complex intron/exon patterns.

In computational biology gene prediction or gene finding refers to the process of

identifying the regions of genomic DNA that encode genes. This includes protein-coding

genes as well as RNA genes, but may also include prediction of other functional elements

such as regulatory regions. Gene finding is one of the first and most important steps in

understanding the genome of a species once it has been sequenced.

In its earliest days, "gene finding" was based on painstaking experimentation on living

cells and organisms. Statistical analysis of the rates of homologous recombination of

several different genes could determine their order on a certain chromosome, and

information from many such experiments could be combined to create a genetic map

specifying the rough location of known genes relative to each other. Today, with

comprehensive genome sequence and powerful computational resources at the disposal of

42 | P a g e

the research community, gene finding has been redefined as a largely computational

problem.

Determining that a sequence is functional should be distinguished from determining

the function of the gene or its product. The latter still demands in vivo experimentation

through gene knockout and other assays, although frontiers of bioinformatics research are

making it increasingly possible to predict the function of a gene based on its sequence

alone.

V. Protein Structure Prediction

Protein threading, also known as fold recognition, is a method of protein modeling (i.e.

computational protein structure prediction) which is used to model those proteins which

have the same fold as proteins of known structures, but do not have homologous proteins

with known structure. It differs from the homology modeling method of structure

prediction as it (protein threading) is used for proteins which do not have their

homologous protein structures deposited in the Protein Data Bank (PDB), whereas

homology modeling is used for those proteins which do. Threading works by using

statistical knowledge of the relationship between the structures deposited in the PDB and

the sequence of the protein which one wishes to model.

The prediction is made by "threading" (i.e. placing, aligning) each amino acid in the

target sequence to a position in the template structure, and evaluating how well the target

fits the template. After the best-fit template is selected, the structural model of the

sequence is built based on the alignment with the chosen template. Protein threading is

based on two basic observations: that the number of different folds in nature is fairly

small (approximately 1300); and that 90% of the new structures submitted to the PDB in

the past three years have similar structural folds to ones already in the PDB (according to

the CATH release notes).

43 | P a g e

The 3D structures of molecules are of great importance to their functions in nature.

Since structural prediction of large molecules at an atomic level is largely intractable

problem, some biologists introduced ways to predict 3D structure at a primary sequence

level. This includes biochemical or statistical analysis of amino acid residues in local

regions and structural inference from homologs (or other potentially related proteins)

with known 3D structures.

There have been a large number of diverse approaches to solve the structure prediction

problem. In order to determine which methods were most effective a structure prediction

competition was founded called CASP (Critical Assessment of Structure Prediction).

Fig: Target protein structure (3dsm, shown in ribbons), with Calpha backbones (in gray) of 354 predicted models for it submitted in the CASP8 structure-prediction experiment.

44 | P a g e

2. Genome Annotation

In the context of genomics, annotation is the process of marking the genes and other

biological features in a DNA sequence. The first genome annotation software system was

designed in 1995 by Dr. Owen White, who was part of the team at The Institute for

Genomic Research that sequenced and analyzed the first genome of a free-living

organism to be decoded, the bacterium Haemophilus influenzae. Dr. White built a

software system to find the genes (places in the DNA sequence that encode a protein), the

transfer RNA, and other features, and to make initial assignments of function to those

genes. Most current genome annotation systems work similarly, but the programs

available for analysis of genomic DNA are constantly changing and improving.

3. Computational Evolutionary

Biology

Evolutionary biology is the study of the origin and descent of species, as well as their

change over time. Informatics has assisted evolutionary biologists in several key ways; it

has enabled researchers to:

Trace the evolution of a large number of organisms by measuring changes in their

DNA, rather than through physical taxonomy or physiological observations alone.

More recently, compare entire genomes, which permits the study of more complex

evolutionary events, such as gene duplication, horizontal gene transfer, and the

prediction of factors important in bacterial speciation.

Build complex computational models of populations to predict the outcome of the

system over time.

Track and share information on an increasingly large number of species and

organisms.

Future work endeavours to reconstruct the now more complex tree of life.

45 | P a g e

The area of research within computer science that uses genetic algorithms is sometimes

confused with computational evolutionary biology, but the two areas are not necessarily

related.

4. Literature Analysis

The growth in the number of published literature makes it virtually impossible to read

every paper, resulting in disjointed subfields of research. Literature analysis aims to

employ computational and statistical linguistics to mine this growing library of text

resources. For example:

Abbreviation recognition - identify the long-form and abbreviation of biological

terms.

Named entity recognition - recognizing biological terms such as gene names.

Protein-protein interaction - identify which proteins interact with which proteins

from text.

The area of research draws from statistics and computational linguistics.

5. Analysis of Gene Expression

The expression of many genes can be determined by measuring mRNA levels with

multiple techniques including microarrays, expressed cDNA sequence tag (EST)

sequencing, serial analysis of gene expression (SAGE) tag sequencing, massively parallel

signature sequencing (MPSS), RNA-Seq, also known as "Whole Transcriptome Shotgun

Sequencing" (WTSS), or various applications of multiplexed in-situ hybridization. All of

these techniques are extremely noise-prone and/or subject to bias in the biological

measurement, and a major research area in computational biology involves developing

statistical tools to separate signal from noise in high-throughput gene expression studies.

46 | P a g e

Such studies are often used to determine the genes implicated in a disorder: one might

compare microarray data from cancerous epithelial cells to data from non-cancerous cells

to determine the transcripts that are up-regulated and down-regulated in a particular

population of cancer cells.

6. Analysis of Regulation

Regulation is the complex orchestration of events starting with an extracellular signal

such as a hormone and leading to an increase or decrease in the activity of one or more

proteins. Bioinformatics techniques have been applied to explore various steps in this

process. For example, promoter analysis involves the identification and study of sequence

motifs in the DNA surrounding the coding region of a gene. These motifs influence the

extent to which that region is transcribed into mRNA. Expression data can be used to

infer gene regulation: one might compare microarray data from a wide variety of states of

an organism to form hypotheses about the genes involved in each state.

In a single-cell organism, one might compare stages of the cell cycle, along with

various stress conditions (heat shock, starvation, etc.). One can then apply clustering

algorithms to that expression data to determine which genes are co-expressed. For

example, the upstream regions (promoters) of co-expressed genes can be searched for

over-represented regulatory elements.

7. Analysis of Protein Expression

Protein microarrays and high throughput (HT) mass spectrometry (MS) can provide a

snapshot of the proteins present in a biological sample. Bioinformatics is very much

involved in making sense of protein microarray and HT MS data; the former approach

faces similar problems as with microarrays targeted at mRNA, the latter involves the

problem of matching large amounts of mass data against predicted masses from protein

sequence databases, and the complicated statistical analysis of samples where multiple,

but incomplete peptides from each protein are detected.

47 | P a g e

8. Analysis of Mutations in Cancer

In cancer, the genomes of affected cells are rearranged in complex or even

unpredictable ways. Massive sequencing efforts are used to identify previously unknown

point mutations in a variety of genes in cancer. Bioinformaticians continue to produce

specialized automated systems to manage the sheer volume of sequence data produced,

and they create new algorithms and software to compare the sequencing results to the

growing collection of human genome sequences and germline polymorphisms. New

physical detection technologies are employed, such as oligonucleotide microarrays to

identify chromosomal gains and losses (called comparative genomic hybridization), and

single-nucleotide polymorphism arrays to detect known point mutations. These detection

methods simultaneously measure several hundred thousand sites throughout the genome,

and when used in high-throughput to measure thousands of samples, generate terabytes of

data per experiment. Again the massive amounts and new types of data generate new

opportunities for bioinformaticians. The data is often found to contain considerable

variability, or noise, and thus Hidden Markov model and change-point analysis methods

are being developed to infer real copy number changes.

Another type of data that requires novel informatics development is the analysis of

lesions found to be recurrent among many tumors.

9. Comparative Genomics

Comparative genomics is the study of the relationship of genome structure and

function across different biological species or strains. Comparative genomics is an

attempt to take advantage of the information provided by the signatures of selection to

understand the function and evolutionary processes that act on genomes. While it is still a

young field, it holds great promise to yield insights into many aspects of the evolution of

modern species. The sheer amount of information contained in modern genomes (3.2

gigabases in the case of humans) necessitates that the methods of comparative genomics

48 | P a g e

are automated. Gene finding is an important application of comparative genomics, as is

discovery of new, non-coding functional elements of the genome.

Comparative genomics exploits both similarities and differences in the proteins, RNA,

and regulatory regions of different organisms to infer how selection has acted upon these

elements. Those elements that are responsible for similarities between different species

should be conserved through time (stabilizing selection), while those elements

responsible for differences among species should be divergent (positive selection).

Finally, those elements that are unimportant to the evolutionary success of the organism

will be unconserved (selection is neutral).

One of the important goals of the field is the identification of the mechanisms of

eukaryotic genome evolution. It is however often complicated by the multiplicity of

events that have taken place throughout the history of individual lineages, leaving only

distorted and superimposed traces in the genome of each living organism. For this reason

comparative genomics studies of small model organisms (for example the model

Caenorhabditis elegans and closely related Caenorhabditis briggsae) are of great

importance to advance our understanding of general mechanisms of evolution.

Having come a long way from its initial use of finding functional proteins, comparative

genomics is now concentrating on finding regulatory regions and siRNA molecules.

Recently, it has been discovered that distantly related species often share long conserved

stretches of DNA that do not appear to code for any protein (see conserved non-coding

sequence). One such ultra-conserved region, that was stable from chicken to chimp has

undergone a sudden burst of change in the human lineage, and is found to be active in the

developing brain of the human embryo.

Computational approaches to genome comparison have recently become a common

research topic in computer science. A public collection of case studies and

demonstrations is growing, ranging from whole genome comparisons to gene expression

analysis. This has increased the introduction of different ideas, including concepts from

49 | P a g e

systems and control, information theory, strings analysis and data mining. It is anticipated

that computational approaches will become and remain a standard topic for research and

teaching, while multiple courses will begin training students to be fluent in both topics.

Fig: Human FOXP2 gene and evolutionary conservation is shown in and multiple alignment (at bottom of figure) in this image from the UCSC Genome Browser. Note that conservation tends

to cluster around coding regions (exons).

10. Modeling Biological Systems

Systems biology involves the use of computer simulations of cellular subsystems (such

as the networks of metabolites and enzymes which comprise metabolism, signal

transduction pathways and gene regulatory networks) to both analyze and visualize the

complex connections of these cellular processes. Artificial life or virtual evolution

attempts to understand evolutionary processes via the computer simulation of simple

(artificial) life forms.

Modelling biological systems is a significant task of systems biology and mathematical

biology. Computational systems biology aims to develop and use efficient algorithms,

data structures, visualization and communication tools with the goal of computer

50 | P a g e

modelling of biological systems. It involves the use of computer simulations of biological

systems, like cellular subsystems (such as the networks of metabolites and enzymes

which comprise metabolism, signal transduction pathways and gene regulatory networks)

to both analyze and visualize the complex connections of these cellular processes.

Artificial life or virtual evolution attempts to understand evolutionary processes via the

computer simulation of simple (artificial) life forms.

Overview

It is understood that an unexpected emergent property of a complex system is a result

of the interplay of the cause-and-effect among simpler, integrated parts. Biological

systems manifest many important examples of emergent properties in the complex

interplay of components. Traditional study of biological systems requires reductive

methods in which quantities of data are gathered by category, such as concentration over

time in response to a certain stimulus. Computers are critical to analysis and modelling of

these data. The goal is to create accurate real-time models of a system's response to

environmental and internal stimuli, such as a model of a cancer cell in order to find

weaknesses in its signalling pathways, or modelling of ion channel mutations to see

effects on cardiomyocytes and in turn, the function of a beating heart.

A monograph on this topic summarizes an extensive amount of published research in

this area up to 1987, including subsections in the following areas: computer modelling in

biology and medicine, arterial system models, neuron models, biochemical and

oscillation networks, quantum automata, quantum computers in molecular biology and

genetics, cancer modelling, neural nets, genetic networks, abstract relational biology,

metabolic-replication systems, category theory applications in biology and medicine,

automata theory, cellular automata, tessallation models and complete self-reproduction,

chaotic systems in organisms, relational biology and organismic theories. This published

report also includes 390 references to peer-reviewed articles by a large number of

authors.

51 | P a g e

Standards

By far the most widely accepted standard format for storing and exchanging models in

the field is the Systems Biology Markup Language (SBML) The SBML.org website

includes a guide to many important software packages used in computational systems

biology. Other markup languages with different emphases include BioPAX and CellML.

Particular Tasks

Cellular model

Creating a cellular model has been a particularly challenging task of systems biology

and mathematical biology. It involves the use of computer simulations of the many

cellular subsystems such as the networks of metabolites and enzymes which comprise

metabolism, signal transduction pathways and gene regulatory networks to both analyze

and visualize the complex connections of these cellular processes.

The complex network of biochemical reaction/transport processes and their spatial

organization make the development of a predictive model of a living cell a grand

challenge for the 21st century.

In 2006, the National Science Foundation (NSF) put forward a grand challenge for

systems biology in the 21st century to build a mathematical model of the whole cell. E-

Cell Project aims "to make precise whole cell simulation at the molecular level possible".

CytoSolve developed by V. A. Shiva Ayyadurai and C. Forbes Dewey, Jr. of Department

of Biological Engineering at the Massachusetts Institute of Technology, provided a

method to model the whole cell by dynamically integrating multiple molecular pathway

models.

A dynamic computer model of intracellular signaling was the basis for Merrimack

Pharmaceuticals to discover the target for their cancer medicine MM-111.

52 | P a g e

Membrane computing is the task of modelling specifically a cell membrane.

Fig: Signal transduction pathways

Protein Folding

Protein structure prediction is the prediction of the three-dimensional structure of a

protein from its amino acid sequence — that is, the prediction of its secondary, tertiary,

and quaternary structure from its primary structure. Structure prediction is fundamentally

different from the inverse problem of protein design. Protein structure prediction is one of

the most important goals pursued by bioinformatics and theoretical chemistry; it is highly

important in medicine (for example, in drug design) and biotechnology (for example, in

the design of novel enzymes). Every two years, the performance of current methods is

53 | P a g e

assessed in the CASP experiment (Critical Assessment of Techniques for Protein

Structure Prediction).

Human Biological Systems

Brain Model

The Blue Brain Project is an attempt to create a synthetic brain by reverse-engineering

the mammalian brain down to the molecular level. The aim of the project, founded in

May 2005 by the Brain and Mind Institute of the École Polytechnique in Lausanne,

Switzerland, is to study the brain's architectural and functional principles. The project is

headed by the Institute's director, Henry Markram. Using a Blue Gene supercomputer

running Michael Hines's NEURON software, the simulation does not consist simply of an

artificial neural network, but involves a partially biologically realistic model of neurons.

It is hoped by its proponents that it will eventually shed light on the nature of

consciousness.

There are a number of sub-projects, including the Cajal Blue Brain, coordinated by the

Supercomputing and Visualization Center of Madrid (CeSViMa), and others run by

universities and independent laboratories in the UK, U.S., and Israel. The Human Brain

Project builds on the work of the Blue Brain Project. It is one of six pilot projects in the

Future Emerging Technologies Research Program of the European Commission,

competing for a billion euro funding.

Model of the Immune System

The last decade has seen the emergence of a growing number of simulations of the

immune system.

54 | P a g e

Virtual Liver

The Virtual Liver project is a 43 million euro research program funded by the German

Government, made up of seventy research group distributed across Germany. The goal is

to produce a virtual liver, a dynamic mathematical model that represents human liver

physiology, morphology and function.

11. High-throughput Image Analysis

Computational technologies are used to accelerate or fully automate the processing,

quantification and analysis of large amounts of high-information-content biomedical

imagery. Modern image analysis systems augment an observer's ability to make

measurements from a large or complex set of images, by improving accuracy, objectivity,

or speed. A fully developed analysis system may completely replace the observer.

Although these systems are not unique to biomedical imagery, biomedical imaging is

becoming more important for both diagnostics and research. Some examples are:

High-throughput and high-fidelity quantification and sub-cellular localization

(high-content screening, cytohistopathology, Bioimage informatics)

Morphometrics

Clinical image analysis and visualization

Determining the real-time air-flow patterns in breathing lungs of living animals

Quantifying occlusion size in real-time imagery from the development of and

recovery during arterial injury

Making behavioral observations from extended video recordings of laboratory

animals

Infrared measurements for metabolic activity determination

Inferring clone overlaps in DNA mapping, e.g. the Sulston score

55 | P a g e

12. Structural Bioinformatic Approaches

Prediction of Protein Structure

Protein structure prediction is another important application of bioinformatics. The

amino acid sequence of a protein, the so-called primary structure, can be easily

determined from the sequence on the gene that codes for it. In the vast majority of cases,

this primary structure uniquely determines a structure in its native environment.

Knowledge of this structure is vital in understanding the function of the protein. For lack

of better terms, structural information is usually classified as one of secondary, tertiary

and quaternary structure. A viable general solution to such predictions remains an open

problem. As of now, most efforts have been directed towards heuristics that work most of

the time.

One of the key ideas in bioinformatics is the notion of homology. In the genomic

branch of bioinformatics, homology is used to predict the function of a gene: if the

sequence of gene A, whose function is known, is homologous to the sequence of gene B,

whose function is unknown, one could infer that B may share A's function. In the

structural branch of bioinformatics, homology is used to determine which parts of a

protein are important in structure formation and interaction with other proteins. In a

technique called homology modeling, this information is used to predict the structure of a

protein once the structure of a homologous protein is known. This currently remains the

only way to predict protein structures reliably.

One example of this is the similar protein homology between hemoglobin in humans

and the hemoglobin in legumes (leghemoglobin). Both serve the same purpose of

transporting oxygen in the organism. Though both of these proteins have completely

different amino acid sequences, their protein structures are virtually identical, which

reflects their near identical purposes.

56 | P a g e

Other techniques for predicting protein structure include protein threading and de novo

physics-based modeling.

Molecular Interaction

Protein–protein interaction prediction is a field combining bioinformatics and

structural biology in an attempt to identify and catalog physical interactions between

pairs or groups of proteins. Understanding protein–protein interactions is important for

the investigation of intracellular signaling pathways, modelling of protein complex

structures and for gaining insights into various biochemical processes. Experimentally,

physical interactions between pairs of proteins can be inferred from a variety of

experimental techniques, including yeast two-hybrid systems, protein-fragment

complementation assays (PCA), affinity purification/mass spectrometry, protein

microarrays, fluorescence resonance energy transfer (FRET), and Microscale

Thermophoresis (MST). Efforts to experimentally determine the interactome of numerous

species are ongoing, and a number of computational methods for interaction prediction

have been developed in recent years.

Methods

Proteins that interact are more likely to co-evolve, therefore, it is possible to make

inferences about interactions between pairs of proteins based on their phylogenetic

distances. It has also been observed in some cases that pairs of interacting proteins have

fused orthologues in other organisms. In addition, a number of bound protein complexes

have been structurally solved and can be used to identify the residues that mediate the

interaction so that similar motifs can be located in other organisms.

57 | P a g e

Phylogenetic Profiling

Phylogenetic profiling finds pairs of protein families with similar patterns of presence

or absence across large numbers of species. This method identifies pairs likely to act in

the same biological process, but does not necessarily imply physical interaction.

Prediction of Co-evolved Protein Pairs Based on

Similar Phylogenetic Trees

This method involves using a sequence search tool such as BLAST for finding

homologues of a pair of proteins, then building multiple sequence alignments with

alignment tools such as Clustal. From these multiple sequence alignments, phylogenetic

distance matrices are calculated for each protein in the hypothesized interacting pair. If

the matrices are sufficiently similar (as measured by their Pearson correlation coefficient)

they are deemed likely to interact.

Identification of Homologous Interacting Pairs

This method consists of searching whether the two sequences have homologues which

form a complex in a database of known structures of complexes. The identification of the

domains is done by sequence searches against domain databases such as Pfam using

BLAST. If more than one complex of Pfam domains is identified, then the query

sequences are aligned using a hidden Markov tool called HMMER to the closest

identified homologues, whose structures are known. Then the alignments are analysed to

check whether the contact residues of the known complex are conserved in the alignment.

Identification of Structural Patterns

This method builds a library of known protein–protein interfaces from the PDB, where

the interfaces are defined as pairs of polypeptide fragments that are below a threshold

slightly larger than the Van der Waals radius of the atoms involved. The sequences in the

58 | P a g e

library are then clustered based on structural alignment and redundant sequences are

eliminated. The residues that have a high (generally >50%) level of frequency for a given

position are considered hotspots. This library is then used to identify potential

interactions between pairs of targets, providing that they have a known structure (i.e.

present in the PDB).

Bayesian Network Modelling

Bayesian methods integrate data from a wide variety of sources, including both

experimental results and prior computational predictions, and use these features to assess

the likelihood that a particular potential protein interaction is a true positive result. These

methods are useful because experimental procedures, particularly the yeast two-hybrid

experiments, are extremely noisy and produce many false positives, while the previously

mentioned computational methods can only provide circumstantial evidence that a

particular pair of proteins might interact.

3D Template-Based Protein Complex Modelling

This method makes use of known protein complex structures to predict as well as

structurally model interactions between query protein sequences. The prediction process

generally starts by employing a sequence based method (e.g. Interolog) to search for

protein complex structures that are homologous to the query sequences. These known

complex structures are then used as templates to structurally model the interaction

between query sequences. This method has the advantage of not only inferring protein

interactions but also suggests models of how proteins interact structurally, which can

provide some insights into the atomic level mechanism of that interaction. On the other

hand, the ability for this method to makes a prediction is limited to a relatively small

number of known protein complex structures.

59 | P a g e

Supervised Learning Problem

The problem of PPI prediction can be framed as a supervised learning problem. In this

paradigm the known protein interactions supervise the estimation of a function that can

predict whether an interaction exists or not between two proteins given data about the

proteins (e.g., expression levels of each gene in different experimental conditions,

location information, phylogenetic profile, etc.).

Relationship to Docking Methods

The field of protein–protein interaction prediction is closely related to the field of

protein–protein docking, which attempts to use geometric and steric considerations to fit

two proteins of known structure into a bound complex. This is a useful mode of inquiry

in cases where both proteins in the pair have known structures and are known (or at least

strongly suspected) to interact, but since so many proteins do not have experimentally

determined structures, sequence-based interaction prediction methods are especially

useful in conjunction with experimental studies of an organism's interactome.

Application of Bioinformatics in

Various Fields Bioinformatics is the use of IT in biotechnology for the data storage, data warehousing

and analyzing the DNA sequences. In Bioinfomatics knowledge of many branches are

required like biology, mathematics, computer science, laws of physics & chemistry, and

of course sound knowledge of IT to analyze biotech data. Bioinformatics is not limited to

the computing data, but in reality it can be used to solve many biological problems and

find out how living things works.

60 | P a g e

It is the comprehensive application of mathematics (e.g., probability and statistics),

science (e.g., biochemistry), and a core set of problem-solving methods (e.g., computer

algorithms) to the understanding of living systems.

These include the following:

1. Molecular medicine

1.1 More drug targets

1.2 Personalised medicine

1.3 Preventative medicine

1.4 Gene therapy

2. Microbial genome applications

2.1 Waste cleanup

2.2 Climate change

2.3 Alternative energy sources

2.4 Biotechnology

2.5 Antibiotic resistance

2.6 Forensic analysis of microbes

2.7 The reality of bioweapon creation

2.8 Evolutionary studies

3. Agriculture

3.1 Crops

61 | P a g e

3.2 Insect resistance

3.3 Improve nutritional quality

3.4 Grow crops in poorer soils and that are drought resistant

4. Animals

5. Comparative studies

1. Molecular Medicine

Molecular medicine is a broad field, where physical, chemical, biological and medical

techniques are used to describe molecular structures and mechanisms, identify

fundamental molecular and genetic errors of disease, and to develop molecular

62 | P a g e

interventions to correct them. The molecular medicine perspective emphasizes cellular

and molecular phenomena and interventions rather than the previous conceptual and

observational focus on patients and their organs.

In November 1949, with the seminal paper, "Sickle Cell Anemia, a Molecular

Disease", in Science magazine, Linus Pauling, Harvey Itano and their collaborators laid

the groundwork for establishing the field of molecular medicine. In 1956, Roger J.

Williams wrote Biochemical Individuality, a prescient book about genetics, prevention

and treatment of disease on a molecular basis, and nutrition which is now variously

referred to as individualized medicine and orthomolecular medicine. Another paper in

Science by Pauling in 1968, introduced and defined this view of molecular medicine that

focuses on natural and nutritional substances used for treatment and prevention.

63 | P a g e

The human genome will have profound effects on the fields of biomedical research and

clinical medicine. Every disease has a genetic component. This may be inherited (as is

the case with an estimated 3000-4000 hereditary disease including Cystic Fibrosis and

Huntingtons disease) or a result of the body's response to an environmental stress which

causes alterations in the genome (eg. cancers, heart disease, diabetes..).

The completion of the human genome means that we can search for the genes directly

associated with different diseases and begin to understand the molecular basis of these

diseases more clearly. This new knowledge of the molecular mechanisms of disease will

enable better treatments, cures and even preventative tests to be developed.

1.1 More Drug Targets

At present all drugs on the market target only about 500 proteins. With an improved

understanding of disease mechanisms and using computational tools to identify and

validate new drug targets, more specific medicines that act on the cause, not merely the

symptoms, of the disease can be developed. These highly specific drugs promise to have

fewer side effects than many of today's medicines.

64 | P a g e

1.2 Personalised Medicine

Personalized medicine is a medical model that proposes the customization of

healthcare, with all decisions and practices being tailored to the individual patient by use

of genetic or other information. Practical application outside of long established

considerations like a patient's family history, social circumstances, environment and

behaviors are very limited so far and practically no progress has been made in the last

decade.

Clinical medicine will become more personalised with the development of the field of

pharmacogenomics. This is the study of how an individual's genetic inheritence affects

the body's response to drugs. At present, some drugs fail to make it to the market because

a small percentage of the clinical patient population show adverse affects to a drug due to

sequence variants in their DNA.

As a result, potentially lifesaving drugs never make it to the marketplace. Today,

doctors have to use trial and error to find the best drug to treat a particular patient as those

with the same clinical symptoms can show a wide range of responses to the same

treatment. In the future, doctors will be able to analyse a patient's genetic profile and

prescribe the best available drug therapy and dosage from the beginning.

65 | P a g e

1.3 Preventative Medicine

Preventive medicine or preventive care consists of measures taken to prevent diseases,

(or injuries) rather than curing them or treating their symptoms. This contrasts in method

with curative and palliative medicine, and in scope with public health methods (which

work at the level of population health rather than individual health).

Preventive medicine strategies are typically described as taking place at the primary,

secondary, tertiary and quaternary prevention levels. In addition, the term primal

prevention has been used to describe all measures taken to ensure fetal well-being and

prevent any long-term health consequences from gestational history and/or disease. The

rationale for such efforts is the evidence demonstrating the link between fetal well-being,

or "primal health," and adult health. Primal prevention strategies typically focus on

providing future parents with: education regarding the consequences of epigenetic

influences on their child, sufficient leave time for both parents, and financial support if

required. This includes parenting in infancy as well.

Simple examples of preventive medicine include hand washing, breastfeeding, and

immunizations. Preventive care may include examinations and screening tests tailored to

an individual's age, health, and family history. For example, a person with a family

history of certain cancers or other diseases would begin screening at an earlier age and/or

66 | P a g e

more frequently than those with no such family history. On the other side of preventive

medicine, some nonprofit organizations, such as the Northern California Cancer Center,

apply epidemiologic research towards finding ways to prevent diseases.

Prevention levels

Doctor’s side

Disease

absent present

Patient’s

side

Illness

absent

Primary

prevention

illness absent

disease absent

Secondary

prevention

illness absent

disease present

present

Quaternary

prevention

illness present

disease absent

Tertiary

prevention

illness present

disease present

67 | P a g e

Definitions

With the specific details of the genetic mechanisms of diseases being unravelled, the

development of diagnostic tests to measure a person’s susceptibility to different diseases

may become a distinct reality. Preventative actions such as change of lifestyle or having

treatment at the earliest possible stages when they are more likely to be successful, could

result in huge advances in our struggle to conquer disease.

1.4 Gene Therapy

Gene therapy is the use of DNA as a pharmaceutical agent to treat disease. It derives

its name from the idea that DNA can be used to supplement or alter genes within an

individual's cells as a therapy to treat disease. The most common form of gene therapy

involves using DNA that encodes a functional, therapeutic gene in order to replace a

mutated gene. Other forms involve directly correcting a mutation, or using DNA that

encodes a therapeutic protein drug (rather than a natural human gene) to provide

treatment. In gene therapy, DNA that encodes a therapeutic protein is packaged within a

"vector", which is used to get the DNA inside cells within the body. Once inside, the

Level Definition

Primary

prevention

Methods to avoid occurrence of disease. Most population-based health

promotion efforts are of this type.

Secondary

prevention

Methods to diagnose and treat existent disease in early stages before it

causes significant morbidity.

Tertiary

prevention

Methods to reduce negative impact of extant disease by restoring function

and reducing disease-related complications.

Quaternary

prevention

Methods to mitigate or avoid results of unnecessary or excessive

interventions in the health system.

68 | P a g e

DNA becomes expressed by the cell machinery, resulting in the production of therapeutic

protein, which in turn treats the patient's disease.

Fig: Gene therapy using an Adenovirus vector. A new gene is inserted into an adenovirus. If the treatment is successful, the new gene will make functional protein to treat a disease.

In the not too distant future, the potential for using genes themselves to treat disease

may become a reality. Gene therapy is the approach used to treat, cure or even prevent

disease by changing the expression of a person’s genes. Currently, this field is in its

infantile stage with clinical trials for many different types of cancer and other diseases

ongoing.

1.5 Drug Designing

Drug Discovery process is an expensive and time consuming process that may be

succeed or may not. Retrospective analysis of pharmaceutical industry during the 1990’s

estimate that each new drug in the market takes an average 12-15 years to develop,

69 | P a g e

costing in the region of $1.3 billion. In addition, one in nine compounds that enters

clinical trials makes it to the market.

In the present era, the emerging new diseases are very much responsible for the

advancement in the field of Computer Aided Drug Design (CADD). Drug Designing is

the innovative boom in the field of Bioinformatics which is very cost and time effective.

Hence, computers are playing a vital role in the Drug Designing.

70 | P a g e

Drug Discovery process have formed by four factors ; Scientific Knowledge, Available

Technology, Human Resources and Diseases. The Scientific Knowledge, Available

Technology, Human Resource provide the strategic assets by which the strategies were

formed.

Once a new compound has been identified in the laboratory, medicines are developed as

follows:

I. PRECLINICAL TESTING:

A pharmaceutical company conducts laboratory and animal studies to show biological

activity of the compound against the target disease, and the compound is evaluated for

safety.

II. INVESTIGATIONAL NEW DRUG APPLICATION (IND):

After completing preclinical testing, a company files an IND with the U.S. Food and

Drug Administration (FDA) to begin to test the drug in people. The IND becomes

effective if FDA does not disapprove it within 30days. The IND shows results of

previous experiments; how, where and by whom the new studies will be conducted; the

chemical structure of the compound; how it is thought to work in the body; any toxic

71 | P a g e

effects found in the animal studies; and how the compound is manufactured. All clinical

trials must be reviewed and approved by the Institutional Review Board (IRB) where the

trials will be conducted.

III. CLINICAL TRIALS:

PHASE-1:

In this studies are primarily concerned with assessing the drug candidate’s safety. A

small number of healthy volunteers are given the compound to test what happens to the

drug in the human body – how it is absorbed, metabolized and excreted. About 70% of

drug candidates pass this initial phase of testing.

PHASE-2:

The drug candidate is tested for efficacy. Usually, this is explored in a randomised trial

where the compound or a drug is given to several hundred patients with the condition or

disease to be treated. Depending on the condition the trial can last from several months to

several number of years. The output is an increased understanding of the safety of the

compound and clear information about effectiveness. If a drug passes this phase, then it

can be considered that the drug can be considered truly a drug.

72 | P a g e

PHASE-3:

A drug is tested in several hundred to several thousand patients. This provides a more

thorough understanding of the drug’s effectiveness, benefits and the range of possible

adverse reactions. These trials typically last several years.

IV. APPROVAL:

Once FDA approves an NDA, the new medicine becomes available for physicians to

prescribe. A company must continue to submit periodic reports to FDA, including any

cases of adverse reactions and appropriate quality control records.

Hence, in this way a drug is said to be passed through different trials and atlast to the

market.

2. Microbial Genome Applications

Microorganisms are ubiquitous, that is they are found everywhere. They have been

found surviving and thriving in extremes of heat, cold, radiation, salt, acidity and

pressure. They are present in the environment, our bodies, the air, food and water.

Traditionally, use has been made of a variety of microbial properties in the baking,

brewing and food industries. The arrival of the complete genome sequences and their

potential to provide a greater insight into the microbial world and its capacities could

have broad and far reaching implications for environment, health, energy and industrial

applications. For these reasons, in 1994, the US Department of Energy (DOE) initiated

the MGP (Microbial Genome Project) to sequence genomes of bacteria useful in energy

production, environmental cleanup, industrial processing and toxic waste reduction.

73 | P a g e

By studying the genetic material of these organisms, scientists can begin to understand

these microbes at a very fundamental level and isolate the genes that give them their

unique abilities to survive under extreme conditions.

2.1 Waste Cleanup

Deinococcus radiodurans is known as the world's toughest bacteria and it is the most

radiation resistant organism known. Scientists are interested in this organism because of

its potential usefulness in cleaning up waste sites that contain radiation and toxic

chemicals.

Microbial Genome Program (MGP) scientists are determining the DNA sequence of

the genome of C. crescentus, one of the organisms responsible for sewage treatment.

2.2 Climate Change

Increasing levels of carbon dioxide emission, mainly through the expanding use of

fossil fuels for energy, are thought to contribute to global climate change. Recently, the

DOE (Department of Energy, USA) launched a program to decrease atmospheric carbon

dioxide levels. One method of doing so is to study the genomes of microbes that use

carbon dioxide as their sole carbon source.

2.3 Alternative Energy Sources

Scientists are studying the genome of the microbe Chlorobium tepidum which has an

unusual capacity for generating energy from light.

2.4 Biotechnology

The archaeon Archaeoglobus fulgidus and the bacterium Thermotoga maritima have

potential for practical applications in industry and government-funded environmental

74 | P a g e

remediation. These microorganisms thrive in water temperatures above the boiling point

and therefore may provide the DOE, the Department of Defence, and private companies

with heat-stable enzymes suitable for use in industrial processes.

Other industrially useful microbes include, Corynebacterium glutamicum which is of

high industrial interest as a research object because it is used by the chemical industry for

the biotechnological production of the amino acid lysine. The substance is employed as a

source of protein in animal nutrition. Lysine is one of the essential amino acids in animal

nutrition. Biotechnologically produced lysine is added to feed concentrates as a source of

protein, and is an alternative to soybeans or meat and bonemeal.

Xanthomonas campestris pv. is grown commercially to produce the exopolysaccharide

xanthan gum, which is used as a viscosifying and stabilising agent in many industries.

Lactococcus lactis is one of the most important micro-organisms involved in the dairy

industry, it is a non-pathogenic rod-shaped bacterium that is critical for manufacturing

dairy products like buttermilk, yogurt and cheese. This bacterium, Lactococcus lactis

ssp., is also used to prepare pickled vegetables, beer, wine, some breads and sausages and

other fermented foods. Researchers anticipate that understanding the physiology and

genetic make-up of this bacterium will prove invaluable for food manufacturers as well

as the pharmaceutical industry, which is exploring the capacity of L. lactis to serve as a

vehicle for delivering drugs.

2.5 Antibiotic Resistance

Scientists have been examining the genome of Enterococcus faecalis a leading cause

of bacterial infection among hospital patients. They have discovered a virulence region

made up of a number of antibiotic-resistant genes that may contribute to the bacterium's

transformation from a harmless gut bacteria to a menacing invader. The discovery of the

region, known as a pathogenicity island, could provide useful markers for detecting

75 | P a g e

pathogenic strains and help to establish controls to prevent the spread of infection in

wards.

2.6 Forensic Analysis of Microbes

Scientists used their genomic tools to help distinguish between the strain of Bacillus

anthracis that was used in the summer of 2001 terrorist attack in Florida with that of

closely related anthrax strains.

2.7 The Reality of Bioweapon Creation

Scientists have recently built the virus poliomyelitis using entirely artificial means.

They did this using genomic data available on the Internet and materials from a mail-

order chemical supply. The research was financed by the US Department of Defence as

part of a biowarfare response program to prove to the world the reality of bioweapons.

The researchers also hope their work will discourage officials from ever relaxing

programs of immunisation. This project has been met with very mixed feelings.

2.8 Evolutionary Studies

The sequencing of genomes from all three domains of life, eukaryota, bacteria and

archaea means that evolutionary studies can be performed in a quest to determine the tree

of life and the last universal common ancestor.

3. Agriculture

The sequencing of the genomes of plants and animals should have enormous benefits

for the agricultural community. Bioinformatic tools can be used to search for the genes

within these genomes and to elucidate their functions. This specific genetic knowledge

could then be used to produce stronger, more drought, disease and insect resistant crops

76 | P a g e

and improve the quality of livestock making them healthier, more disease resistant and

more productive.

3.1 Crops

Comparative genetics of the plant genomes has shown that the organisation of their

genes has remained more conserved over evolutionary time than was previously believed.

These findings suggest that information obtained from the model crop systems can be

used to suggest improvements to other food crops. Arabidopsis thaliana (water cress) and

Oryza sativa (rice) are examples of available complete plant genomes.

3.2 Insect Resistance

Genes from Bacillus thuringiensis that can control a number of serious pests have been

successfully transferred to cotton, maize and potatoes. This new ability of the plants to

resist insect attack means that the amount of insecticides being used can be reduced and

hence the nutritional quality of the crops is increased.

3.3 Improve Nutritional Quality

Scientists have recently succeeded in transferring genes into rice to increase levels of

Vitamin A, iron and other micronutrients. This work could have a profound impact in

reducing occurrences of blindness and anaemia caused by deficiencies in Vitamin A and

iron respectively.

Scientists have inserted a gene from yeast into the tomato, and the result is a plant

whose fruit stays longer on the vine and has an extended shelf life.

77 | P a g e

3.4 Grow in Poorer Soils and Drought Resistant

Progress has been made in developing cereal varieties that have a greater tolerance for

soil alkalinity, free aluminium and iron toxicities. These varieties will allow agriculture to

succeed in poorer soil areas, thus adding more land to the global production base.

Research is also in progress to produce crop varieties capable of tolerating reduced water

conditions.

4. Animals

Sequencing projects of many farm animals including cows, pigs and sheep are now

well under way in the hope that a better understanding of the biology of these organisms

will have huge impacts for improving the production and health of livestock and

ultimately have benefits for human nutrition.

5. Comparative Studies

Analysing and comparing the genetic material of different species is an important

method for studying the functions of genes, the mechanisms of inherited diseases and

species evolution. Bioinformatics tools can be used to make comparisons between the

numbers, locations and biochemical functions of genes in different organisms.

Organisms that are suitable for use in experimental research are termed model

organisms. They have a number of properties that make them ideal for research purposes

including short life spans, rapid reproduction, being easy to handle, inexpensive and they

can be manipulated at the genetic level.

An example of a human model organism is the mouse. Mouse and human are very

closely related (>98%) and for the most part we see a one to one correspondence between

genes in the two species. Manipulation of the mouse at the molecular level and genome

78 | P a g e

comparisons between the two species can and is revealing detailed information on the

functions of human genes, the evolutionary relationship between the two species and the

molecular mechanisms of many human diseases.

Biological Databases The collection of the biological data on a computer which can be manipulated to appear

in varying arrangements and subsets is regarded as a database. The biological information

can be stored in different databases. Each database has its own website with unique

navigation tools.

Nucleotide sequence databases: The nucleotide sequence data

submitted by the scientists and genome sequencing groups is at the databases

namely GenBank, EMBL (European Molecular Biology Laboratory) and DDBI

(DNA Data Bank of Japan). There is a good coordination between these three

databases as they are synchronized on daily basis.

Protein sequence databases: These are usually prepared from

existing literature and/or in consultation with the experts. In fact, these databases

represent the translated DNA databases.

Molecular structure of databases: The three dimensional (3-D)

structures of macro molecules are determined by X-ray crystallography and

nuclear magnetic resonance (NMR). PDB and SCOP are the primary databases of

biological molecules.

Other databases: KEGG database is an important one that provides

information on the current knowledge of molecular biology and cell biology with

special reference to information on metabolic pathways, interacting molecules and

genes.

79 | P a g e

Types of BioInformatics Tools The Bioinformatics tools are the software programs for the saving, retrieving and analysis

of Biological data and extracting the information from them.

Factors that must be taken into consideration when designing these tools are:

The end user (the biologist) may not be a frequent user of computer technology

and thus it should be very user friendly.

These software tools must be made available over the internet given the global

distribution of the scientific research community.

The Bioinformatics Tools may be categorized into following categories:

1. Homology and Similarity Tools

2. Protein Function Analysis

80 | P a g e

3. Structural Analysis


1.Homology and Similarity Tools

The term homology implies a common evolutionary relationship between two traits -

whether they are DNA sequences or bristle patterns on a fly's nose. Homologous

sequences are sequences that are related by divergence from a common ancestor. Thus

the degree of similarity between two sequences can be measured while their homology is

a case of being either true of false. This set of tools can be used to identify similarities

between novel query sequences of unknown structure and function and database

sequences whose structure and function have been elucidated.

2. Protein Function Analysis

Function Analysis is Identification and mapping of all functional elements (both

coding and non-coding) in a genome. This group of programs allow you to compare your

protein sequence to the secondary (or derived) protein databases that contain information

on motifs, signatures and protein domains. Highly significant hits against these different

pattern databases allow you to approximate the biochemical function of your query

protein.

3. Structural Analysis

These sets of tools allow you to compare structures with the known structure

databases. The function of a protein is more directly a consequence of its structure rather

than its sequence with structural homologs tending to share functions. The determination

of a protein's 2D/3D structure is crucial in the study of its function.

81 | P a g e


This set of tools allows you to carry out further, more detailed analysis on your query

sequence including evolutionary analysis, identification of mutations, hydropathy

regions, CpG islands and compositional biases. The identification of these and other

biological properties are all clues that aid the search to elucidate the specific function of

your sequence.

Bioinformatics Tools

1. BLAST

The Basic Local Alignment Search Tool (BLAST) for comparing gene and protein

sequences against others in public databases, now comes in several types including PSI-

BLAST, PHI-BLAST, and BLAST 2 sequences. Specialized BLASTs are also available

for human, microbial, malaria, and other genomes, as well as for vector contamination,

immunoglobulins, and tentative human consensus sequences.

2. FASTA

A database search tool used to compare a nucleotide or peptide sequence to a sequence

database. The program is based on the rapid sequence algorithm described by Lipman

and Pearson. It was the first widely used algorithm for database similarity searching. The

program looks for optimal local alignments by scanning the sequence for small matches

called "words". Initially, the scores of segments in which there are multiple word hits are

calculated ("init1"). Later the scores of several segments may be summed to generate an

"initn" score. An optimized alignment that includes gaps is shown in the output as "opt".

The sensitivity and speed of the search are inversely related and controlled by the "k-tup"

variable which specifies the size of a "word".

82 | P a g e

3. EMBOSS

EMBOSS (The European Molecular Biology Open Software Suite) is a new, free open

source software analysis package specially developed for the needs of the molecular

biology user community. Within EMBOSS you will find around 100 programs

(applications) for sequence alignment, database searching with sequence patterns, protein

motif identification and domain analysis, nucleotide sequence pattern analysis, codon

usage analysis for small genomes, and much more.

4. Clustalw

ClustalW is a general purpose multiple sequence alignment program for DNA or

proteins. It produces biologically meaningful multiple sequence alignments of divergent

sequences, calculates the best match for the selected sequences, and lines them up so that

the identities, similarities and differences can be seen.

5. RASMOL

RasMol is a computer program written for molecular graphics visualization intended

and used primarily for the depiction and exploration of biological macromolecule

structures, such as those found in the Protein Data Bank. It was originally developed by

Roger Sayle in the early 90s.

Historically, it was an important tool for molecular biologists since the extremely

optimized program allowed the software to run on (then) modestly powerful personal

computers. Before RasMol, visualization software ran on graphics workstations that, due

to their expense, were less accessible to scholars. RasMol has become an important

educational tool as well as continuing to be an important tool for research in structural

biology.

Protein Databank (PDB) files can be downloaded for visualization from the Research

Collaboratory for Structural Bioinformatics (RCSB) bank. These have been uploaded by

83 | P a g e

researchers who have characterized the structure of molecules usually by X-ray

crystallography or NMR spectroscopy.

6. CHEMSKETCH

ACD/ChemSketch is an advanced chemical drawing tool and is the accepted interface

for the industry's best NMR and molecular property predictions, nomenclature, and

analytical data handling software. ACD/ChemSketch is also available as freeware, with

functionalities that are highly competitive with other popular commercial software

packages. The freeware contains tools for 2D structure cleaning, 3D optimization and

viewing, InChI generation and conversion, drawing of polymers, organometallics, and

Markush structures—capabilities that are not even included in some of the commercial

packages from other software producers. Also included is an IUPAC systematic naming

capability for molecules with fewer than 50 atoms and 3 rings. The capabilities of

ACD/ChemSketch can be further extended and customized by programming.

The commercial version of ACD/ChemSketch offers additional capabilities above and

beyond the freeware offering. It includes a number of advanced features including a

dictionary of more than 165,000 trivial, common, and trade names with their

corresponding structures. It allows the user to view SDfiles, and search Microsoft Word

84 | P a g e

or Adobe PDF reports, SDfiles, molfiles, and CambridgeSoft ChemDraw files by

chemical structure, substructure, or structure similarity.

7. WINCOOT

WinCoot is developed by Bernhard Lohkamp & Paul Emsley and is used by 4 users of

Software Informer. The most popular versions of this product among our users are: 0.0,

0.1 and 0.3. The name of the program executable file is cmd.exe.

8. AUTODOCK

AutoDock is a suite of automated docking tools. It is designed to predict how small

molecules, such as substrates or drug candidates, bind to a receptor of known 3D

structure. AutoDock actually consists of two main programs: AutoDock performs the

85 | P a g e

docking of the ligand to a set of grids describing the target protein; AutoGrid pre-

calculates these grids. In addition to using them for docking, the atomic affinity grids can

be visualised. This can help, for example, to guide organic synthetic chemists design

better binders. We have also developed a graphical user interface called AutoDockTools,

or ADT for short, which amongst other things helps to set up which bonds will treated as

rotatable in the ligand and to analyze dockings.

AutoDock has applications in:

X-ray crystallography

Structure-based drug design

Lead optimization

Virtual screening (HTS)

Combinatorial library design

Protein-protein docking

Chemical mechanism studies

9. SWISS PDB VIEWER

Swiss-PdbViewer (aka DeepView) has been developped since 1994 by Nicolas Guex.

Swiss-PdbViewer is tightly linked to SWISS-MODEL, an automated homology modeling

86 | P a g e

server developed within the Swiss Institute of Bioinformatics (SIB) at the Structural

Bioinformatics Group at the Biozentrum in Basel.

Swiss-PdbViewer is an application that provides a user friendly interface allowing to

analyze several proteins at the same time. The proteins can be superimposed in order to

deduce structural alignments and compare their active sites or any other relevant parts.

Amino acid mutations, H-bonds, angles and distances between atoms are easy to obtain

thanks to the intuitive graphic and menu interface.

Application Programs

1. JAVA in Bioinformatics

Due to Platform independence nature of Java, it is emerging as a key player in

bioinformatics. Physiome Sciences' computer-based biological simulation technologies

and Bioinformatics Solutions' PatternHunter are two examples of the growing adoption of

Java in bioinformatics.

87 | P a g e

2. Perl in Bioinformatics

Perl is also being used in the processing of biological data. One example of perl project

is BioPerl project.

3. XML in Bioinformatics

Extensible Markup Language (XML) is a markup language that defines a set of rules

for encoding documents in a format that is both human-readable and machine-readable. It

is defined in the XML 1.0 Specification produced by the W3C, and several other related

specifications, all gratis open standards.

The design goals of XML emphasize simplicity, generality, and usability over the

Internet. It is a textual data format with strong support via Unicode for the languages of

the world. Although the design of XML focuses on documents, it is widely used for the

representation of arbitrary data structures, for example in web services.

88 | P a g e

4. C in Bioinformatics

C is a general-purpose programming language initially developed by Dennis Ritchie

between 1969 and 1973 at Bell Labs. Its design provides constructs that map efficiently

to typical machine instructions, and therefore it found lasting use in applications that had

formerly been coded in assembly language—most notably system software like the Unix

operating system. C is one of the most widely used programming languages of all time,

and there are very few computer architectures for which a C compiler does not exist.

5. C++ in Bioinformatics

C++ (pronounced "cee plus plus") is a statically typed, free-form, multi-paradigm,

compiled, general-purpose programming language. It is regarded as an intermediate-level

language, as it comprises a combination of both high-level and low-level language

features. Developed by Bjarne Stroustrup starting in 1979 at Bell Labs, it adds object

oriented features, such as classes, and other enhancements to the C programming

language. Originally named C with Classes, the language was renamed C++ in 1983, as a

pun involving the increment operator.

89 | P a g e

6. Python in Bioinformatics

Python is a general-purpose, high-level programming language whose design

philosophy emphasizes code readability. Its syntax is said to be clear and expressive.

Python has a large and comprehensive standard library.

7. R in Bioinformatics

R is an open source programming language and software environment for statistical

computing and graphics. The R language is widely used among statisticians for

developing statistical software and data analysis.

90 | P a g e

8. MySQL in Bioinformatics

MySQL is the world's most used open source relational database management system

(RDBMS) that runs as a server providing multi-user access to a number of databases.

9. SQL in Bioinformatics

SQL (sometimes referred to as Structured Query Language) is a special-purpose

programming language designed for managing data in relational database management

systems (RDBMS).

Originally based upon relational algebra and tuple relational calculus, its scope

includes data insert, query, update and delete, schema creation and modification, and data

access control.

10. CUDA in Bioinformatics

Compute Unified Device Architecture (CUDA) is a parallel computing architecture

developed by Nvidia for graphics processing. CUDA is the computing engine in Nvidia

graphics processing units (GPUs) that is accessible to software developers through

variants of industry standard programming languages. Programmers use 'C for CUDA' (C

with Nvidia extensions and certain restrictions), compiled through a PathScale or Open64

C compiler, to code algorithms for execution on the GPU. CUDA architecture shares a

range of computational interfaces with two competitors: the Khronos Group's OpenCL

and Microsoft's DirectCompute. Third party wrappers are also available for Python, Perl,

Fortran, Java, Ruby, Lua, Haskell, MATLAB, IDL, and native support in Mathematica.

91 | P a g e

CUDA programming in the web browser is freely available for individual non-

commercial purposes in NCLab.

11. MATLAB in Bioinformatics

MATLAB (matrix laboratory) is a numerical computing environment and fourth-

generation programming language. Developed by MathWorks, MATLAB allows matrix

manipulations, plotting of functions and data, implementation of algorithms, creation of

user interfaces, and interfacing with programs written in other languages, including C,

C++, Java, and Fortran.

92 | P a g e

12. Microsoft Excel in Bioinformatics

Microsoft Excel is a commercial spreadsheet application written and distributed by

Microsoft for Microsoft Windows and Mac OS X. It features calculation, graphing tools,

pivot tables, and a macro programming language called Visual Basic for Applications. It

has been a very widely applied spreadsheet for these platforms, especially since version 5

in 1993, and it has almost completely replaced Lotus 1-2-3 as the industry standard for

spreadsheets. Excel forms part of Microsoft Office. The current versions are 2010 for

Microsoft Windows and 2011 for Mac OS X.

Bioinformatics Projects

1. BioJava

The BioJava Project is providing the Java tool for the processing of data in Java.

2. BioPerl

The BioPerl project many module for biological data processing.

3. BioXML

A part of the BioPerl project, this is a resource to gather XML documentation, DTDs and

XML aware tools for biology in one location.

bioinformatics

Education

bioinformatics perl

bioinformatics matlab

bioinformatics cuda

bioinformatics sql

bioinformatics mysql

bioinformatics r

bioinformatics python

bioinformatics xml