bioinformatics by: jesus caban boxuan gu. what is bioinformatics? bioinformatics has been defined as...

46
Bioinformatics By: Jesus Caban Boxuan GU

Upload: gloria-lester

Post on 13-Jan-2016

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Bioinformatics

By: Jesus CabanBoxuan GU

Page 2: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

What is Bioinformatics? Bioinformatics has been defined as a

means for analyzing, comparing, graphically displaying, modeling, storing, systemizing, searching, and ultimately distributing biological information, which includes sequences, structures, function, and phylogeny.

Page 3: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Researches in Bioinformatics the study of DNA structure and its

functions gene and protein expressions protein production, structure and

functions genetic regulatory systems clinical applications

Page 4: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Biology employs a digital language for representing its information using the four basic alphabets (A, C, G, T).

All the chromosomes in an organism‘s cell have been represented and being dentified using these alphabets.

The demanding challenge here is to determine how this digital language of the chromosomes is being converted into the three-dimensional and sometimes four-dimensional languages of living and breathing organisms.

Bioinformatics language

Page 5: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Central Dogma of Molecular Biology

DNA RNA Protein

StructuresProcesses

CellEnvironment

Page 6: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Protein structure

How many different protein structures are there?How different are they?

Page 7: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Protein structure

What is this?Is the DNA normal?

Page 8: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Computational Biology - Applications and Approaches

Page 9: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

The Overview of DNA and RNA

Deoxyribonucleic acid (DNA) is a macromolecular chain of nucleotides that serves as a basic carrier of genetic information and is able to self-replicate. DNA can be represented as a sequence of nucleotide bases.

Page 10: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

DNA DNA sequences are typically from

thousands to millions of bases long. DNA usually consists of two strands of complementary nucleotide sequences that are base paired to each other.

DNA in humans forms a linear chain, but DNA can also form a circular molecule.

Page 11: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

DNA

A hypothetical double-stranded DNA molecule can be represented as ACGTGGTAGAGACCCTGTGTGATAGACCACGGGTA

TGCACCATCTCTGGGACACACTATCTGGTGCCCAT

As A pairs with T and C pairs with G and vice versa

Here A - Adenine, C - Cytocine, G - Guanine, T - Thymine

Page 12: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

RNA The RNA is the same as above DNA with the

exception that T is replaced by U, which represents uricil nucleotide.

An organism is further classified into two types. Eukaryotes - higher-order organisms whose DNA

is enclosed in a cell nucleus. E.g. humans Prokaryotes - organisms such as bacteria whose

DNA is not enclosed in a nucleus. E.g. bacteria.

Page 13: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

GENE

A gene is a contiguous interval of DNA that contains the information needed to code for a protein. Genes form the basic units of heredity.

Page 14: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

The Biological problem in CS

computer scientists were empowered to consider a variety of biologically important problems defined primarily on sequences, or (more in the computer science vernacular) on strings

reconstructing long strings of DNA from overlapping string fragments

Page 15: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Cont.• determining physical and genetic maps from

probe data under various experimental protocols;

• storing, retrieving, and comparing DNA strings

• comparing two or more strings for similarities• searching databases for related strings and

substrings• defining and exploring different notions of

string relationships

Page 16: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Cont.

• looking for new or ill-defined patterns occurring frequently in DNA

• looking for structural patterns in DNA and protein

• determining secondary(two-dimensional) structure of RNA

• finding conserved, but faint, patterns in many DNA and protein sequences; and more

Page 17: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Challenges In molecular biology, there are several hundred

specialized databases holding raw DNA, RNA, and amino acid strings, or processed patterns (called motifs) derived from the raw string data.

The currently available algorithms for this problem are a little bit slow and erroneous due in the contexts that billions of DNA sequences are stored in present day DNA and protein databases available world wide

Page 18: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Cont.

the question here is that exact matching will remain a problem of interest as the size of the databases grow exponentially and

also because it will continue to be a subtask needed for more complex searches that will be devised in the near future to fulfill the various and advanced requirements of molecular biologists

Page 19: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

DNA contamination

Contamination is often caused by a fragment (substring) of a vector (DNA string) used to incorporate the desired DNA in a host organism or the contamination is from the DNA of the host itself.

Contamination can also come from very small amounts of undesired foreign DNA that gets physically mixed into the desired DNA and then amplified by PCR (Polymerase chain reaction) used to make copies of the desired DNA.

Page 20: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Cont.

DNA contamination problem can be represented as follows: Given a string S1 (the newly isolated and sequenced

string of DNA) and a known string S2 (the combined sources of possible contamination), find all substrings of S2 that occur in S1 and that are longer than some given length l.

These substrings are candidates for unwanted pieces of S2 that have contaminated the desired DNA string.

Page 21: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Matching and Alignment

Exact string matching Knuth-Morris-Pratt and Boyer-Moore Exact matching with a set of

patterns Aho-Corasick Inexact matching Edit Distance and dynamic programming Sequence alignment problems Multiple alignment problems

Page 22: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

How to Solve

Naive method-Slide P along T and for each alignment, compare characters from left to right. O(n*(m-n+1)).

Knuth-Morris-Pratt(KMP algorithm)-O(n+m).

Page 23: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

String is not Sequence A string is not the same as the

concept of a (sub)sequence in biology!

(Sub)sequences in the biological literature refer to strings that might be interspersed with other characters, such as gaps

Page 24: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

More on Bioinformatics

Page 25: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Bioinformatics facts

3 billion chemical base pairs make up human DNA

There are about 30,000 genes There are about 100,000 proteins Changes in a single base pair are

responsible for may illness

Page 26: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Genomics

Genome complete set of genetic instructions for

making an organism Genomics

attempts to analyze or compare the entire genetic complement of a species

Page 27: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Genome Project

U.S. Human Genome Project was a 13-year effort coordinated by the Department of Energy and the National Institutes of Health.

Goals: identify genes in human DNA determine chemical base pairs create databases tools for data analysis

The Feb. 16, 2001, issue of Science, contains the first analysis of the working draft human genome sequence.

Page 28: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

More on Genomics

Comparative Genomics: the management and analysis of the millions of data points that result from Genomics

Functional Genomics: ways of identifying gene functions and associations

Structural Genomic: whole-genome analysis

Page 29: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Modern Molecular Biology

signals are received at the cell surface, and travel eventually to the nucleus

transcription factors cause the signal to be converted into a change in expression of a gene

The gene products are converted to proteins in the cytoplasm

where they can now effect further changes in the cell.

From: Genes for Geeks. http://www.hpcf.upr.edu/~humberto/presentations

http://www.bioteach.ubc.ca

Page 30: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Proteins

Proteins make up about 15% of the mass of the average person

Proteins are most of the components of cells

Polypeptides: small soluble proteins consisting of a few amino acids linked together

3D structures are composed of one or more polypeptides

An amino acid is a small organic molecule, there are about 20 different amino acids

Page 31: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Structural Biology

The function of a protein is completely determined by its structure (3D shape)

The structure of a protein is completely determined by the sequence of its polypeptide components

The first biopolymers to be sequenced were proteins, but now it is much simpler, faster, and cheaper to sequence DNA

Page 32: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Genomics Language

Genomic DNA is a linear sequence of 4 nucleotides (A, C, G, T)

DNA forms the double helix by pairing with its reverse complement (A-T, G-C)

Genomic DNA contains many genes, each of which is formed from one or more exons (stretches of genomic DNA), separated by introns

A gene is copied into complementary RNA in a process called transcription (U substitutes T)

Page 33: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Protein Sequence Alignments

1. Assigning functions to unknown proteins

2. Determine relatedness of organisms

3. Identify structurally and functionally important elements

4. Make predictions about the 3D structure

Page 34: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Sequence alignment problem

Given two sequences over an alphabet Σ, and a cost function that assigns a cost to an alignment, find an alignment of the string that minimizes the cost.

Example: DNA or protein sequences, find the best

match between them Best match is the one with the minimum

the cost

Page 35: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Sequence alignment problem

Usually solved with dynamic programming

Recall your algorithm class? O(m*n)

G A A T T C A G T T A | | | | | | G G A _ T C _ G _ _ A

G A A T T C A G T T A | | | | | | G G A T _ C _ G _ _ A

From: http://www.sbc.su.se/~per/molbioinfo2001/dynprog/adv_dynamic.html

Page 36: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

The Protein Folding Problem

Given a sequence S, and an integer E, is there a fold that has –E or lower energy? Has been shown to be NP-complete in

2D (HAMILTONIAN PATH)

Page 37: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

The Inverse Protein Folding (IFP) problem

Given a target structure or conformation of a protein G, find a sequence S of length n that: Has G as it’s minimum energy state. Has the lowest degeneracy (number of

other conformations with the same energy) of any possible sequence.

Page 38: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

The Heuristic Sequence Design (HSD) Problem

IFP is a NP-complete, the best known algorithm must search over all possible conformations of all possible sequences.

HSD problems try to simplify the computation by restricting the problem The Canonical Method: find the sequence

with at most λn hydrophobic residues The Grand Canonical Method: Change the

energy function

Page 39: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Bioinformatics Software

Page 40: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

GCG

Genetics Computer Group The Wisconsin Package for Sequence

Analysis Consists of 130+ integrated programs Web based, command-line and X window

system.

Page 41: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

SeqWeb

Database Searching and Retrieval for GCG

Comparison Protein Analysis Mapping Pattern Recognition

Page 42: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

EMBOSS

EMBOSS is a suite where you will find around 100 bioinformatics programs

Sequence alignment Database search with sequence

pattern Protein motif identification Link: http://emboss.org/

Page 43: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Artemis

Artemis is a free genome viewer and annotation tool that allows visualization of sequence features and the results of analyses within the context of the sequence, and its six-frame translation

Link: http://www.sanger.ac.uk/Software/Artemis/

Page 44: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Rasmol

RasMol is a free program which displays molecular structure

http://www.umass.edu/microbio/rasmol/index2.htm

Page 45: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Bio-Knoppix

Bio-knoppix is a customized distribution of Knoppix enhanced for bioinformatics applications and presentations.

Link: http://bioknoppix.hpcf.upr.edu/

Page 46: Bioinformatics By: Jesus Caban Boxuan GU. What is Bioinformatics? Bioinformatics has been defined as a means for analyzing, comparing, graphically displaying,

Some useful sites http://www.bioinformatics.ca/links_directory/ http://www.hgmp.mrc.ac.uk/GenomeWeb/docs-theory.html http://biotech.icmb.utexas.edu/pages/bioinform/biresources.html http://scop.berkeley.edu/ http://www.cse.ucsc.edu/~karplus/compbio_pages.html http://www.peterindia.net/ComputBioArticles.html http://bioknoppix.hpcf.upr.edu/

http://www.ornl.gov/sci/techresources/Human_Genome http://bioinformatics.org/