bioinformatics miron

Welcome to BIOINFORMATICS-MiRON

Outline Workshops chronology on hands out Brief background information Applications & role Bioinformatics tools Practical classes Problem solving exercises What’s expected of you ? Questions/comments are welcome at

all points

Aims To introduce the concepts and language of

bioinformatics. To provide an understanding of how nucleic

acid and protein sequence data is obtained and analysed.

To develop skills in utilising online databases and interpreting data.

To develop an understanding of how bioinformatics can be applied to solve specific problems in biomedical science.

To develop transferable IT and communications skills.

In this workshop…..

You will learn about how data is generated and analysed

As well as what the generated data can tell us about the molecular biology of organisms

And various practical applications of this knowledge

What is bioinformatics?

Why bioinformatics? Over the past decade massive

amounts of sequence data have been generated

This has more recently been joined by gene expression data obtained from microarrays and proteomic technologies

This vast amount of data can only be analysed using various specialised computer algorithms

Main Topics (Review............) Genome organisation and analysis Functional genomics Advanced techniques in molecular biology Archives, information retrieval and alignments: Nucleic acid sequence databases; genome

databases; protein sequence databases; database searching

Dot plots (SIMILARITY MATRX) and sequence alignments (PSI BLAST);

Genome expression: Microarray analysis, proteomics, eukaryotic genome expression

What bioinformatcian think they are

What they do

Examples of Bioinformatics

Database interfaces Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …

Sequence alignment BLAST, FASTA

Multiple sequence alignment Clustal W, MultAlin, DiAlign

Gene finding Genscan, GenomeScan, GeneMark, GRAIL

Protein Domain analysis and identification pfam, BLOCKS, ProDom,

Pattern Identification/Characterization Gibbs Sampler, AlignACE, MEME

Protein Folding prediction PredictProtein, SwissModeler

Five W that all biologists should know

NCBI (The National Center for Biotechnology Information;

http://www.ncbi.nlm.nih.gov/ EBI (The European Bioinformatics Institute)

http://www.ebi.ac.uk/ The Canadian Bioinformatics Resource

http://www.cbr.nrc.ca/ SwissProt/ExPASy (Swiss Bioinformatics Resource)

http://expasy.cbr.nrc.ca/sprot/ PDB (The Protein Databank)

http://www.rcsb.org/PDB/

Remember while using web server-based tools

You are using someone else’s computer

You are (probably) getting a reduced set of options or capacity

Servers are great for sporadic or proof-of-principle work, but for intensive work, the software should be obtained and run locally

Human Gene Index Database HGI is a database of expressed DNA

sequences, mostly made of ESTs, which are a type of partial cDNA

EST stands for Expressed Sequence Tag These short sequences were created

using essentially the same method used to make cDNAs

As such they represent the expressed part of a genome and are made from mRNA which is ultimately expressed from GENES

Gene Structure

Similarity Searching There are a variety of computer

programs that are used for making comparisons between DNA sequences.

The most popular is known as BLAST (Basic Local Alignment Search Tool)

BLAST is free at the NCBI website

BLAST is Complex Similarity searching relies on the

concepts of alignment and distance between pairs of sequences.

Distances can only be measured between aligned sequences (match vs. mismatch at each position).

A similarity search is a process of testing the best alignment of a query sequence with every sequence in a database.

INTRO TO BLAST Basic Local Alignment Search Tool It is used to compare a query sequence with those

contained in nucleotide databases by aligning the query sequence with previously characterised genes, therefore helping in identifying genes.

The emphasis of this tool is to find regions of sequence similarity between two different genes.

These sequence alignments can yield clues about the structure and function of a novel sequence, and about its evolutionary history and homology with other sequences in the database.

Workshop -1 (database search & inference of possible homology)

Please refer to getting started with bioinformatics

BLAST has Automatic Translation

BLASTX makes automatic translation (in all 6 reading frames) of your DNA query sequence to compare with protein databanks

TBLASTN makes automatic translation of an entire DNA database to compare with your protein query sequence

Only make a DNA-DNA search if you are working with a sequence that does not code for protein.

A typical sequence ready for submission to BLAST

>THC2465887 GGCTGCGGAGGACCGACCGTCCCCACGCCTGCCGCCCCGCGACCCCGACCGCCAGCATGATCGCCGCGCAGCTCCTGGCC TATTACTTCACGGAGCTGAAGGATGACCAGGTCAAAAAGATTGACAAGTATCTCTATGCCATGCGGCTCTCCGATGAAAC TCTCATAGATATCATGACTCGCTTCAGGAAGGAGATGAAGAATGGCCTCTCCCGGGATTTTAATCCAACAGCCACAGTCA AGATGTTGCCAACATTCGTAAGGTCCATTCCTGATGGCTCTGAAAAGGGAGATTTCATTGCCCTGGATCTTGGTGGGTCT TCCTTTCGAATTCTGCGGGTGCAAGTGAATCATGAGAAAAACCAGAATGTTCACATGGAGTCCGAGGTTTATGACACCCC AGAGAACATCGTGCACGGCAGTGGAAGCCAGCTTTTTGATCATGTTGCTGAGTGCCTGGGAGATTTCATGGAGAAAAGGA AGATCAAGGACAAGAAGTTACCTGTGGGATTCACGTTTTCTTTTCCTTGCCAACAATCCAAAATAGATGAGGCCATCCTG ATCACCTGGACAAAGCGATTTAAAGCGAGCGGAGTGGAAGGAGCAGATGTGGTCAAACTGCTTAACAAAGCCATCAAAAA GCGAGGGGACTATGATGCCAACATCGTAGCTGTGGTGAA

BLAST OUTPUT

Query: 3034 TGCATGGTTTGATTTTGACCTGGTC---C---CCC-ACGTGTGAAGTGTAGTGGCATCCA 3086 |||||| | |||||| |||||||| | ||| ||||||||||| |||||||| ||| Sbjct: 75 TGCATGATCTGATTTCAACCTGGTCGTACGCTCCCCACGTGTGAAGTTTAGTGGCACCCA 134 Query: 3087 TTTCTAATGTATGCATTCATCCAACAGAGTTATTTATTGGCTGGAGATGGAAAATCACAC 3146 |||| | | | ||||||| || |||||||||||||||||| ||||| ||| |||| | Sbjct: 135 TTTCCAGTCTCTGCATTCGTCTGACAGAGTTATTTATTGGCCCAAGATGAAAAGTCACGC 194 Query: 3147 CACCTGACAGGCCTTCTGGG-CCTCCAAAGCCCATCCTTGGGGTTCCCCCTCCCTGTGTG 3205 || | | |||||||| |||| |||| ||||| ||||||||| | | ||||||||| Sbjct: 195 CATCCGCCAGGCCTTATGGGGCCTCTGCAGCCCGTCCTTGGGGACACATC-CCCTGTGTG 253 Query: 3206 AAATGTATTATCACCAGCAGACACTGCCGGGCCTCC-C-TCCCGGGGGCACTGCCTGAAG 3263 ||||||||||||||||||||||||||||||| |||| | |||| |||||| | | | Sbjct: 254 AAATGTATTATCACCAGCAGACACTGCCGGGACTCCTCCTCCCAGGGGCA-T-CTTAGCT 311 Query: 3264 GCGAG-TGTGGGCATAGCATTAGCTGCTTCCTCCCCTCCTG-GCA-CCCACTGTGGCC-T 3319 || | | | |||| ||||| || | ||| | | | |||| | || | | Sbjct: 312 GCTTCCTCCCGTCCCAGCACCCACTGCTGTCTGGCGTCCCGAGGATCCCA-TCAGGACGT 370 Query: 3320 GGC-ATCGCATCGTGGTGTGTCAATGCCACAAAATCGTGTGTCCGTGGAACCAGTCCTAG 3378 | | || || | | |||| | || || | || ||| | | || || | Sbjct: 371 GTCCATGCCACTGAGTCGTGTG--T-CCGTGGAA-C-TG-GTCAGAGCCACT--TCGTGA 422 Query: 3379 CCGCGTGTGACAGTCTTGCATTCTGTTTGTCTCGTGGGGGGAGGTGGACAG-TCCTGCGG 3437 | | | || || ||| | ||| | | | | || || ||||| || Sbjct: 423 CAGTCT-TG-CATTCTGTCTGTCT--TGGGGTGGNNGGNAAGNNNNNCCANNTCCTGTGG 478 Query: 3438 -AAAT--GTGTCTTGTCTCCATTTGGA-TAAAA-GGAA-CCAA--CCAACAAACAATGCC 3489 ||| | | |||| |||||||||| ||||| |||| |||| ||||||| || |||| Sbjct: 479 GAAAAAGGGGCCTTGGCTCCATTTGGGGTAAAAAGGAAACCAAACCCAACAA-CAGTGCC 537 Query: 3490 A-TCACTGG-AATTTCCC-ACCG-CTTT--GTGAGCCGTG-TCGTATGA-CCTAGTAAAC 3541 ||| ||| |||| ||| | | |||| ||||||| || | |||||| ||||| || Sbjct: 538 CCTCATTGGGAATTCCCCCATTGGCTTTTTGTGAGCCATGGTTGTATGAACCTAGGTAAA 597 Query: 3542 TTTGT 3546 || | Sbjct: 598 CTTNT 602

BLAST line-up of human v canine partial cDNAs for hexokinase 1

Understand the Statistics!

BLAST produces an E-value for every match This is the same as the P value in a statistical test

A match is generally considered significant if the E-value < 0.05 (smaller numbers are more significant)

Very low E-values (e-100) are homologs or identical genes

Moderate E-values are related genes Long regions of moderate similarity are

more important than short regions of high identity.

BLAST is Approximate

BLAST makes similarity searches very quickly because it takes shortcuts. looks for short, nearly identical “words” (11

bases)

It also makes errors misses some important similarities makes many incorrect matches

easily fooled by repeats or skewed composition

Bad Genome Annotation Gene finding is at best only 90%

accurate.

New sequences are automatically annotated with BLAST scores.

Bad annotations propagate

Its going to take us 10-20 years or more to sort this mess out!

Conclusions

We have only touched small parts of the elephant

Trial and error (intelligently) is often your best tool

Keep up with the main five sites, and you’ll have a pretty good idea of what is happening and available

bioinformatics miron

Education