bioinformatics miron
DESCRIPTION
Bioinformatics Presentation - Training of MiRONTRANSCRIPT
Welcome to BIOINFORMATICS-MiRON
Outline Workshops chronology on hands out Brief background information Applications & role Bioinformatics tools Practical classes Problem solving exercises What’s expected of you ? Questions/comments are welcome at
all points
Aims To introduce the concepts and language of
bioinformatics. To provide an understanding of how nucleic
acid and protein sequence data is obtained and analysed.
To develop skills in utilising online databases and interpreting data.
To develop an understanding of how bioinformatics can be applied to solve specific problems in biomedical science.
To develop transferable IT and communications skills.
In this workshop…..
You will learn about how data is generated and analysed
As well as what the generated data can tell us about the molecular biology of organisms
And various practical applications of this knowledge
What is bioinformatics?
Why bioinformatics? Over the past decade massive
amounts of sequence data have been generated
This has more recently been joined by gene expression data obtained from microarrays and proteomic technologies
This vast amount of data can only be analysed using various specialised computer algorithms
Main Topics (Review............) Genome organisation and analysis Functional genomics Advanced techniques in molecular biology Archives, information retrieval and alignments: Nucleic acid sequence databases; genome
databases; protein sequence databases; database searching
Dot plots (SIMILARITY MATRX) and sequence alignments (PSI BLAST);
Genome expression: Microarray analysis, proteomics, eukaryotic genome expression
What bioinformatcian think they are
What they do
Examples of Bioinformatics
Database interfaces Genbank/EMBL/DDBJ, Medline, SwissProt, PDB, …
Sequence alignment BLAST, FASTA
Multiple sequence alignment Clustal W, MultAlin, DiAlign
Gene finding Genscan, GenomeScan, GeneMark, GRAIL
Protein Domain analysis and identification pfam, BLOCKS, ProDom,
Pattern Identification/Characterization Gibbs Sampler, AlignACE, MEME
Protein Folding prediction PredictProtein, SwissModeler
Five W that all biologists should know
NCBI (The National Center for Biotechnology Information;
http://www.ncbi.nlm.nih.gov/ EBI (The European Bioinformatics Institute)
http://www.ebi.ac.uk/ The Canadian Bioinformatics Resource
http://www.cbr.nrc.ca/ SwissProt/ExPASy (Swiss Bioinformatics Resource)
http://expasy.cbr.nrc.ca/sprot/ PDB (The Protein Databank)
http://www.rcsb.org/PDB/
Remember while using web server-based tools
You are using someone else’s computer
You are (probably) getting a reduced set of options or capacity
Servers are great for sporadic or proof-of-principle work, but for intensive work, the software should be obtained and run locally
Human Gene Index Database HGI is a database of expressed DNA
sequences, mostly made of ESTs, which are a type of partial cDNA
EST stands for Expressed Sequence Tag These short sequences were created
using essentially the same method used to make cDNAs
As such they represent the expressed part of a genome and are made from mRNA which is ultimately expressed from GENES
Gene Structure
Similarity Searching There are a variety of computer
programs that are used for making comparisons between DNA sequences.
The most popular is known as BLAST (Basic Local Alignment Search Tool)
BLAST is free at the NCBI website
BLAST is Complex Similarity searching relies on the
concepts of alignment and distance between pairs of sequences.
Distances can only be measured between aligned sequences (match vs. mismatch at each position).
A similarity search is a process of testing the best alignment of a query sequence with every sequence in a database.
INTRO TO BLAST Basic Local Alignment Search Tool It is used to compare a query sequence with those
contained in nucleotide databases by aligning the query sequence with previously characterised genes, therefore helping in identifying genes.
The emphasis of this tool is to find regions of sequence similarity between two different genes.
These sequence alignments can yield clues about the structure and function of a novel sequence, and about its evolutionary history and homology with other sequences in the database.
Workshop -1 (database search & inference of possible homology)
Please refer to getting started with bioinformatics
BLAST has Automatic Translation
BLASTX makes automatic translation (in all 6 reading frames) of your DNA query sequence to compare with protein databanks
TBLASTN makes automatic translation of an entire DNA database to compare with your protein query sequence
Only make a DNA-DNA search if you are working with a sequence that does not code for protein.
A typical sequence ready for submission to BLAST
>THC2465887 GGCTGCGGAGGACCGACCGTCCCCACGCCTGCCGCCCCGCGACCCCGACCGCCAGCATGATCGCCGCGCAGCTCCTGGCC TATTACTTCACGGAGCTGAAGGATGACCAGGTCAAAAAGATTGACAAGTATCTCTATGCCATGCGGCTCTCCGATGAAAC TCTCATAGATATCATGACTCGCTTCAGGAAGGAGATGAAGAATGGCCTCTCCCGGGATTTTAATCCAACAGCCACAGTCA AGATGTTGCCAACATTCGTAAGGTCCATTCCTGATGGCTCTGAAAAGGGAGATTTCATTGCCCTGGATCTTGGTGGGTCT TCCTTTCGAATTCTGCGGGTGCAAGTGAATCATGAGAAAAACCAGAATGTTCACATGGAGTCCGAGGTTTATGACACCCC AGAGAACATCGTGCACGGCAGTGGAAGCCAGCTTTTTGATCATGTTGCTGAGTGCCTGGGAGATTTCATGGAGAAAAGGA AGATCAAGGACAAGAAGTTACCTGTGGGATTCACGTTTTCTTTTCCTTGCCAACAATCCAAAATAGATGAGGCCATCCTG ATCACCTGGACAAAGCGATTTAAAGCGAGCGGAGTGGAAGGAGCAGATGTGGTCAAACTGCTTAACAAAGCCATCAAAAA GCGAGGGGACTATGATGCCAACATCGTAGCTGTGGTGAA
BLAST OUTPUT
Query: 3034 TGCATGGTTTGATTTTGACCTGGTC---C---CCC-ACGTGTGAAGTGTAGTGGCATCCA 3086 |||||| | |||||| |||||||| | ||| ||||||||||| |||||||| ||| Sbjct: 75 TGCATGATCTGATTTCAACCTGGTCGTACGCTCCCCACGTGTGAAGTTTAGTGGCACCCA 134 Query: 3087 TTTCTAATGTATGCATTCATCCAACAGAGTTATTTATTGGCTGGAGATGGAAAATCACAC 3146 |||| | | | ||||||| || |||||||||||||||||| ||||| ||| |||| | Sbjct: 135 TTTCCAGTCTCTGCATTCGTCTGACAGAGTTATTTATTGGCCCAAGATGAAAAGTCACGC 194 Query: 3147 CACCTGACAGGCCTTCTGGG-CCTCCAAAGCCCATCCTTGGGGTTCCCCCTCCCTGTGTG 3205 || | | |||||||| |||| |||| ||||| ||||||||| | | ||||||||| Sbjct: 195 CATCCGCCAGGCCTTATGGGGCCTCTGCAGCCCGTCCTTGGGGACACATC-CCCTGTGTG 253 Query: 3206 AAATGTATTATCACCAGCAGACACTGCCGGGCCTCC-C-TCCCGGGGGCACTGCCTGAAG 3263 ||||||||||||||||||||||||||||||| |||| | |||| |||||| | | | Sbjct: 254 AAATGTATTATCACCAGCAGACACTGCCGGGACTCCTCCTCCCAGGGGCA-T-CTTAGCT 311 Query: 3264 GCGAG-TGTGGGCATAGCATTAGCTGCTTCCTCCCCTCCTG-GCA-CCCACTGTGGCC-T 3319 || | | | |||| ||||| || | ||| | | | |||| | || | | Sbjct: 312 GCTTCCTCCCGTCCCAGCACCCACTGCTGTCTGGCGTCCCGAGGATCCCA-TCAGGACGT 370 Query: 3320 GGC-ATCGCATCGTGGTGTGTCAATGCCACAAAATCGTGTGTCCGTGGAACCAGTCCTAG 3378 | | || || | | |||| | || || | || ||| | | || || | Sbjct: 371 GTCCATGCCACTGAGTCGTGTG--T-CCGTGGAA-C-TG-GTCAGAGCCACT--TCGTGA 422 Query: 3379 CCGCGTGTGACAGTCTTGCATTCTGTTTGTCTCGTGGGGGGAGGTGGACAG-TCCTGCGG 3437 | | | || || ||| | ||| | | | | || || ||||| || Sbjct: 423 CAGTCT-TG-CATTCTGTCTGTCT--TGGGGTGGNNGGNAAGNNNNNCCANNTCCTGTGG 478 Query: 3438 -AAAT--GTGTCTTGTCTCCATTTGGA-TAAAA-GGAA-CCAA--CCAACAAACAATGCC 3489 ||| | | |||| |||||||||| ||||| |||| |||| ||||||| || |||| Sbjct: 479 GAAAAAGGGGCCTTGGCTCCATTTGGGGTAAAAAGGAAACCAAACCCAACAA-CAGTGCC 537 Query: 3490 A-TCACTGG-AATTTCCC-ACCG-CTTT--GTGAGCCGTG-TCGTATGA-CCTAGTAAAC 3541 ||| ||| |||| ||| | | |||| ||||||| || | |||||| ||||| || Sbjct: 538 CCTCATTGGGAATTCCCCCATTGGCTTTTTGTGAGCCATGGTTGTATGAACCTAGGTAAA 597 Query: 3542 TTTGT 3546 || | Sbjct: 598 CTTNT 602
BLAST line-up of human v canine partial cDNAs for hexokinase 1
Understand the Statistics!
BLAST produces an E-value for every match This is the same as the P value in a statistical test
A match is generally considered significant if the E-value < 0.05 (smaller numbers are more significant)
Very low E-values (e-100) are homologs or identical genes
Moderate E-values are related genes Long regions of moderate similarity are
more important than short regions of high identity.
BLAST is Approximate
BLAST makes similarity searches very quickly because it takes shortcuts. looks for short, nearly identical “words” (11
bases)
It also makes errors misses some important similarities makes many incorrect matches
easily fooled by repeats or skewed composition
Bad Genome Annotation Gene finding is at best only 90%
accurate.
New sequences are automatically annotated with BLAST scores.
Bad annotations propagate
Its going to take us 10-20 years or more to sort this mess out!
Conclusions
We have only touched small parts of the elephant
Trial and error (intelligently) is often your best tool
Keep up with the main five sites, and you’ll have a pretty good idea of what is happening and available