protein sequence databases

38
Protein sequence databases Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis www.hytti.uku.fi/~toronen/Gradu_verkkoon.zip and from CSC bio-opas http://www.csc.fi/oppaat/bio/ http://www.csc.fi/oppaat/bio/bio-opas.pdf

Upload: ahmed-vang

Post on 03-Jan-2016

62 views

Category:

Documents


2 download

DESCRIPTION

Protein sequence databases. Petri Törönen Shamelessly copied from material done by Eija Korpelainen This also includes old material from my thesis www.hytti.uku.fi/~toronen/ Gradu_verkkoon.zip and from CSC bio-opas http://www.csc.fi/oppaat/bio/ http://www.csc.fi/oppaat/bio/bio-opas.pdf. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Protein sequence databases

Protein sequence databases

Petri TörönenShamelessly copied from material done by Eija KorpelainenThis also includes old material from my thesiswww.hytti.uku.fi/~toronen/Gradu_verkkoon.zipand from CSC bio-opashttp://www.csc.fi/oppaat/bio/http://www.csc.fi/oppaat/bio/bio-opas.pdf

Page 2: Protein sequence databases

Why protein sequences?

• most (laboratory) analysis is done with nucleotide sequences

• therefore the analysis at the nucleotide level is natural

Page 3: Protein sequence databases

But there are drawbacks

-divergence in codons => same protein, different nucleotide sequence!

http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/C/Codons.html

-similarity between different aminoacids

Therefore all the similarity is not visible at the nucleotide level!

Page 4: Protein sequence databases

…more…

Protein databases also include often more detailed information.

Protein (not the RNA) is often the actual functional unit that has a biological function.

-note the exceptions like structural RNAs.

Page 5: Protein sequence databases

Protein databases

• SwissProt

• TrEMBL

• PIR-PSD

Swissprot and TrEMBL (Translated EMBL) have been unified to UniProtTHIS INFO IN PART ERRONEOUS! SwissProt still also available as a separate entity.

Page 6: Protein sequence databases

Differences between databases

• Some include all the available information (more or less reliable information)– large coverage, everything is stored in the database– small reliablity, information has not been confirmed– computer annotation => updating fast

• Some cover only the reliable information– small coverage– information is reliable– expert curation => updating slow

• SwissProt – TREMBL – RemTREMBL

Page 7: Protein sequence databases

Why Swissprot is nice?

• Sequences are manually annotated and checked

• No multiple entries for the same sequence

• Annotations include protein function, modifications after translation, active sites etc.

• Linked to many other databases

Page 8: Protein sequence databases

So how to search protein sequences from available databases?

• Search with a protein name

• Search with a proteins function/derscriptive words

• Search with a protein/RNA sequence

Next slides handle first two options…

Page 9: Protein sequence databases

Ways to access Swiss/UniProt

http://au.expasy.org/sprot/

Expasy server for UniprotNote that the page includes links to ’full text search’ and

to ’advanced search’

http://www.ebi.uniprot.org/uniprot-srv/uniProtPowerSearch.do

Power Search to UniProt databasehttp://srs.csc.fi/

One of the SRS servers availble in WWW

http://srs.ebi.ac.uk

http://srs.embl-heidelberg.de:8000/srs5/

Page 10: Protein sequence databases

SRS

• Sequence Retrieval System• Allows search from several databases

• not limited to SwissProt!

• AND, OR, BUTNOT type boolean operations can be used in the search (useful with keywords)

=> Works with sequence name and with complex keyword queries.

• Obtained results can be further processed:– linking to new set of databases– includes sequence analysis, sequence alingment

Page 11: Protein sequence databases

Select ’start a temporary project’

Page 12: Protein sequence databases

Select database(s). Here I select SwissProtNote that also other databases can be searched with SRS!Available databases vary between the different SRS servers.

Page 13: Protein sequence databases

Insert the query for looking the sequence.Here I search with the sequence name (csk_mouse).Search goes through all the text fields (AllText) in the SwissProt files

These are available fieldsthat can be searched with the search term

Page 14: Protein sequence databases

obtained result

Available information on the sequence.

More information from here

Page 15: Protein sequence databases

• Obtained result demonstrated the detailed information available from the SwissProt

• Note that the stored information includes– information on the organism– gene name, gene description– links to the articles discussing about the seq.– part comments has a detailed description on

• function• tissue localization

– part features has a detailed description on• domains• various functional components

Page 16: Protein sequence databases

SRS Search with boolean operators (AND, OR, BUTNOT)

Queries can be combined with & (= AND), | (= OR), ! (=NOT)Different rows are also combined (by default) with AND

The example looks for proteins with organism Name either mouse OR rat. Also the description field must include words receptor AND kinase BUTNOT tyrosine.

Page 17: Protein sequence databases

Further linking to other databases

We can link the obtained results with the other databases by going further from this link

Go to the results of the previous search..

Page 18: Protein sequence databases

Selection of sequences that have a known 3D structure

2. The box next to PDB database is selected with mouse

1. The sub folder with protein databases is opened by selecting protein function structure and interactions databases

3. Lets select here the filtering of the obtained results to the ones that have a link to 3D structure

Page 19: Protein sequence databases
Page 20: Protein sequence databases

Summary

• protein databases show detailed information of protein sequences

• Uniprot/Swissprot is recommended protein database

-manually curated-non-overlapping

• SRS is a method for searching information from selected databases with search terms

• Word of warning: Sometimes SRS does not work as nicely as hoped!

Page 21: Protein sequence databases

Search of the protein databases with sequences

So what can be done if we have a sequence that we do not know nothing about?We can look for similar known protein from databases.This can be done directly with protein sequences.

(Database searching is probably handled more later. Sorry for wrong order!)

Page 22: Protein sequence databases

Nucleotide to amino acids

If you have produced a nucleotide seq. in laboratory you might still want to compare it to protein sequences for previous reasons (slide n. 3). You’ll have two options:

Page 23: Protein sequence databases

1.Use tools (like BLASTX, FastX) that automatically compare the nucleotide seq. to amino acid databases.

These can search sequence similarities going from one reading frame to another. => Simple, You don’t have to worry about translating the sequence (see below)

BLASTX and FastX are explained more in detail later

2.Translate the seq. using available tools(for example http://www.ebi.ac.uk/emboss/transeq/ )

-required with tools that accept only protein sequence

-remember that you do not know the reading frame!

Correct reading frame can move from one frame to another (sequencing errors like addition or deletion of nucleotides)!!

Page 24: Protein sequence databases

Automatic tools comparing nucl. seq. with protein database

• BLASTX

-looks for most similar protein sequences for your nucleotide sequence by comparing all possible reading frames.

-Member of BLAST program familyhttp://www.ncbi.nlm.nih.gov/BLAST/

Page 25: Protein sequence databases

For nucleotide sequencesBLASTX can be obtained here

If you do a query with a protein sequencethen use this

Page 26: Protein sequence databases

SEQUENCE:>embl|AB029485|AB029485 Mus musculus ARIP1 mRNA for activin receptor interacting protein

protein database (SwissProt) can be selected here

You can find the seq from google with AB029485

Page 27: Protein sequence databases

Next Window is opened here

Page 28: Protein sequence databases

Web page that is given while the results are being waited.

Page 29: Protein sequence databases

Colour figure presents wherethe match to the database wasin our query sequence.

colour presents the goodness ofscore.

E value tells how many similarresults can be expected by random

The alingment can beviewed from this link

Page 30: Protein sequence databases

The alingment enablesthe manual evaluation of the result This is the link to database that we searched

giving the full information on the sequence

Page 31: Protein sequence databases

Changing the nucleotides to amino acids

Transeq requires you to paste the nucleotide sequence, to select the reading frame (1, 2 or 3) and to select forward or reverse direction

http://www.ebi.ac.uk/emboss/transeq/

Page 32: Protein sequence databases

An example sequence obtained with randomly typed g,a,c,t:DQLTCQSTVSAGLAWLAGMA

The obtained sequencesfrom different reading framescan be used to search protein databases...

Page 33: Protein sequence databases

Motif databases

• Motifs are conserved areas in the functionally similar proteins

• These are crucial parts for protein function– protein cannot change them without changing the

function

• Analysis of sequences with motifs can be more efficient when no close sequence relatives are found– recommended when normal sequence search gives

no results

Page 34: Protein sequence databases

What is motif?What is motif?

modified from Terri Attwood, 2002modified from Eija korpelainen...

Areas with strong conservation betweenalingned sequences

Page 35: Protein sequence databases

Motif databases

BLOCKS

http://blocks.fhcrc.org/

PROSITE

http://au.expasy.org/prosite/

...and more...

Page 36: Protein sequence databases

http://au.expasy.org/tools/

Subgroup Pattern and profile searches shows the list of protein motif analysis tools

Page 37: Protein sequence databases

INTERPROhttp://www.ebi.ac.uk/InterProScan/

Combines many motif databases in one search

can take DNA or proteinsequence.

Fragment of the BLASTX test sequence

Page 38: Protein sequence databases

Kinase associated motifs

PDZ domainsImportant for protein-interactions

WW domainsImportant for bindingproteins