tunis, march 2007 a. auchincloss uniprotkb and expasy uniprotkb/swiss-prot and expasy: protein...

146
Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the Swiss Institute of Bioinformatics Andrea Auchincloss ([email protected] ) Tunis, March 19, 2007

Post on 23-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and

proteomics tools developed at the

Swiss Institute of Bioinformatics Andrea Auchincloss ([email protected])

Tunis, March 19, 2007

Page 2: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Outline• The Swiss Institute of Bioinformatics• What is UniProt?• UniProt Knowledgebase: Swiss-Prot and

TrEMBL• HPI, post-translational modifications, HAMAP• UniRef and UniParc• Databases for protein function and domains:

PROSITE, InterPro etc.• ExPASy; other tools

Page 3: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Swiss Institute of Bioinformatics (SIB)

• Non-profit foundation created in 1998;• Groups in Geneva, Lausanne and Basel;• Federation of several groups (some of

which existed and collaborated long before the foundation of the institute), about 170 researchers in 2006.

Page 4: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

www.isb-sib.ch

Page 5: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

SIB missions

• Development of databases and software tools;• High-quality bioinformatics research program;• Courses and seminars for the training of

bioinformatics research scientists. This includes a master’s degree in proteomics and bioinformatics, several weekly courses and a doctoral school

• Services to the Swiss Life Sciences community (EMBnet node).

Page 6: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Swiss Institute of Bioinformatics:20 research and service groups

Page 7: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Proteins are organic compounds made of amino acids arranged in a linear chain and joined by peptide bonds…

Wikipedia

Page 8: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Different ‘views’ of a protein

Proteins are composed of 20 "standard" amino acids, symbolised by a LETTER.

Page 9: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Proteins can also work together to perform a particular function, and they often associate to form complexes.

Page 10: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Proteins are essential parts of all living organisms and participate in every process within cells.

-> enzymes

-> structural or mechanical functions

-> important in cell signaling, immune response, cell adhesion, cell cycle, toxins….

Proteins are a necessary component in our diet, since animals cannot synthesize all the amino acids and must obtain essential amino acids from food.

Page 11: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Protein/Gene number

Organism Number

Bacteria 182-8,591S. cerevisiae 6,127C. elegans 17,947 Drosophila 13,849A. thaliana ∼ 25,674Human ∼21,000

Page 12: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

1953: 1st sequence (bovine insulin)

1986: 4,000 sequences

2006: 3.5 million sequences

Where will it stop?

The universe in which protein databases

evolve

AMB, SP20

Page 13: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

179,000,021,0001st estimate: ~30 million species (1.5 million named) 2nd estimate: 20 million bacteria/archaea x 4,000 genes

5 million protists x 6,000 genes

3 million insects x 14,000 genes

1 million fungi x 6,000 genes

0.6 million plants x 20,000 genes

0.2 million molluscs, worms, arachnids, etc. x 20,000 genes

0.2 million vertebrates x 21,000 genes

The calculation: 2x107x4000+5x106x6000+3x106x14000+106x6000+6x105x20000+2x105x20000+2x105x21000+21000(you!)

Caveat: this is an estimate of the number of potential sequence entries, but not that of the number of distinct protein entities in the biosphere. AMB, SP20

Page 14: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

What is sequencing is underway right now?

Many eukaryotic & bacterial genomes (varying sizes)

Metagenomics (environmental samples)

~ 6 million sequences submitted/published in December 2006,

~ 17 million sequences being generated at the Venter Institute, 6 million proteins are being

submitted from the GOS (Global Ocean Sampling) trip

Page 15: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Protein sequences; what is sequenced?

Currently about 3.5 to 4.0 million ‘known’ protein sequences

More than 99% of these are derived by translation of nucleotide sequences

Less than 1%: direct protein sequencing (Edman, MS/MS…)

-> It is important that users know where the protein sequence comes from…

(sequence & gene prediction quality)!

Page 16: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Level of DNA/RNA sequence quality

- DNA/RNA sequencing quality (genome or WGS, cDNA or EST …)

- Gene prediction quality; programs used, is there manual intervention afterwards?

For example:Authors can specify the nature of the CDS in the nucleotide databases by using qualifiers: "/evidence=experimental" or "/evidence=not_experimental".

Very rarely done…

Page 17: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

The hectic life of a sequence …

cDNAs, ESTs, genomes, …

EMBL, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

…if the submitters provide an annotated Coding Sequence

(CDS)

Public protein sequence databases

Public nucleic acid

databases

Page 18: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

CDS translation provided by EMBL

CDS provided by the submitters

The first Met !

CDS: CoDing Sequence (CDS)

Page 19: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Complete genome (submitted)

only ~ 1,858 CDS available!

Data not submitted

Page 20: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Issue for the users:the protein database jungle

Page 21: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

The hectic life of a sequence …

cDNAs, ESTs, genomes, …

EMBL, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

…if the submitters provide an annotated Coding Sequence

(CDS)

Public protein sequence databases

Public nucleic acid

databases

Page 22: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

The hectic life of a sequence …

TrEMBL GenPept

CoDing Sequences provided by submitters

cDNAs, ESTs, genomes, …

EMBL, GenBank, DDBJ

Data not submitted to public databases, delayed or cancelled…

Swiss-Prot

RefSeq*

Manually annotated

PRF

Scientific publications derived sequences

EnsEMBL*

IPI

CCDS

UniParc

UniProtKB

PDB* Also gene prediction

PIR

+ species-specific databases (EcoGene, TubercuList, TIGR…)

Page 23: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Page 24: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Major public protein sequence database ‘sources’

UniProtKB: Swiss-Prot + TrEMBL

NCBI-nr: Swiss-Prot + GenPept + PIR + PDB + PRF + RefSeq

PIR PDB PRF

UniProtKB/Swiss-Prot: manually annotated protein sequences (11,000 species)

UniProtKB/TrEMBL: submitted CDS (EMBL) + automated annotation; non redundant with Swiss-Prot (127,000 species)

GenPept: submitted CDS (GenBank); redundant with UniProtKB (about 130,000 species)

PIR: Protein Information Resource; archive since 2003; integrated into UniProtKB

PDB: Protein Databank: 3D data and associated sequences

PRF: journal scan of ‘published’ peptide sequences

RefSeq: Reference Sequence for DNA, RNA, protein + gene prediction (4,000 species)

Integrated resources

‘cross-references’

Separated resources

Page 25: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Other protein sequence databases

CCDS: EBI + NCBI + Wellcome Trust Sanger + UC Santa Cruz (2 species)

Consensus human and mouse sequences between 4 institutions… Combining different approaches – ab initio, by similarity - and taking advantage of the expertise acquired by different institutes, including manual annotation…

EnsEMBL: UniProtKB + RefSeq + gene prediction (31 species)

aligns some eukaryotic genomic sequences with all the sequences found in EMBL, UniProtKB/Swiss-Prot, RefSeq and UniProtKB/TrEMBL (→ known genes)- Also does some gene prediction (→ novel genes)

IPI: UniProtKB + RefSeq + EnsEMBL + (H-InvDB, TAIR, VEGA) (7 species)

provides a guide to the main databases that describe the human, mouse, rat, zebrafish, Arabidopsis, chicken, and cow proteomes.

Page 26: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

The UniProtThe UniProt consortiumconsortium

Protein Information Resource

European Bioinformatics Institute European Molecular Biology Laboratory

Swiss Institute of

Bioinformatics

Page 27: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

The UniProt Consortium

UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein

information

www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006).

Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc

and soon UniMES (for Metagenomic and Environmental Sequences)

Page 28: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

UniProt KnowledgeBase

UniProtKB/TrEMBLComputer annotatedprotein sequences

3’600’000 entries~100’000 species

UniRef100UniRef 90UniRef 50

• One UniRef100 entry =

All identical sequences (including fragments).

• One UniRef90 entry = Sequences that have at least

90% or more identity.

• One UniRef50 entry =Sequences that are at least

50% or more identity.

Independent of species.

UniProt Archives~8’000’000 entries

Archived raw protein

sequences, found in publicly

accessible databases:

Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl,

IPI, PDB, RefSeq, FlyBase, WormBase,

Patent Offices.Use with extreme caution: Contains

pseudogenes, incorrect CDS

predictions, etc…

UniProtKB/Swiss-Prot

Manually annotatedprotein sequences

260’000 entries ~10’000 species

UniProtKB Release 9.7 consists of:

The Universal Protein resource components

Allows comprehensible BLAST similarity searches by providing sets of representative sequences

UniProtKB

produced by SIB and EBI

produced by PIR

produced by EBI

Page 29: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

UniProt KnowledgeBase

UniProtKB/TrEMBLComputer annotatedprotein sequences

3,900,000 entries~127,000 species

UniRef100UniRef 90UniRef 50

• One UniRef100 entry =

All identical sequences (including fragments).

• One UniRef90 entry = Sequences that have at least

90% or more identity.

• One UniRef50 entry =Sequences that are at least

50% or more identity.

Independent of species.

UniProt Archives~8’000’000 entries

Archived raw protein

sequences, found in publicly

accessible databases:

Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl,

IPI, PDB, RefSeq, FlyBase, WormBase,

Patent Offices.Use with extreme caution: Contains

pseudogenes, incorrect CDS

predictions, etc…

UniProtKB/Swiss-Prot

Manually annotatedprotein sequences

260,000 entries ~11,000 species

The Universal Protein resource components

Allows comprehensible BLAST similarity searches by providing sets of representative sequences

UniProtKB

produced by SIB and EBI

produced by PIR

produced by EBI

Page 30: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

UniProt KnowledgeBase

UniProtKB/TrEMBLComputer annotatedprotein sequences

3,900,000 entries~127,000 species

UniRef100UniRef 90UniRef 50

• One UniRef100 entry =

All identical sequences (including fragments).

• One UniRef90 entry = Sequences that have at least

90% or more identity.

• One UniRef50 entry =Sequences that are at least

50% or more identity.

Independent of species.

UniProt Archives~8’000’000 entries

Archived raw protein

sequences, found in publicly

accessible databases:

Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl,

IPI, PDB, RefSeq, FlyBase, WormBase,

Patent Offices.Use with extreme caution: Contains

pseudogenes, incorrect CDS

predictions, etc…

UniProtKB/Swiss-Prot

Manually annotatedprotein sequences

260,000 entries ~11,000 species

The Universal Protein resource components

Allows comprehensible BLAST similarity searches by providing sets of representative sequences

UniProtKB

produced by SIB and EBI

produced by PIR

produced by EBI

Page 31: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

UniProt KnowledgeBase

UniProtKB/TrEMBLComputer annotatedprotein sequences

3,900,000 entries~127,000 species

UniRef100UniRef 90UniRef 50

• One UniRef100 entry =

All identical sequences (including fragments).

• One UniRef90 entry = Sequences that have at least

90% or more identity.

• One UniRef50 entry =Sequences that are at least

50% or more identity.

Independent of species.

UniProt Archives~8,800,000 entries

Archived raw protein

sequences, found in publicly

accessible databases:

Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl,

IPI, PDB, RefSeq, FlyBase, WormBase,

Patent Offices.Use with extreme caution: Contains

pseudogenes, incorrect CDS

predictions, etc…

UniProtKB/Swiss-Prot

Manually annotatedprotein sequences

260,000 entries ~11,000 species

The Universal Protein resource components

Allows comprehensible BLAST similarity searches by providing sets of representative sequences

UniProtKB

produced by SIB and EBI

produced by PIR

produced by EBI

Page 32: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

UniProt web sites…

http://www.expasy.org/sprot/

http://www.pir.uniprot.org/

http://www.ebi.ac.uk/uniprot/

http://www.uniprot.org/

Soon, a new unified web site,

with a very powerful search engine….

Page 33: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

http://beta.uniprot.org/

Test it! Logon:guestPassword: amazing

Page 34: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

The UniProt groups from SIB, The UniProt groups from SIB, EBI and PIREBI and PIR (Antibes, September 2004)

In Geneva (SIB):2 Group Leaders44 Annotators4 Prosite annotators22 Programmers and Researchers5 Administrators, science communicators 3 System Administrators4 Students1 GISAID------------------85 people

At EBI: (Swiss-Prot + EMBL + TrEMBL)75 people (29 Annotators)

At PIR: 1 Group Leader13 Protein Science Team12 Informatics Team------------------26 people

Page 35: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniProtKB has biweekly releases; available from about ~100 servers, the main sources being ExPASy and www.uniprot.org

Page 36: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniProtKBFrom EMBL (DNA) to

TrEMBL (protein)

Page 37: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

EMBL

TrEMBL

Reference

Automated extract of the protein sequence (CDS), gene name,

taxonomy and references.

Automated annotation (KWs and protein family).

Gene/protein name

CDS

Taxonomy

Page 38: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

! TrEMBL does not translate DNA sequences, nor does it use gene prediction programs: only takes the existing CDS proposed by the submitting authors in the EMBL/Genbank/DDBJ entry

In particular, the proposed CDS and derived protein sequences can be experimentally proven or derived from gene prediction programs (this is not obvious from the TrEMBL entry)

TrEMBL does not validate any sequences

Page 39: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

!!!!

The quality of UniProtKB/TrEMBL data is directly dependent on the information provided by the

submitter of the original nucleotide entry.

Page 40: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniProtKBFrom TrEMBL to Swiss-Prot

Page 41: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Swiss-Prot

Annotation of sequence differences (conflicts, variants, splicing…)

EMBL

TrEMBL

CDS

Average of 6 independent sequence reports for each human protein

Manual annotation of the sequence and

associated biological

information (derived from literature,

external experts, databases…)

Automated extraction of the protein sequence (CDS), gene name

and references.Automated annotation.

Page 42: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Distinguishing Swiss-Prot and TrEMBL

– A TrEMBL entry is a computer-annotated record derived from a coding sequence (CDS) in the nucleotide sequence databases, not in Swiss-Prot, after some redundancy removal and automated annotation.

– A Swiss-Prot entry is a manually annotated record for a given protein.

Page 43: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniProtKB From TrEMBL to Swiss-Prot

Step 1: Sequence check

Page 44: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniProtKB/Swiss-Prot

Non-redundant 1 entry -> 1 gene (1 species)

i) Merge all known protein sequences (CDS and amino acid) derived from the same gene

-> decreases redundancy and improves sequence reliability

ii) Annotation of the sequence differences (including conflicts, polymorphisms, splice variants etc..)

-> annotation of protein diversity

Page 45: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

260,000 + 3,800,000 3,600,000

Redundancy…

Redundancy in TrEMBL&

Redundancy between TrEMBL and Swiss-Prot

In the future: redundancy is going to decrease: "new" genome sequencing → "new" proteins

UniProtKB/Swiss-Prot ~11,000 species

UniProtKB/TrEMBL ~127,000 species

Page 46: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

- 13 sequences (complete or partial) - derived from mRNA (n=6) or genomic DNA (n=7)

Page 47: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

All alternatively spliced sequences are available for BLAST searches, protein identification tools and are downloadable…

Human: ~2/3 of the human genes are alternatively spliced

Page 48: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

- 6 genomic sequences (complete or partial) - 1 protein sequence from PIR

Page 49: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Multiple alignment of the available clpB sequences

Page 50: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Page 51: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Within Swiss-Prot?

• A snapshot of the situation (December 2006):– 28,200 entries with 82,000 sequence conflicts;– 2,600 entries with corrected frameshifts;– 15,100 entries with corrected initiation sites;– 4,300 entries with other sequence ‘problems’.

• At least 43,000 entries (19% of Swiss-Prot) required a minimal amount of annotation effort to obtain the “correct” sequence.

Page 52: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Quality of protein information from genome projects

• Proteins originating from different genome projects:– Drosophila: what a curated (thanks to FlyBase)

genome effort should look like: only 1.8% of the gene models conflict with what we have in UniProtKB/Swiss-Prot;

– Arabidopsis: a genome where lots of work was done to annotate it when it was sequenced, but where nothing as been done since (at least in the public view): 19.5% of the gene models are erroneous;

– Tetraodon nigroviridis: a quick and dirty automatic run through a genome with no manual intervention: >90% of the gene models produce incorrect proteins.

– Bacteria and Archaea have almost no splicing, so prediction is “easier”, however errors are still made…

Page 53: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

• Producing a clean set of sequences is not a trivial task;

• It is not getting easier as more and more types of sequence data is submitted;

• It is important to pursue our efforts in making sure we provide to our users the most correct set of sequences for a given organism.

Page 54: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

• As most protein sequences are derived from translation of nucleotide sequence and are only predictions, the new PE line indicates whether there is any evidence that proves the existence of a protein;

• The ‘Protein existence evidence’ will have 5 different qualifiers:

1. Evidence at protein level2. Evidence at transcript level3. Inferred from homology4. Predicted- Unassigned (used mostly in TrEMBL)

New ‘Protein existence evidence’ tag

Page 55: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Righting the wrongs

“Sequences are rarely deposited in a “mature” state; as with all scientific research, DNA and protein

annotation is a continual process of learning, revision and corrections.”

“Sequencing error rates: ~1 base in 10’000”

“Making people aware of errors is good and great; making people aware that they’re responsible also for

correcting errors is even greater”

C. Hardley, EMBO reports, 4(9), 2003.

Page 56: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniProtKB

From TrEMBL to Swiss-Prot

Step 2: Annotation:literature

controlled vocabulary

Page 57: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

• The focal point of the efforts to maintain and develop UniProtKB/Swiss-Prot;

• It is becoming more and more important as it provides: a summary of what is known about a protein; creates template for automatic annotation for the

many organisms whose genome sequence is/will be available but whose proteins will not be characterized;

provides well annotated (corpus) entries to train literature mining tools (text mining).

Annotation

Page 58: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

(…)

….Source of data- publications (> 1,700 journals cited) -also external scientific expertise & other databases

Page 59: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Comments: “structured free text”, 27 defined topics

Manually annotatedInformation from papers, specialized databases, computer prediction, external experts, brain stormingDistinction between data obtained experimentally and computerized inferences

Page 60: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniProtKB

From TrEMBL to Swiss-Prot

Step 3: Sequence analysis (bioinformatics tools)

Page 61: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Annotators could not work without the help of our software developers;

The annotation platform

Page 62: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Anabelle: much more than a domain

annotation platform

Page 63: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Page 64: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

We manually check the results !

Page 65: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

What else is in a UniProtKB/Swiss-Prot entry?

Page 66: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Cross-references; a central hub

• Swiss-Prot was the first database with X-references;• Explicitly X-referenced to 85 databases:

– DNA (EMBL/GenBank/DDBJ), – 3D-structure (PDB)– Family and domain (InterPro, HAMAP, PROSITE, Pfam,

etc.)– genomic (OMIM, MGI, FlyBase, SGD, SubtiList, etc.)– 2D-gel (e.g. SWISS-2DPAGE)– specialized db (e.g.GlycoSuiteDB, PhosSite, MEROPS);– literature (PubMed)

• Each UniProtKB/Swiss-Prot entry can be seen as a central hub for the data available about the protein it describes

Gasteiger E. et al, Curr. Issues Mol. Biol. 3:47-55(2001)Gasteiger E. et al, Curr. Issues Mol. Biol. 3:47-55(2001)www.expasy.org/cgi-bin/lists?dbxref.txtwww.expasy.org/cgi-bin/lists?dbxref.txt

Page 67: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

2D-gel databases ANU-2DPAGEAarhus/Ghent-2DPAGECOMPLUYEAST-2DPAGECornea-2DPAGE DOSAC-COBS-2DPAGEECO2DBASEHSC-2DPAGEOGPPHCI-2DPAGEPMMA-2DPAGERat-heart-2DPAGEREPRODUCTION-2DPAGESiena-2DPAGESWISS-2DPAGE

Family and domain databasesGene3DHAMAPInterProPANTHERPIRSFPfamPRINTSProDomPROSITESMARTTIGRFAMs

Organism-specific databasesAGDCYGD DictyBaseEchoBASEEcoGeneeuHCVdbFlyBaseGeneDB_SpombeGeneFarmGrameneH-InvDB HGNCHIVHPA LegioListLepromaListiListMaizeGDBMGIMIMMypuListPhotoListRGDSagaListSGDStyGeneSubtiListTAIRTubercuListWormBaseWormPepZFIN

Enzyme and pathway databasesBioCycReactome

MiscellaneousArrayExpressdbSNPDIPDrugBank GOIntActLinkHubRZPD-ProtExp

Protein family/group databasesGermOnlineMEROPSPeroxiBasePptaseDBREBASETRANSFAC

Sequence databasesEMBLPIRUniGene

3D structure databasesHSSPPDBSMR

PTM databasesGlycoSuiteDBPhosSite

UniProtKB/Swiss-Prot

explicit links

Genome annotation databasesEnsemblGenomeReviewsKEGGTIGR

Page 68: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Implicit cross-references on new web server and ExPASy

Implicit X-references to 26 additional db added by the ExPASy server on the www (i.e.: GeneCards, ModBase, etc.)

These X-refs are not present as hard-coded DR lines in the Swiss-Prot entry as it can be downloaded by ftp, but are added on the fly when someone views an entry on ExPASy. This can be done because enough information is present in the UniProtKB entry to access the related information in another db. Example: All Swiss-Prot/TrEMBL are linked to the BLOCKS domain db, via the Swiss-Prot/TrEMBL accession number

Page 69: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Keyword definition and usage in Swiss-Prot

Linked to Gene Ontology to further facilitate

information retrieval via controlled vocabularies

Page 70: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

In a UniProtKB/Swiss-Prot entry, you can expect to find:

• All the names of a given protein (and of its gene);• Its biological origin with links to the taxonomic

databases;• A selection of references;• A summary of what is known about the protein:

function, alternative products, PTM, tissue expression, disease, 3D-structures, etc.…;

• Numerous cross-references;• Selected keywords;• A description of important sequence features:

domains, PTMs, variations, etc.;• A (often corrected) protein sequence and the

description of various isoforms/variants.

Page 71: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Monitoring entry history: The UniProtKB Sequence/Annotation Version archive

Page 72: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Page 73: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

… and many useful links:

Page 74: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

And on the new website

other tools are not yet available…

Page 75: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniProt Knowledgebase

• Swiss-Prot: Manually annotated section

• TrEMBL: Automatically annotated section

Page 76: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Distinguishing Swiss-Prot and TrEMBL

Page 77: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Page 78: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Accession number: to be used when you cite a UniProtentry in anywhere (never cite the entry name (ID) alone)

Page 79: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Non-Redundant Complete Proteome Sets

• Text search UniProtKB keyword “Complete proteome”, combined with an organism name

• Or download precomputed sets (bacteria, archaea, some eukaryotes): ftp://ftp.expasy.org/databases/complete_proteomes/entries

• Or EBI Integr8 http://www.ebi.ac.uk/integr8/

Page 80: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

The main annotation programs:

• HAMAP (High quality Automated and Manual Annotation of microbial Proteomes; bacteria, archaea, plastids);

• HPI (Human Proteomics Initiative);• PPAP (Plant Proteome Annotation Project);• FPAP (Fungal Proteome Annotation Project);• Viral proteins;• Tox-Prot (Toxin Annotation Project);• ENZYMES (proteins with EC numbers);• PTMs• 3D-structure• Protein-protein interactions• Quality assurance, includes controlled vocabularies

Swiss-Prot annotation priorities

Page 81: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Model organisms

• Organisms for which we want to have a more in-depth coverage;

• Completeness, links with specialized databases, specific documents;

• Examples: E.coli, B.subtilis, human, mouse, fruitfly, C.elegans, yeast, S.pombe, A.thaliana.

Page 82: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Human Proteomics Initiative

(HPI)

Page 83: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

post-translational modifications of proteins

(PTMs)5-10 fold increase

alternative splicingof mRNA

2-5 fold increase

~ 100,000 human

transcripts

~ 21,000 human genes

~ 1,000,000 human proteins

Considerable increase in complexity

From genome to proteome

Page 84: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

In the case of human genes, the Swiss-Prot/TrEMBL redundancy is still very high:

15,803 + 53,100 about 20,000*

* human gene number estimation:21,000-35,000

MS proteomics has verified more than 10% of human genes products, but has not identified significant numbers of unpredicted proteins

What is missing:• Sequences not submitted to EMBL/GenBank/DDJB (and PIR)• Not yet predicted or known genes ("no CDS provided by the submitters" or no DNA sequence)• Confidential data (Patent application sequences)• Immunoglobulins, T-cell receptors (-> UniParc)•…

1000

Page 85: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Page 86: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Post-translational modifications

(PTMs)

Page 87: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

PTM definition

a post-translational modification or PTM is

a modification of a polypeptide chain involving

the making or the breaking of covalent bond(s)

that occurs during (co-translational class) or after

translation.

Page 88: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

PTMs PTMs influenceinfluence or even or even definedefine protein protein functionfunction

phosphorylation and possibly GlcNAcylation and S-nitrosylation are a means of transducing extracellular signals to the inside of the cells.methylation has a role in nuclear protein import. lipid addition allows protein to membrane association (e.g. GPI-anchor, myristate, palmitate).intrachain disulfide bonds and N-glycosylation influence protein folding.interchain disulfide bonds bind subunits together.other PTMs are directly involved in the protein function, as for example the binding of cofactors (e.g. pyridoxal phosphate), or the synthesis of a cofactor by the modification of amino acids present in the protein (e.g. quinones).

Page 89: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

PTM varietyGly Ala Val Leu Ile Lys Arg His Asp Glu Asn Gln Cys Ser Thr Met Pro Phe Tyr Trp

acetylation

methylation

acylation

phosphorylation

oxidation

crosslinks

hydroxylation cofactor binding sulfation C-linked sugar N-linked sugar O-linked sugar S-linked sugar

acetylation

methylation

acylation

crosslinks

GPI

amidation

crosslinks

methylation

C-terminal modifications

in black: cytoplasmic modificationsin dark grey: both cytoplasmic and extracellular modifications, depending on the exact typein light grey: extracellular modifications

N-terminal modifications

side-chain modifications

Each protein can be modified at various sites…which gives a high number of ‘alternative’ peptides.

283 different protein modifications are annotated in UniProtKB/Swiss-Prot…

Page 90: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Large scale experiments (LSE) for

PTMs! • PTM information can now be obtained from

results of proteomics large scale experiments (LSE);

• In the past 12 months we have added about 6’000 experimental PTMs using data originating from some of these projects.

AMB, SP20

Page 91: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Proteomic studies have lead to the updating of 2767 human Swiss-Prot entries, mainly with PTM information

(UniProt release 10.0 , March 2007)

Glycosylation (9%)

Other PTMs (4%)

Phosphorylation (83%)

Subcellular location (4%)

Page 92: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Bacteria and Archaea

(HAMAP)

Page 93: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

In 2006, ≈130 new bacterial and archaeal genomes (not WGS) were submitted to the DNA databases;

If on "average" 4,000 proteins/genome=>500,000 proteins!

How to cope????

Page 94: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

High qualityAutomated andManualAnnotation of microbial

ProteomesLots of microbial genomes, lots of proteins. What should we do with them in UniProt?

HAMAP

Page 95: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

http://www.expasy.org/unirule/MF_00319

Page 96: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Automatic annotation of proteins belonging to specified families

(1)• This program requires the continuous

development and adaptation of software tools as well as the development of a database of annotation rules for each family (so far about 1,400).

Page 97: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Allows us to annotate automatically, yet with a very high level of quality, proteins that belong to well defined protein families;

Can be applied to both characterized proteins and to some UPF’s (Uncharacterized Protein Family);

The families are based on UniProtKB/Swiss-Prot entries, so we first do all the annotation steps described earlier!

Page 98: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the
Page 99: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Using HAMAP, we can currently annotate to Swiss-Prot quality level between 10% to 50% of a complete

microbial proteome (next step: HAMAP for Fungi…)

/www.expasy.ch/sprot/hamap/

Page 100: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Updates• DNA sequence archives

– EMBL/GenBank/DDBJ is an archive• All submitted data goes into the archive• Submitters are responsible for the submitted

sequences and the accompanying annotation• Nobody else can change them (including the

curators at EMBL/GenBank/DDBJ)

• Protein sequence databases– UniPRotKB/Swiss-Prot is NOT an archive

• Swiss-Prot chooses what goes into the database and where to place it

• Swiss-Prot updates annotation and sequences when necessary

Page 101: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

**ZB SYP, 28-NOV-2003; ALB, 16-NOV-2004; MIM, 31-Jan-2006;**ZB BER, 13-FEB-2006; LYG, 14-JUN-2006; LYG, 21-SEP-2006;**ZB CHH, 05-DEC-2006;

Page 102: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

User updates or annotation requests

Page 103: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Accessing & Searching UniProtKBAccessing & Searching UniProtKBDirect access (keyword search)• New search tool – we’ll use it later• Sequence Retrieval System (SRS, Europe), will

disappear • Entrez (NCBI, USA) – UniProtKB/Swiss-Prot (not

TrEMBL) is integrated in GenPept, but with a changed format, and with some information (e.g. implicit cross-references) is missing

• Query tools on ExPASy & UniProt (http://www.expasy.org/sprot/, http://www.uniprot.org)

Indirect access (sequence search)• Bioinformatics & sequence analysis tools (Blast,

Fasta, GCG, Emboss, MS Identification tools…)

Page 104: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Downloading the UniProt Knowledgebase

http://www.expasy.org/sprot/download.htmlhttp://www.expasy.org/sprot/download.html

• Swiss-Prot and TrEMBL form a complete, non-redundant database, the UniProt Knowledgebase

• Can be downloaded from ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase

• In “Swiss-Prot” format, fasta or xml format• Complemented by sequences of alternative splice

isoforms• “everything” about “ all” proteins! (at least all CDS

submitted to the public nucleotide sequence databases)

Page 105: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

If you want to develop tools to work with your local copy of UniProtKB:

Swissknife – a PERL parser for UniProtKBConstantly updated according to latest format

changesAdvantage: you do not need to know how

exactly the information is stored in the flat file

• http://swissknife.sourceforge.net/• ftp://ftp.ebi.ac.uk/pub/software/swissprot/

Swissknife/

Page 106: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

• Swiss-Prot is the non redundant, manually annotated and highly cross-referenced section of the UniProt Knowledgebase

• Be aware of the differences between UniProtKB/TrEMBL and UniProtKB/Swiss-Prot – Computer vs. Human– Redundant vs. Non-redundant

• Always cite the Accession number, not the entry name– The AC is stable– The entry name might change

We need your feedback and your [email protected]

http://www.expasy.org/sprot/update.html(and from every UniProtKB entry page on our servers)

Take home messageTake home message

Page 107: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

The UniProt Consortium

UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein

information

www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006).

Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc

and soon UniMES (for Metagenomic and Environmental Sequences)

Page 108: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniRef100, 90 and 50 clusters

One UniRef100 entry -> all identical sequences from UniProtKB and some sections of UniParc (including fragments, Swiss-Prot splice variants).

One UniRef90 entry -> sequences that have at least 90% or more identity.

One UniRef50 entry -> sequences that are at least 50% identical.

Page 109: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniRef100, 90 and 50 clusters

One cluster can contain sequences of several species, clustering is done independently of the organism

Each cluster has a “representative”, “reference” sequence, preferably that of the best-annotated Swiss-Prot entry

UniRef identifiers are of the form UniRef100_P99999, UniRef50_P00414 – not stable, as clusters are recomputed with every biweekly release, and cluster representatives can change!

UniRef is useful for comprehensive BLAST sequence searches by providing sets of representative sequences.

Page 110: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Implicit cross-link UniProtKB to UniRef:Implicit cross-link UniProtKB to UniRef:

new web view:new web view:

Page 111: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

The UniProt Consortium

UniProt (Universal Protein Resource): the world's most comprehensive catalogue of protein

information

www.uniprot.org, Wu et al. Nucleic Acids Res. 34:D187-191(2006).

Provides 3 databases: -UniProtKB (Swiss-Prot + TrEMBL) -UniRef -UniParc

and soon UniMES (for Metagenomic and Environmental Sequences)

Page 112: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniParc – the UniProt Archive• 8.8 million sequences• Sequences and cross-references (AC numbers)• A comprehensive collection of the raw protein

sequences in public databases (including those not submitted to the DNA databases):

Swiss-Prot, TrEMBL, PIR, EMBL, Ensembl, IPI, PDB, RefSeq, FlyBase, WormBase, Patent Offices.

• UniParc can be used to track sequence versionsUse with extreme caution: also contains pseudogenes,

incorrect CDS predictions, etc…and is highly redundant !

Page 113: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniParc UniParc tracks a protein tracks a protein sequence and its integration in various databasessequence and its integration in various databases

http://www.pir.uniprot.org/cgi-bin/textSearch_AR

Patent data

Page 114: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

UniParc entry UPI0000033477 part 2

TrEMBL entry was merged into Swiss-Prot

TrEMBL entry probably to be merged into Swiss-Prot

Page 115: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Page 116: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

www.expasy.ch/prosite

Page 117: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

A database of protein families and domains using two kinds of motif descriptors:

Patterns or regular expressions : •User friendly (easy to understand and to use)•Well designed for the detection of biologically meaningful sites such as residues playing a structural or functional role •Can be used to scan a protein database in reasonable time on any computer

Generalized profiles or weight matrices : •Well adapted to cover the full length of the protein or domain •Are able to detect highly divergent families or domains with only a few well conserved positions

PROSITE

Page 118: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Identification of protein domains and families

• There are two non-exclusive approaches for the determination of the function of an uncharacterized protein:– Comparison with a complete sequence database

(BLAST)– Scanning a database of patterns and profiles

• Most proteins can be grouped into families. Proteins belonging to a particular family share functional attributes and are derived from a common ancestor;

• Some regions in the sequence are more conserved than others during evolution because they are important for the function or the structure of the protein;

• Like fingerprints for police identification, signatures built out of sequence patterns or profiles can be used to formulate hypotheses about the function of uncharacterized proteins.

Page 119: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Definitions of conserved regions

Conserved regions can be classified into 5 different groups:

• Families: proteins that have the same domain arrangement, be 1 or many domains.

• Domains: specific combination of secondary structures that assume characteristic three dimensional structures or folds.

• Repeats: structural units always found in two or more copies that assemble in specific fold. Assemblies of repeats might also be thought of as domains.

• Motifs: short regions with conserved active- or binding-sites that usually adopt a folded conformation only in association with their ligands.

• Sites: functional residues (active sites, disulfide bridges, post-translationally modified residues)

Page 120: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Conserved regions (2)

CSA_PPIASE

TP

R

TP

R

TP

R

PPID family: 1 CSA_PPIASE domain + 3 TPR repeat

Cys 181: active site residue Binding cleft (motif)

Page 121: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

http://www.expasy.org/tools/scanprosite/

Page 122: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Page 123: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Functionally and structurally relevant residues in PROSITE motif

descriptorsA new concept to extract more information

from profilesPrinciple :• Combining the advantages of profiles

(high sensitivity) and patterns (position-specific information)

• Tagging of amino acids at precise positions in the profile and checking their presence in the matched sequence

Page 124: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Aim:

• Provide users with biologically meaningful functional and structural information:

active sites, post-translational modification sites,binding sites,disulfide bonds,transmembrane regions.

• Help the UniProtKB/Swiss-Prot annotation and provide enhanced homogeneity:

domain name and boundaries,keywords and linked GO terms,EC numbers,false negative PROSITE patterns.

ProRule

 

  

   

 

 

 

Page 125: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Sigrist et al.: Bioinformatics 21:4060-4066(2005)

www.expasy.ch/prosite/prorule.html

Page 126: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Other methods for protein/domain identification

Pfam, TIGRFAMs, SMART, Gene3D, PANTHER, CDD: Hidden Markov Models (HMM), Probabilistic models;

PRINTS: “Unweighted” matrices; protein fingerprints

BLOCKS: Weight matrix derived from ungapped alignments;

PIRSF, SUPERFAMILY: classification system based on evolutionary relationship of whole proteins

ProDom: automatic compilation of homologous domains based on recursive PSI-BLAST searches.

Page 127: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

The InterPro projectwww.ebi.ac.uk/interpro

Integrated Documentation Resource of Protein Families, Domains and Functional Sites

Page 128: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

The InterPro projectwww.ebi.ac.uk/interpro

• Unification of PROSITE, PRINTS, Pfam and ProDom into an integrated resource of protein families, domains and functional sites in 2000;

• Joint effort in creating a unified yet methodologically diverse system for protein family/domain identification;

• Single set of “documents” linked to the various methods;• Distributed with tools by anonymous FTP and through

www servers;• Used to enhance the functional annotation of UniProtKB

(Swiss-Prot and TrEMBL)• Has progressively incorporated other databases

Page 129: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Current status of InterProRelease 14.1 (February 2007) was built from Pfam, PRINTS, PROSITE, ProDom, SMART, TIGRFAMs, PIRSF, Scop based SUPERFAMILY, Gene3D and PANTHER, and the current UniProt/Swiss-Prot + TrEMBL data. (for details see http://www.ebi.ac.uk/interpro/release_notes.html)

InterPro release 14.1 contains 13,953 entries, representing 3,911 domains, 9,610 families, 232 repeats, 34 active sites, 20 binding sites and 19 post-translational modification sites. Overall, there are 15,880,845 InterPro hits from 3,100,874 UniProtKB protein sequences.

92.4% of Swiss-Prot and 76.4% of TrEMBL protein sequences have one or more InterPro hits.

Page 130: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

http://www.ebi.ac.uk/interpro/

Page 131: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

http://www.ebi.ac.uk/interpro/IEntry?ac=IPR001304

Page 132: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

InterPro: Graphical domain representation

Page 134: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

http://www.ebi.ac.uk/integr8/ProteomeAnalysisAction.do?orgProteomeId=18

Page 135: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

The ExPASy www server

• First molecular biology server on the Web (August 1993); ~500 million accesses since;

• Dedicated to proteomics:– Databases: UniProtKB, PROSITE, Swiss-2DPAGE,

etc.;– Many 2D/MS protein identification/characterization and

sequence analysis tools;

• Mirror sites in Australia, Brazil, Canada, China and Korea: http://{au|br|ca|cn|kr|www}.expasy.org

Page 136: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Page 137: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

ExPASy software tools

• Tools for the display and management of databases (NiceProt, Swiss-Shop sequence alerting system, etc.);

• Tools for sequence analysis (ScanProsite, ProtParam, ProtScale, RandSeq, Translate, etc.);

• Proteomics tools (AACompIdent, FindMod, FindPept, Aldente, PeptideMass, TagIdent, etc.);

• 3D-structure analysis and display tools (Swiss-Model, Swiss-PDBviewer)

Page 138: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Identification:Aldente,

TagIdent, AAcompIdent,

MultiIdent

Identification:Aldente,

TagIdent, AAcompIdent,

MultiIdent

Characterization:

FindMod,GlycoMod, FindPept

Characterization:

FindMod,GlycoMod, FindPept

Analysis:PeptideMass,GlycanMass,BioGraph,

PeptideCutterProtScale,ProtParam

Analysis:PeptideMass,GlycanMass,BioGraph,

PeptideCutterProtScale,ProtParam

- Use annotation in Swiss-Prot and TrEMBL (preprocessing, PTMs, etc.)- Hyper-links between tools and databases

http://www.expasy.org/tools/

Page 139: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

http://www.expasy.org/links.html

Page 140: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Finding out about recent developments:

UniProtKB/Swiss-Prot recent format changes:http://www.expasy.org/sprot/relnotes/sp_news.html

UniProtKB/Swiss-Prot planned format changes:http://www.expasy.org/sprot/relnotes/sp_soon.html

Subscribe to the electronic Swiss-Flash bulletins: http://www.expasy.org/swiss-flash/

What’s new on ExPASy: http://www.expasy.org/history.html

Page 141: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

References (1)UniProtKB/Swiss-Prot: http://www.expasy.org/sprot/sprot-ref.html

Wu C. et al. The Universal Protein Resource (UniProt): an expanding universe of protein information.Nucleic Acids Res. 34:D187-191(2006).

Boeckmann B. et al. Protein variety and functional diversity: Swiss-Prot annotation in its biological contextComptes Rendus Biologies 328:882-99(2005).

Bairoch A.Swiss-Prot: Juggling between evolution and stability Brief. Bioinform. 5:39-55(2004).

Farriol-Mathis N. et al. Annotation of post-translational modifications in the Swiss-Prot knowledgebase. Proteomics 4:1537-1550(2004).

Gasteiger E. et al. A. Swiss-Prot: Connecting biological knowledge via a protein databaseCurr. Issues Mol. Biol. 3:47-55(2001).

Page 142: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

References (2)PROSITE:Hulo N., et al., The PROSITE database. Nucleic Acids Res. 34:D227-

D230(2006).

Sigrist C.J.A., et al., PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform. 3:265-274(2002).

Gattiker A., et al., ScanProsite: a reference implementation of a PROSITE scanning tool. Applied Bioinformatics 1:107-108(2002).

Sigrist C.J.A., et al., ProRule: a new database containing functional and structural information on PROSITE profiles. Bioinformatics. 2005 21(21):4060-6.

ExPASy:Gasteiger E. et al.ExPASy: the proteomics server for in-depth protein

knowledge and analysis. Nucleic Acids Res. 31:3784-3788(2003).

Page 144: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Tunis, March 2007A. Auchincloss

UniProtKB and ExPASy

Take home message

• We need your [email protected]

Or via the website

Page 145: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Before the introduction to Swiss-Prot/ExPASy…

After the introduction to Swiss-Prot /ExPASy …

Page 146: Tunis, March 2007 A. Auchincloss UniProtKB and ExPASy UniProtKB/Swiss-Prot and ExPASy: Protein sequence databases and proteomics tools developed at the

Some practical exercises:

http://education.expasy.org/cours/Tunis/

1. Finding databases2. Comparing protein databases3. Comparing BLAST programs4. BLAST output5. Bacterial start sites6. UniRef7. Different views of UniProtKB8. Environmental sequences9. Inter-database links & PROSITE10. InterPro11. Using UniProtKB/Swiss-Prot to create datasets