anastasia nikolskaya lai-su yeh protein information resource georgetown university medical center
DESCRIPTION
PIR: a comprehensive resource for functional analysis of protein sequences and families . Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center Washington, DC. PIR Web Site. NEW web site, soon to become public http://pir.georgetown.edu - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/1.jpg)
Anastasia Nikolskaya Lai-Su Yeh
Protein Information ResourceGeorgetown University Medical CenterWashington, DC
PIR: a comprehensive resource for functional analysis of protein sequences and families
![Page 2: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/2.jpg)
2
PIR Web Site NEW web site, soon to become publichttp://pir.georgetown.edu currently an old version
PIR and UniProt web sites interlinked and cross-navigable
PIR-specific features
Text Search Sequence Search Classification Database Search
![Page 3: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/3.jpg)
3
i
• Integration of protein family, function, structure
• Rich links (executive summary + hypertext links) to > 90 databases
• Value-added reports for 1.96 Million UniProtKB protein entries
i
iProClass Protein Knowledgebase
Disease/Variation
OMIMHapMap
…Ontology
GO
Protein Sequence
UniProtUniRefUniParcRefSeq
GenPept…
Gene/Genome
GenBank/EMBL/DDBJLocusLinkUniGene
MGITIGR
…
Gene Expression
GEOGXD
ArrayExpressCleanExSOURCE
…
Structure
PDBSCOPCATH
PDBSumMMDB
…
Family
PIRSFInterPro
PfamPrositeCOG
…
Interaction
DIPBIND
…
Taxonomy
NCBI TaxonNEWT
Protein Expression
Swiss-2DPAGEPMG
…
Literature
PubMed
Function/Pathway
EC-IUBMBKEGG
BioCartaEcoCyc
WIT…
Modification
RESIDPhosphoBase
…
iProClass
Integrated Protein Knowledgebase
iProClass
Integrated Protein Knowledgebase
http://pir.georgetown.edu/iproclass
![Page 4: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/4.jpg)
4
Example
Want to find info on chorismate mutases,Specifically:Start with Bacillus subtilis P19080 = CHMU_BACSU
Relatedness to other chorismate mutases- Homology- Domain architecture
- Is it related to E.coli P07022 (a well-studied bifunctional enzyme (P-protein), chorismate mutase/prephenate dehydratase)
![Page 5: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/5.jpg)
5
iProClass Sequence Report
![Page 6: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/6.jpg)
6
What can we find about “chorismate mutase”
Protein Analysis: I. Text Search iProClass
![Page 7: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/7.jpg)
7
Text SearchResults (I)
UniProt ID
![Page 8: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/8.jpg)
8
Text SearchResults (II)
Display options: add or remove columns
![Page 9: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/9.jpg)
9
Text Search Results (III)
Find chorismate mutase(s) from B. subtilis
![Page 10: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/10.jpg)
10
Determining Protein HomologyIs B. subtilis CM P19080 homologous to E.coli P-protein P07022? to B. subtilis AroA(G) P39912?Which domains, if any, in multidomain chorismate mutases it corresponds to?What kinds of domain architecture exist in chorismate mutases?
![Page 11: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/11.jpg)
11
Retrieve Proteins by UID in Batch Mode
ID mapping option: can use various non-UniProt IDs
Batch Retrieval
![Page 12: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/12.jpg)
12
Determining Protein Homology:Sequence Search
BLAST FASTA SSearch
![Page 13: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/13.jpg)
13
Blast Search ResultsBLAST query UniProt sequence P19080hits PIRSF005965 family members as best hits
![Page 14: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/14.jpg)
14
Pre-compiled Related Sequences: saves time
![Page 15: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/15.jpg)
15
BLAST/SSEARCH Results
SSEARCH Alignment
BLASTAlignment
![Page 16: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/16.jpg)
16
Determining Protein Homology: Peptide Search
![Page 17: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/17.jpg)
17
Peptide Search Results
![Page 18: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/18.jpg)
18
Protein families reflect evolutionary relationships Function often follows along the family lines Therefore, matching a protein sequence a protein family
provides information about a protein (need a highly curated and annotated family)
Faster and often more accurate than searching against a protein database
Protein classification facilitates sequence and functional analysis of proteins and is used for accurate automatic annotation (PIRSF is used for UniProt annotation)
Family Classification System:One-Stop Platform for Protein Analysis
![Page 19: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/19.jpg)
19
PIRSF Classification System PIRSF: reflects evolutionary relationships of full-length
proteins
Definitions: Basic unit = Homeomorphic Family Homologous: Inferred by sequence similarity Homeomorphic: Full-length sequence similarity and common domain
architecture Hierarchy: Flexible number of levels with varying degrees of sequence
conservation; Network Structure: multiple domain parents
Advantages: Annotation of both generic biochemical and specific biological functions Accurate propagation of annotation and development of standardized
protein nomenclature and ontology
![Page 20: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/20.jpg)
20
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF001499: Bifunctional CM/PDH (T-protein)
PIRSF006786: PDH, feedback inhibition-insensitive
PIRSF005547: PDH, feedback inhibition-sensitive
PF02153: Prephenatedehydrogenase (PDH)
PIRSF017318: CM of AroQ class, eukaryotic type
PIRSF001501: CM of AroQ class, prokaryotic type
PIRSF026640: Periplasmic CM
PIRSF001500: Bifunctional CM/PDT (P-protein)
PIRSF001499: Bifunctional CM/PDH (T-protein)
PF01817: Chorismatemutase (CM)
PIRSF006493: Ku, prokaryotic type
PIRSF500001: IGFBP-1
…
PIRSF500006: IGFBP-6
PIRSF Homeomorphic Subfamily
• 0 or more levels
• Functional specialization
PIRSF018239: IGFBP-related protein, MAC25 type
PIRSF001969: IGFBP
PIRSF003033: Ku70 autoantigen
PIRSF016570: Ku80 autoantigen
PIRSF Homeomorphic Family• Exactly one level
• Full-length sequence similarity and common domain architecture
PIRSF Superfamily
• 0 or more levels
• One or more common domains
PF00219: Insulin-like growth factor binding protein
(IGFBP)
PIRSF800001: Ku70/80 autoantigenPF02735: Ku70/Ku80 beta-barrel domain
Domain Superfamily• One common Pfam
domain
PIRSF Classification SystemA protein may be assigned to only one homeomorphic family, which may have zero or more child nodes and zero or more parent nodes. Each homeomorphic family may have as many domain superfamily parents as its members have domains.
![Page 21: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/21.jpg)
21
Unclassified UniProtKB proteins
Uncurated Homeomorphic Clusters
Orphans
Preliminary Homeomorphic Families
Final Families, Subfamilies, Superfamilies
Add/Remove Members
Name, Refs, Abstract, Domain Arch.
Automatic Clustering
Computer-assisted Manual Curation
Automatic Procedure Unassigned Proteins
Au
tom
atic
Pla
ce
me
nt
Hierarchies (Superfamilies/Subfamilies)
Map Domains on Clusters
Merge/Split Clusters
New Proteins
Protein Name Rules/Site Rules Build and Test HMMs
1
2
3
4
5
6
7 8
![Page 22: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/22.jpg)
22
Unclassified UniProtKB proteins
Uncurated Homeomorphic Clusters
Orphans
Preliminary Homeomorphic Families
Final Families, Subfamilies, Superfamilies
Add/remove members
Name, refs, abstract, domain arch.
Automatic clustering
Computer-assisted Manual Curation
Automatic Procedure Unassigned proteins
Au
tom
atic
pla
ceme
nt
Hierarchies (superfamilies/subfamilies)
Map domains on Clusters
Merge/splitclusters
New proteins
Protein Name Rule/Site Rule Build and test HMMs
1
2
3
4
5
6
7 8
Unclassified UniProtKB proteins
Uncurated Homeomorphic Clusters
Orphans
Preliminary Homeomorphic Families
Final Families, Subfamilies, Superfamilies
Add/remove members
Name, refs, abstract, domain arch.
Automatic clustering
Computer-assisted Manual Curation
Automatic Procedure Unassigned proteins
Au
tom
atic
pla
ceme
nt
Hierarchies (superfamilies/subfamilies)
Map domains on Clusters
Merge/splitclusters
New proteins
Protein Name Rule/Site Rule Build and test HMMs
1
2
3
4
5
6
7 8
![Page 23: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/23.jpg)
23
Tool: Curator’s Decision Maker
![Page 24: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/24.jpg)
24
Classification Tool: BlastClust Curator-guided
clustering
Single-linkage clustering using BLAST
Retrieve all proteins sharing a common domain
Iterative BlastClust (fixed length coverage)
![Page 25: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/25.jpg)
25
Family Analysis of Homologous Proteins1. Fully Curated Protein Family:
Especially important when the protein of interest is underannotated or misannotated (happens often!)
Evidence types: Characterized (validated), Predicted (by computational methods) or Uncharacterized
2. Preliminary or Uncurated Family Have to do some analysis OR contact PIR and ask to prioritize this family
3. No Family Classification Have to do some analysis OR contact PIR and ask to prioritize this family
iProClass search PIRSF - blank
![Page 26: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/26.jpg)
26
Underannotated Proteins
Search iProClass with PIRSF005965
Providing more information
![Page 27: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/27.jpg)
27
PIRSF SCAN (sequence search)
UniProt sequence Q8Y5X7 is automatically classified as chorismate mutase of the AroH classPIRSF005965
Returns only matches to fully curated PIRSFs
![Page 28: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/28.jpg)
28
Taxonomic distribution of PIRSF can be used to infer evolutionary history of the proteins in the PIRSF
PIRSF Family Report: Curated Protein Family Information
Phylogenetic tree and alignment view allows further sequence analysis
![Page 29: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/29.jpg)
29
PIRSF Family Report (II)
Integrated value added information from other databases
Mapping to other protein classification databases
![Page 30: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/30.jpg)
30
CM from B.subtilis P19080 does not bring B.subtilis AroA(G) or E. coli P-protein (or related proteins) in BLAST search
Contains a different PFAM domain Identical conserved motifs are not found NOT homologous
PIRSF reports: abstracts contain most of this info PIRSF domain architecture (curated or uncurated): Pfam and
newly defined domains Structure information (PDB links) Hierarchy in DAG (under development)
Chorismate Mutase Results from iProClass Analysis
Use PIRSF family database for the same analysis:
![Page 31: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/31.jpg)
31
PIRSF Text Search
New domain
AroA(G)
![Page 32: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/32.jpg)
32
Chorismate Mutase Convergent Evolution – EC 5.4.99.5 (Non-Orthologous Gene
Displacement) Two Distinct Sequence/Structure Types
AroQ Class: SCOP (all ), core: 6 helices, bundle AroH Class: SCOP (+), core: beta-alpha-beta-alpha-beta(2)
Two Pfam Domains: PF01817, PF07736 (New PFAM domain)
AroQAroQ AroHAroH
![Page 33: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/33.jpg)
33
Developing DAG Viewer
Before:all chorismate mutase proteins and families hit PF01817includingPIRSF005965(not homologous to the rest)
Subfamily
Network structure (in DAG) for PIRSF family classification system reflects PIRSF family hierarchy which is based on evolutionary relationships
![Page 34: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/34.jpg)
34
DAG Viewer (II)
After:PFAM created a new domain PF07736which is found in PIRSF005965 members
“Orphans”: no family classification
![Page 35: Anastasia Nikolskaya Lai-Su Yeh Protein Information Resource Georgetown University Medical Center](https://reader035.vdocuments.mx/reader035/viewer/2022062409/56814fbf550346895dbd790d/html5/thumbnails/35.jpg)
35
PIR Team Dr. Cathy Wu, Director
Protein Classification teamDr. Winona Barker Dr. Lai-Su Yeh Dr. Anastasia NikolskayaDr. Darren Natale Dr. Zhang-Zhi Hu Dr. Raja Mazumder Dr. CR Vinayaka Dr. Sona Vasudevan Dr. Cecilia Arighi
Informatics teamDr. Hongzhan Huang Dr. Peter McGarvey Baris Suzek, M.S. Sehee Chung, M.S. Dr. Leslie Arminski Dr. Hsing-Kuo Hua Yongxing Chen, M.S. Jian Zhang, M.S. Dr. Xin Yuan
Students
Christina Fang Vincent Hermoso Natalia Petrova
UniProt is supported by the National Institutes of Health, grant # 1 U01 HG02712-01