bourne rdap11 data publication repositories
DESCRIPTION
Phil Bourne, Protein Data Bank; Data Publication Repositories; RDAP11 Summit The 2nd Research Data Access and Preservation (RDAP) Summit An ASIS&T Summit March 31-April 1, 2011 Denver, CO In cooperation with the Coalition for Networked Information http://asist.org/Conferences/RDAP11/index.htmlTRANSCRIPT
A Few RDAP ThoughtsBased on Experience with
The RCSB Protein Data Bank
www.rcsb.org
Philip E. BourneUCSD
3/31/11 RDAP Summit 2011
http://www.slideshare.net/pebourne/rdap-033111
Disclaimer
I am not an expert in institutional repositories
I happen to have helped develop and oversee a resource that I use for my own research
3/31/11 RDAP Summit 2011
What is the Protein Data Bank (PDB)?
“Stored collective”
“Consistent with scholarly practice”
Clifford Lynch
3/31/11 RDAP Summit 2011
What is the Protein Data Bank (PDB)? The single community owned
worldwide repository containing structures of publically accessible biological macromolecules
A resource used by ~ 200,000 individuals per month
A resource distributing worldwide the equivalent to ¼ the National Library of Congress each month
A bicoastal resource 1TB
3/31/11
Nu
mb
er o
f re
leas
ed e
ntr
ies
Year
PDB Total Contents by Year
3/31/11
Why We Think We Are Successful?
Number of visits and page views is growing faster than number of unique visitors
Metric of Success - A Research Tool for Influenza
* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm
Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010
1RUZ: 1918 H1 Hemagglutinin
Structure Summary page activity forH1N1 Influenza related structures
*
3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir
Looking Back Over the Past 12 Years – In General Everything was harder and took longer than we thought There are a lot of politics associated with data Emphasis has shifted from archive to + analytical tool
to + educational tool Consequently outreach is our most important yet least
understood activity today Staff needed to change accordingly Policy has changed as well – some support for non-
generic tools Prorated our budget has decreased
Looking Back Over the Past 12 Years – Infrastructure It took about 5 years to achieve and
subsequently sustain 99.99% uptime We have gone through 3 distinct architectural
changes– Object model / Perl CGI– Object-relational model Enterprise Java– Redesign same model widget based UI
3/31/11 RDAP Summit 2011
Bluhm et al. 2011 Quality Assurance doi: 10.1093/database/bar003
Looking Back Over the Past 12 Years – Data & Data Management About 25% of our budget has been spent on data
remediation Support yearly snapshots and versioning Our ontology/data model has been a critical component of
our workflow and data accuracy The same model is too complex to facilitate wide adoption
by others that use our data Our data are such that we can retain redundant copies Data objects are discreet and we assign DOIs Constantly striving to have the user distinguish raw from
derived data
3/31/11 RDAP Summit 2011
Trends Today
Constant demand for better performance Use of Web services (SOAP and now RESTful) are
increasing The uptake on the use of widgets has been slower
than I hoped Users are hankering after additional annotations of
the data – working on database-literature integration
Mobile use is increasing Web 2.0 services are in demand
3/31/11 RDAP Summit 2011
Website Performance Improvements
Back End– Back-end tuning and use of
multilevel caching in the areas of searches, query results, explorer pages and hierarchical views
– Better performance and a more robust and scalable system
Front End– Cleaner JavaScript and
CSS
– Inline Image Data– Compressed Content
(Gzip + Base 64)
– Result: 25% - 40% increase in render performance
1. A link brings up figures from the paper
0. Full text of PLoS papers stored in a database
2. Clicking the paper figure retrievesdata from the PDB which is
analyzed
3. A composite view ofjournal and database
content results
Literature Integration – The Dream
1. User clicks on content
2. Metadata and webservices to data provide an interactive view that can be annotated
3. Selecting features provides a data/knowledge mashup
4. Analysis leads to new content I can share
4. The composite view haslinks to pertinent blocks
of literature text and back to the PDB
1.
2.
3.
4.
The Knowledge and Data Cycle
PLoS Comp. Biol. 2005 1(3) e34
www.rcsb.org/pdb/explore/literature.do?structureId=1TIM
Example of Interoperability: The Database View
BMC Bioinformatics 2010 11:220
Example of Interoperability – The Literature View
From Anita de Waard, Elsevier
Worldwide Protein Data Bank
www.wwpdb.orgSemantic Tagging & Widgets are a Powerful Tool to Integrate Data and Knowledge of that Data, But as Yet Not Used Much
Will Widgets and Semantic Tagging Change Computational Biology? PLoS Comp. Biol. 6(2) e1000673
Semantic Tagging of Database Content in The Literature or Elsewhere
http://www.rcsb.org/pdb/static.do?p=widgets/widgetShowcase.jspPLoS Comp. Biol. 6(2) e1000673Semantic Tagging
PDBMobile
• Fast, low bandwidth data access• First version supports iPhone OS• Future versions will support Android,
Blackberry OS6 and others.• HTML 5-based web application• Client-side database stores data for
offline-access• Tight integration with MyPDB
Objective: PDB Data Access On-The-Go
PDBMobile
• Access to saved queries • Add/delete queries • Flag interesting entries• Add personal structure annotations
Tight Integration with MyPDB
Future
New views on the data for subclasses of user
New data deposition system – increase speed and accuracy while reducing costs
New types of analysis
Acknowledgements
Funding Agencies: NSF, NIGMS, DOE, NLM, NCI, NCRR, NIBIB, NINDS, NIDDK
213/31/11 RDAP Summit 2011