bourne rdap11 data publication repositories

A Few RDAP ThoughtsBased on Experience with

The RCSB Protein Data Bank

www.rcsb.org

Philip E. BourneUCSD

[email protected]

3/31/11 RDAP Summit 2011

http://www.slideshare.net/pebourne/rdap-033111

http://www.rcsb.org/

Disclaimer

I am not an expert in institutional repositories

I happen to have helped develop and oversee a resource that I use for my own research


What is the Protein Data Bank (PDB)?

“Stored collective”

“Consistent with scholarly practice”

Clifford Lynch


What is the Protein Data Bank (PDB)? The single community owned

worldwide repository containing structures of publically accessible biological macromolecules

A resource used by ~ 200,000 individuals per month

A resource distributing worldwide the equivalent to ¼ the National Library of Congress each month

A bicoastal resource 1TB

3/31/11

Nu

mb

er o

f re

leas

ed e

ntr

ies

Year

PDB Total Contents by Year

3/31/11

Why We Think We Are Successful?

Number of visits and page views is growing faster than number of unique visitors

Metric of Success - A Research Tool for Influenza

* http://www.cdc.gov/h1n1flu/estimates/April_March_13.htm

Jan. 2008 Jan. 2009 Jan. 2010Jul. 2009Jul. 2008 Jul. 2010

1RUZ: 1918 H1 Hemagglutinin

Structure Summary page activity forH1N1 Influenza related structures

*

3B7E: Neuraminidase of A/Brevig Mission/1/1918 H1N1 strain in complex with zanamivir

Looking Back Over the Past 12 Years – In General Everything was harder and took longer than we thought There are a lot of politics associated with data Emphasis has shifted from archive to + analytical tool

to + educational tool Consequently outreach is our most important yet least

understood activity today Staff needed to change accordingly Policy has changed as well – some support for non-

generic tools Prorated our budget has decreased

Looking Back Over the Past 12 Years – Infrastructure It took about 5 years to achieve and

subsequently sustain 99.99% uptime We have gone through 3 distinct architectural

changes– Object model / Perl CGI– Object-relational model Enterprise Java– Redesign same model widget based UI


Bluhm et al. 2011 Quality Assurance doi: 10.1093/database/bar003

Looking Back Over the Past 12 Years – Data & Data Management About 25% of our budget has been spent on data

remediation Support yearly snapshots and versioning Our ontology/data model has been a critical component of

our workflow and data accuracy The same model is too complex to facilitate wide adoption

by others that use our data Our data are such that we can retain redundant copies Data objects are discreet and we assign DOIs Constantly striving to have the user distinguish raw from

derived data


Trends Today

Constant demand for better performance Use of Web services (SOAP and now RESTful) are

increasing The uptake on the use of widgets has been slower

than I hoped Users are hankering after additional annotations of

the data – working on database-literature integration

Mobile use is increasing Web 2.0 services are in demand


Website Performance Improvements

Back End– Back-end tuning and use of

multilevel caching in the areas of searches, query results, explorer pages and hierarchical views

– Better performance and a more robust and scalable system

Front End– Cleaner JavaScript and

CSS

– Inline Image Data– Compressed Content

(Gzip + Base 64)

– Result: 25% - 40% increase in render performance

1. A link brings up figures from the paper

0. Full text of PLoS papers stored in a database

2. Clicking the paper figure retrievesdata from the PDB which is

analyzed

3. A composite view ofjournal and database

content results

Literature Integration – The Dream

1. User clicks on content

2. Metadata and webservices to data provide an interactive view that can be annotated

3. Selecting features provides a data/knowledge mashup

4. Analysis leads to new content I can share

4. The composite view haslinks to pertinent blocks

of literature text and back to the PDB

1.

2.

3.

4.

The Knowledge and Data Cycle

PLoS Comp. Biol. 2005 1(3) e34

www.rcsb.org/pdb/explore/literature.do?structureId=1TIM

Example of Interoperability: The Database View

BMC Bioinformatics 2010 11:220

Example of Interoperability – The Literature View

From Anita de Waard, Elsevier

Worldwide Protein Data Bank

www.wwpdb.orgSemantic Tagging & Widgets are a Powerful Tool to Integrate Data and Knowledge of that Data, But as Yet Not Used Much

Will Widgets and Semantic Tagging Change Computational Biology? PLoS Comp. Biol. 6(2) e1000673

Semantic Tagging of Database Content in The Literature or Elsewhere

http://www.rcsb.org/pdb/static.do?p=widgets/widgetShowcase.jspPLoS Comp. Biol. 6(2) e1000673Semantic Tagging

PDBMobile

• Fast, low bandwidth data access• First version supports iPhone OS• Future versions will support Android,

Blackberry OS6 and others.• HTML 5-based web application• Client-side database stores data for

offline-access• Tight integration with MyPDB

Objective: PDB Data Access On-The-Go

PDBMobile

• Access to saved queries • Add/delete queries • Flag interesting entries• Add personal structure annotations

Tight Integration with MyPDB

Future

New views on the data for subclasses of user

New data deposition system – increase speed and accuracy while reducing costs

New types of analysis

Acknowledgements

Funding Agencies: NSF, NIGMS, DOE, NLM, NCI, NCRR, NIBIB, NINDS, NIDDK

213/31/11 RDAP Summit 2011

bourne rdap11 data publication repositories

Documents

protein data bank

composite view

data

resource

pdb

number

performance

back