ndsa 2013-abrams-integrating-repositories-for-data-sharing
DESCRIPTION
The thorough integration of information technology and resources into scientific workflows has nurtured a new paradigm of data-intensive science. However, far too much research activity still takes place in silos, to the detriment of open scientific inquiry and advancement. Data-intensive science would be facilitated by more universal adoption of good data management practices ensuring the ongoing viability and usability of all legitimate research outputs, including data, and the encouragement of data publication and sharing for reuse. The centerpiece of such data sharing is the digital repository, acting as the foundation for external value-added services supporting and promoting effective data acquisition, publication, discovery, and dissemination. Since a general-purpose curation repository will not be able to offer the same level of specialized user experience provided by disciplinary tools and portals, a layered model built on a stable repository core is an appropriate division of labor, taking best advantage of the relative strengths of the concerned systems. The Merritt repository, operated by the University of California Curation Center (UC3) at the California Digital Library (CDL), functions as a curation core for several data sharing initiatives, including the eScholarship open access publishing platform, the DataONE network, and the Open Context archaeological portal. This presentation with highlight two recent examples of external integration for purposes of research data sharing: DataShare, an open portal for biomedical data at UC, San Francisco; and Research Hub, an Alfresco-based content management system at UC, Berkeley. They both significantly extend Merritt’s coverage of the full research data lifecycle and workflows, both upstream, with augmented capabilities for data description, packaging, and deposit; and downstream, with enhanced domain-specific discovery. These efforts showcase the catalyzing effect that coupled integration of curation repositories and well-known public disciplinary search environments can have on research data sharing and scientific advancement.TRANSCRIPT
NDSA Digital Preservation 2013
Integrating Repositories for Research Data Sharing
Stephen AbramsCalifornia Digital Library
Angela Rizk-JacksonJulia Kochi
University of California, San Francisco
Noah WittmanUniversity of California, Berkeley
NDSA Digital Preservation 2013
Why is data curation important?
Accelerating scientific progress Enabling appropriate scrutiny and verification of results Promoting integrity and debate Facilitating new collaborations Avoiding needless duplication of effort Increasingly, complying with institutional policies, publication
requirements, and funder mandates
Cf. White and Teds (2011), “Making the case for research data management” DCC briefing paper, www.dcc.ac.uk/resources/briefing-papers/making-case-rdm
NDSA Digital Preservation 2013
Merritt
Curation repository available to the UC community and external partners Preservation and access Content agnostic, model free Highly decentralized micro-services architecture
Cf. Abrams, Cruse, Kunze, and Minor (2011), “Curation micro-services: A pipeline metaphor for repositories,” Journal of Digital Information 12(2), journals.tdl.org/jodi/article/view/1605
26 curatorial units 271 collections 325,000 objects 450,000 versions 4,500,000 files 13 TB
www.cdlib.org/uc3/merrittmerritt.cdlib.org
NDSA Digital Preservation 2013
Merritt
Storage nodeStorage broker
Inventory
ONEShare UNM storage node
Storage node
UI/API
UI/API
UI/API
LDAP
LDAP
LDAP
RDBMS
Fixity
User agent
Message queue
RDBMS
Load balancer
Ingest
Load balancer
Ingest
Ingest
EZID
No-SQL
DataCite
…
DataONE member node
RDBMS
RDBMS
DataONEcoord’ing node
…
IDF
Load balancer
Web of Knowledge
Primo
SAN
SDSC cloud
NDSA Digital Preservation 2013
(Some) issues to address
Scale Individual objects ranging from 0 to 47,000 files Individual files ranging from 0 to 14 GB
Maintaining control Concern over potential loss of control over dissemination and
use of data
User experience Switch from organizational to individual interaction
www.flickr.com/photos/vixon/116447718www.flickr.com/photos/traftery/4319529821www.flickr.com/photos/32195273@N05/51076852642
NDSA Digital Preservation 2013
(Some) issues to address
Scale Individual objects ranging from 0 to 47,000 files Individual files ranging from 0 to 14 GB
Maintaining control Concern over potential loss of control over dissemination and
use of data
User experience Switch from organizational to individual interaction
Augment repository function by composition (when possible) and addition (when necessary) Loosely-coupled integration with external community supported
systems and services
NDSA Digital Preservation 2013
Scale
Avoiding client timeout ≤ 2 GB: File-based stream-based AIP-to-DIP processing > 2 GB: Asynchronous delivery
Email notification with personalized, time-limited URL
Streamlined storage provisioning SDSC cloud
cloud.sdsc.edu
www.kevatron.co.uk/converting-8-24-bit-samples-in-coreaudio-on-ios www.flickr.com/photos/paulbhartzog/680749585
NDSA Digital Preservation 2013
Control
Data use agreements (DUAs) Explicit assertion of license requirements and terms of use Curatorial and consumer notification of acceptance
Cf. Brazhnik and Jones (2007), “Anatomy of data integration,” Journal of Biomedical Informatics 40(3): 252-69, doi:10.1016/j.jbi.2006.09.001
From: [email protected]: Merritt DUA acceptance
Name: Stephen AbramsAffiliation: California Digital LibraryCollection: UCSF DataShareObject: Frontotemporal Lobar Degeneration (FTLD)Date: 2013-05-31 09:50:34 PDTTerms of use: As part of this agreement, Consumer submits to the following statements: (1) I will receive access to de-identified data and will not attempt to establish the
identity of any of the study subjects.(2) I will share these data only with my immediate co-workers, and I will not transfer
these data to other research groups. I understand that these data are available to other research groups through the process by which I obtain them.
(3) I will require anyone in my group who utilizes these data, or anyone with whom I share these data to comply with this data use agreement
...
NDSA Digital Preservation 2013
User experience
Due to its open eligibility policy, Merritt will always provide a more generic UX than special-purpose or disciplinary systems
Shifting user roles, shifting expectations Institutional individual researcher Behavioral expectations set by the commercial/mobile web
NDSA Digital Preservation 2013
User experience
Due to its open eligibility policy, Merritt will always provide a more generic UX than special-purpose or disciplinary systems
Shifting user roles, shifting expectations Institutional individual researcher Behavioral expectations set by the commercial web
Integration with extant services that better provide the desired UX DataShare
Research Hub
NDSA Digital Preservation 2013
DataShare
“The goal of the DataShare project is to catalyze widespread sharing of scientific research data”datashare.ucsf.edu
UCSF Clinical and Translational Science Institutectsi.ucsf.edu
UCSF Librarywww.library.ucsf.edu
UCSF Center for Imaging of Neurodegenerative Diseasewww.radiology.ucsf.edu/cind
Architecture DataShare submission client (Ruby/Rails)
Merritt curation repository DataShare discovery portal (XTF/Java)
NDSA Digital Preservation 2013
DataShare
Prepare Describe Upload Curate Discover Share
NDSA Digital Preservation 2013
DataShare
Prepare Best practice advice
Describe Upload Curate Discover Share
NDSA Digital Preservation 2013
DataShare
Prepare Describe
Schema-directedmetadata editor
DataCite schemaschema.datacite.org
Upload Curate Discover Share
NDSA Digital Preservation 2013
DataShare
Prepare Describe Upload
File browse ordrag-n-drop
Curate Discover Share
NDSA Digital Preservation 2013
DataShare
Prepare Describe Upload Curate
Manage datasets
Discover Share
NDSA Digital Preservation 2013
DataShare
Prepare Describe Upload Curate Discover
Faceted search andbrowse
Share
NDSA Digital Preservation 2013
DataShare
Prepare Describe Upload Curate Discover Share
DataONE DataCite (soon) Primo
Web of Knowledge SEO
NDSA Digital Preservation 2013
Merritt + DataShare
Storage nodeStorage broker
Inventory
ONEShare UNM storage node
Storage node
UI/API
UI/API
UI/API
LDAP
LDAP
LDAP
RDBMS
Fixity
User agent
Message queue
RDBMS
Load balancer
Ingest
Load balancer
Ingest
Ingest
EZID
No-SQL
DataCite
…
DataONE member node
RDBMS
RDBMS
DataONEcoord’ing node
…
IDF
Load balancer
Web of Knowledge
Primo
SAN
SDSC cloud
DataShare upload
Collection Atom feed
XTF xtf.cdlib.org
DataShare portal
Lucene
NDSA Digital Preservation 2013
Research Hub
“Research Hub provides powerful tools for content management and collaboration”hub.berkeley.edu
Alfresco CMSwww.alfresco.com
770 projects, 3,900 users Personal file management Project collaboration Departmental resource pooling Research data management
Desktop sync, mobile app, Adobe Creative Suite
UC Berkeley Information Services and Technologyist.berkeley.edu
NDSA Digital Preservation 2013
Research Hub
Prepare Acquire and
arrange
Describe Upload Curate Discover Share
NDSA Digital Preservation 2013
Research Hub
Prepare Describe
Schema-directedmetadata editors
Upload Curate Discover Share
NDSA Digital Preservation 2013
Research Hub
Prepare Describe Upload
Direct action
Curate Discover Share
NDSA Digital Preservation 2013
Research Hub
Prepare Describe Upload
Policy-based workflow rules
Curate Discover Share
NDSA Digital Preservation 2013
Research Hub
Prepare Describe Upload
Drag-and-drop
Curate Discover Share
NDSA Digital Preservation 2013
Research Hub
Prepare Describe Upload Curate
Manage datasets
Discover Share
NDSA Digital Preservation 2013
Research Hub
Prepare Describe Upload Curate Discover
Search / browse
Share
NDSA Digital Preservation 2013
Research Hub
Prepare Describe Upload Curate Discover Share
Curatorialinvitation
NDSA Digital Preservation 2013
Merritt + DataShare + Research Hub
Storage nodeStorage broker
Inventory
ONEShare UNM storage node
Storage node
UI/API
UI/API
UI/API
LDAP
LDAP
LDAP
RDBMS
Fixity
User agent
Message queue
RDBMS
Load balancer
Ingest
Load balancer
Ingest
Ingest
EZID
No-SQL
DataCite
…
DataONE member node
RDBMS
RDBMS
DataONEcoord’ing node
…
IDF
Load balancer
Web of Knowledge
Primo
SAN
SDSC cloud
DataShare upload
Collection Atom feed
XTF xtf.cdlib.org
DataShare portal
Lucene
Research Hub
NDSA Digital Preservation 2013
Future integrations
UCTrust/InCommon federationIncommon.org
Open Context archaeological portalopencontext.org
Nuxeowww.nuxeo.com
UC system-wide DAMS
Islandoraislandora.ca
Fedora Merritt
DPNwww.dpn.org
NDSA Digital Preservation 2013
Sharing research through repositories
Conform to institutional policy, publication requirements, and funder mandates
Pro-active curation of valuable research outputs Stable citation and access High visibility publication and discovery Use metrics
NDSA Digital Preservation 2013
Sharing research through repositories
Conform to institutional policy, publication requirements, and funder mandates
Pro-active curation of valuable research outputs Stable citation and access High visibility publication and discovery Use metrics Repository layering as an appropriate division of labor
Exploiting existing capabilities already in local use
NDSA Digital Preservation 2013
For more information Merritt
www.cdlib.org/uc3/[email protected] Abrams David LoyPatricia Cruse Mark ReyesShirin Faenza Joan StarrScott Fisher Carly StrasserErik Hetzner Marisa StrongJoshua Hubbard Bhavitavya VedulaGreg Janée Kenneth WeissJohn Kunze Perry WilletRosalie Lack
DataSharedatashare.ucsf.eduGeoffrey Boushey Julia KochiAnirvan Chatterjee Angela Rizk-JacksonManinder Kahlon Michael Weiner
Research Hubhub.berkeley.eduIan Crew Michael McCarthy (Tribloom)Noah WittmanPatrick McGrath
www.slideshare.net/UC3/ndsa-2013abramsintegratingrepositoriesfordatasharing