status report from the twg/ccit to the cewg

23
Status report from the TWG/CCIT to the CEWG 2009-08-12 Dave Vieglais and Ryan Scherle

Upload: dewey

Post on 17-Jan-2016

19 views

Category:

Documents


0 download

DESCRIPTION

Status report from the TWG/CCIT to the CEWG. 2009-08-12 Dave Vieglais and Ryan Scherle. TWG Overview. Activity Two Meetings Weekly (or so) telecon Time contributions by Duane and Mark Significant outcomes thus far Project infrastructure (plone sites, svn) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Status report from the TWG/CCIT to the CEWG

Status report from the TWG/CCITto the CEWG

2009-08-12

Dave Vieglais and Ryan Scherle

Page 2: Status report from the TWG/CCIT to the CEWG

TWG Overview

• Activity

• Two Meetings

• Weekly (or so) telecon

• Time contributions by Duane and Mark

• Significant outcomes thus far

• Project infrastructure (plone sites, svn)

• Use cases, interactions, requirements

• Discussions, especially Identifiers and Identity

• Student projects

Page 3: Status report from the TWG/CCIT to the CEWG

Architecture

• Process (Meta-architecture > Conceptual A. > Logical A.)

• Use Cases

• Functional requirements

• Interfaces and interactions

• Prototyping

• Core pieces

• Fluff

• Iterative process (somewhat)

• Identify and resolve issues at all stages

• Limited resources - so important to get design right

Page 4: Status report from the TWG/CCIT to the CEWG

Use Cases

• Identified major categories and obvious use cases early

• Subsequently expanded to 34 or so

• Diagrams developed to illustrate interactions for each use case

• Capture desired system functional requirements

• APIs identified, getting more stableNetwork Preservation Federated IdentityObject

ManagementDiscovery and Use

HeartbeatGUID

ReplicationIdentity ProviderAuthentication

CreateRead

Query

AuthorizationData use policy

UpdateDelete

LoggingNotification

HealthCapacity

ValidationMigration

WorkflowOntology

Provenance

Page 5: Status report from the TWG/CCIT to the CEWG
Page 6: Status report from the TWG/CCIT to the CEWG
Page 7: Status report from the TWG/CCIT to the CEWG

Use Case Issues

• UC2 "Get list of GUIDs from metadata search"

• Can queries be done at MN with equivalent results?

• Where is result filtering based on access privileges performed?

• Authentication issue - if search across many nodes, then where is identity resolved

• UC3 "Registration of a new member node"

• Should new nodes be registered with specified trust levels?

Page 8: Status report from the TWG/CCIT to the CEWG

Use Case Issues (2)

• UC4,5 "Create/Update/Delete metadata record in Member Node."

• What is the policy on archival copies of data and metadata? (Can data packages be deleted? Published packages modified?)

• UC12 "User Authentication - Person via client software authenticates against Identify Provider to establish session token."

• Where is identity stored? MN? CN? Combination of all?

Page 9: Status report from the TWG/CCIT to the CEWG

Use Case Issues (3)

• UC24 "Transactions - CNs and MNs should support transaction sets where operations all complete successfully or get rolled back (e.g., upload both data and metadata records)."

• Do transactions span multiple MNs, CNs?

• UC27 "CN should support forward migration of metadata documents from one version to another within a standard and to other standards."

• 20+ metadata standards

• How to handle lossy conversions?

Page 10: Status report from the TWG/CCIT to the CEWG

Use Case Issues (4)

• UC28 "Relationships/Versioning - Derived products should be linked to source objects so that notifications can be made to users of derived products when source products change."

• Who asserts these relationships? How are relationships managed?

• UC31 "Manage Access Policies - Client can specify access restrictions for their data and metadata objects. Also supports release time embargoes."

• Group management has an important, perhaps unusual temporal component.

Page 11: Status report from the TWG/CCIT to the CEWG

Coordinating Node Requirements

• CNs provide a central role in infrastructure ∴ critical to identify functional and non-functional requirements early

• Non-exhaustive list of 21 requirements (so far)

• e.g.:

• “Coordinating Node services should be designed to be independently scalable.”

• “Data packages are not discoverable through any public interface until all Coordinating Nodes have confirmed that they have a copy of the corresponding metadata document.”

• “Metadata searches should return in a maximum of “xxx” seconds.”

Page 12: Status report from the TWG/CCIT to the CEWG

General conclusions

The member nodes come with a diverse set of technologies and practices. The coordinating nodes will need to be very permissive while providing quality services.

History/versioning:

• Keep all versions of metadata, so we can see where it came from (and metadata doesn't take much storage)

• The original data package should always be stored. Transformed versions may be needed for some operations of the coordinating nodes.

• It may be too much of a burden to store all versions of a data file.

Page 13: Status report from the TWG/CCIT to the CEWG

Identity, Authentication, Authorization

• MN & CN security services necessary to

• preserve and verify integrity of data packages (in D1)

• prevent malicious intent or inappropriate access

• Six identity / security models in industry:

• Centralized (LDAP)

• Distributed directories (LDAP + referrals)

• Distributed management and replication (LDAP + replication)

• Grid Security Infrastructure proxy certificates

• Open ID

• Shibboleth + InCommon

Page 14: Status report from the TWG/CCIT to the CEWG

Identities

Types of users:• non-authenticated user• registered user (at member node)• registered user (DataONE central)• group member• site manager (for harvests, system operations, etc.)• change request approval workflow• owner of intellectual property rights

Privileges:• access/modify both data and metadata • Member Node Write• create/execute system functions• access logged information

Page 15: Status report from the TWG/CCIT to the CEWG

Metadata Standards

• 20 or so relevant standards

DC, DwC, EML, CSDGM, GCMD-DIF, ISO 19137:2007, NeXML, WaterML, Genbank-FFF, ISO 19115, GML, CDF, DDI, GEML, ESML, CSR, ESG, ECHO, ...

• Conversion between standards is a lossy process

• Issues of compatibility in metadata storage across MNs

• Original metadata will be stored unchanged

• Need to define metadata standard that will be used to support search and discovery operations (CN)

Page 16: Status report from the TWG/CCIT to the CEWG

Search Terms

Page 17: Status report from the TWG/CCIT to the CEWG

Identifiers

• Fundamental component of entire architecture

• Many schemes (handle, LSID, PURL, ...), each with advantages and faults

• Not practical for DataONE to dictate single identifier scheme across all Member Nodes

• Feasible to require that identifiers are unique across all participating MNs

• However, not feasible to assume that all MNs will support all identifier schemes

• Key question: Must an identifier always resolve to the same sequence of bits? Or should it be more abstract?

Page 18: Status report from the TWG/CCIT to the CEWG

PrototypesBy November 2009 meeting (hmm...):

• Member Node contributes metadata to Coordinating Node using GUID

• CN initiates replication of data object from MN to MN

• Logging for instrumentation and usage

• Update data object (revision) by Member Node

Others targets, in order of importance:

• Replication of metadata and system information between CNs

• Failover and load balancing between CNs

• Formalize all service API specs. using a language agnostic IDL

• Comparison and evaluation of existing systems/standards/protocols used by prototype implementations

• Authentication and authorization using LDAP (initial impl.)

• Search portal user interface using Coordinating Node metadata content

• Heartbeat/state of health services

• Registry services using, perhaps, a simple list as an initial method

• Stress and load testing

Page 19: Status report from the TWG/CCIT to the CEWG

Current activities

• Wrapping up this year’s student internships.

• Addressing the general questions arising out of the use case diagrams (some of these questions will be discussed at the coordination meeting)

• Developing a report on identifier usage.

• Creating APIs to be used in prototypes.

Page 20: Status report from the TWG/CCIT to the CEWG

Hurdles

• Resources & Contributors

• Identity, authentication, authorization

• Identifiers

• Rules for data handling and archive (what is data?)

• Metadata extraction

• CN replication

Page 21: Status report from the TWG/CCIT to the CEWG

Feedback from CEWG

What is the vision for access management to DataONE, and how much of that will be left up to member nodes?

• Answer: Data providers must "establish trust" to publish/modify content.

• What does "establish trust" entail? Is there a technical component?

• Who are “data providers”? The member nodes or the end users?

Page 22: Status report from the TWG/CCIT to the CEWG

Open Questions

• What policies should we have for managing DataONE documents?

• What properties should we enforce regarding identifiers?

• What are the minimum requirements for a member node to join the DataONE community? Or, how accommodating should we be?

• Can we identify some member nodes that will implement all best practices and serve as models for the other member nodes?

• How much data should we expect to handle? It is unclear what the uptake curve will be, but this has major implications for our architectural planning.

• Do we want/need a registry of name spaces for identifiers?

• Is it reasonable to store replicas using the ID scheme of the secondary member node, as long as the coordinating nodes are capable of resolving the original identifier to the correct location?

• What types of access control should be allowed?

• What time constraints are we under?

Page 23: Status report from the TWG/CCIT to the CEWG

Open Questions (2)

• Can the CEWG produce some science-oriented use cases that augment our current technical cases?

• Will member nodes be willing to use central DataONE services and/or create adapters that allow their services to communicate with DataONE?

• Are there technologies that are widely used across the member node community? If so, these would be promising targets, as we could create a small number of adapters that could be used for a large number of member nodes.

• What are the high-value member nodes, for which we must provide custom adapters?