rdsi dash tinman

Data Sharing (DaSh) Programme

Tinman –31st October 2011

2

SECTION A – CONTEXT AND PURPOSE

The RDSI project was established through the SuperScience investment from the Education

Infrastructure Fund (EIF) in the 2009 Federal Budget, and is managed through the Department of

Innovation, Industry, Science and Resources (DIISR). The detailed objectives, expected outcomes and

process to achieve these are described in the RDSI Project Plan, available from the RDSI website1.

Quoting:

The expected benefits of RDSI are to:

• improve the availability of quality research data for sharing and re-use and, as a result, expand the

scale and scope of problems that Australian researchers may seek to address;

• improve research efficiency; and

• reduce institutional data storage costs and enable more extensive collaboration.

The infrastructure may also assist institutions to:

• sustain a quality of research in the digital age that includes the reproducibility of results;

• meet the storage requirements of key research activities undertaken at that institution; and

• comply with the research data provisions of Universities Australia’s Australian Code for the

Responsible Conduct of Research.

The RDSI project is delivered through four key programmes which are jointly coordinated, depend on

each other, but are delivered through different and complementary approaches.

• The Node Development (NoDe) programme will establish a small number of physical sites around

Australia to provide baseline storage and access services to the research sector.

• The Data Sharing (DaSh) programme will develop the technical architecture for inter-node and

node-user data movement, access management and sharing functionality for the sector.

• The Research Data Services (ReDS) programme will support the development of larger collections

of value, their infrastructure requirement at nodes and their association with collaboration and

analysis facilities.

• The Vendor Panel (VePa) programme provides the public research sector with a set of preferred

commercial suppliers for the delivery of storage infrastructure and services, leveraging the

economies of scale of both the sector and the RDSI investment.

The intention of the RDSI project is to foster the development of an enduring and sustainable

infrastructure, on a cost-effective basis, well beyond the lifetime of the project itself.

This document addresses the DaSh programme, providing a broad outline of the programme itself, its

requirements, expectations and deliverables. As a “tinman” model it is not intended to be a final

position, but to provide suggestions on points of further discussion and encourage feedback. It

follows the earlier strawman workshop. It will form the basis for the final model of the DaSh

programme. There will sector wide consultation on this tinman model.

1 http://rdsi.uq.edu.au/

3

SECTION B – SUMMARY OF THE DaSh PROGRAMME

1. Goals of the DaSh programme

The DaSh Programme will build capability to support the sharing and re-use of research data and, as a

result, is aimed at expanding the scale and scope of problems that Australian researchers may seek to

address. In order to identify what high performance data sharing and data movement services are

needed by the sector, consultations with relevant research sector stakeholders, combined with an

evaluation of existing services will be undertaken during implementation of the Project.

2. DaSh programme Themes

The DaSh Programme will consist of ten themes as follows:

DaShNet – The network connecting nodes to users and to each other

Federated Authorisation and Service registration – Upgrading the AAF for authorisation

ReDS Application Processing – Automation of application workflows for the ReDS programme

RDSI DaShBoard – A system to automatically collect and publish Node and Collections metrics

RDSI Data Fabric – Providing a common access to collections and working storage for researchers

RDSI File Systems – Establishing File System(s) across nodes with a consistent namespace

RDSI Data Mover – Providing fast data movement between, into and out of nodes

RDSI StoreGate – A gateway to external public storage

RDSI DaShLab – An environment to support testing of implementation and changes of RDSI elements

RDSI Portal – AAF Integrated access to RDSI elements/services with appropriate entitlement

3. Consultation and governance within the DaSh programme

The DaSh Programme will establish a Technical Advisory Committee (TAC) to provide advice to the

project on elements of the technical architecture. The TAC will consist of staff from the RDSI project,

including the Project Director, Project Manager and DaSh Technical Architect together with a

representative from each confirmed Node.

A Technical Reference Group (TRG) will also be established to provide early comments on DaSh

designs and proposals. Membership of the TRG will be open.

4. Development Principles for the DaSh programme

Where possible, the project will seek to acquire or re-use software before considering development. If

development is required, the RDSI project team will call for expressions of interest from Nodes to

undertake the development. Where practical, there will be calls for expression of interest from Nodes

to host RDSI services, if hosting is required.

4

SECTION C – DISCUSSION OF THEMES IN THE DaSh PROGRAMME

This section will discuss the proposed themes in the DaSh programme together with considerations of

implementation.

DaShNet – The network connecting nodes to users and to each other

Proposition

Interconnecting RDSI nodes to each other with the highest available bandwidth will improve the

availability of services across RDSI by supporting replication between nodes. Replication will allow the

delivery of higher availability services than could be provided by individual data centres, which are

often at Uptime Institute Tier 2 status. Resilience through widely distributed replication is more cost

effective than upgrading data centres. These interconnections will also support rapid data movement

between nodes. As the use of RDSI nodes increases, there is a potential for congestion at the access

point for the node. This can be alleviated by providing high bandwidth dedicated access for each

node.

Discussion

The goals of DaShNet will be to establish:

(i) A set of interconnections between primary nodes using the fastest available wavelengths

across the AARNet backbone

(ii) A network access connection for each primary node to support dedicated access to the

node. This will also use the fastest available wavelengths across the AARNet backbone

(iii) Appropriate network connections to each additional node

It is anticipated that the majority of funding for this theme will be for network equipment and

wavelength implementation costs. The initial expectation is that there will be a single source of

network equipment for this theme.

Implementation Considerations

DaShNet will be a project proposed by RDSI to the National Research Network (NRN) project which

will look to use the upgraded AARNet backbone. There will, therefore, be early discussion between

RDSI, NRN and AARNet.

Federated Authorisation and Service registration – Upgrading the AAF for authorisation

Proposition

RDSI will require that users must be able to use the Australian Access Federation (AAF) for

authentication unless there is an agreed exception. However, the mechanisms for granting

authorisation to use a resource, such as a collection, would have to be implemented independently by

resource managers unless there is a federated approach to authorisation. Implementation of an

“Entitlements Service” to support such an approach will benefit users and managers of RDSI

infrastructure and collections by eliminating duplication and providing a consistent approach to

authorisation. The AAF is the logical home for such an entitlements service.

5

Discussion

An entitlement service would be a logical extension to the AAF’s existing authorisation service and

would have wide benefits in providing a consistent approach for a number of eResearch projects,

including RDSI and NeCTAR. An entitlements service could be either developed specifically as a direct

enhancement of the AAF or one of a small number of commercially available entitlement systems

could be licenced and integrated with the AAF. This will be a crucial service for other developments in

RDSI and in other projects; an early delivery of at least the interface specifications will therefore be

essential.

A directory holding authorisation, registration and other service information will be required to

support RDSI services and the design of the directory will depend on the choice of solution for the

entitlement service. A part of such a directory might be implemented in an RDSI portal.


Early discussion will be undertaken between RDSI, AAF and NeCTAR to determine the most

appropriate design.

ReDS Application Processing – Automation of application workflows for the ReDS programme

Proposition

There will be a range of applications for allocation of space under the ReDS programme which vary in

complexity and size. There will be advantages in both timeliness and workload if the process is

automated to the greatest possible extent. This automation will also provide additional benefits by

providing a point for automatically capturing and storing the parameters of a collection for use by the

RDSI measurement and monitoring processes.

Discussion

The design of this application will be determined by developments in the ReDS programme which will

establish agreed levels of delegation and automation and by the requirements of the RDSI DaShBoard

which will determine the data to be captured for monitoring and measurement. The application may

involve either bespoke development or licencing a commercial product for modification and

integration. It will need to integrate with the RDSI portal and will influence the definition of metrics

about collections to be used by the RDSI DaShBoard.


The RDSI project team will develop a specification for this application taking into account the

requirements of other DaSh themes. After an initial market survey of available commercial offerings,

a specification for ReDS Application Processing will be developed and expressions of interest would

then be sought from confirmed RDSI nodes to develop, integrate and host the application as

appropriate. It is anticipated that ReDS Application processing would be integrated with the RDSI

Portal and that potential users would access it through the portal.

6

RDSI DaShBoard – A system to automatically collect and publish Node and Collections metrics

Proposition

The project plan has described the process of establishing trust in the collection of RDSI nodes by

openly and transparently publishing performance against agreed service levels and metrics for both

nodes and collections. The RDSI DaShBoard will automate the process of collecting and publishing this

data to support the production of timely information with low levels of manual intervention.

Discussion

Metrics for the monitoring of nodes will be jointly developed with the Node Development programme

and metrics for monitoring collections will be jointly developed with the ReDS programme. A common

protocol for the transmission of monitoring data to the RDSI DaShBoard must also be developed. It is

anticipated that the DaShBoard would be a part of the RDSI Portal and would collect and display

information from all RDSI Nodes automatically and on a regular basis. This information would relate

to both nodes and collections. The DaShBoard could be developed specifically for RDSI or commercial

software could be licenced and integrated with the RDSI Portal.


After an initial market survey of available commercial offerings, a specification for the RDSI

DaShBoard will be developed and expressions of interest would then be sought from confirmed RDSI

nodes to develop, integrate and host the application as appropriate.

RDSI Data Fabric – Providing a common access to collections and working storage for researchers

Proposition

As described in the RDSI Project Plan, one of the objectives for the DaSh programme is to provide a

consistent interface for researchers to collections and it is understood that this may be only one of a

number of interfaces depending on the nature and uses of the collection. At the same time, it is

helpful for researchers to also have access to some easily accessible storage, through the same

collaborative interface, to support their access to, use and development of, collections. This storage

must support easy collaboration between researchers. The RDSI Data Fabric will be the means of

achieving these objectives.

Discussion

The ARCS Data Fabric successfully provides a consistent interface to collaborative storage using iRODS

middleware and iRODS has significant functionality to support a consistent interface to also

distributed collections at RDSI nodes and elsewhere. The ARCS Data Fabric implements its own

arrangements for entitlements and uses other ARCS functionality to establish service registration

through an LDAP directory which also supports the additional identity credentials needed for

WebDAV access to the Data Fabric.

Whilst there is a goal of migrating functionality from the ARCS Data Fabric to the RDSI Data Fabric, this

does not necessarily imply that the technology solution will be the same and there may well be

benefits in ensuring that an RDSI Data Fabric integrates with and uses the entitlements service

7

described earlier. Furthermore a number of commercial solutions have emerged over the last year

with some having tight coupling with an entitlements service.

After an initial investigation and potentially testing of existing open source solutions and

commercially available products, the RDSI Project team will work with iVEC, who are the existing

development and support group for the ARCS Data Fabric, to develop an appropriate specification

which will then be the subject of consultation with the sector. In the event that a commercially

available product is chosen, joint work will be undertaken with the Vendor Panel (VePa) programme

in relation to establishing a panel for procurement.


After the development of an initial specification, the RDSI Data Fabric would be established by a call

for expressions of interest from confirmed RDSI nodes, for one node to undertake any development

or integration and three nodes host it. The developing node might also be one of the three hosting

nodes.

RDSI File Systems – Establishing File System(s) across nodes with a consistent namespace

Proposition

For some applications, a distributed file system providing a consistent namespace within and between

nodes may be required to provide increased levels of durability. In addition, a file system with

enhanced levels of security may be necessary if nodes are to host data collections with higher levels

of confidentiality.

Discussion

The RDSI File Systems theme will work with the RDSI Research Data Managers, the ReDS Programme

Manager and other stakeholders to develop appropriate use cases, whilst the DaSh Technical

Architect will identify, and where feasible, test different options. These may include open source file

systems or commercially available file systems that could be licenced by the sector. In the latter case,

the VePa programme will be leveraged to establish an appropriate panel of vendors. The use cases

and technical options will be discussed with confirmed nodes and other interested stakeholders

before developing a requirements specification.


Once a requirements specification has been developed the RDSI project team will discuss

implementation options with confirmed nodes.

RDSI Data Mover – Providing fast data movement between, into and out of nodes

Proposition

Researcher accessible tools to efficiently move data between nodes, into nodes and out of nodes will

be of benefit to users of RDSI services.

8

Discussion

As the size of data sets scales up to hundreds of terabytes and potentially petabytes, existing tools to

ingest data into nodes, extract data from nodes or move it between nodes are severely challenged. In

particular, for larger data movements, a third party transfer service is needed so that a researcher can

submit a transfer and then continue to use their own computing resources whilst waiting for

notification of completion. For efficiency, the process needs to be user driven and substantially

automated.


An investigation and potentially testing of existing open source solutions and the small number of

commercially available products will be undertaken and published. After consultation with nodes and

other interested parties, a specification for development or for licencing and integration of a

commercially available product will be developed.

RDSI StoreGate – A gateway to external public storage

Proposition

Researchers will benefit from streamlined access to one or more external public storage clouds both

as a means of storing appropriate research data in the cloud and for accessing relevant services in

public storage clouds. One potential use could be for additional copies of data that are stored at RDSI

nodes; however there are a number of use cases. A particular benefit in using external public cloud

storage is that it is an “on demand” service which can meet short term needs with a fast provisioning

time (often in minutes), little upper limit on capacity and an ability to pay only for the time that the

storage is actually needed. Potential users of external public storage will still need to pay for such

storage; the proposition is that it will be faster and cheaper to access and that it may reduce

proliferation in the use of small pools of external storage each with their own identity credentials.

Discussion

The RDSI project, owing to the nature of its funding, cannot fund public cloud storage. However, to

facilitate use of public storage for research purposes it can develop a gateway for connection to a

number of external public storage cloud providers. Use of such external storage encounters three

principal difficulties; performance, proliferation and cost. Performance issues arise from accessing

external storage over the public internet rather than taking advantage of the dedicated high

bandwidth available across the Australian Research and Education Network (AREN). RDSI StoreGate

would seek to address this issue by attempting to facilitate peering of a number of external public

storage providers with the AREN.

There is anecdotal evidence of significant existing use of external public storage providers for research

data. An example would be the use of Dropbox which stores its data in Amazon’s storage service. The

proliferation in the use of individual Dropbox accounts which do not integrate with other services in

the sector, such as the AAF, forms a barrier to collaboration. RDSI StoreGate will investigate options

to improve integration.

The cost of using external public storage often breaks down into three components. A network traffic

charge; a cost for moving data into and out of the external storage; and a cost of the storage itself.

The first of these could be eliminated by peering a number of storage providers with the AREN as

9

described earlier. The second of these is a function of location and the content delivery networks

used by external storage providers. It may also be improved by peering but it would greatly benefit

from the ability to access Australian based providers. Both this and the third element of cost (the

storage itself) are susceptible to price reduction through demand aggregation. By working with the

RDSI Vendor Panel (VePa) programme to create a panel of external storage providers, RDSI StoreGate

is intended to reduce costs through the aggregation of demand for external public storage.


Internet 2 recently announced its Net+ services which include some form of aggregated access to

external storage providers, Box.net and HP in the United States. The RDSI project team will review

available providers in Australia and work closely with AARNet, Internet 2 and others in developing a

specification for RDSI StoreGate and will also work closely with the VePa programme to construct a

panel of external public storage providers. Implementation options will be developed after these

stages.

RDSI DaShLab – An environment to support testing of implementation and changes of RDSI elements

Proposition

The DaSh Technical Architect, together with Technical Architects from each of the nodes will benefit

from the ability to test implementations of, and changes to the RDSI Technical Architecture.

Discussion

The RDSI Nodes and the network between them, present a unique environment which cannot easily

be replicated by any individual node or institution. Successful implementation of infrastructure and

applications will be dependent on an ability to undertake meaningful testing. By establishing a test

environment or testbed which spans a number of nodes, it will be possible to support the testing of

infrastructure and applications in a realistic environment. DaShLab will be the test environment

spanning the Nodes.


After the development of an initial specification, DaShLab would be established by a call for

expressions of interest from confirmed RDSI nodes, with a target minimum of 2 nodes and no

maximum number. The DaSh programme would fund infrastructure at the nodes to facilitate the

development of DaShLab.

RDSI Portal – AAF Integrated access to RDSI elements/services with appropriate entitlement

Proposition

An RDSI Portal will be an effective means of integrating access to all RDSI services including those

described within other DaSh programme themes. It may also be effective in acting as an integration

point with other eResearch project services.

Discussion

10

The design of the RDSI Portal will be strongly dependent on developments within the other RDSI

themes with which it must integrate. It must clearly be integrated with the AAF and with the

Entitlements Service described earlier. Depending on the design of the Entitlements Service, it may be

necessary for the RDSI Portal to hold directory information about service or resource registration.

The portal may involve either bespoke development or licencing a commercial product for

modification and integration.


The RDSI project team will develop a specification for the RDSI Portal taking into account the

requirements of other DaSh themes. After an initial market survey of available commercial offerings,

a specification for the portal will be developed and expressions of interest would then be sought from

confirmed RDSI nodes to develop, integrate and host the application as appropriate.

SECTION D – IN-DEPTH DISCUSSION OF DaSh PROGRAMME ELEMENTS

This section explores underlying components of the DaSh programme in depth. It is presented to

underpin, extend and enhance the discussion on DaSh programme themes, which have been

described earlier in summary form. The topics discussed in this section are generally applicable to

more than one theme and it is not, therefore, intended that there should be a one to one

correspondence between these topics and the themes.

1. Identity, Authentication and Authorization within RDSI

The Australian Higher Education and Research sectors like many other countries has a SAML v2 based

trust federation called the Australian Access Federation (AAF). This technology allows university staff,

students and researchers to access applications using the credentials issued to them by their

institutions. By later proving possession of and control over these credentials during some act of

authentication at the institution's Identity Provider (IdP), the binding between the end-user and its

digital identity is also proven at some level of assurance.

At a simpler level, institutions manufacture the digital identity of staff, students and researchers

within their institution, based on information within their systems-of-record like HR and SIS systems,

using some form of identity and access management process tailored to that institution. The end-

user's digital identity is composed of all the relevant attributes that may potentially be used to

provide access to a resource. The AAF provides a mechanism, based on the SAML v2 specification, to

assert some components of an end-user's digital identity and transport them securely to a service

provider (SP) so that the resource owner can make an informed authorization decision to allow an

end-user to access to that resource. No matter what resources an end-user wishes to access it is

always based on the end-users digital identity. Effectively an end-user's digital identity is a constant

across all the SPs in the federation.

An end-user's digital identity can in some cases be supplemented by other Identity Providers outside

the province of the end-user's institution. While this allows for a more expressive attribute economy

for authorization it does create some policy and technical issues. Ideally an institution's IdP should

only assert attributes within its province and identity process. Asserting attributes outside one's

11

province diminishes the level of assurance of those attributes. Secondly there is an issue of scale at

the institutions themselves. As an example consider an attribute whose presence in a SAML assertion

informs a SP that a group of researchers from several institutions can access a resource. Using only

institutional IdPs to achieve this, an attribute must be present in the digital identity of every member

of this group. Coordination of this level over a potentially large number of institutions and people is

somewhat erratic. However if this attribute was asserted by a single non-institutional IdP the scaling

problem is minimized.

The AAF is in the process of creating a National Entitlement Service which will allow principle

investigators and people of similar ilk to create entitlements linked to end-users which can be

asserted to a SP in addition to the institution's assertion and used by the SP for fine-tuned

authorization. RDSI will leverage this service in much of its web-browser-based applications. Node

operators should also follow RDSI's lead and where appropriate use the AAF to authenticate to Node-

based service providers.

In fact it is one of the prime principles of the RDSI project to directly attempt to use AAF's federated

identity to access both web-browser-based applications and non-web-browser-based applications.

However the typical SAML v2 authentication profiles used in the AAF does not work well for

applications that are not based on the web browser metaphor; which entails the use of HTTP Cookies

and Redirects. Examples of some of these applications in the RDSI's circle of interest are:

• WebDAV

• i-commands for iRODS

• XMPP/Jabber. (XMPP is a one of the potential protocols for the management and control of

cloud resources.)

• SSH.

• Mounting and accessing file systems.

• Accessing databases.

• My Proxy (which is a service for issuing X.509 certificate for Grid computing.)

Luckily there are initiatives already in play to develop the ability to use a federated credential to

access these applications and services. For example the iPlant Collaborative

<http://www.iplantcollaborative.org> is using work based on Project Moonshot to use federated

access to authenticate and use iRODS i-commands on the command line. (The Project Moonshot work

connects GSS with Radius and EAP to achieve this). There is also work to use the SAML v2 Enhanced

Client or Proxy profile to achieve a similar result. RDSI and the AAF will work together to advance

these innovative authentication and authorization initiatives for use within RDSI and its sister

projects. Unfortunately these emerging technologies are still a bit rough on the edges and may not be

available for production use within RDSI. Until these technologies become more mature one will need

to use contemporary solutions to some of these access issues.

Additionally the movement of data has been a significant component of Grid computing for some

time and many applications have been developed to provide these services using the Globus Tool Kit.

These Grid services typically use GSI (Grid Security Infrastructure) to provide authentication and

authorization using X.509 certificates. While the Globus tools are, in some people’s eyes, overly

12

complicated, it would be a mistake to ignore an existing production infrastructure that does do the

job. For this reason GSI will be a significant component of authentication and authorization in RDSI

Nodes.

2. Identity, Authentication and Authorization within Data Storage Systems

In the previous section the underling concept of an end-user's digital identity being constant across all

service providers is a powerful one. But how does one provide an analogous concept of an end-user's

digital identity being constant across all data storage systems, both within RDSI Nodes, RDSI's sister

projects and other programs?

One way of achieving this goal is to synchronize all participating data storage systems to use a

common identity layer. One such layer could be implemented using a LDAP Directory service,

common across all participating data storage systems. This in concert with the Pluggable

Authentication Modules (PAM) mechanism, which is standard in almost all Unix-like systems, can

provide such an identity layer. We will also concentrate on the Portable Operating System Interface

for Unix (POSIX) series of standards as again this covers most UNIX systems as well as Microsoft

Windows systems if the Microsoft Windows Services for UNIX (SFU) component is installed. It should

be noticed that POSIX/Unix semantics are different to the Microsoft Windows/NFS semantics but with

SFU installed the core identity layer should be consistent over both UNIX and Windows.

POSIX systems link a user to a numeric ID; called the UID; and link a collection of users to a group

which is identified by the numeric ID called the GID. In this POSIX representation the user names and

group names are only there as crutches for the “wetware” that use these systems. It is the numeric

values of the UID and GID that matters in the file system operations. Access to files or directories are

based on the UID, the GID, the permissions of the file (which are stored in the inode of the file or

directory) and credentials used to prove the identity of the user. A LDAP Directory or Active Directory

server can store these mapping of users to UIDs and collections of users to GIDs and the credentials

used to authenticate the end-user. A number of other useful information can be stored using LDAP

schemas like RFC 2307.

This Data Storage Identity Layer (DSIL) will provide a consistent user/UID and group/GID namespace

which can be plugged in to a both remote and local file systems using the likes of PAM and the DSIL

LDAP server, etc so that remote and local file systems share the same semantics of a particular UID or

GID without any remapping. Administrators of the local or remote file systems that use these

mappings can provision user accounts as longs as they do not degrade the semantics of the mappings.

For instance if the user Bob has a UID/GID of 12345/67890 as defined in the DSIL LDAP directory, any

account provisioned for Bob on any participating file system must have a username of Bob, a UID of

numeric value 12345 and a default GID of numeric value 67890. Additional attributes related to

provisioning of accounts like the home directory, the GECOS field, the preferred shell, etc are in the

domain of the local or remote administrators. As OpenLDAP is likely to be chosen as the DSIL LDAP

service a local administrator may host a local leaf-node LDAP replica using the OpenLDAP Translucent

Proxy and rewrite the non-mandatory attributed on the fly.

13

An interface within the RSDI portal will also be provided for those who do not want to use DSIL service

and instead provide their own UID/GID mappings. This interface will allow such users to design their

own mapping options which they will use when mounting a remote file system onto their local file

system. In both these cases it is typically the root user that mounts these remote file system. There

must be a certain level of trust in this act amongst all parties.

It should be noted that without this intelligent design provided by DSIL, the sharing of data through a

remote file system is made more difficult as individual UIDs and GIDs may have to be remapped from

a remote file system's UIDs/GIDs to the local file systems UIDs/GIDs so as to receive the full benefit of

the remote file system. Taking this in account and the fact that there are many such file systems in the

Australian High Education and Research sectors, using UID/GID remapping is an unscalable and

piecemeal solution.

Additionally there are issues concerning the confidentiality and integrity of the data exported from or

imported to remote file systems. File systems like NFSv4.1 provide a GSS-API mechanisms to provide

the confidentiality and integrity of the data without the likes of TLS, GRE or SSH tunnelling; however

many file systems do not. More on this topic will be discussed later sections.

The central component of DSIL is a well replicated LDAP service using various typical LDAP schemas

including the likes of RFC 2307 and 2377. The directory needs to be populate with identity

information from both end-user's IdPs and various Attribute Authorities that may have additional

identity information outside the scope of the their institution. A prototypical workflow of a person

named Bob wishing to register with DSIL is as follows:

(i) Bob uses his credentials issued by his institution to access the RDSI portal using the AAF

infrastructure for the first time. Bob's email address, surname, given name and other

appropriate attributes are asserted in the SAML payload to RDSI DSIL Registration portal.

As this is Bob's first visit to the portal, it requires Bob to nominate two unique usernames.

The first is an 8 character username based on the original POSIX standard. This will provide a

compatibility level over all POSIX based systems where required. The second username will be

a long username; potentially 256 characters.

Both of these accounts will be linked. While DNs of both LDAP entries are different, the

important attributes like the uidNumber, gidNumber, etc will be provisioned uniquely and

replicated to both accounts.

(It should also be noted that most modern systems use unsigned 32bit integers to store UIDs

and GIDs. This potentially provides the DSIL directory service a maximum of 2 billion accounts

and also 2 billion groups. To ensure that system accounts don't collide with DSIL, end user

accounts will start with UIDs and GIDs of 1,000,000.)

(ii) Bob must also provide a new password for this account. Passwords will be stored in a

Kerberos v5 KDC (Key Distribution Centre) which will impose strong passwords and strong

hashes (like AES) as defined by RDSI policy. Kerberos v5 pre-authentication will also be

enabled to reduce the risk of comprised hashes. This password will also suffer a password

aging regime as per RDSI policy. Nearing the end of the aging Bob will receive a series of email

prompting him to access the RDSI portal password management system to restart the aging

14

process. Ignoring these emails will trigger the archiving of Bob's account.

(iii) Bob at this time (or any other time) can upload and manage his SSH public keys. A patched

version of SSH, namely OpenSSH-LPK, provides an easy way of centralizing strong user

authentication by using an LDAP server for retrieving public keys instead of ~/.ssh/authorised

keys. This allows the de-provisioning of a user's SSH access at one point.

(iv) Bob at a previous time has accessed the AAF National Entitlement Service where he has

defined and managed a set of Australian and New Zealand Standard Research Classification

codes which represents the research discipline that Bob is interested in. This information

coupled with similar codes related to research data will allow RDSI to track in a board sense

how researchers use research collections and allow RDSI to tune the use of RDSI and Nodes

over the project. As part of the SAML workflow the AAF National Entitlement Service

Attribute Authority will be queried to add these ANZSRC code to the SAML assertion.

(v) Bob at this time (or any other time) can request to be a group coordinator. A new unique GID

will be provided to Bob and he will be able to invite other users to register with the DSIL

services (if needed) and join his group. At this time the group only exists as an entry in the

DSIL directory. To provision this group access control, a local or remote administrator must

change the GID of the file or directory the new GID.

Bob can also define other group coordinators which will have the same rights as Bob within

that group.

(vi) Bob can only manage the password of his DSIL account after a successful federated

authentication to RDSI portal password management system. System Administrators should

be very reluctant to change Bob's password. This will ensure that their DSIL credentials

maintain an appropriate level of authentication assurance.

Through end-users registering with the DSIL service RDSI will organically grow a database of identity

information that will span both web-browser-based application and data storage systems.

3. RDSI Portal

The RDSI portal is one of the major web applications within the RDSI ecosystem. It will provide several

federated services which will be of use to the end users of RDSI and potentially other sister projects.

These services will be detailed below.

DSIL Registration Portal

The purpose of the DSIL Registration portal is to extend the digital identity of an end-user into the

realm of data storage by adding attributes that describe the end-user in a file system. Once an end-

user has authenticated to the portal an LDAP entry is created in the DSIL directory. Such a directory

entry might look like:

dn: uid=bobuser,ou=users,dc=dsil,dc=rdsi,dc=edu,dc=au

objectclass: top

objectclass: person

objectclass: organizationalPerson

objectclass: inetOrgPerson

objectclass: posixAccount

objectclass: ldapPublicKey

15

description: Bob User's Account

userPassword: {KERBEROS}[email protected]

cn: Bob User

sn: User

givenname: Bob

mail: [email protected]

manager: uid=bobsboss,ou=users,dc=dsil,dc=rdsi,dc=edu,dc=au

uid: bobuser

uidNumber: 1234500

gidNumber: 6789000

homeDirectory: /home/bobuser

sshPublicKey: ssh-dss AAAAB3...

sshPublicKey: ssh-dss AAAAM5...

At this stage there is only the potentiality of this entry. A system administrator of a file system that

participates in the use of the DSIL service must provision the account as they seem fit as long as they

do not degrade the semantics of the user/UID and group/GID mappings.

Account Management Portal

End-users will need to manage their own DSIL credentials in-line with the RDSI password policy. This

portal will ensure that DSIL credentials are sufficiently strong over the period of the password aging

process. Previous passwords will not be appropriate to use for new passwords and will be rejected.

The DSIL service will initially attempt to achieve a level of authentication assurance similar to the NIST

800-63/Liberty Identity Assurance Framework standard at level 2. Levels of identity assurance are

asserted by the end-user's IDP and transported to the DSIL service in the payload of a SAML assertion

where it will be encoded within the eduPersonAssurance attribute.

As there will be password aging associated with their DSIL credentials end-users might lose access to

the DSIL service because they have ignored the series of emails of impending doom of their account.

Moreover there are situations where end-users just fall off the map. End-users should nominate an

email address of a colleague or similar trusted person into this portal so that if a person cannot

respond to an email of doom another may respond so as to fend off the account archiving process.

Group management Portal

Group management is a crucial component of any research activity. Once you have proven your

identity to a service provider or relying party some authorisation process kicks in to determine if you

have the rights to access it. This is true from the large scale of the Large Hadron Collider to the much

smaller scale of a simple file system contained protected data. Authorization in a file system is

typically achieve using groups. If a user is in the right group and the permissions of a file or directory

allow access to that group one can access the file or directory.

As described above, a group coordinator defines the need of a particular group to provide access

control to a resource. These groups and their members must be added to DSIL directory and await the

provisioning of this group to a resource (i.e. file or directory) by a local or remote administrator.

16

However there are many additional ways to provide authorization information so as to create a more

vivid pastel of mechanisms other than just file system groups. Especially as the DSIL directory will

contain both local and institutionally sourced attributes. One means to manage all these

authorization data within the DSIL directory may be provide an instance of Internet2's Grouper

Groups Management Toolkit v2.0. Such a directory entry for a group might look like:

dn: cn=bobsgroup,ou=groups,dc=dsil,dc=rdsi,dc=edu,dc=au

objectclass: top

objectclass: posixGroup

description: Bobs Group

cn: bobsgroup

gidNumber: 1000000

memberUid: bobuser

memberUid: eveuser

RDSI Attribute Authority

A (SAML v2) Attribute Authority is an effective way of having your authorization cake and eating it as

well. When an institutional IDP asserts a set of attributes to a SP, it should only assert information

that is within its scope of the institution. However there may be other sources of authorization

information pertaining to the authenticated end-user of the institutional IdP which may provide extra

information to an SP. Mashing these sets of authorization data together provides a richer pallet of

authorization possibilities.

An Attribute Authority (AA) provides a secondary source of attributes to be asserted with the payload

of the institutional IdP. An AA is a somewhat like a lobotomized IdP and is usually backed by a LDAP

server; in this case the DSIL directory. The management of the AA attributes can be provided by

applications like Internet2 Grouper Groups Management Toolkit v2.0, as described above, and will

allow delegated individuals access and management of various authorization data. When programs

like the Project Moonshot reach a certain level of production quality the RDSI AA will be ready to

provide direct authentication and authorization use institutional credentials.

ReDs Portal

The ReDs portal, a component of the RDSI portal, allows collection owners and data curators to

submit their data for merit-based ReDs funding so as to offset the cost of storing their data at the

various RDSI nodes. Using the collection owner's or data curator's federated identity, the portal will

allow them to upload sufficient information so that the RDSI Resource Allocation Panel can assay the

merit of the submission. The information required for the submission is detailed in the ReDs program.

Collection owners and data curators will be able to track the progression of their ReDs bid through the

portal. Also all formal communications between ReDs bidders and the RDSI Resource Allocation

Panel will be tracked as well.

Monitoring and Analytics

As in any business it is important to maintain a constant vigil of the metrics that describe the health of

the business so as to maximize its profits. In a similar way the RDSI project also needs to keep a close

eye on the metrics that describe its health. The RDSI ecosystem consists of many entities such as

17

potential and successful Node bidders, collection owners and data custodians, potentially and

successful ReDs bidders and of course the end-user researchers as well. All these entities need to

have sufficient information so as to make their component of RDSI a success and therefore the whole

project a success.

RDSI will ensure that these metrics are monitored and provided as openly as possible to the all.

My Node Portal

As stated above the RDSI ReDs program provides funding for the storage of significant data

collections. Once a collection owner or data curator has been successful in their ReDs bid they have to

store the collection in one of the RDSI Nodes. The choice of which node is of course up to successful

bidder.

A conscientious bidder would need to take in many facts concerning the way a particular node

functions as a business or how their collections would suit a node that specializes around a set of

disciplines. This information is typically somewhat elusive in most cases. To aide successful ReDs

bidders to make an informed choice RDSI will ensure that sufficient information is available to them.

RDSI Nodes must supply up-to-date detailed information and metrics concerning their operations.

This information will be displayed on the My Node portal, a component of the RDSI portal.

All RDSI nodes will be require to regularly collect various selections of information concerning all the

facets of a node's operations. This data will be transfer to the My Node portal and displayed in an

intuitive manner.

4. RDSI Analytics

It is of considerate importance for RDSI itself to monitor how researchers of various disciplines

interact with data sets produced by various disciplines. While this information at a level of individual

researcher is somewhat overbearing and is an issue to researchers’ privacy, at a discipline level it can

provide information that will enhance the success of the RDSI project.

Relating the Australian and New Zealand Standard Research Classification (ANZSRC) codes of

researchers to the same codes associated with the data sets as metadata should provide de-identified

data that will help the RDSI project to measure its success.

Knowledge Management

The sharing of knowledge is an important process in research. Without this sharing the efficiency of

research endeavours would be much curtailed and researchers would spend significant time re-

inventing the wheel. The RDSI portal will provide a wiki so as to allow researchers to share the tricks

of their trades; data wise. The wiki will also allow researchers to document how they create, use and

store their data. This may well produce productive synergies between various researchers and even

disciplines themselves.

It is also important for RDSI and Node operators to have a good understanding of the data practices of

researchers and disciplines so to meet their needs.

18

5. DaShNet

Moving data from a RDSI Node to a researcher or when a researcher ingesting a new data collection

into RDSI Node will be one of the “meat and potatoes” daily operations within the RDSI project.

However these daily operations are fraught with consequences especially if the volume of data to be

transferred is large. If there is insufficient network bandwidth and/or high network latencies between

the researcher and the data they are trying to access, the efficiency of the research process will

deteriorate. Researchers usually have many activities “on the go” and they will typically move on to

another activity while waiting for a long data transfer to finish. Getting back to the original activity

may take some time or in some cases never.

As an example consider this; most Australian universities have either at the minimum a 1Gbps

connection or a 10Gbps connection at the maximum. Transferring 1TB of data will take either slightly

over 2 hours at 1Gbps or 13 minutes at 10Gbps. In this scenario a researcher will probably just go out

for a cup of coffee rather that move on to another activity. However if 100TB of data was transferred,

it would take either slightly over 222 hours at 1Gbps or 22 hours at 10Gbps, the researcher would

definitely move on to a new activity.

The solution for this issue is twofold. Firstly the network bandwidth between the researcher and a

RDSI Node must be maximized considering the network topology both inside the researcher's

institution and the AREN (Australian Research and Education Network) backbone. Similarly the

network latency must be likewise minimized. In a coordinated move the AREN is currently moving

their backbone bandwidth to 100Gbps and the NRN (National Research Network) are providing

40Gbps network links from the AREN backbone to a RDSI Node's border router.

Reconsidering the previous 100TB data transfer at a bandwidth of 40Gbps and assuming that an

institution will eventually upgrade their border routers to at least 40Gbps it would take approximately

5 hours to transfer the 100TB of data rather than the 22 hours at 10Gbps.

Secondly highly efficient data movement protocols must be employed. This topic will be discussed in a

later section.

6. National File System

One of the initiatives of the Australian Research and Collaboration Services (ARCS) was the ARCS Data

Fabric which provided 25GB of free storage to all researchers. Unfortunately ARCS funding finished 1st

July 2011 leaving this service in financial doubt. However RDSI has to step forward to continue this

service. The RDSI project will provide a National File System that will be provided to researchers in

the Australian Higher Education and Research sector 25GB of free storage.

The deployment of this file system will be in much the same image of the ARCS Data Fabric so as to

provide the similar interface to previous and current users. It will run the iRODS v3 software using the

OS authentication feature. This will allow the DSIL LDAP directory to provide the same username/UID

and group/GID semantics within iRODS as without.

7. Data as a Service

The RDSI project is a prime example of DaaS, Data as a Service. As defined by wikipedia:

DaaS is based on the concept that the product, data in this case, can be provided on

19

demand to the user regardless of geographic or organizational separation of provider

and consumer.

Data as a Service brings the notion that data quality can happen in a centralized

place, cleansing and enriching data and offering it to different systems, applications

or users, irrespective of where they were in the organization or on the network. As

such, Data as Service solutions provide the following advantages:

• Agility – Customers can move quickly due to the simplicity of the data access and the

fact that they don’t need extensive knowledge of the underlying data. If customers

require a slightly different data structure or has location specific requirements, the

implementation is easy because the changes are minimal.

• Cost-effectiveness – Providers can build the base with the data experts and

outsource the presentation layer, which makes for very cost effective user interfaces

and makes change requests at the presentation layer much more feasible.

• Data quality – Access to the data is controlled through the data services, which

tends to improve data quality because there is a single point for updates. Once

those services are tested thoroughly, they only need to be regression tested if they

remain unchanged for the next deployment.

In RDSI's case the data itself is generated by researchers doing the normal things that researchers do;

i.e. compiling discipline based data sets and publishing their findings. Such data sets as prescribe by

the rigours of the RDSI ReDs program will be uploaded to the central repositories within the collection

of RDSI Nodes. Easy discovery and access to the data contained within the RDSI Nodes is an

imperative.

Data Discovery and Metadata

As the RDSI Nodes will be brimming with useful data sets and collections it will be very important for a

researcher to be able to easily find a particular data set. However without sufficient metadata

describing the data set it will be next to impossible for a researcher to discover the existence of the

data let alone where it is located. Without accurate and sufficient metadata the purpose of the RDSI

infrastructure is pointless.

In Medieval times parish priests were entrusted with the care of souls. These priests were titled

curates. In present times data custodians are entrusted with the care of metadata. Data collections

and data sets must have data custodians too so that they can be curated, cared for and discoverable

throughout their life cycle.

It is assumed that ANDS (Australian National Data Service) will provide its expertise with respect to

curation matters. For more details on this subject please read the ANDS Guide The Data Curation

Continuum.

Data Movement

One of the prime purposes of a RDSI Node is to be able to move data from or to a Node to where it

can be consumed by a researcher so as to provide some form of new scientific result. This data

20

movement can be achieved in an extraordinary large number of ways and means. However the data

movement mechanism that is chosen is usually the proscribed data movement protocol of a particular

discipline or the preferred data movement mechanism of the researcher or his/hers research group.

In a sector as robust as the Australian Higher Education and Research sector this still provides a

potentially large numbers of data movement mechanisms in use. It would be economically infeasible

for every RDSI Node to provide an interface for every data movement mechanism used in the sector.

At some stage a Node must choose what interfaces it will support.

So how can RDSI help Nodes in the choice of data movement mechanisms? An obvious answer is that

RDSI through DaSh Technical Architecture will compel all Nodes to implement a certain set of data

movement mechanism. These mechanisms will be decided through a community input process.

Nodes can of course implement other data movement mechanism as well and this choice will

obviously be one of the many differentiators of the Node from other Nodes; either attracting or

repelling successful ReDs bidders.

Of these compelled interfaces a number will be consider as a commodity type. That is to the end-

users these interfaces will be well known and common in their use. For the system administrators of

Nodes these interfaces will also be well known and the installation and support of them should be a

well known quantity to Node system administrators. Some potential examples of these are for

instance GridFTP, NFS v4, CIFS, webDAV and iRODS.

A number of these compelled data movement mechanism may also be of a specialist type where the

burden of installation and support is higher than the commodity type and end-users may not have

been commonly exposed to them.

The list of compelled data movement mechanism will provide an initial level playing field for all

Nodes. It will also provide a level playing field for all end-users of RDSI Node repositories.

In the next sections we will discuss some of the data movement interfaces that may have a part to

play within the RDSI project. As a gross simplification these interfaces will be categorized as:

• File Transfers (in which the provision of the service is mostly stateless).

• File Systems (in which the provision of the service is mostly stateful).

• Data Middleware (in which there are other applications between the data and the end-

user).

Which interfaces that will be compelled will be teased out using community advice. As initial list may

look like this:

File Transfers File Systems Data Middleware

GridFTP NFS v4.1 (pNFS)

Clustered NFS

iRODS

Rsync over SSH pCIFS

CIFS

Globus Online

Amazon S3 webDAV Reliable File Transfer (RFT)

HTTP Storage Resource Manager (SRM)

21

As in all movement of data from one place to another there is always a risk that either the

confidentiality and/or the integrity of the data may be compromised in transit. Some data movement

mechanisms provide a layer of encryption to minimize these risks. Others use digital signatures or

checksums to detect that the data has been tampered with. However there are other data movement

mechanisms that do not provide any security of the data in transit.

The use of such insecure data movement mechanisms within the RDSI project will only be tolerated

when these data movement mechanisms are tunnelled through a layer that will supply a layer of

confidentiality and integrity of the data. Such layers are provided by protocols like TLS, GRE or SSH

tunnelling. In these cases it is the responsibility of the Node to provide this end-to-end layer from a

RDSI Node to the end-user however they wish.

File Transfers

The original File Transfer Protocol (FTP) specification was published as RFC 114 in 1971, even before

TCP and IP existed. Since then file transfers have been in the past the heavy lifters in the data

movement area. Simply put, file transfers move a complete file or a piece of a file from one place to

another; ideally as fast as possible. Examples of file transfer mechanisms of interested to RDSI are:

• GridFTP is a protocol for network transfers using grid frameworks. GridFTP is part of the

Globus toolkit and was designed for efficient and secure transfer of large amounts of data.

GridFTP uses extensions to the FTP protocol to add enhancements such as parallel transfers

and automatic restart of transfer after interruption.

o ARCS GridFTP service

• Rsync

• HTTP

• Amazon S3 is an online storage web service offered by Amazon Web Services. Amazon S3

provides storage through web services interfaces (REST, SOAP, and Bit Torrent).

• Tsunami UDP is a fast user-space file transfer protocol that uses TCP control and UDP data for

transfer over very high speed long distance networks (≥ 1 Gbps and even 10 GE), designed to

provide more throughput than possible with TCP over the same networks.

• Aspera’s fasp™ transport technology is an emerging standard for the high-speed movement of

large files or large collections of files over wide area networks.

• Bitspeed Velocity is a software application that accelerates file transfers. It maximizes existing

WAN bandwidth to up to 100% utilization

8. SRM based File Transfers

In the simplest situation a file transfer mechanism assumes there is there is only one protocol

supported at both ends of the transfer. However in real life either ends of the transport may support

multiple file transfer mechanism and there may not be an exact overlap of these mechanisms. In such

cases the separate end points must negotiate a common mechanism before a file transfer can be

initiated. Storage Resource Management (SRM) is a Grid middleware application which that help

provide this negotiation layer as well as other useful features such as coordinating storage allocation,

dynamic space reservation and automatic garbage collection that prevents clogging of storage

systems.

22

File Systems

A distributed file system or network file system is any file system that allows access to files from

multiple hosts sharing via a computer network. This makes it possible for multiple users on multiple

machines to share files and storage resources. The client nodes do not have direct access to the

underlying block storage but interact over the network using a protocol. This makes it possible to

restrict access to the file system depending on access lists or capabilities on both the servers and the

clients, depending on how the protocol is designed. Ideally these file systems should be able to move

data as fast as possible so as to maximize researcher productivity. Examples of network file systems

of interested to RDSI are:

• NFS v4.1/pNFS. NFSv4.1 adds the Parallel NFS pNFS capability, which enables data access

parallelism. The NFSv4.1 protocol defines a method of separating the file system meta-data

from the location of the file data; it goes beyond the simple name/data separation by striping

the data amongst a set of data servers. Whether an implementation of NFS v4.1/pNFS

provides sufficient aspects of the standard to provide strong authentication, data integrity and

data privacy is the concern of a Node operator given the RDSI stance on the importance of this

matter.

• NFS v4. The NFS v4 protocol specification RFC 3010 provides both strong authentication using

GSSAPI as well as strong integrity and privacy using LIPKEY and SPKM-3. Whether an

implementation of NFS v4 provides sufficient aspects of the standard to provide strong

authentication, data integrity and data privacy is the concern of a Node operator given the

RDSI stance on the importance of this matter.

• SMB/CIFS/pCIFS. The Common Internet File System (CIFS), also known as Server Message

Block (SMB), is a network protocol whose most common use is sharing files on a Local Area

Network. While CIFS can use strong authentication protocols like Kerberos it has little natively

support in the areas of data integrity or privacy. To combat this deficiency one can tunnel

CIFS/SMB file systems over protocols like SSH, TLS or GRE.

CTDB is a cluster implementation of the TDB database used by Samba and other projects to

store temporary data and is the core component that provides pCIFS ("parallel CIFS") with

Samba3/4.

• webDAV (RFC 4918) is a set of methods based on the Hypertext Transfer Protocol that

facilitates collaboration between users in editing and managing documents and files stored on

web servers. The WebDAV protocol makes the Web a readable and writable medium. It

provides a framework for users to create, change and move documents on a server. The most

important features of the WebDAV protocol include:

• Locking ("overwrite prevention")

• Properties (creation, removal, and querying of information about author, modified date

et cetera);

• Namespace management (ability to copy and move Web pages within a server's

namespace)

• Collections (creation, removal, and listing of resources)

The webDAV specification does not natively support data integrity or privacy however

23

typically webDAV is tunnelled through TLS to provide these services.

Data Middleware

• iRODS. The Integrated Rule-Oriented Data System, is open source software that helps people

manage large collections of digital data distributed across multiple sites running diverse

infrastructure.

• OpeNDAP. An acronym for "Open-source Project for a Network Data Access Protocol", is a

data transport architecture and protocol widely used by earth scientists. The protocol is based

on HTTP and the current specification is OPeNDAP 2.0 draft. OPeNDAP includes standards for

encapsulating structured data, annotating the data with attributes and adding semantics that

describe the data. The protocol is maintained by OPeNDAP.org, a publicly-funded non-profit

organization that also provides free reference implementations of OPeNDAP servers and

clients.

• Globus Online. Globus Online is a fast, reliable file transfer service that makes it easy for any

user to move any data anywhere. Recommended by HPC centres and user communities of all

kinds, Globus Online automates the time-consuming and error-prone activity of managing file

transfers, so users can stay focused on what’s most important: their research.

• Globus Reliable File Transfer (RFT) Service. RFT is a Web Services Resource Framework (WSRF)

compliant web service that provides “job scheduler"-like functionality for data movement. You

simply provide a list of source and destination URLs (including directories or file globs) and

then the service writes your job description into a database and then moves the files on your

behalf. Once the service has taken your job request, interactions with it are similar to any job

scheduler.

• Globus Replica Location Service (RLS). The RLS service is one component of data management

services for Grid environments. RLS is a tool that provides the ability keep track of one or more

copies, or replicas, of files in a Grid environment. This tool, which is included in the Globus

Toolkit, is especially helpful for users or applications that need to find where existing files are

located in the Grid.

• Globus Data Replication Service (DRS). The function of the DRS is to ensure that a specified set

of files exists on a storage site. The DRS begins by querying RLS to discover where the desired

files exist in the Grid. After the files are located, the DRS creates a transfer request that is

executed by RFT. After the transfers are completed, DRS registers the new replicas with RLS.

• WAN Data Cache. Researchers are naturally distributed over the city and country. In most

cases researchers are locate at universities where their access to sufficient network bandwidth

is both sufficiently large and sufficiently close. Access speeds to data within the RDSI Nodes

will thus be sufficient due to the AREN, NRN and the DaShNet initiative. However there will

always researchers who may not be so endowed network-bandwidth-wise. These “spatially

disenfranchised” are still required to perform their science and access data within the RDSI

Nodes. WAN Data Caches can help these spatially disenfranchised researchers to achieve

significantly more effective access and bandwidth to data within a RDSI Node than they

currently have. However this access and bandwidth will always be less than that of “spatially

enfranchised researchers”.

24

Structure Data

The labels "structured data" and "unstructured data" are often used ambiguously by different interest

groups; and often used lazily to cover multiple distinct aspects of the issue. In reality, there are at

least three orthogonal aspects to structure2:

• The structure of the data itself.

• The structure of the container that hosts the data.

• The structure of the access method used to access the data.

These three dimensions are largely independent and one does not need to imply another. For

example, it is absolutely feasible and reasonable to store unstructured data in a structured database

container and access it by unstructured search mechanisms.

In many cases researchers will have their data described and constrained by some of the aspects

detailed above. To support these activities Nodes would require more infrastructure that “just plain

storage”. However Node operators should see this as an opportunity. In fact a collection of Nodes

might collaborate together to provide, for example, a massive distributed query engine based on the

concepts of NoSQL and Map/Reduce. Such a service could be quite enticing to a significant portion of

the Australian Higher Education and Research sectors.

9. Data Integrity

Data Integrity within the RDSI project is of utmost importance. If a RDSI Node can't provide data to an

end-user in the same state in which it was ingested, then researchers may not be able to trust the

data from that Node. Moreover the stain of the loss of data integrity from one Node may affect the

trust-worthiness of another Node in the eyes of data custodians and the end users as well. While it is

impossible to reduce the risk of data integrity to zero, it is possible to management this risk.

Node operators bear the brunt of this risk and they must ensure that proactive and reactive measures

be taken. As a proactive measure storage systems should be able to detect such events like bit rot and

silent corruptions and attempt to heal them without human input. Such events should also be

monitored within the My Node portal even if the system successful healed the loss of data integrity.

As a secondary proactive measure a service similar to fsprobe, the CERN probabilistic data integrity

checker, should performs a regular check of file systems by writing various combinations of bit

patterns and then reading them back. This can be used to identify file system, operating system and

hardware problems.

As a reactive measure when a storage system does detect a fault, the cause must be investigated

promptly and mitigation strategies designed and put in place. RDSI must be informed of these faults

and mitigation strategies. Node operators should share this information with each other so that a

body of knowledge of these storage anomalies can help minimize future storage anomalies over all

RSDI infrastructure.

2 Duncan Pauly, founder and chief technology officer of Coppereye

25

10. Manifestation of Trust within RDSI program

It is obvious that all RDSI infrastructures must manifest a significant level of trust worthiness so that

researchers, data custodians and other users will feel secure in its use.

In an infrastructure like PKI or a SAML based federation the province of trust is usually located at a

single point. For instance the trust root of either a root CA (for the case of a PKI) or a self-signed

certificate (for the case of a SAML federation for use to digitally signing an aggregation of SAML

metadata). For both PKI and a SAML federation there are also open and well-published practice

statements that allow end-users and replying parties to understand the risks of using the PKI or SAML

federation.

In the RDSI infrastructure there are a number of trust centres that manifest this aggregated trust.

Some are manifested by RDSI itself as a governance and policies layer, some are manifested by the

RDSI Nodes and their work practices. Some are manifested in the appropriate sanctioned use of the

DSIL LDAP directory. There are also trust manifestation centres that at first glance have little real

connection to either RDSI or Nodes. For example when a remote RDSI file system is mounted on a

local file system it is the work practices of the local system administrators that generate the trust-

worthiness of that act.

For this reason the manifestation of trust for all aspects of the RDSI project is somewhat more

complicated than the simple case of a PKI or SAML federation. This increases the risk to end-users and

replying parties as they may not be able to the full understand the risks of using the RDSI

infrastructure. The RDSI governance layer must manage the perception of risks well so as to optimize

the significant level of trust worthiness so that researchers, data custodians and other users will feel

secure in its use.

rdsi dash tinman

Technology

rdsi project plan

rdsi project team

rdsi investment

reds programme rdsi

rdsi elementsservices

rdsi website1

nodes rdsi storegate

reuse of research data