data intensive research initiative for south africa · health, bio & food data intensive...
TRANSCRIPT
Data Intensive Research Initiative for South Africa
A. VahedResearch Data Management Workshop
Pretoria, 11 August 2015
• Background
• Objectives
• RDM tasks
• Activities & Outputs
• Organisational Structure & Implementation
Outline
11 August 2015 © CSIR, 2015 2
NICIS
• National data integrative enabler supporting
– MTSF
– RDP
– SARIR,…
• Overarching coordination & national strategy
– National (Tier1)
– Regional (Tier2)
11 August 2015 © CSIR, 2015 3
NICIS
AcademiaSciencecouncils
::
RI’s
CoreServices
Networkedresources
Computing Services (CHPC +)
Networking Services
(SANReN)
Data Services (DIRISA)
Materials & Manuf.
Energy
Earth & Environment
Phy Sci & Eng.
Humans & Society
Health, Bio & Food
Data intensive research environments (SA_Grid … Cloud)Skills & expertise
Ph
ysic
al-S
ervi
ceSu
pp
ort
Ap
plic
atio
n
e-Research, e-ScienceIntegrated
distributed
cyber platform
for
– Data
Management
– Data
Intensive
Research
11 August 2015 © CSIR, 2015 4
DIRISA
AcademiaSciencecouncils
::
RI’s
CoreServices
Networkedresources
Computing Services (CHPC +)
Networking Services
(SANReN)
Data Services (DIRISA)
Materials & Manuf.
Energy
Earth & Environment
Phy Sci & Eng.
Humans & Society
Health, Bio & Food
Data intensive research environments (SA_Grid … Cloud)Skills & expertise
Ph
ysic
al-S
ervi
ceSu
pp
ort
Ap
plic
atio
n
11 August 2015 © CSIR, 2015 5
Other views
Industrial Sector Awareness
Capability
D. Tildesley: Vision of integrated e-infrastructure ecosystem
Connections Computing & Data Skills
Sector Domain Knowledge
Computing Data
Networks
SecuritySoftware
Hardware
11 August 2015 © CSIR, 2015 7
Value
Rich world of discipline, cross- and multi disciplinary data analytics
(enrichment, annotation,…)
Harmonised world of data management (preservation, workflows,…)
Federated world of data generators and sensory observations
(models and measurements)
Services & Environments
Data stewardship
Data Acquisition
Standards & Policies
Literature Sharing
Skill
s an
d e
xpe
rtis
eA
dvo
cacy
& O
utr
eac
h
• Extreme Data– Global, massive,
well-typed, homogeneous volumes
– LHC & SKA
• Research Big Data– Large, mixed-typed
volumes – Imagery, text, audio, etc
• Business Big Data– Lots of usually closed
transactional, serialised data
– Social data (Facebook, Twitter, Google, etc)
• Long Tail Data– Lots of (poorly managed)
relatively small data sets
11 August 2015 © CSIR, 2015 8
Data landscape
The underlying attributes are very different!
11 August 2015 © CSIR, 2015 9
Data class characteristicsClass Ownership Big Data Vs Technology Skills Research Env
Extreme International Vol, Vel, Open
Exascale Comp Maths / Stats / Astro, Visual
Distributed teams
Big Data –Business
Businesses Vol, Vel, Var, Closed
Clusters, SAS, Cloud, Hadoop
Data Engineers Team
Big Data –Research
National, Institutional
Vol, Vel, Var, Ver, “Open” access
HPC, Clusters, Grid, Cloud, data transfer
Data Scientists, Domain Researchers,Comp Scientists, Maths, Model
VRE, multi-disc, RIs
Long Tail Department, Individual
Var, Ver Grid, cloud Stats, Comp Science
Individuals, PhD, PD, Ris
1. Robust infrastructure and services– Federated Tier 1 & Tier 2 repositories– Virtual research environments
2. Ensure good data stewardship– Policies, protocols & standards– Internationally benchmarked
3. Develop capacity & expertise– Data intensive research– Programmes with HEIs
4. Advocacy & outreach– Data stewardship and data sharing– Stakeholder engagement – forums (SADA)
5. Coordination & strategy– National data intensive research activities– Inform on and guide aligned & consolidated strategic agenda
11 August 2015 © CSIR, 2015 10
How
Data Stewardship
Infrastructure & Services
Policies & Standards
Capacity & Expertise
Advocacy & Outreach
Coordination & Strategy
11 August 2015 © CSIR, 2015 11
DIRISA PlanYear 1 (2014/15)Survey & assess
• Institutional arrangements
• Infrastructure, policies & capacity
• Proposals
• Early adopters
Year 2 (2015/16)Develop & build
• Infrastructure & services
• Big projects
• Policies & processes
• Capacity building
Year 3 (2016/17)Grand research
• Federated network
• Open data & publishing
• Business partnerships
• E2E Data Mngt
Year 5+ (> 2017)Global competitive
• Beyond eResearch
• Long running & real-time science
• Fused & streamed data
Year 1
Action/Task Outputs
- Institutional arrangements- Set up forums & events- Engage & consult - Survey, assess “As-Is” situation- Prioritise areas & needs- Coordinate new & ongoing
projects
- Tier 1 & core services- Data stewardship policies & framework
(RDA, etc)- University data science programmes- Solicited proposals in data stewardship- Data intensive research strategy coordinated with
funders, strategies and key initiatives
Role: national capstone orchestration
• NOT a data custodian or data owner BUT supports data stewardship in federated context
• NOT a research funder BUT promote & support research (with caveats)
• Coordinate and guide, NOT prescribe, data intensive research and strategy
• Promote & support, NOT require, data contribution and adoption of Open Data and Open Science
• Coordinate, NOT prescribe, data research capacity development
11 August 2015 © CSIR, 2015 12
What DIRISA is not
• Stakeholder engagement– Workshops in CTN & PTA
– Survey of current data intensive research activities and needs
• Data stewardship & data intensive research strategic frameworks
• Architecture for federated data infrastructure
• DIRISA website
• NRF Open Access statement
11 August 2015 © CSIR, 2015 13
So far...
Stakeholder engagement for coordinated data management and data intensive research
• Avoid, minimize duplication of efforts and resources
• Promote sound data stewardship practices
• Promote cross-disciplinary research collaboration
• Share or federate resources where feasible
• Consolidate training interventions• Promote and advance projects that
address priority issues• Promote data intensive research
activity
11 August 2015 © CSIR, 2015 14
South African Data Alliance
• Deploy data services (Phase 1): Dropbox-like interface: upload/deposit, DOI registry, search & browse (Phase 2: VRE)
• RDM plan template
• RDM policy
• Research and capacity development with DST & NRF
• Call Tier 2 data nodes
11 August 2015 © CSIR, 2015 15
Now:
• Research data lifecycle– Observation / Generation– :– Preservation/ Expunction
• Quality, (Re)usability• Intellectual Property, Copyright,
Licensing, Policy• Identity & Stacking• Ethics & Privacy
– Re-identification– Discriminative profiling – Who watches the watchers?
• Access, Trust & Security– Laws have borders; data does not
• Infrastructure (institutional & technical)• Persistence & Provenance• Data sharing mind-set• Data sharing mind-set (What’s in it for
me?)
11 August 2015 © CSIR, 2015 16
Issues
Collection & formats
Organising & storing
Long term preservation
Ethics & IP
Sharing & re-use
Metadata
11 August 2015 © CSIR, 2015 17
Mindset
• “You want me to give you my data?”
• “You want me to share my (hard-collected) data?”
• “But we’ve tried this before”
• “We’re ok, we don’t need help”
• “Why should we get involved?”
• “This won’t work”
• “Show us the money”
• CODATA Data Citation Task Group • Board on Research Data and Information (BRDI)• International Council for Scientific and Technical Information (ICSTI)• DataCite• The Dataverse Network• National Information Standards Organization (NISO)• Creative Commons and Science Commons• CENDI – U.S. interagency group focused on scientific and technical information
issues and coordination of activities• Global Biodiversity Information Facility (GBIF)• World Data System (WDS)• STM Association (“Out of Cite, Out of Mind” publication)• Digital Curation Centre, UK• Research Data Alliance• DataFirst (UCT)• South African Data Archive (SADA)
11 August 2015 © CSIR, 2015 19
Data Citation Initiatives
11 August 2015 © CSIR, 2015 20
Capacity Development
Core & conversion modules
DataScience
(Analytics, Visualisation,
…)
DataEngineering
Technologies(Hadoop,
MapReduce,…)
DataStewardship
Management(Policy,
Standards,…)
Spe
cial
isat
ion
11 August 2015 © CSIR, 2015 21
Innovation & Discovery
Support & Enablement
Infrastructure
Capacity Development
Research Communities
CHPC DIRISA
Services & Standards
SANReN
Advocacy & Outreach
Science strategies
Collaboration