data workflow management, data preservation and stewardship
DESCRIPTION
Data Workflow Management, Data Preservation and Stewardship. Peter Fox Data Science – ITEC/CSCI/ERTH-4350/6350 Week 10, November 5, 2013. Contents. Scientific Data Workflows Data Stewardship Summary Next class(es) Projects. Scientific Data Workflow. What it is Why you would use it - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/1.jpg)
1
Peter Fox
Data Science – ITEC/CSCI/ERTH-4350/6350
Week 10, November 5, 2013
Data Workflow Management, Data Preservation and
Stewardship
![Page 2: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/2.jpg)
Contents• Scientific Data Workflows
• Data Stewardship
• Summary
• Next class(es)
• Projects
2
![Page 3: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/3.jpg)
Scientific Data Workflow• What it is
• Why you would use it
• Some more detail in the context of Kepler– www.kepler-project.org
• Some pointer to other workflow systems
3
![Page 4: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/4.jpg)
4
What is a workflow?• General definition: series of tasks performed
to produce a final outcome
• Scientific workflow – “data analysis pipeline”– Automate tedious jobs that scientists traditionally
performed by hand for each dataset– Process large volumes of data faster than
scientists could do by hand
![Page 5: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/5.jpg)
5
Background: Business Workflows
• Example: planning a trip• Need to perform a series of tasks: book a
flight, reserve a hotel room, arrange for a rental car, etc.
• Each task may depend on outcome of previous task– Days you reserve the hotel depend on days of the
flight– If hotel has shuttle service, may not need to rent a
car
• E.g. tripit.com?
![Page 6: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/6.jpg)
6
What about scientific workflows?
• Perform a set of transformations/ operations on a scientific dataset
• Examples– Generating images from raw data– Identifying areas of interest in a large dataset– Classifying set of objects– Querying a web service for more information
on a set of objects– Many others…
![Page 7: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/7.jpg)
7
More on Scientific Workflows
• Formal models of the flow of data among processing components
• May be simple and linear or more complex
• Can process many data types:– Archived data– Streaming sensor data– Images (e.g., medical or satellite)– Simulation output– Observational data
![Page 8: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/8.jpg)
8
Challenges • Questions:
– What are some challenges for scientists implementing scientific workflows?
– What are some challenges to executing these workflows?
– What are limitations of writing a program?
![Page 9: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/9.jpg)
9
Challenges• Mastering a programming language
• Visualizing workflow
• Sharing/exchanging workflow
• Formatting issues
• Locating datasets, services, or functions
![Page 10: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/10.jpg)
10
Kepler Scientific Workflow Management System
• Graphical interface for developing and executing scientific workflows
• Scientists can create workflows by dragging and dropping
• Automates low-level data processing tasks
• Provides access to data repositories, compute resources, workflow libraries
![Page 11: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/11.jpg)
11
Benefits of Scientific Workflows
• Documentation of aspects of analysis
• Visual communication of analytical steps
• Ease of testing/debugging
• Reproducibility
• Reuse of part or all of workflow in a different project
![Page 12: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/12.jpg)
12
Additional Benefits
• Integration of multiple computing environments
• Automated access to distributed resources via web services and Grid technologies
• System functionality to assist with integration of heterogeneous components
![Page 13: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/13.jpg)
Why not just use a script?• Script does not specify low-level task
scheduling and communication
• May be platform-dependent
• Can’t be easily reused
• May not have sufficient documentation to be adapted for another purpose
13
![Page 14: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/14.jpg)
Why is a GUI useful?• No need to learn a programming language
• Visual representation of what workflow does
• Allows you to monitor workflow execution
• Enables user interaction
• Facilitates sharing of workflows
14
![Page 15: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/15.jpg)
The Kepler Project• Goals
– Produce an open-source scientific workflow system• enable scientists to design scientific workflows and execute them
– Support scientists in a variety of disciplines• e.g., biology, ecology, astronomy
– Important features• access to scientific data• flexible means for executing complex analyses• enable use of Grid-based approaches to distributed computation• semantic models of scientific tasks• effective UI for workflow design
![Page 16: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/16.jpg)
Distributed execution• Opportunities for parallel execution
– Fine-grained parallelism– Coarse-grained parallelism
• Few or no cycles
• Limited dependencies among components
• ‘Trivially parallel’
• Many science problems fit this mold– parameter sweep, iteration of stochastic models
• Current ‘plumbing’ approaches to distributed execution– workflow acts as a controller
• stages data resources
• writes job description files
• controls execution of jobs on nodes
– requires expert understanding of the Grid system
• Scientists need to focus on just the computations– try to avoid plumbing as much as possible
![Page 17: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/17.jpg)
– Higher-order component for executing a model on one or more remote nodes
– Master and slave controllers handle setup and communication among nodes, and establish data channels
– Extremely easy for scientist to utilize• requires no knowledge of grid computing systems
Distributed Kepler
OUT
IN
Master Slave
Controller Controller
![Page 18: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/18.jpg)
Token
{1,5,2}
• Need for integrated management of external data– EarthGrid access is partial, need refactoring– Include other data sources, such as JDBC, OpeNDAP, etc.– Data needs to be a first class object in Kepler, not just represented
as an actor– Need support for data versioning to support provenance
• e.g., Need to pass data by reference– workflows contain large data tokens (100’s of megabytes)– intelligent handling of unique identifiers (e.g., LSID)
Token
ref-276
{1,5,2}
Data Management
A B
![Page 19: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/19.jpg)
Managing Data Heterogeneity• Data comes from
heterogeneous sources– Real-world observations– Spatial-temporal contexts– Collection/measurement
protocols and procedures– Many representations for the
same information (count, area, density)
– Data, Syntax, Schema, Semantic heterogeneity
• Discovery and “synthesis” (integration) performed manually– Discovery often based on intuitive notion of “what is out there”– Synthesis of data is very time consuming, and limits use
![Page 20: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/20.jpg)
Scientific workflow systems support data analysis
KEPLER
![Page 21: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/21.jpg)
Composite Component(Sub-workflow)
Loops often used in SWFs; e.g., in genomics and bioinformatics (collections of data, nested data, statistical regressions, ...)
A simple Kepler workflow
(T. McPhillips)
![Page 22: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/22.jpg)
Workflow runs PhylipPars iteratively to discover all of the most parsimonious trees.
UniqueTrees discards redundant trees in each collection.
Lists Nexus filesto process (project) Reads text files Parses Nexus format
Draws phylogenetic trees
PhylipPars infers treesfrom discrete, multi-statecharacters.
A simple Kepler workflow
(T. McPhillips)
![Page 23: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/23.jpg)
An example workflow run, executed as a Dataflow Process Network
A simple Kepler workflow
![Page 24: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/24.jpg)
Smart Mediation Services (SMS) motivation
• Scientific Workflow Life-cycle – Resource Discovery
• discover relevant datasets
• discover relevant actors or workflow templates
– Workflow Design and Configuration• data actor (data binding)
• data data (data integration / merging / interlinking)
• actor actor (actor / workflow composition)
• Challenge: do all this in the presence of …– 100’s of workflows and templates– 1000’s of actors (e.g. actors for web services, data analytics, …)– 10,000’s of datasets– 1,000,000’s of data items– … highly complex, heterogeneous data
– price to pay for these resources: $$$ (lots) – scientist’s time wasted: priceless!
![Page 25: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/25.jpg)
Workflow validation in Kepler
• Navigate errors and warnings within the workflow
– Search for and insert “adapters” to fix (structural and semantic) errors …
• Statically perform semantic and structural type checking
![Page 26: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/26.jpg)
+Provenance Framework • Provenance
– Track origin and derivation information about scientific workflows, their runs and derived information (datasets, metadata…)
• Types of Provenance Information:– Data provenance
• Intermediate and end results including files and db references– Process (=workflow instance) provenance
• Keep the wf definition with data and parameters used in the run– Error and execution logs
– Workflow design provenance
![Page 27: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/27.jpg)
Kepler Provenance Recording Utility• Parametric and customizable
– Different report formats– Variable levels of detail
• Verbose-all, verbose-some, medium, on error
– Multiple cache destinations
• Saves information on– User name, Date, Run, etc…
![Page 28: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/28.jpg)
Some other workflow systems• SCIRun• Sciflo• Triana• Taverna• Pegasus• Some commercial tools:
– Windows Workflow Foundation– Mac OS X Automator
• http://www.isi.edu/~gil/AAAI08TutorialSlides/5-Survey.pdf
• http://www.isi.edu/~gil/AAAI08TutorialSlides/ • See reading for this week
28
![Page 29: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/29.jpg)
Data Stewardship• Putting a number of data life cycle,
management aspects together
• Keep the ideas in mind as you complete your assignments
• Why it is important
• Some examples
29
![Page 30: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/30.jpg)
Why it is important• 1976 NASA Viking mission to Mars (A. Hesseldahl, Saving
Dying Data, Sep. 12, 2002, Forbes. [Online]. Available: http://www.forbes.com2002/09/12/0912data_print.html )
• 1986 BBC Digital Domesday (A. Jesdanun, “Digital memory threatened as file formats evolve,” Houston Chronicle, Jan. 16, 2003. [Online]. Available: http://www.chron.com/cs/CDA/story.hts/tech/1739675 )
• R. Duerr, M. A. Parsons, R. Weaver, and J. Beitler, “The international polar year: Making data available for the long-term,” in Proc. Fall AGU Conf., San Francisco, CA, Dec. 2004. [Online]. Available: ftp://sidads.colorado.edu/pub/ppp/conf_ppp/Duerr/The_International_Polar_Year:_Making_Data_and_Information_Available_for_the_Long_Term.ppt 30
![Page 31: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/31.jpg)
At the heart of it
• Inability to read the underlying sources, e.g. the data formats, metadata formats, knowledge formats, etc.
• Inability to know the inter-relations, assumptions and missing information
• We’ll look at a (data) use case for this shortly
• But first we will look at what, how and who in terms of the full life cycle 31
![Page 32: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/32.jpg)
What to collect?• Documentation
– Metadata– Provenance
• Ancillary Information
• Knowledge
32
![Page 33: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/33.jpg)
Who does this?• Roles:
– Data creator– Data analyst– Data manager– Data curator
33
![Page 34: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/34.jpg)
How it is done• Opening and examining Archive Information
Packages
• Reviewing data management plans and documentation
• Talking (!) to the people:– Data creator– Data analyst– Data manager– Data curator
• Sometimes, reading the data and code 34
![Page 35: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/35.jpg)
Data-Information-Knowledge Ecosystem
35
Data Information Knowledge
Producers Consumers
Context
PresentationOrganization
IntegrationConversation
CreationGathering
Experience
![Page 36: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/36.jpg)
Acquisition
• Learn / read what you can about the developer of the means of acquisition– Documents may not be easy to find
– Remember bias!!!
• Document things as you go
• Have a checklist (see the Data Management list) and review it often
36
![Page 37: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/37.jpg)
20080602 Fox VSTO et al.
37
![Page 38: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/38.jpg)
Curation (partial)• Consider the organization and presentation of
the data
• Document what has been (and has not been) done
• Consider and address the provenance of the data to date, you are now THE next person
• Be as technology-neutral as possible
• Look to add information and metainformation
38
![Page 39: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/39.jpg)
Preservation• Usually refers to the full life cycle
• Archiving is a component
• Stewardship is one act of preservation
• Intent is that ‘you can open it any time in the future’ and that ‘it will be there’
• This involves steps that may not be conventionally thought of
• Think 10, 20, 50, 200 years…. looking historically gives some guide to future considerations 39
![Page 40: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/40.jpg)
Some examples and experience• NASA, NOAA• http://wiki.esipfed.org/index.php/
Preservation_and_Stewardship • Library community• Note:
– Mostly in relation to publications, books, etc but some for data
– Note that knowledge is in publications but the structure form is meant for humans not computers, despite advances in text analysis
– Very little for the type of knowledge we are considering: in machine accessible form
40
![Page 41: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/41.jpg)
Back in the day... NASA
SEEDS Working Group on Data Lifecycle• Second Workshop Report
o http://esdswg.gsfc.nasa.gov/pdf/W2_lcbo_bothwell.pdf o Many LTA recommendations
• Earth Sciences Data Lifecycle Reporto http://esdswg.gsfc.nasa.gov/pdf/lta_prelim_rprt2.pdfo Many lessons learned from USGS experience, plus some
recommendations• SEEDS Final Report (2003) - Section 4
o http://esdswg.gsfc.nasa.gov/pdf/FinRec.pdfo Final recommendations vis a vis data lifecycle
MODIS Pilot Project• GES DISC, MODAPS, NOAA/CLASS, ESDIS effort• Transferred some MODIS Level 0 data to CLASS
![Page 42: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/42.jpg)
Mostly Technical Issues
• Data Preservationo Bit-level integrityo Data readability
• Documentation• Metadata• Semantics• Persistent Identifiers• Virtual Data Products• Lineage Persistence• Required ancillary data• Applicable standards
![Page 43: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/43.jpg)
Mostly Non-Technical Issues
• Policy (constrained by money…)• Front end of the lifecycle
o Long-term planning, data formats, documentation...• Governance and policy• Legal requirements• Archive to archive transitions
• Money (intertwined with policy)• Cost-benefit trades• Long-term needs of NASA Science Programs • User input
o Identifying likely users• Levels of service• Funding source and mechanism
![Page 44: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/44.jpg)
HDF4 Format "Maps"for Long Term Readability
C. Lynnes, GES DISCR. Duerr and J. Crider, NSIDC
M. Yang and P. Cao, The HDF Group
Use case: a real live one; deals mostlywith structure and (some) content
HDF=Hierarchical Data FormatNSIDC=National Snow and Ice Data CenterGES=Goddard Earth ScienceDISC=Data and Information Service Center
![Page 45: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/45.jpg)
In the year 2025...
A user of HDF-4 data will run into the following likely hurdles:• The HDF-4 API and utilities are no longer supported...
o ...now that we are at HDF-7• The archived API binary does not work on today's OS's
o ...like Android 9.1 • The source does not compile on the current OS
o ...or is it the compiler version, gcc v. 12.x?• The HDF spec is too complex to write a simple read
program...o ...without re-creating much of the API
What to do?
![Page 46: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/46.jpg)
HDF Mapping Files
Concept: create text-based "maps" of the HDF-4 file layouts while we still have a viable HDF-4 API (i.e., now)• XML• Stored separately from, but close to the data files• Includes
o internal metadatao variable info o chunk-level info
byte offsets and length linked blocks compression information
Task funded by ESDIS project• The HDF Group, NSIDC and GES DISC
![Page 47: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/47.jpg)
Map sample (extract)
<hdf4:SDS objName="TotalCounts_A" objPath="/ascending/Data Fields" objID="xid-DFTAG_NDG-5"> <hdf4:Attribute name="_FillValue" ntDesc="16-bit signed integer"> 0 0 </hdf4:Attribute> <hdf4:Datatype dtypeClass="INT" dtypeSize="2" byteOrder="BE" /> <hdf4:Dataspace ndims="2"> 180 360 </hdf4:Dataspace> <hdf4:Datablock nblocks="1"> <hdf4:Block offset="27266625" nbytes="20582" compression="coder_type=DEFLATE" /> </hdf4:Datablock> </hdf4:SDS>
![Page 48: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/48.jpg)
Status and Future
Status • Map creation utility (part of HDF)• Prototype read programs
o Co Perl
• Paper in TGRS special issue• Inventory of HDF-4 data products within EOSDIS
Possible Future Steps• Revise XML schema• Revise map utility and add to HDF baseline• Implement map creation and storage operationally
o e.g., add to ECS or S4PA metadata files
![Page 49: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/49.jpg)
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
NASA/ MODIS Contextual Info
![Page 50: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/50.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Instrument/sensor characteristics
50
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 51: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/51.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Processing Algorithms & Scientific Basis
51
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 52: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/52.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Ancillary Data
52
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 53: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/53.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Processing History including Source Code
53
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 54: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/54.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Quality Assessment Information
54
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 55: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/55.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Validation Information
55
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 56: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/56.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Other Factors that can Influence the Record
56
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 57: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/57.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Bibliography
57
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 58: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/58.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Contextual Information:
• Instrument/sensor characteristics including pre-flight or pre-operational performance measurements (e.g., spectral response, noise characteristics, etc.)
• Instrument/sensor calibration data and method• Processing algorithms and their scientific basis,
including complete description of any sampling or mapping algorithm used in creation of the product (e.g., contained in peer-reviewed papers, in some cases supplemented by thematic information introducing the data set or derived product)
• Complete information on any ancillary data or other data sets used in generation or calibration of the data set or derived product
58
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 59: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/59.jpg)
Presented by R. Duerr at the Summer Institute on Data Curation, June 2-5, 2008Graduate School of Library and Information Science, University of Illinois at Urbana-Champaign
Contextual Information (continued):
• Processing history including versions of processing source code corresponding to versions of the data set or derived product held in the archive
• Quality assessment information• Validation record, including identification of validation data sets• Data structure and format, with definition of all parameters and
fields• In the case of earth based data, station location and any
changes in location, instrumentation, controlling agency, surrounding land use and other factors which could influence the long-term record
• A bibliography of pertinent Technical Notes and articles, including refereed publications reporting on research using the data set
• Information received back from users of the data set or product
59
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
![Page 60: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/60.jpg)
However…• Even groups like NASA do not have a
governance model for this work
• Governance: is the activity of governing. It relates to decisions that define expectations, grant power, or verify performance. It consists either of a separate process or of a specific part of management or leadership processes. Sometimes people set up a government to administer these processes and systems. (wikipedia)
60
![Page 61: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/61.jpg)
Who cares…• Stakeholders:
– NASA for integrity of their data holdings (is it their responsibility?)
– Public for value for and return on investment– Scientists for future use (intended and un-
intended)– Historians
61
![Page 62: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/62.jpg)
Library community• OAIS – Open Archival Information System,
http://en.wikipedia.org/wiki/Open_Archival_Information_System
• OAI (PMH and ORE) – Open Archives Initiative (Protocol for Metadata Harvesting and Object Reuse and Exchange), http://www.openarchives.org/
• Do some reading on your own for this
62
![Page 63: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/63.jpg)
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Metadata Standards - PREMIS
• Provide a core preservation metadata set with broad applicability across the digital preservation community
• Developed by an OCLC and RLG sponsored international working group– Representatives from libraries, museums,
archives, government, and the private sector.
• Based on the OAIS reference model
• Preservation Metadata Interchange Std.
![Page 64: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/64.jpg)
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Metadata Standards - PREMIS
• Maintained by the Library of Congress• Editorial board with international membership• User community consulted on changes
through the PREMIS Implementers Group • Version 1 was released in June 2005• Version 2 was released ~ 2012
![Page 65: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/65.jpg)
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Rights
Events
Agents
“a coherent set of contentthat is reasonably
described as a unit”For example, a web site, data set or collection of data sets
“a coherent set of contentthat is reasonably
described as a unit”For example, a web site, data set or collection of data sets
“a discrete unit of information in digital form”
For example, a data file
“a discrete unit of information in digital form”
For example, a data file“assertions of one or more
rights or permissionspertaining to an object
or an agent”e.g., copywrite notice, legalstatute, deposit agreement
“assertions of one or more rights or permissions
pertaining to an objector an agent”
e.g., copywrite notice, legalstatute, deposit agreement
“an action that involves atleast one object or agentknown to the preservation
repository”e.g., created, archived,
migrated
“an action that involves atleast one object or agentknown to the preservation
repository”e.g., created, archived,
migrated
“a person, organization, orsoftware program associatedwith preservation events in
the life of an object”e.g., Dr. Spock donated it
“a person, organization, orsoftware program associatedwith preservation events in
the life of an object”e.g., Dr. Spock donated it
PREMIS - Entity-Relationship Diagram
IntellectualEntities
Objects
![Page 66: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/66.jpg)
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
PREMIS - Types of Objects
• Representation - “the set of files needed for a complete and reasonable rendition of an Intellectual Entity”
• File • Bitstream - “contiguous or non-contiguous
data within a file that has meaningful common properties for preservation purposes”
![Page 67: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/67.jpg)
7th Joint ESDSWG meeting, October 22, Philadelphia, PAData Lifecycle Workshop sponsored by the Technology Infusion Working Group
Information from users• Data Errors found
• Quality updates
• Things that need further explanation
• Metadata updates/additions?
• Community contributed metadata????
![Page 68: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/68.jpg)
Back to why you need to…• E-science uses data and it needs to be
around when what you create goes into service and you go on to something else
• That’s why someone on the team must address life-cycle (data, information and knowledge) and work with other team members to implement organizational, social and technical solutions to the requirements
68
![Page 69: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/69.jpg)
(Digital) Object Identifiers• Object is used here so as not to pre-empt an
implementation, e.g. resource, sample, data, catalog– DOI = http://www.doi.org/, e.g. 10.1007/s12145-
008-0001-8 – visit crossref.org and see where this leads you.
– URI, http://en.wikipedia.org/wiki/Uniform_Resource_Identifier e.g. http://www.springerlink.com/content/0322621781338n85/fulltext.pdf
– XRI (from OAIS), http://www.oasis-open.org/committees/xri
69
![Page 70: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/70.jpg)
Versioning• Is a key enabler of good
preservation
• Is a tricky trap for those that do not conform to written guidelines for versioning
• http://en.wikipedia.org/wiki/Revision_control
70
![Page 71: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/71.jpg)
Summary• The progression toward more formal
encoding of science workflow, and in our context data-science workflow (dataflow) is substantially improving data management
• Awareness of preservation and stewardship for valuable data and information resources is receiving renewed attention in the digital age
• Workflows are a potential solution to the data stewardship challenge
• Which brings us to the final assignment71
![Page 72: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/72.jpg)
Final assignment• See web (10% of grade).
72
![Page 73: Data Workflow Management, Data Preservation and Stewardship](https://reader034.vdocuments.mx/reader034/viewer/2022052510/56813a20550346895da1fa02/html5/thumbnails/73.jpg)
What is next• Next week – Academic basis for Data
Science, Data Models, Schema, Markup Languages and Data as Service Paradigms,
• Week of Nov. 19 = possible guest lecture or ?
• Week after - Webs of Data and Data on the Web, the Deep Web, Data Discovery, Data Integration
• Reading for this week – see web site
• Last class is week 13, Dec. 3 – project presentations
73