a brief provenance tour … via dataone
Post on 17-Mar-2018
62 Views
Preview:
TRANSCRIPT
ABriefProvenanceTour…via DataONE
BertramLudäscher&DaveVieglais
DataONE UsersGroupmeeting
2017-07-24,Bloomington,INIndianaMemorialUnion
ATourofProvenance:Overview
§ Allscienceisphysicsorstampcollecting provenance§ Whatisprovenance...§ ...bywhom,forwhom,forwhat§ ...how-to(inDataONE)?§ Prospective provenance(≈scientificworkflows)§ Retrospective provenance(≈runtimeevents,traces)§ Hybrid provenance
§ ProvenanceinDataONE
§Acloserlookatprovenance
2
ATourofProvenance:Overview
§ Allscienceisphysicsorstampcollecting provenance§ Whatisprovenance...§ ...bywhom,forwhom,forwhat§ ...how-to(inDataONE)?§ Prospective provenance(≈scientificworkflows)§ Retrospective provenance(≈runtimeevents,traces)§ Hybrid provenance
§ ProvenanceinDataONE
§Acloserlookatprovenance
3
ThePriceisRight!Right?
• Oneoftheseishasbeensoldfornearly$180million.• Theothercould beworthasmuchormore.• Whichiswhich?• Whatisthedifference?Provenance@DUG-2017 4
https://en.wikipedia.org/wiki/Les_Femmes_d%27Alger#.22Version_O.22
https://en.wikipedia.org/wiki/La_Bella_Principessa
Provenance defined…• OxfordEnglishDictionary
– Theplaceoforiginorearliestknown history ofsomething:• anorangerugofIranianprovenance
– Thebeginning ofsomething’sexistence;itsorigin:• theytrytounderstandthewholeuniverse,itsprovenance andfate
– Arecordofownershipofaworkofartoranantique,usedasaguidetoauthenticity orquality:
• themanuscripthasadistinguishedprovenance
• Whatistheprovenance(origin)of“provenance”?
Provenance@DUG-2017 5
ComputationalProvenancedefined…• Provenanceofwhat?Bywhom?Forwhat?
• Origin andprocessinghistoryofdigital artifact…– usually:data(products),figures,...– sometimes:workflow (andscript)evolution…– …byresearchers,(computational-,data-)scientists– ...fordata(re-)use– …fortransparency,reproducibility– ...forothers?...self?
• Differentsub-communitiesstudyprovenance:– Provenancein(scientific)workflows…– Provenanceindatabases…– Wait,thereismore:
• ...programminglanguages,systems/security,…• …informationscience,archivalscience,diplomatics
– Lastnotleast:provenanceinthesciencesciences!• a.k.a.naturalsciences
Provenance@DUG-2017 6
Provenance intheNaturalSciences..
• Canyou“seeprovenance”inthisimage?• GrandCanyon’srocklayersarearecordoftheearlygeologichistoryofNorthAmerica.
Theancestralpuebloan granariesatNankoweap Creektellarchaeologistsaboutmorerecenthumanhistory.(ByDrenaline,licensedunderCCBY-SA3.0)
• Notshown:computationalarchaeologistsreconstructingpastclimatefrommultipletree-ringdatabasesè computationalprovenanceiskeyfortransparency &reproducibility
Provenance@DUG-2017 7
…inBiology&NaturalHistoryProvenance =Understanding whathappened…
Zrzavý,Jan,DavidStorch,and Stanislav
Mihulka.Evolution:EinLese-Lehrbuch.
Springer-Verlag,2009.
Author:Jkwchui (BasedondrawingbyTruth-seeker2004)
Provenance@DUG-2017 8Natura non facit saltus
Provenance-in-Science Palooza• Whatarethose?• Cosmology• Geology,Stratigraphy• Phylogeny
– theTreeofLife
• Genealogy– yourfamily:literally
• AcademicPedigree– “Doktorvater”(Doktormutter)
• Etymology• Chainofcustody
– ofart(ifacts)
• Fromprovenance toexplanationsandunderstandingProvenance@DUG-2017 9
UsingProvenance forTransparency,Reproducibility
• Whatinput data wentintothisstudy?
• Whatmethods wereused?• …withwhat parameter
settings, calibrations,…?
• Canwetrust thedataandmethods?
§ Provenance (lineage):trackorigin andprocessinghistoryofdata§ è query provenancetounderstand,exploit(data,code)dependencies§ è attribution,credit,dataqualityassessment,trustviaaudittrails,provenance§ è discovery ofdata,methodologies,experiments
Provenance@DUG-2017 10
https://en.wikipedia.org/wiki/Hockey_stick_graph
ClimateChange:Whodunnit?
Provenance@DUG-2017 11
Trackingsources(data,code)……thehardway
Provenance@DUG-2017 12
Provenance today: Importance:✓ How-to:??
èprojects,groupsconductR&Donprovenancemethods,tools,…
Inparticular:
“Thisreportistheresultofathree-yearanalyticaleffortbyateamofover300experts,overseenbyabroadlyconstitutedFederalAdvisoryCommitteeof60members.Itwasdevelopedfrominformationandanalysesgatheredinover70workshopsandlisteningsessionsheldacrossthecountry.”
Provenance@DUG-2017 13
From Provenance to Reproducible Science …
Capturing provenance is crucial for transparency, interpretation, debugging, … => repeatable experiments, => reproducible science=> need workflow-system agnostic model
Provenance@DUG-2017 14
... via scientific workflows (… and scripts)
Provenance@DUG-2017 15
Tour Stop: Scientific Workflows: ASAP• Automation
– wfs to automate computational aspects of science
• Scaling (exploit and optimize machine cycles)
– wfs should make use of parallel compute resources – wfs should be able handle large data
• Abstraction, Evolution, Reuse (human cycles)– wfs should be easy to (re-)use, evolve, share
• Provenance– wfs should capture processing history, data lineageè traceable data- and wf-evolutionè Reproducible Science
TridentWorkbench
VisTrails
Es wareinmal …Provenance@DUG-2017 16
Executable WATERS Workflow in Kepler
Provenance@DUG-2017 17
Data Curation Workflows (Filtered-Push … Kepler … Kurator projects)
Provenance@DUG-2017 18
RuntimeProvenance(a.k.a.traces,logs,
retrospectiveprovenance,“Trace-land”)
WorkflowModeling&Design(a.k.a.prospective provenance
“Workflow-land”)
Provenance@DUG-2017 19
Workflowsó Provenanceanimportantlink!
ProvONE:PROVforscientificworkflows(Transferstationtoanyofseveralother“standardextensions”)
“Trace-Land” (retrospective provenance)
“Data-Land”
YangCao1,ChristopherJones2,Víctor Cuevas-Vicenttín3,MatthewB.Jones2,BertramLudäscher1,TimothyMcPhillips1,PaoloMissier4,ChristopherSchwalm5,PeterSlaughter2,DaveVieglais6,LaurenWalker2,Yaxing Wei71UniversityofIllinois,Urbana-Champaign,2NationalCenterforEcologicalAnalysisandSynthesis,UCSB,3UniversidadPopularAutónoma delEstadodePuebla,Mexico,4SchoolofComputingScience,NewcastleUniversity,UK,5WoodsHoleResearchCenter,Falmouth,MA,6UniversityofKansas,Lawrence,7EnvironmentalSciencesDivision,OakRidgeNationalLab,TN
Also: OPM-W(G&Getal),others…
“Workflow-Land” (prospective prov.)
Provenance@DUG-2017 20
ProvenanceStandards vsTools• Doweneedmorestandardstosortthisout?• Howshouldwethink aboutprovenance?
– ...inworkflows,scripts,databases?• Whatcanwedo withprovenance?
– ...inworkflowsanddatabases?• Tools tocreate,share,use provenance
– …notjustfor“provenanceforothers”– ...needmore“provenanceforself”
• è creating,using(querying!)provenance• …inDataONE!• Modelingscriptsasworkflows&linkingprovenanceè YesWorkflow toolkit(later)
Provenance@DUG-2017 21
DataONE Summer2017Internship
• YesWorkflow (YW)modellinkedtoProvONE (W3CPROVextension)• Thissummer:YWmodelinRDF;YWmodelqueriesinSPARQL:
DataONE summerinterns2017(LinhHoang,HuiLyu,UIUC)• https://github.com/idaks/DataONE-Prov-Summer-2017Provenance@DUG-2017 22
LinhHoang,HuiLyu @UIUC
ATourofProvenance:Overview
§ Allscienceisphysicsorstampcollecting provenance§ Whatisprovenance...§ ...bywhom,forwhom,forwhat,how-to?§ Prospective provenance(≈scientificworkflows)§ Retrospective provenance(≈runtimeevents,traces)§ Hybrid provenance
§ProvenanceinDataONE
§Acloserlookatprovenance
23
ProvenanceinDataONE
§ PhaseIIGoal:Facilitatereproduciblescience
§ Trackdataderivationhistory§ Trackdatainputsandoutputsofanalyses§ Trackanalysisandmodelexecutions§ Preserveanddocumentsoftwareworkflows§ Link allofthesetopublications
24
ProvenanceataGlance
§ProvONE extensionstoW3CPROV§Newfirst-classobjectsinDataONE
§ Figures,Software
§ExtendedDataPackagewithProvenance§NewTools
§ Provenanceindexing§ Websearchandbrowseinterfaceforprovenance§ Matlab toolforgeneratingprovenance§ Rtoolforgeneratingprovenance§ YesWorkflow toolformodelingprovenancefromscripts
25
ProvONE extendsPROVforscience
“Trace-Land” (retrospective provenance)
“Data-Land”
“Workflow-Land” (prospective prov.)
Provenance@DUG-2017 26
Provenance inDataONEStoringMetadata
§DatasetmodelinDataONE
§Whereprovenanceinformationisstored
27
ModelingProvenanceRelationships
§ prov:used§ prov:generated§ prov:derivedFrom
Data Package 1
metadata science data
figuressoftware
cito:documents
prov:used
prov:generated
prov:derivedFrom
science datadata granule 1
(doi:10.5063/F1Z899CZ)
OAI-ORE with ProvONE trace
28
Provenance…ofData
29 bit.ly/DWS_01_04Provenance@DUG-2017
ProvenanceDisplay:DataONE SearchUI
30
...
Provenance@DUG-2017
ProvenanceDisplay: DataONE SearchUI
31
https://search.dataone.org/#view/urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171Provenance@DUG-2017
ProvenanceDisplay:DataONE SearchUI
32
https://search.dataone.org/#view/urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171Provenance@DUG-2017
ProvenanceDisplay:DataONE SearchUI
33https://search.dataone.org/#view/urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171
Provenance@DUG-2017
ProvenanceinDataONECreatingProvenanceMetadata
§Howtogenerateprovenance§ Retrospective§ Prospective
34
InvestigatorTools
35
RecordrRLibrary
MatlabDataONEToolbox
YesWorkflow ToolYW
Example:R-programming
36
1 # Generate map of locations by type
2 library(recordr)
3 recordr <- new(“Recordr”)
4pkg <- record(recordr, “./hcdbSites.R”, “loc-by-type-png”)
YesWorkflow:ModelingScriptsasWorkflows
37
1 # @begin CreateGulfOfAlaskaMaps
2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv
3 # @in world @as RWorldMap
4 # @out map @as Map_Of_Sampling_Locations.png
5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png
... mapping code is here ...
25 # @end CreateGulfOfAlaskaMaps
TransitiveCredit
38
Whenausercitesapub,weknow:• Whichdataproducedit• What software producedit• Whatwasderived fromit• Whotocreditdowntheattributionstack
• Missier,Paolo.2016."Datatrajectories:trackingreuseofpublisheddatafortransitivecreditattribution."InternationalJournalofDigitalCuration11,no.1,1-16.
• Katz &Smith.2014.ImplementingTransitiveCreditwithJSON-LD.arXiv:1407.5117
• ...
ProvenanceinDataONE search..
search.dataone.orgProvenance@DUG-2017 39
Provenance in Action: Benefits & Impact
ADataONE search(here:“grass”)yieldsdifferentpackageswithprovenance
Provenance@DUG-2017 40
DataONE: Support for ProvenanceYaxing’s script withinputs &outputproducts
Christopher’sYesWorkflow
model
ChristopherusingYaxing’s outputsasinputsforhisscript
Christopher’sresultscanbetracedbackall
thewaytoYaxing’sinput
Provenance@DUG-2017 41
ExploringProvenanceinDataONE
• Let’sgothereè MarkCarls.2017.AnalysisofhydrocarbonsfollowingtheExxonValdezoilspill,GulfofAlaska,1989- 2014.GulfofAlaskaDataPortal.urn:uuid:3249ada0-afe3-4dd6-875e-0f7928a4c171.
42
ATourofProvenance:Overview
§ Allscienceisphysicsorstampcollecting provenance§ Whatisprovenance...§ ...bywhom,forwhom,forwhat,how-to?§ Prospective provenance(≈scientificworkflows)§ Retrospective provenance(≈runtimeevents,traces)§ Hybrid provenance
§ ProvenanceinDataONE
§Acloserlookatprovenance
43
FromWorkflows&ProvenancetoProvenanceforScript-based Workflows…
• Whatworkflowtoolsare(most)scientistsusing?– Workflowsystems– …vsscripts(Python,R,MATLAB,...)
• Whatprovenancetoolsaretheir?– Workflowsystemsupport– Toolsfor“workflow”scripts!?
Provenance@DUG-2017 44
SKOPE:SynthesizedKnowledgeOfPastEnvironmentsBocinsky,Kohleretal.studyrain-fedmaizeof Anasazi
– FourCorners;AD600–1500. ClimatechangeinfluencedMesaVerdeMigrations;late13thcenturyAD.Usesnetworkoftree-ringchronologiestoreconstructaspatio-temporalclimatefieldatafairlyhighresolution(~800m)fromAD1–2000.Algorithmestimatesjointinformationintree-ringsandaclimatesignaltoidentify“best” tree-ringchronologiesforclimatereconstructing.
K.Bocinsky,T.Kohler,A2000-yearreconstructionoftherain-fedmaizeagriculturalnicheintheUSSouthwest.Nature
Communications.doi:10.1038/ncomms6618
… implemented as an R Script … Provenance@DUG-2017 45
ProvenanceSupportforReproducibleScienceExample:PaleoclimateReconstruction
Sciencepaper(OA)uses:• opensourcecode:
– R,PaleoCAR,…
• Isthatallweneed?• Whatwasthe“workflow”?
• Isthereprospectiveand/orretrospectiveprovenance?
Provenance@DUG-2017 46
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
?
YesWorkflow:Yes,scriptsareworkflows,too!
• Script vsWorkflows/ASAP:– Automation: *****– Scaling: **– Abstraction: *– Provenance: **
Provenance@DUG-2017 47
YesWorkflow.org• YesWorkflow (YW)
– Startedasagrass-rootseffort(Kurator,SKOPE,..)– …meetingthescientists/userswheretheyR!
• R,Matlab,(i)Python,Jupyter,…
– Scripts+simple userannotations
• =>Revealtheworkflowmodel/abstraction…thatunderliesthe(script)implementation
• =>YWcangiveusmoreofASAP!– FirstYW: ASAP(Abstraction)...– ThenYW-recon:ASAP(reconstructingruntime Provenance)
Provenance@DUG-2017 48
YW (prospective)andYW-Recon (retrospective)Provenance• 1.YW:AnnotateScript=>YWModel
– Annotate@BEGIN..@END,@IN,@OUT– Visualize,share,behappyJ
• 2.Runscript– Filesarereadandwritten– Folder- &Filenameshavemetadata
• 3.YW-Recon– Use@URItagsthatlinkYWModeló PersistedData– RunURI-templatequeries
• cf.“ls -R”&RegEx matching
• 4.YW-Query– Answertheuser’sprovenancequeries
Provenance@DUG-2017 49
YWannotations:Model yourWorkflow!
Provenance@DUG-2017 50
YesWorkflow:Prospective &RetrospectiveProvenance…(almost)forfree!
• YWannotationsinthescript(R,Python,Matlab)areusedtorecreatetheworkflowviewfromthescript…
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
YW!
Provenance@DUG-2017 51
GetModernClimate
PRISM_annual_growing_season_precipitation
SubsetAllData
dendro_series_for_calibration
dendro_series_for_reconstruction CAR_Analysis_unique
cellwise_unique_selected_linear_models
CAR_Analysis_union
cellwise_union_selected_linear_models
CAR_Reconstruction_union
raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors
CAR_Reconstruction_union_output
ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif
master_data_directory prism_directory
tree_ring_datacalibration_years retrodiction_years
PaleoclimateReconstruction(openSKOPE.org)• …explainedusingYesWorkflow!
KyleB.,(computational)archaeologist:"Ittookmeabout20minutestocomment.LessthananhourtolearnandYW-annotate,all-told."
Provenance@DUG-2017 52
main
fetch_maskinput_mask_file
load_datainput_data_file standardize_with_mask
land_water_mask
NEE_data simple_diagnosestandardized_NEE_data result_NEE_pdf
YW:Get3viewsforthepriceof1
result_NEE_pdf
input_mask_file land_water_maskfetch_mask
input_data_file NEE_dataload_data
standardized_NEE_data
standardize_with_mask
standardize_with_masksimple_diagnose
fetch_mask land_water_mask
load_data NEE_data
standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf
input_mask_file
input_data_file
Process-centricview- processesarethefocusinworkflowsandincomputationalnarratives
Provenance@DUG-2017 53
Data-centricview- dataarethefocusofdataflow&provenanceinfoandindatanarratives
Combinedview(YWdefault)- dataflow-orientedworkflow&provenancestory
è TowardsAutomatingDataNarratives(Gil&Garijo)
Multi-ScaleSynthesisandTerrestrialModelIntercomparison
Project(MsTMIP)
fetch_drought_variable
drought_variable_1
fetch_effect_variable
effect_variable_1
convert_effect_variable_units
effect_variable_2
create_land_water_mask
land_water_mask
init_data_variables
predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1
define_droughts
sigma_dv_event month_dv_length
detrend_deseasonalize_effect_variable
effect_variable_3
calculate_data_variables
recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2
export_recovery_time_figure
output_recovery_time_figure
export_drought_value_variable_figure
output_drought_value_variable_figure
export_predrought_effect_variable_figure
output_predrought_effect_variable_figure
export_drought_number_variable_figure
output_drought_number_figure
input_drough_variable
input_effect_variable
ChristopherSchwalm,Yaxing Wei
Provenance@DUG-2017 54
✔ Provenancecapture (Matlab,R,Python,…scientificworkflowsystems)✔ Uploading,sharing,linking provenancethroughvariousprovenancetools
✗ Toolsforscientiststoexploit (≠ capture,share,link)provenance fortheirownday-to-daywork.
è Primetheprovenancepump andincreaseprovenancegenerationè Scientistsacceleratetheirworkvianew,active usesofprovenance.
But…howtoprimetheprovenancepump??Must support “ProvenanceforSelf”!
ProvenanceforSelf?!
ProvenanceforOthers
Provenance@DUG-2017 55
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
YW-RECON:Prospective&RetrospectiveProvenance…(almost)forfree!
cassette_id
sample_score_cutoff
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_namesample_quality
calculate_strategy
rejected_sample accepted_sample num_images energies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_id energy frame_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
• URI-templateslink conceptualentitiestoruntimeprovenance“leftbehind”bythescriptauthor…
• …facilitatingprovenancereconstructionProvenance@DUG-2017 56
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q1:Whatsamples didthescriptruncollectimagesfrom?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
Provenance@DUG-2017 57
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q2:Whatenergies wereusedforimagecollectionfromsampleDRT322?
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
Provenance@DUG-2017 58
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
Q3:WhereistherawimageofthecorrectedimageDRT322_11000ev_030.img?run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
Provenance@DUG-2017 59
initialize_run
run_logfile:run/run_log.txt
load_screening_results
sample_name sample_quality
calculate_strategy
rejected_sample accepted_sample num_imagesenergies
log_rejected_sample
rejection_logfile:/run/rejected_samples.txt
collect_data_set
sample_idenergyframe_numberraw_image
file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw
transform_images
corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img
total_intensitypixel_count corrected_image_path
log_average_image_intensity
collection_logfile:run/collected_images.csv
sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv
calibration_imagefile:calibration.img
cassette_id
sample_score_cutoff
run/
├── raw
│ └── q55
│ ├── DRT240
│ │ ├── e10000
│ │ │ ├── image_001.raw
... ... ... ...
│ │ │ └── image_037.raw
│ │ └── e11000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_037.raw
│ └── DRT322
│ ├── e10000
│ │ ├── image_001.raw
... ... ...
│ │ └── image_030.raw
│ └── e11000
│ ├── image_001.raw
... ...
│ └── image_030.raw
├── data
│ ├── DRT240
│ │ ├── DRT240_10000eV_001.img
... ... ...
│ │ └── DRT240_11000eV_037.img
│ └── DRT322
│ ├── DRT322_10000eV_001.img
... ...
│ └── DRT322_11000eV_030.img
│
├── collected_images.csv
├── rejected_samples.txt
└── run_log.txt
Q5:Whatcassette-idhadthesampleleadingtoDRT240_10000ev_001.img?
Provenance@DUG-2017 60
JoãoF.Pimentel,SaumenDey,TimothyMcPhillips,KhalidBelhajjame,DavidKoop,LeonardoMurta,
VanessaBraganholo,BertramLudascher
Yin&Yang:Demonstrating complementaryprovenancefromnoWorkflow &
YesWorkflow
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args251 args
251 options254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
noWorkflow:not onlyWorkflow!
• Scriptshaveprovenance,too!
• Transparently capturesome/allprovenancefromPythonscriptruns.
• Usefilterqueries to“zoom”intorelevantparts..
Provenance@DUG-2017 62
simulate_data_collection
initialize_run
run_log load_screening_results
sample_namesample_quality
calculate_strategy
accepted_samplerejected_sample num_imagesenergies
log_rejected_sample
rejection_log
collect_data_set
sample_id energyframe_number raw_image
transform_images
corrected_imagetotal_intensitypixel_count
log_average_image_intensity
collection_log
sample_spreadsheet
calibration_image
sample_score_cutoffdata_redundancy
cassette_id
simulate_data_collection
collect_data_set
sample_id energy frame_number raw_image
calculate_strategy
accepted_sample num_imagesenergies
load_screening_results
sample_namesample_quality
transform_images
corrected_image
sample_spreadsheet
calibration_image
sample_score_cutoff data_redundancy
cassette_id
module.__build_class__
module.__build_class__
simulate_data_collection
180 return
180 run_logger
201 return
201 new_image_file
230 parser
231 cassette_id
236 add_option
241 add_option
246 add_option
248 set_usage
251 parse_args251 args
251 options254 module.len
24 cassette_id
24 sample_score_cutoff
24 data_redundancy
24 calibration_image_file
30 exists
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
33 exists
32 filepath
34 module.remove
36 run_log
37 write
38 str(sample_score_cutoff)
38 write
38 str(sample_score_cutoff)
49 str.format
49 sample_spreadsheet_file
50 spreadsheet_rows
cassette_q55_spreadsheet.csv
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format 51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
72 str.format
72 write
73 open
73 rejection_log
74 str.format
74 TextIOWrapper.write
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
calibration.img
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format 93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
50 spreadsheet_rows(sample_spreadsheet_file)
51 str.format
51 write
50 sample_name
50 sample_quality
61 calculate_strategy
61 rejected_sample
61 energies
61 accepted_sample
61 num_images
90 str.format
90 write
91 sample_id
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open119 collection_log_file 120 module.writer
120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file
120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format
106 transform_image 106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open 119 collection_log_file 120 module.writer 120 collection_log
121 writer.writerow
92 collect_next_image
92 collect_next_image(casset ... _{frame_number:03d}.raw')
93 str.format
93 write
92 energy
92 frame_number
92 intensity
92 raw_image_file
106 str.format 106 transform_image
106 corrected_image_file
106 total_intensity
106 pixel_count
107 str.format
107 write
118 average_intensity
119 open
119 collection_log_file 120 module.writer120 collection_log
121 writer.writerow
92 collect_next_image
50 spreadsheet_rows
128 return
run/run_log.txt
run/rejected_samples.txt
run/raw/q55/DRT240/e10000/image_001.raw
run/data/DRT240/DRT240_10000eV_001.img
run/collected_images.csv
run/raw/q55/DRT240/e10000/image_002.raw
run/data/DRT240/DRT240_10000eV_002.img
run/raw/q55/DRT240/e11000/image_001.raw
run/data/DRT240/DRT240_11000eV_001.img
run/raw/q55/DRT240/e11000/image_002.raw
run/data/DRT240/DRT240_11000eV_002.img
run/raw/q55/DRT240/e12000/image_001.raw
run/data/DRT240/DRT240_12000eV_001.img
run/raw/q55/DRT240/e12000/image_002.raw
run/data/DRT240/DRT240_12000eV_002.img
run/raw/q55/DRT322/e10000/image_001.raw
run/data/DRT322/DRT322_10000eV_001.img
run/raw/q55/DRT322/e10000/image_002.raw
run/data/DRT322/DRT322_10000eV_002.img
run/raw/q55/DRT322/e11000/image_001.raw
run/data/DRT322/DRT322_11000eV_001.img
run/raw/q55/DRT322/e11000/image_002.raw
run/data/DRT322/DRT322_11000eV_002.img
simulate_data_collection
230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>
251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])
251 args = ['q55']
251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>
24 cassette_id = 'q55'
24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0
24 calibration_image_file = 'calibration.img'
49 str.format
49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'
50 spreadsheet_rows(sample_spreadsheet_file)
50 sample_name = 'DRT240'50 sample_quality = 45
61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])
61 accepted_sample = 'DRT240'61 num_images = 2
61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'
92 collect_next_image(casset ... _{frame_number:03d}.raw')
92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'
106 str.format
106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')
calibration.img
run/data/DRT240/DRT240_11000eV_002.img
lineagequerylineagequery
YesWorkflow:Conceptual workflowmodel
noWorkflow:Python tracemodel
Buthowdowebridgethisgap???
WouldliketouseYWmodeltoqueryNW
data!
Provenance@DUG-2017 63
HabemusPons!We’vegottheBridge!Thebridgeisthejourney..(Thejourneyisthedestination)
LineageofimagefileintermsofYW
model,withdetailsfromNWprovenance
Provenance@DUG-2017 64
DemoTime
Provenance@DUG-2017 65
YW-IDCC’17DemoUseCasesDomain Usecase Programminglanguage Provenancemethods
Climatescience C3C4 MATLAB YW+MATLABRunManager
Astrophysics LIGO Python YW+NW(code-level)
Protein crystalsamples Simulatedatacollection
Python YW+NW(code-level)
Biodiversitydatacuration
kurator-SPNHC Python YW-recon+YW-logging
Socialnetwork analysis Twitter Python YW +NW(file-level)
Oceanography OHIBC Howe Sound(multi-run multi-script)
R YW +RRunManager
Provenance@DUG-2017 66
LIGOexample:Whatstrain_L1_whitenbp dependson…
Overall workflow
Upstreamofstrain_L1_whitenbp
(prospective)
GRAVITATIONAL_WAVE_DETECTION
LOAD_DATA
Load hdf5 data.
strain_H1strain_L1 strain_16 strain_4
AMPLITUDE_SPECTRAL_DENSITY
Amplitude spectral density.
ASDsfile:GW150914_ASDs.png
PSD_H1PSD_L1
WHITENING
suppress low frequencies noise.
strain_H1_whiten strain_L1_whiten
BANDPASSING
remove high frequency noise.
strain_H1_whitenbp strain_L1_whitenbp
STRAIN_WAVEFORM_FOR_WHITENED_DATA
plot whitened data.
WHITENED_strain_datafile:GW150914_strain_whitened.png
SPECTROGRAMS_FOR_STRAIN_DATA
plot spectrogram for strain data.
spectrogramfile:GW150914_{detector}_spectrogram.png
SPECTROGRAMS_FOR_WHITEND_DATA
plot spectrogram for whitened data.
spectrogram_whitenedfile:GW150914_{detector}_spectrogram_whitened.png
FILTER_COEFS
Filter signal in time domain (bandpassing).
COEFFICIENTS
FILTER_DATA
filter data.
filtered_white_noise_datafile:GW150914_filter.png
strain_H1_filtstrain_L1_filt
STRAIN_WAVEFORM_FOR_FILTERED_DATA
plot the filtered data.
H1_strain_filteredfile:GW150914_H1_strain_filtered.png
H1_strain_unfilteredfile:GW150914_H1_strain_unfiltered.png
WAVE_FILE_GENERATOR_FOR_WHITENED_DATA
Make sound files for whitened data.
whitened_bandpass_wavefilefile:GW150914_{detector}_whitenbp.wav
SHIFT_FREQUENCY_BANDPASSED
shift frequency of bandpassed signal.
strain_H1_shifted strain_L1_shifted
WAVE_FILE_GENERATOR_FOR_SHIFTED_DATA
Make sound files for shifted data.
shifted_wavefilefile:GW150914_{detector}_shifted.wav
DOWNSAMPLING
Downsampling from 16384 Hz to 4096 Hz.
H1_ASD_SamplingRatefile:GW150914_H1_ASD_{SamplingRate}.png
FN_Detectorfile:{Detector}_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_ratefile:H-H1_LOSC_{DownSampling}_V1-1126259446-32.hdf5
fs
upstream(strain_LI_whitenbp) [prospective]
WHITENING
strain_H1_whiten strain_L1_whiten
AMPLITUDE_SPECTRAL_DENSITY
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detectorfile:{Detector}_LOSC_4_V1-...
FN_Sampling_ratefile:H-H1_LOSC_{Rate}_V1-...
fs
upstream(strain_L1_whitenbp) [URI-recon]
WHITENING
strain_H1_whiten strain_L1_whiten
AMPLITUDE_SPECTRAL_DENSITY
PSD_H1 PSD_L1
LOAD_DATA
strain_H1 strain_L1
BANDPASSING
strain_L1_whitenbp
FN_Detector
L-L1_LOSC_4_V1-1126259446-32.hdf5H-H1_LOSC_4_V1-1126259446-32.hdf5
FN_Sampling_rate
H-H1_LOSC_4_V1-1126259446-32.hdf5H-H1_LOSC_16_V1-1126259446-32.hdf5
fs
upstream(strain_LI_whitenbp) [NW-recon]
WHITENING
strain_L1_whitenstrain_L1_whiten = array([8.494, -1.672, ..., 72.156])
AMPLITUDE_SPECTRAL_DENSITY
PSD_L1psd_L1 = scipy.interpolate.interpolate.interp1d
object at 0x113969418
LOAD_DATA
strain_L1strain_L1 = array([-1.779e-18, -1.765e-18, ..., -1.719e-18])
BANDPASSING
strain_L1_whitenbpstrain_L1_whitenbp = array([8.184, 19.935,..., -0.684])
FN_Detectorfn_d = L-L1_LOSC_4_V1-1126259446-32.hdf5
fsfs = 4096
Upstreamofstrain_L1_whitenbp(hybridYW-NWatthecode-
level)
Upstreamofstrain_L1_whitenbp(hybridYW-NWatthefile-level)
• 3inputsspreadacross5 (=2x2+1)files
• Doesintermediatedatastrain_L1_whitenbpdependonall5inputs?
• Intermediatedatastrain_L1_whitenbpdependonlyon2 outof5inputs!
Provenance@DUG-2017 67
Finer-grainedProvenance:UserLogFiles!
Provenance@DUG-2017 68
Conclusions1:DataONE ProvenanceataGlance
§ProvONE extensionstoW3CPROV§Newfirst-classobjectsinDataONE
§ Figures,Software
§ExtendedDataPackagewithProvenance§NewTools
§ Provenanceindexing§ Websearchandbrowseinterfaceforprovenance§ Matlab toolforgeneratingprovenance§ Rtoolforgeneratingprovenance§ YesWorkflow toolformodelingprovenancefromscripts
69
• Provenance …– …iskeytotransparency,reproducibilty,comprehensibility– ...comesinmany(hybrid)forms(workflow graphs,logfiles,traceevents,...)– …ismetadata(=>“alovenotetothefuture”)– …shouldbeactionabletoday(feedboth,yourIGM&RDM)
• Provenance-for-Self …– ...asks:howdoesprovenancehelpmegetmy
workdonetoday?– … iswhatprovenancetechnologistsandtool
buildersshoulddomoreof!
Conclusions2:TowardsProvenance-for-Self
Provenance@DUG-2017 70
Insidethemindofamasterprocrastinator(TEDTalkbyTimUrban)
• SKOPE: systemandtoolstodiscover,access,analyze,visualizepaleoenvironmental data– unprecedentedabilitytoexploreprovenance
(detailed,comprehensiblerecordofcomputationalderivationofresults)
– forresearchers,tinkerers,andmodelers– è SKOPEPosterbyAndrewBrown
• WholeTale:– leverage&contributetoexistingCItosupportthe
wholetale(“livingpaper”),fromworkflowruntoscholarlypublication
– integratetools&CI(DataONE,EC!?,Globus,iRODS,NDS,...)tosimplifyuseandpromotebestpractices.
– drivenbyscienceWGs(archaeology/SKOPE,materialsscience,astro,bio..)
Conclusions3(otherprojects)
Provenance@DUG-2017 71
Timeallowing…
…thoughtsonreproducibilityandprovenance...
Provenance@DUG-2017 72
Provenance@DUG-2017 73
PRIMAD(whathaveyou“primed”?)
Provenance@DUG-2017 74
Dagstuhl Seminar#16041Report Outputs=Exec(M,I,P,D)|RO,A- M=parsimony/bootstrap/..- I=packageXYZ- P=MacOS ..- D=(Params,Files)
PRIMAD(whathaveyou“primed”?)
Provenance@DUG-2017 75
Dagstuhl Seminar#16041Report
ReproducibilityCrisis(reprised)
• Successful reproducibilitystudy:• increases trust inpriorstudyJ• …butnosurprisesL
• Failed reproducibilitystudy:• decreasestrust (orfalsifies)priorstudyL• …butsurprising failureyieldsnewinfo/knowledgeJ
• Learningfromfailures!– Notreallyanew,revolutionaryidea..– Whatisapositivevsnegativeresultanyways?– ...failearly,failoften...
Provenance@DUG-2017 76
top related