querying provenance information: basic notions and an example from paleoclimate reconstruction

30
Querying Provenance Information: Basic Notions and an Example from Paleoclimate Reconstruction (SKOPE Project) Bertram Ludäscher ? , Victoria Stodden*, Kyle Bocinsky, Keith Kintigh, Tim Kohler, Timothy McPhillips, Johnathan Rush ? contact (Q&A) for SKOPE & YesWorkflow *presenting 1

Upload: bertram-ludaescher

Post on 09-Jan-2017

27 views

Category:

Data & Analytics


2 download

TRANSCRIPT

QueryingProvenanceInformation:BasicNotionsandanExamplefromPaleoclimate

Reconstruction(SKOPEProject)BertramLudäscher?,VictoriaStodden*,KyleBocinsky,KeithKintigh,TimKohler,TimothyMcPhillips,JohnathanRush

?contact(Q&A)forSKOPE&YesWorkflow *presenting 1

SKOPEProject:SynthesizedKnowledgeOfPastEnvironments

Example:Bocinsky,Kohleretal.studyrain-fedmaizeof Anasazi– FourCorners;AD600–1500. ClimatechangeinfluencedMesaVerdeMigrations;late

13thcenturyAD.Usesnetworkoftree-ringchronologiestoreconstructaspatio-temporalclimatefieldatafairlyhighresolution(~800m)fromAD1–2000.Algorithmestimatesjointinformationintree-ringsandaclimatesignaltoidentify“best” tree-ringchronologiesforclimatereconstructing.

… implemented as an R Script … 2

K.Bocinsky,T.Kohler,A2000-yearreconstructionoftherain-fedmaizeagriculturalnicheintheUSSouthwest.Nature Communications.doi:10.1038/ncomms6618

Data&Codeavailable!..butisitenough?Howtothinkabout;model;querytheunderlyingworkflow

anddataprovenance?

3

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

?YesWorkflow:Yes,scripts(often)are

workflows,too!

• R,MATLAB,Python,…scripts“hide”valuabledataflow,unlessrevealedusingaworkflowmodel.

• YesWorkflow toolcanbeusedtomodel,queryunderlyingworkflowandprovenanceinfo.

4

YesWorkflow:Prospective &RetrospectiveProvenance…(almost)forfree!

• SimpleYWannotations(@begin,@end,@in,@out,…)inthescript(R,Python,MATLAB)areusedtorecreatetheworkflowviewfromthescript…

YW!

5

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

GetModernClimate

PRISM_annual_growing_season_precipitation

SubsetAllData

dendro_series_for_calibration

dendro_series_for_reconstruction CAR_Analysis_unique

cellwise_unique_selected_linear_models

CAR_Analysis_union

cellwise_union_selected_linear_models

CAR_Reconstruction_union

raster_brick_spatial_reconstruction raster_brick_spatial_reconstruction_errors

CAR_Reconstruction_union_output

ZuniCibola_PRISM_grow_prcp_ols_loocv_union_recons.tif ZuniCibola_PRISM_grow_prcp_ols_loocv_union_errors.tif

master_data_directory prism_directory

tree_ring_datacalibration_years retrodiction_years

PaleoclimateReconstruction(OpenSKOPE.org)• …explainedusingYesWorkflow!

KyleB.,(computational)archaeologist:"Ittookmeabout20minutestocomment.LessthananhourtolearnandYW-annotate,all-told."

6

YWannotations:ModelyourWorkflow!

7

YW (prospective)andYW-Recon(retrospective)Provenance• 1.YW:AnnotateScript=>YWModel– Annotate@BEGIN..@END,@IN,@OUT– Visualize,share,behappyJ

• 2.Runscript– Filesarereadandwritten– Folder- &Filenameshavemetadata

• 3.YW-Recon– Use@URItagsthatlinkYWModeló PersistedData– RunURI-templatequeries

• cf.“ls -R”&RegEx matching

• 4.YW-Query– Answertheuser’sprovenancequeries

8

YW (prospective)andYW-Recon(retrospective)Provenance• 1.YW:AnnotateScript=>YWModel– Annotate@BEGIN..@END,@IN,@OUT– Visualize,share,behappyJ

• 2.Runscript– Filesarereadandwritten– Folder- &Filenameshavemetadata

• 3.YW-Recon– Use@URItagsthatlinkYWModeló PersistedData– RunURI-templatequeries

• cf.“ls -R”&RegEx matching

• 4.YW-Query– Answertheuser’sprovenancequeries

9

Blurringthelinebetweengenericprovenancequestions andscience questions ...

• WhatversionofGDAL (GeospatialDataAbstractionLibrary)wasused?• Whatwerethefiles andparameters usedasinputs tothescripts used?• Whatgeographicregions andyearsarecovered by thePaleoCAR input?• Areanyregions in theprocessed data not covered by theinputdata?• Foreachvaluedisplayedinthegraphsordownloadedfromthewebapp:

– Is ittheexact value outputbyPaleoCAR forthe30"x30"regioncontainingthemarker?

– Oris itavalue interpolated frommultiplevaluesinthePaleoCAR output?– Ifinterpolated,whatarethevalues andcorrespondingcoordinatesforthepoints

used in theinterpolation?– Whatformulaorcurve-fittingalgorithm wasused for performingtheinterpolation?

• Whataretheestimatederrors in theinput data todataprocessing(andinputstoPaleoCAR)thatresultintheseestimatederrors?

• …

10

Blurringthelinebetweengenericprovenancequestions andscience questions ...

• …• Interesting,provenance-based questionforareconstruction

techniquelikePaleoCAR:– Whattree-ringchronologies/specieswereselectedforaparticular

reconstruction(say,summertemperature)?

• Suchinformationcanreveallocalclimatepatternsorlong-rangeclimateteleconnections.

• =>ifaresearchermistrustsaparticulartree-ringchronology,theymightbeinterestedinwhat(geographicandtemporal)portionsofareconstructionareinfluencedbythesuspectchronology(ifany).

11

ExecutiveSummary• Researchpapersexplainfindings,methods;increasinglylinkto

data&code(&execenvironment=>WholeTale)• Prospective provenance(= workflowdefinition)and

retrospective provenance(e.g.datalineage)forscript-basedcomputationalstudies(R,Python,MATLAB,…)canbecombinedtosupportpowerfulhybridprovenancequeries.– Provenanceisn’tjustmetadataforothers:“provenance-for-self”

queriescanbeusedbyresearchersduring thestudies.

• YesWorkflow toolcanbeusedtomodelprospectiveprovenance,combinewithandqueryretrospectiveprovenance

• SKOPE projectprovidesrichusecasesfor“deep”(science-oriented)provenancequeries.

12

ProvenanceSupportforReproducibleScienceUseCase:PaleoclimateReconstruction

Sciencepaper(OA)uses:• opensourcePaleoCAR

modelinR• Butwhatwasthe

“workflow”?• Isthereprospective

and/orretrospectiveprovenance?

• =>YesWorkflow toolcanhelp!

13

SUPPLEMENTARYMATERIAL

14

run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

YW-RECON:Prospective&RetrospectiveProvenance…(almost)forfree!

cassette_id

sample_score_cutoff

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_namesample_quality

calculate_strategy

rejected_sample accepted_sample num_images energies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_id energy frame_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

• URI-templateslink conceptualentitiestoruntimeprovenance“leftbehind”bythescriptauthor…

• …facilitatingprovenancereconstruction15

initialize_run

run_logfile:run/run_log.txt

load_screening_results

sample_name sample_quality

calculate_strategy

rejected_sample accepted_sample num_imagesenergies

log_rejected_sample

rejection_logfile:/run/rejected_samples.txt

collect_data_set

sample_idenergyframe_numberraw_image

file:run/raw/{cassette_id}/{sample_id}/e{energy}/image_{frame_number}.raw

transform_images

corrected_imagefile:data/{sample_id}/{sample_id}_{energy}eV_{frame_number}.img

total_intensitypixel_count corrected_image_path

log_average_image_intensity

collection_logfile:run/collected_images.csv

sample_spreadsheetfile:cassette_{cassette_id}_spreadsheet.csv

calibration_imagefile:calibration.img

cassette_id

sample_score_cutoff

Q:WhereistherawimageofthecorrectedimageDRT322_11000ev_030.img?run/  

├──  raw  

│      └──  q55  

│              ├──  DRT240  

│              │      ├──  e10000  

│              │      │      ├──  image_001.raw  

...          ...  ...  ...  

│              │      │      └──  image_037.raw  

│              │      └──  e11000  

│              │              ├──  image_001.raw  

...          ...          ...  

│              │              └──  image_037.raw  

│              └──  DRT322  

│                      ├──  e10000  

│                      │      ├──  image_001.raw  

...                  ...  ...  

│                      │      └──  image_030.raw  

│                      └──  e11000  

│                              ├──  image_001.raw  

...                          ...  

│                              └──  image_030.raw  

├──  data  

│      ├──  DRT240  

│      │      ├──  DRT240_10000eV_001.img  

...  ...  ...  

│      │      └──  DRT240_11000eV_037.img  

│      └──  DRT322  

│              ├──  DRT322_10000eV_001.img  

...          ...  

│              └──  DRT322_11000eV_030.img  

│  

├──  collected_images.csv  

├──  rejected_samples.txt  

└──  run_log.txt  

 

16

main

fetch_maskinput_mask_file

load_datainput_data_file standardize_with_mask

land_water_mask

NEE_data simple_diagnosestandardized_NEE_data result_NEE_pdf

Get3viewsforthepriceof1!

result_NEE_pdf

input_mask_file land_water_maskfetch_mask

input_data_file NEE_dataload_data

standardized_NEE_data

standardize_with_mask

standardize_with_masksimple_diagnose

fetch_mask land_water_mask

load_data NEE_data

standardize_with_mask standardized_NEE_data simple_diagnose result_NEE_pdf

input_mask_file

input_data_file

Processview

Dataview

Combinedview

17

Provenance in Action: DataONE Project

ADataONE search(here:“grass”)yieldsdifferentpackageswithprovenance

18

DataONE: Support for ProvenanceYaxing’s script withinputs &outputproducts

Christopher’sYesWorkflow

model

ChristopherusingYaxing’s outputsasinputsforhisscript

Christopher’sresultscanbetracedbackall

thewaytoYaxing’sinput

19

Multi-ScaleSynthesisandTerrestrialModelIntercomparison

Project(MsTMIP)

fetch_drought_variable

drought_variable_1

fetch_effect_variable

effect_variable_1

convert_effect_variable_units

effect_variable_2

create_land_water_mask

land_water_mask

init_data_variables

predrought_effect_variable_1 drought_value_variable_1 recovery_time_variable_1 drought_number_variable_1

define_droughts

sigma_dv_event month_dv_length

detrend_deseasonalize_effect_variable

effect_variable_3

calculate_data_variables

recovery_time_variable_2 drought_value_variable_2 predrought_effect_variable_2 drought_number_variable_2

export_recovery_time_figure

output_recovery_time_figure

export_drought_value_variable_figure

output_drought_value_variable_figure

export_predrought_effect_variable_figure

output_predrought_effect_variable_figure

export_drought_number_variable_figure

output_drought_number_figure

input_drough_variable

input_effect_variable

ChristopherSchwalm,Yaxing Wei

20

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

noWorkflow:not onlyWorkflow!

• Scriptshaveprovenance,too!

• Transparently capturesome/allprovenancefromPythonscriptruns.

• Usefilterqueries to“zoom”intorelevantparts..

21

simulate_data_collection

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>

251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])

251 args = ['q55']

251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>

24 cassette_id = 'q55'

24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0

24 calibration_image_file = 'calibration.img'

49 str.format

49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'

50 spreadsheet_rows(sample_spreadsheet_file)

50 sample_name = 'DRT240'50 sample_quality = 45

61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])

61 accepted_sample = 'DRT240'61 num_images = 2

61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'

92 collect_next_image(casset ... _{frame_number:03d}.raw')

92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'

106 str.format

106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')

calibration.img

run/data/DRT240/DRT240_11000eV_002.img

$now dataflow-f"run/data/DRT240/DRT240_11000eV_002.img"

$(NW_FILTERED_LINEAGE_GRAPH).gv: $(NW_FACTS)now helper df_style.pynow dataflow -v 55 -f $(RETROSPECTIVE_LINEAGE_VALUE) -m simulation| python df_style.py -d BT -e > $(NW_FILTERED_LINEAGE_GRAPH).gv

..auto-“make” this!

noWorkflow lineageofanimagefile

ProvenanceinformationaboutPythonfunctioncalls,variable assignments,etc.

22

simulate_data_collection

initialize_run

run_log load_screening_results

sample_namesample_quality

calculate_strategy

accepted_samplerejected_sample num_imagesenergies

log_rejected_sample

rejection_log

collect_data_set

sample_id energyframe_number raw_image

transform_images

corrected_imagetotal_intensitypixel_count

log_average_image_intensity

collection_log

sample_spreadsheet

calibration_image

sample_score_cutoffdata_redundancy

cassette_id

simulate_data_collection

collect_data_set

sample_id energy frame_number raw_image

calculate_strategy

accepted_sample num_imagesenergies

load_screening_results

sample_namesample_quality

transform_images

corrected_image

sample_spreadsheet

calibration_image

sample_score_cutoff data_redundancy

cassette_id

module.__build_class__

module.__build_class__

simulate_data_collection

180 return

180 run_logger

201 return

201 new_image_file

230 parser

231 cassette_id

236 add_option

241 add_option

246 add_option

248 set_usage

251 parse_args251 args

251 options254 module.len

24 cassette_id

24 sample_score_cutoff

24 data_redundancy

24 calibration_image_file

30 exists

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

33 exists

32 filepath

34 module.remove

36 run_log

37 write

38 str(sample_score_cutoff)

38 write

38 str(sample_score_cutoff)

49 str.format

49 sample_spreadsheet_file

50 spreadsheet_rows

cassette_q55_spreadsheet.csv

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format 51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

72 str.format

72 write

73 open

73 rejection_log

74 str.format

74 TextIOWrapper.write

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

calibration.img

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format 93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw') 93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

50 spreadsheet_rows(sample_spreadsheet_file)

51 str.format

51 write

50 sample_name

50 sample_quality

61 calculate_strategy

61 rejected_sample

61 energies

61 accepted_sample

61 num_images

90 str.format

90 write

91 sample_id

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open119 collection_log_file 120 module.writer

120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file

120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format

106 transform_image 106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open 119 collection_log_file 120 module.writer 120 collection_log

121 writer.writerow

92 collect_next_image

92 collect_next_image(casset ... _{frame_number:03d}.raw')

93 str.format

93 write

92 energy

92 frame_number

92 intensity

92 raw_image_file

106 str.format 106 transform_image

106 corrected_image_file

106 total_intensity

106 pixel_count

107 str.format

107 write

118 average_intensity

119 open

119 collection_log_file 120 module.writer120 collection_log

121 writer.writerow

92 collect_next_image

50 spreadsheet_rows

128 return

run/run_log.txt

run/rejected_samples.txt

run/raw/q55/DRT240/e10000/image_001.raw

run/data/DRT240/DRT240_10000eV_001.img

run/collected_images.csv

run/raw/q55/DRT240/e10000/image_002.raw

run/data/DRT240/DRT240_10000eV_002.img

run/raw/q55/DRT240/e11000/image_001.raw

run/data/DRT240/DRT240_11000eV_001.img

run/raw/q55/DRT240/e11000/image_002.raw

run/data/DRT240/DRT240_11000eV_002.img

run/raw/q55/DRT240/e12000/image_001.raw

run/data/DRT240/DRT240_12000eV_001.img

run/raw/q55/DRT240/e12000/image_002.raw

run/data/DRT240/DRT240_12000eV_002.img

run/raw/q55/DRT322/e10000/image_001.raw

run/data/DRT322/DRT322_10000eV_001.img

run/raw/q55/DRT322/e10000/image_002.raw

run/data/DRT322/DRT322_10000eV_002.img

run/raw/q55/DRT322/e11000/image_001.raw

run/data/DRT322/DRT322_11000eV_001.img

run/raw/q55/DRT322/e11000/image_002.raw

run/data/DRT322/DRT322_11000eV_002.img

simulate_data_collection

230 parser = <optparse.OptionParser object at 0x7fcb6e16e3c8>

251 parse_args = (<Values at 0x7fcb6cbe15c ... cutoff': 12.0}>, ['q55'])

251 args = ['q55']

251 options = <Values at 0x7fcb6cbe15c0 ... ple_score_cutoff': 12.0}>

24 cassette_id = 'q55'

24 sample_score_cutoff = 12.0 24 data_redundancy = 0.0

24 calibration_image_file = 'calibration.img'

49 str.format

49 sample_spreadsheet_file = 'cassette_q55_spreadsheet.csv'

50 spreadsheet_rows(sample_spreadsheet_file)

50 sample_name = 'DRT240'50 sample_quality = 45

61 calculate_strategy = ('DRT240', None, 2, [10000, 11000, 12000])

61 accepted_sample = 'DRT240'61 num_images = 2

61 energies = [10000, 11000, 12000] 91 sample_id = 'DRT240'

92 collect_next_image(casset ... _{frame_number:03d}.raw')

92 energy = 11000 92 frame_number = 292 raw_image_file = 'run/raw/q55/DRT240/e11000/image_002.raw'

106 str.format

106 transform_image = (980, 10, 'run/data/DRT240/DRT240_11000eV_002.img')

calibration.img

run/data/DRT240/DRT240_11000eV_002.img

lineagequerylineagequery

YesWorkflow:Conceptual workflowmodel

noWorkflow:Python tracemodel

Needtobridgethisgap withashared

model

WouldliketouseYWmodeltoqueryNW

data!

23

HabemusPons!We’vegottheBridge!

LineageofimagefileintermsofYW

model,withdetailsfromNWprovenance

24

C3-C4ProspectiveProvenanceC3_C4_map_present_NA

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_air_temperature_data

Tair_Matrix

fetch_monthly_mean_precipitation_data

Rain_Matrix

initialize_Grass_Matrix

Grass_variable

examine_pixels_for_grass

C3_Data C4_Data

generate_netcdf_file_for_C3_fraction

C3_fraction_datafile:outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc

generate_netcdf_file_for_C4_fraction

C4_fraction_datafile:outputs/SYNMAP_PRESENTVEG_C4Grass_RelaFrac_NA_v2.0.nc

generate_netcdf_file_for_Grass_fraction

Grass_fraction_datafile:outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc

SYNMAP_land_cover_map_datainputs/land_cover/SYNMAP_NA_QD.nc

mean_airtempfile:inputs/narr_air.2m_monthly/air.2m_monthly_{start_year}_{end_year}_mean.{month}.nc

mean_precipfile:inputs/narr_apcp_rescaled_monthly/apcp_monthly_{start_year}_{end_year}_mean.{month}.nc

25

UpstreamLineageofC3_fraction_dataC3_C4_map_present_NA

examine_pixels_for_grass

C3_Data

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_precipitation_data

Rain_Matrix

fetch_monthly_mean_air_temperature_data

Tair_Matrix

generate_netcdf_file_for_C3_fraction

C3_fraction_data

SYNMAP_land_cover_map_data

mean_airtempmean_precip

26

UpstreamofGrass_fraction_data!

27

C3_C4_map_present_NA

initialize_Grass_Matrix

Grass_variable

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

generate_netcdf_file_for_Grass_fraction

Grass_fraction_data

SYNMAP_land_cover_map_data

HybridProvenanceGraph

28

C3_C4_map_present_NA

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

fetch_monthly_mean_air_temperature_data

Tair_Matrix

fetch_monthly_mean_precipitation_data

Rain_Matrix

initialize_Grass_Matrix

Grass_variable

examine_pixels_for_grass

C3_Data C4_Data

generate_netcdf_file_for_C3_fraction

[data19] C3_fraction_data

outputs/SYNMAP_PRESENTVEG_C3Grass_RelaFrac_NA_v2.0.nc

generate_netcdf_file_for_C4_fraction

[data20] C4_fraction_data

outputs/SYNMAP_PRESENTVEG_C4Grass_RelaFrac_NA_v2.0.nc

generate_netcdf_file_for_Grass_fraction

[data21] Grass_fraction_data

outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc

[data7] SYNMAP_land_cover_map_data

inputs/land_cover/SYNMAP_NA_QD.nc

[data12] mean_airtemp

inputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.4.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.8.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.12.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.5.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.9.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.2.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.1.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.6.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.10.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.3.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.7.ncinputs/narr_air.2m_monthly/air.2m_monthly_2000_2010_mean.11.nc

[data14] mean_precip

inputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.10.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.3.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.7.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.11.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.4.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.8.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.1.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.12.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.5.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.9.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.2.ncinputs/narr_apcp_rescaled_monthly/apcp_monthly_2000_2010_mean.6.nc

HybridProvenanceGraph:upstreamofGrass_fraction_data file!

29

C3_C4_map_present_NA

initialize_Grass_Matrix

Grass_variable

fetch_SYNMAP_land_cover_map_variable

lon_variable lat_variable lon_bnds_variable lat_bnds_variable

generate_netcdf_file_for_Grass_fraction

[data21] Grass_fraction_data

outputs/SYNMAP_PRESENTVEG_Grass_Fraction_NA_v2.0.nc

[data7] SYNMAP_land_cover_map_data

inputs/land_cover/SYNMAP_NA_QD.nc

DemoTime

30