1
Tracking Metadata and Lineageof the Data Processing Chain
for Mapping Snow Cover Propertieswith the NASA MODIS
James Frew1, Thomas H. Painter2,Peter Slaughter1, Jeff Dozier1
1Donald Bren School of Environmental Science and Management, University of California, Santa Barbara2National Snow and Ice Data Center,University of Colorado, Boulder
2
Outline
Motivation Snow mapping product Implications for hydrologic modeling
Lineage Capture Wrapping: the ESSW experience Instrumenting,
overriding,monitoring: the (ongoing) ES3 experience
3
MODIS image – Sierra Nevada
EOS Terra MODIS
07 March 2004
MOD09 Surface Reflectance
0.555 0.645 0.858
44
Snow-covered area and grain size
5
Hindu Kush
2003 DOY 070
6
Colorado RockiesCLPX
13 March 2002
7
Model structure: MODIS snow-area / albedo
Basinmask
Processing Lineage
Watershedinfo
MODIScloud mask
(48 bits)
MODIS 7 land bands (112 bits)
MODIS quality flags
Topography
MODIS snow cover and grain
size
MODISview
angles
Solarzenith,
azimuth
Snowfraction
albedoRMSerror
Vegfraction
Soilfraction
Shadefraction
Open water
fraction
Quality flag
8
Lineage Capture, Take 1
The ESSW experience
9
Using Existing Science Applications
No “standard”Earth science computing environment commercial packages (ArcInfo, MATLAB, …) public packages/models (MM5, MODTRAN, …) locally-developed codes arbitrary combinations of
Example: SST from AVHRR commercial, standalone programs parameters highly customized for UCSB
How do we get these programs to communicate cooperate
with ESSW, without rewriting them?
Navigate(Manual/Automatic)
Receive
Ingest and Calibrate
Rectify
Sea Surface Temp (SST)
SSTMaps
10
Lineage: Current Best Practice
11
Earth System Science Workbench (ESSW)
Producer and consumer issues can both be addressedby a laboratory metaphor
Experiment Network of models … ingesting / synthesizing data … generating products
Laboratory Experiment execution environment
– Computing + storage = accessibility + scalability
Lab Notebook Persistent storage that can be queried Keeps track of all experiments
– Documentation + lineage = accountability
12
Wrap Your App: Scripts Talk to ESSW
No changes,just additions Wrapper scripts
– Make program (groups) look like ESSW experiments
– use Perl API
Lab Notebook daemon– Accepts API commands– Creates XML documents
Sends to database
ESSW database– XML metadata & DTDs– Tabular metadata
XML search terms Lineage links
Navigate(Manual/Automatic)
Receive
Ingest and Calibrate
Rectify
Sea Surface Temp (SST)
SSTMaps
ESSWDatabase
Perl API
Lab Notebookdaemon
XML + SQL
MySQL
JDBC
Java
Perl
13
ESSW Metadata management
Lab Notebook daemon verifies XML metadata document
Experiment step metadata stored for product lineage tracking
Complete metadata document stored in custom database table XML DTD ← 1:1 → database table (n+1)th column is document itself
Some metadata values extracted into database tables DTD contains column names and types for some elements Always save all the XML,
even if don’t know how to “columnize” all of it
14
# SST experiment wrapper # $L1B is the input Level 1B AVHRR image file# $SST is the output SST image file # run legacy command "nitpix": creates SST image from L1B image $base_temp = 5.0;$temp_step = 0.1;... system("nitpix base_temp=$base_temp temp_step=$temp_step ... $L1B $SST"); # start recording ESSW metadata beginXMLBld($ENV{USER}, "PRODUCTION"); # get metadata for input file $L1B_ID = findSciObjFromFile($L1B);
AHVRR Level 1Bproduct
Multi-channelsea surfacetemperaturealgorithm
Sea surfacetemperature
(SST)
avhrr_sstModel
avhrr_l1b
avhrr_sst
Wrapper Example: Input Dataset
15
# create metadata for SST image $SST_ID = createMetadata("avhrr_sst"); addValue($SST_ID, "avhrr_sst.scene_id.satellite", $satellite);addValue($SST_ID, "avhrr_sst.scene_id.pass_date", $pass_date);... saveToDB($SST_ID, avhrr_sst);closeMetadata($SST_ID); saveDigest($SST, $SST_ID);
AHVRR Level 1Bproduct
Multi-channelsea surfacetemperaturealgorithm
Sea surfacetemperature
(SST)
avhrr_sstModel
avhrr_l1b
avhrr_sst
Wrapper Example: Output Dataset
16
# create metadata for SST experiment $exp = createExperimentMetadata("avhrr_sstModel");$exp_step = createExpStepMetadata($exp, "avhrr_sstExpStp"); addValue($exp_step, "avhrr_sstExpStp.base_temp", $base_temp);addValue($exp_step, "avhrr_sstExpStp.temp_step", $temp_step);... saveToDB($exp_step, "avhrr_sstExpStp");closeMetadata($exp_step);
# connect input and output images to experiment registerExperimentInputs($exp, $L1B_ID);registerExperimentOutputs($exp, $SST_ID); # finish recording ESSW metadata endXMLBld();
AHVRR Level 1Bproduct
Multi-channelsea surfacetemperaturealgorithm
Sea surfacetemperature
(SST)
avhrr_sstModel
avhrr_l1b
avhrr_sst
Wrapper Example: Process
17
# create metadata for SST experiment $exp = createExperimentMetadata("avhrr_sstModel");$exp_step = createExpStepMetadata($exp, "avhrr_sstExpStp"); addValue($exp_step, "avhrr_sstExpStp.base_temp", $base_temp);addValue($exp_step, "avhrr_sstExpStp.temp_step", $temp_step);... saveToDB($exp_step, "avhrr_sstExpStp");closeMetadata($exp_step);
# connect input and output images to experiment registerExperimentInputs($exp, $L1B_ID);registerExperimentOutputs($exp, $SST_ID); # finish recording ESSW metadata endXMLBld();
AHVRR Level 1Bproduct
Multi-channelsea surfacetemperaturealgorithm
Sea surfacetemperature
(SST)
avhrr_sstModel
avhrr_l1b
avhrr_sst
Wrapper Example: Lineage Links
18
Process graph reconstructedfrom ESSW database
19
ESSW Lessons
Providers are customers ESIPs aren’t much good unless scientists are happy to put information in
them
A light touch is the right touch Wrapping is easier for scientists and their programmers to deal with than
complete re-engineering
Scientists do write scripts, but not necessarily Perl Scripting (gluing stuff together) comes naturally to scientists
Scientists don’t write DTDs
Nobody calls metadata APIs
ESSW was automatic, but not automatic enough…
20
Lineage Capture, Take 2
The ES3 experience
21
ES3 : Earth System Science Server
cheap server
RAID 5 controller
cheap server
(mirror)
RAID 5 controller
Back Up Brick (BUB)
read read (backup)
write
cheap server
RAID 5 controller
cheap server
(mirror)
RAID 5 controller
Back Up Brick (BUB)
read read (backup)
write
ESSW++ data lineage tracking
BUB data storage ROCKS processing
clusters
Alexandria Digital Library
Microsoft TerraServer
MODster
OpenDAP
MODIS
Corona
AVHRR
Watershed-scale snow
product
Global-scale snow
product
22
From ESSW to ES3: Summary
Perl wrappers “Probulators”
Perl API web services + XML messages
MySQL XML database(s)
23
From Wrappers to Probulators
Wrappers: Active Lineage +
Complete control over what gets recorded Single language/API for all wrapped events Not tied to execution
– You can even lie about what happened
– Must explicitly script everything Scripts can drift from reality
– You can even lie about what happened
24
From Wrappers to Probulators
Probulators: Passive Lineage +
Record what actually happened– Not just what you think happened
– Not what didn’t happen
Automatic: don’t have to write new scripts for everything
– Different flavors for different environments
– Can’t just do everything in Perl…
25
Probulator patterns
Instrumentation Insert lineage capture instructions directly into science codes
– e.g. “I just created file ‘foo’” Typical implementation: preprocessor/precompiler
Overriding Replace standard routines/libraries with lineage-capturing versions
– e.g. open(…) → snoopy_open(…) Typical implementation: modify execution environment
– environment variables– configuration files
Passive monitoring Trace program execution
– e.g. “called open() with args foo, bar, …” Typical implementation: strace’d shell
26
ES3 Lineage Architecture
probulator1
probulatorn
logger transmitter ES3 core
logfiles
27
Probulating IDL: Instrumenting the code;editpro modscag_cleanse,prefix=prefix,ns=ns,nl=nlHELP, NAMES="*", OUTPUT=ES3_ENVIROMENT & ES3_LOG, $ ENTER="modscag_cleanse", ENVIROMENT=ES3_ENVIROMENT
; clean up {under,over}flow of MODSCAG run;; Input: prefix = prefix for all of the MODSCAG output filenames; ns = number of samples; nl = number of lines; Output: rewrite of the MODSCAG files;; t.h.painter / 1.19.2005
; open snow fileES3_openr,1,string(prefix,'snow.pic')snow=fltarr(ns,nl)readu,1,snow
[ blah blah blah ]
HELP, NAMES="*", OUTPUT=ES3_ENVIROMENT & ES3_LOG, LEAVE="modscag_cleanse", $ ENVIROMENT=ES3_ENVIROMENTEND ; modscag_cleanse
28
Probulating IDL: Results
<init time="20050522T234606Z”pid="31002" stime="20050522T234604Z" pstime="20050522T234256Z" ppid="30920" language="idl" user="haavar" hostname="spitting-duck.bren.ucsb.edu"><enviroment>
<variable name="!PATH" value="/home/haavar/probulator//idl:/home/rsi/idl_6.1/lib/hook:
[…]</enviroment><mount-points>
<mount share="dab15:/ed15/rsi" type="nfs">/home/rsi</mount></mount-points>
</init><enter region="modscag_cleanse">
<enviroment><variable type="INT" name="NL" value="2"/><variable type="INT" name="NS" value="2"/>
[…]</enviroment>
</enter><exec time="20050522T234610Z" routine="OPENR"> <io> <file read="true">/home/haavar/painter/data/tillsnow.pic</file> </io></exec>]
29
Probulating bash: Passive Monitoring
cat /etc/passwd | grep haavar | sed -n 's/\(.*:\)\{2\}\([0-9]\+\).*/\2/p'
25232 1138336174.480079 open("/etc/ld.so.cache", O_RDONLY) = 325232 1138336174.480215 open("/lib/libm.so.6", O_RDONLY) = 3[…]25234 1138336178.887267 dup2(3, 255) = 25525234 1138336178.887912 pipe([3, 4]) = 025234 1138336178.888257 clone(child_stack=0, […], child_tidptr=0xb7f2e708) = 2523525235 1138336178.889366 dup2(4, 1) = 125235 1138336178.889975 pipe([3, 4]) = 025235 1138336178.890326 clone(child_stack=0, […], child_tidptr=0xb7f2e708) = 2523625235 1138336178.891260 pipe([4, 5]) = 025235 1138336178.891756 clone(child_stack=0, […], child_tidptr=0xb7f2e708) = 2523725235 1138336178.892753 clone(child_stack=0, […], child_tidptr=0xb7f2e708) = 2523825238 1138336178.894266 dup2(4, 0) = 025236 1138336178.894726 dup2(4, 1) = 125237 1138336178.894763 dup2(3, 0) = 025237 1138336178.895581 dup2(5, 1) = 1[…]25238 1138336178.897006 execve("/bin/sed", ["sed", "-n", "s/\\(.*:\\)\\{2\\}\\([0-9]\\
+\\).*/\\2/p"], ["HOSTNAME=rubber-duck.bren.ucsb.edu", "TERM=xterm-color", […]25236 1138336178.900117 execve("/bin/cat", ["cat", "/etc/passwd”], […]25237 1138336178.903342 execve("/bin/grep", ["grep", "haavar"], […]
30
Probulating bash: Results
[… <init> same as IDL …]<exec time="20060027T042938.900117Z" routine="/bin/cat" pid="25236" ppid="25235">
<arguments><argument>/etc/passwd</argument>
</arguments><io>
<pipe read="true" id="std-in"/><pipe write="true" id="3"/><pipe write="true" id="std-err"/><file read="true">/etc/ld.so.cache</file>
[…]<file read="true">/etc/passwd</file>
</io></exec><exec time="20060027T042938.903342Z" routine="/bin/grep" pid="25237" ppid="25235">
<arguments><argument>haavar</argument>
</arguments><io>
<pipe read="true" id="3"/><pipe write="true" id="4"/>
[…]</io>
</exec>
31
Now What?
Probulator reports not universally unique Q: How hook separate reports together? A: Logger assigns UUIDs to
– Data streams
– Processes
– Jobs (workflows)
Lineage not explicit Q: How publish lineage? A: ES3 Core builds serialized graph
32
Thanks to:
Current Mike Colee Stephane Maritorena Dominic Metzger Karl Rittger Dave Siegel
Former Anurag Acharya Rajendra Bose Scott Denning Debbie Donahue Jim Duff Calin Duma Erik Fields Jim Gray Steve Miley Jordan Morris Mark Pelletier Pete Peterson Walter Rosenthal Klaus Schauser Håvar Valeur
33
To Probulate Further… http://www.snow.ucsb.edu : Publications
Bose, R. and Frew, J., 2005. Lineage retrieval for scientific data processing: a survey. ACM Computing Surveys, vol. 37, no. 1, pp. 1-28. doi:10.1145/1057977.1057978
Dozier, J., and Painter, T.H., 2004. Multispectral and hyperspectral remote sensing of alpine snow properties. Annual Review of Earth and Planetary Sciences, vol. 32, pp. 465-494. doi:10.1146/annurev.earth.32.101802.120404
Molotch, N.P., Painter, T.H., Bales, R.C., and Dozier, J., 2004. Incorporating remotely sensed snow albedo into spatially distributed snowmelt modeling. Geophysical Research Letters, 31, L03501 doi:10.1029/2003GL019063
Frew, J. and Bose, R., 2001. Earth System Science Workbench: a data management infrastructure for Earth science products. In: Kerschberg, L. and Kafatos, M. (eds.) 2001. Proceedings, 13th International Conference on Scientific and Statistical Database Management (SSDBM 2001), pp. 180-189. doi:10.1109/SSDM.2001.938550