using the grid for astronomical data roy williams, caltech
TRANSCRIPT
Using the Grid for Astronomical Data
Roy Williams, Caltech
Palomar-Quest SurveyCaltech, NCSA, Yale
P48 Telescope
Caltech Yale
NCSA
Transient pipeline computing reservation at sunrise for immediate followup of transients
Synoptic survey massive resampling (Atlasmaker) for ultrafaint detection
TG
NCSA and Caltech and Yale run different pipelines on the same data
50 Gbyte/night
5 Tbyte
ALERT
Wide-area Mosaicking (Hyperatlas)An NVO-Teragrid projectCaltech
High-qualityflux-preserving, spatial accuracy
StackableHyperatlas
Edge-freePyramid weight
Mining AND Outreach
DPOSS 15º
Griffith Observatory "Big Picture"
Synoptic Image Stack
PQ Pipeline
ComputingObservation Night28 columns x 4 filtersup to 70 Gbyte
real-time
next day
cleaned frames hyperatlas pages
coadd
VOEventNet
quasars @z>4
Mosaicking service
Logical SIAP
NVO Registry
Physical SIAP
Computing
Portal
Security
Request
Sandbox
http
Transient from PQ
from catalog pipeline
Event Synthesis Engine
Pairitel
Palomar 60”
Raptor
PQ next-daypipelines
catalog
Palomar-Quest
knownVariables
knownasteroids
SDSS2MASS
PQ Event Factory
remote archives
baselinesky
eStar
VOEventNet
VOEventNet: a Rapid-Response Telescope Grid GRBsatellites
VOEventdatabase
Correlation of mass distribution (SDSS) with CMB (ISW effect)-- statistical significance through ensemble of simulated universes
Connolly and Scrantom, U Pittsburgh
ISW Effect
Analysis of data from AMANDAAntarctic Muon and Neutrino Detector Array
Barwick and Silvestri, UC Irvine
Amanda analysis
Quasar ScienceAn NVO-Teragrid projectPennState, CMU, Caltech
• 60,000 quasar spectra from Sloan Sky Survey• Each is 1 cpu-hour: submit to grid queue• Fits complex model (173 parameter)
– derive black hole mass from line widths
clusters
globusrun
manager
NVO dataservices
N-point galaxy correlationAn NVO-Teragrid projectPitt, CMU
Finding triple correlation in 3D SDSS galaxy catalog (RA/Dec/z)
Lots of large parallel jobs
kd-tree algorithms
TeraGrid
TeraGrid Wide Area Network
TeraGrid Components
• Compute hardware– Intel/Linux Clusters, Alpha SMP clusters, POWER4
cluster, …
• Large-scale storage systems– hundreds of terabytes for secondary storage
• Very high-speed network backbone– bandwidth for rich interaction and tight coupling
• Grid middleware– Globus, data management, …
• Next-generation applications
Overview of Distributed TeraGrid Resources
HPSSHPSS
HPSS UniTree
External Networks
External NetworksExternal
Networks
External Networks
Site Resources Site Resources
Site ResourcesSite ResourcesNCSA/PACI10.3 TF240 TB
SDSC4.1 TF225 TB
Caltech Argonne
Cluster Supercomputer
100s of nodes
purged /scratch
parallel file system/home (backed-up)
login node
job submission and queueing(Condor, PBS, ..)
user
metadata node
parallel I/O
VO service
TeraGrid Allocations Policies
• Any US researcher can request an allocation– Policies/procedures posted at:
• http://www.paci.org/Allocations.html – Online proposal submission
• https://pops-submit.paci.org/
• NVO has an account on Teragrid– (just ask RW)
Data storage
Logical and Physical names
• Logical name– application-context
• eg frame_20050828.012.fits
• Physical name– storage-context
• eg /home/roy/data/frame_20050828.012.fits• eg file:///envoy4/raid3/frames/20050825/012.fits• eg
http://nvo.caltech.edu/vostore/6ab7c828fe73.fits.gz
Logical and Physical Names
• Allows – replication of data– movement/optimization of storage– transition to database (lname -> key)– heterogeneous/extensible storage hardware
• /envoy2/raid2, /pvfs/nvo/, etc etc
Physical Name
• Suggest URI form– protocol://identifier– if you know the protocol, you can interpret the
identifier
• Examplesfile://ftp://srb://uberftp://
• Transition to serviceshttp://server/MadeToOrder?frame=012&a=2&b=3
Typical types of HPC storage needs
Type
Typical size
Use Aggregate BW
Tolerance for Latency
Requirements
1 1-10TB Home filesystem
A lot of small files, high metadata rates, interactive use.
2 (optional)
100’s GB (per CPU)
Local scratch space
High bandwidth data cache.
3 10-100TB
Global filesystem
High aggregate bandwidth. Concurrent access to data. Moderate latency tolerated.
4 100TB-PB
Archival Storage
Large storage pools with low cost. Used for long term storage of results.
Disk Farms (datawulf)
Large files striped over disks
Management node for file creation, access, ls, etc etc
• Homogeneous Disk Farm(= parallel file system)
parallel file systemmetadata node
parallel I/O
Parallel File System
• Large files are striped– very fast parallel access
• Medium files are distributed– Stripes do not all start the same place
• Small files choke the PFS manager– Either containerize– or use blobs in a database
• not a file system anymore: pool of 108 blobs with lnames
•
Containerizing
• Shared metadata• Easier for bulk movement
container file in container
Extraction from Container
• tar container• slow extraction (reads whole container)
• zip container• indexed for fast partial extraction• 2 Gbyte limit on container size• used for fast access 2MASS image service at
Caltech
Storage Resource Broker (SRB)
• Single logical namespace while accessing distributed archival storage resources
• Effectively infinite storage (first to 1TB wins a t-shirt)
• Data replication• Parallel Transfers• Interfaces: command-line, API, web/portal.
Storage Resource Broker (SRB):Virtual Resources, Replication
NCSA
SDSC
workstation
SRB Client
(cmdline, or API)
hpss-sdsc
sfs-tape-sdsc
hpss-caltech
…
Running jobs
3 Ways to Submit a Job
1. Directly to PBS Batch Scheduler – Simple, scripts are portable among PBS TeraGrid clusters
2. Globus common batch script syntax– Scripts are portable among other grids using Globus
3. Condor-G– Nice interface atop Globus, monitoring of all jobs submitted via Condor-G– Higher-level tools like DAGMan
PBS Batch Submission
• Single executables to be on a single remote machine– login to a head node, submit to queue
• Direct, interactive execution– mpirun –np 16 ./a.out
• Through a batch job manager– qsub my_script
• where my_script describes executable location, runtime duration, redirection of stdout/err, mpirun specification…
• ssh tg-login.[caltech|ncsa|sdsc|uc].teragrid.org – qsub flatten.sh –v "FILE=f544"– qstat or showq– ls *.dat– pbs.out, pbs.err files
Remote submission
• Through globus– globusrun -r [some-teragrid-head-node].teragrid.org/jobmanager -f my_rsl_script
• where my_rsl_script describes the same details as in the qsub my_script!
• Through Condor-G– condor_submit my_condor_script
• where my_condor_script describes the same details as the globus my_rsl_script!
globus-job-submit
• For running of batch/offline jobs– globus-job-submit Submit job
• same interface as globus-job-run• returns immediately
– globus-job-status Check job status– globus-job-cancel Cancel job– globus-job-get-output Get job
stdout/err– globus-job-clean Cleanup after job
Condor-G
A Grid-enabled version of Condor that provides robust job management for Globus clients.
– Robust replacement for globusrun– Provides extensive fault-tolerance– Can provide scheduling across multiple
Globus sites– Brings Condor’s job management features
to Globus jobs
Condor DAGMan
• Manages workflow interdependencies• Each task is a Condor description file• A DAG file controls the order in which
the Condor files are run
Data intensive computing with NVO services
Two Key Ideas for Fault-Tolerance
• Transactions• No partial completion -- either all or nothing
– eg copy to a tmp filename, then mv to correct file name
• Idempotent• “Acting as if done only once, even if used multiple times”• Can run the script repeatedly until finished
DPOSS flattening
2650 x 1.1 Gbyte files
Cropping borders
Quadratic fit and subtract
Virtual data
Source Target
Driving the Queues
for f in os.listdir(inputDirectory):
# if the file exists, with the right size and age, then we keep it
ofile = outputDirectory +"/"+ f
if os.path.exists(ofile):
osize = os.path.getsize(ofile)
if osize != 1109404800:
print " -- wrong target size, remaking", osize
else:
time_tgt = filetime(ofile)
time_src = filetime(file)
if time_tgt < time_src:
print(" -- target too old or nonexistant, making")
else:
print " -- already have target file "
continue
cmd = "qsub flat.sh -v \"FILE=" + f +"\""
print " -- submitting batch job: ", cmd
os.system(cmd)
Here is the driver that makes and submits jobs
PBS script
#!/bin/sh
#PBS -N dposs
#PBS -V
#PBS -l nodes=1
#PBS -l walltime=1:00:00
cd /home/roy/dposs-flat/flat
./flat \
-infile /pvfs/mydata/source/${FILE}.fits \
-outfile /pvfs/mydata/target/${FILE}.fits \
-chop 0 0 1500 23552 \
-chop 0 0 23552 1500 \
-chop 0 22052 23552 23552 \
-chop 22052 0 23552 23552 \
-chop 18052 0 23552 4000
A PBS script. Can do "qsub script.sh –v "FILE=f345"
Hyperatlas
Standard naming for atlases and pagesTM-5-SIN-20Page 1589
Standard Scales:scale s means 220-s arcseconds per pixel
SIN projection
TAN projection
TM-5 layout
HV-4 layout
Standard Projections
StandardLayout
Hyperatlas is a Service
All Pages: <baseURL>/getChart?atlas=TM-5-SIN-20 (and no other arguments)
0 2.77777778E-4 'RA---SIN’ 'DEC--SIN' 0.0 -90.01 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 0.0 -85.02 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 36.0 -85.0...1731 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 288.0 85.01732 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 324.0 85.01733 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 0.0 90.0
Sky to Page: page=1603&RA=182&Dec=62 --> page, scale, ctype, RA, Dec. x, y1603 2.777777777777778E-4 'RA---TAN' 'DEC--TAN' 175.3 60.0 -11180.1 7773.7
Best Page: RA=182&Dec=62 --> page, scale, ctype, RA, Dec. x, y1604 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 184.61538 60.0 4422.4 7292.1
Page WCS: page=1604 --> page, scale, ctype, RA, Dec1604 2.77777778E-4 'RA---SIN' 'DEC--SIN' 184.61538 60.0
Replicated ImplementationsbaseURL = http://nvo.caltech.edu:8080/hyperatlas (services try)
Hyperatlas Service
Page to Sky: page=1603&x=200&y=500 --> RA, Dec, nx,n y, nz
184.5 60.1 -0.496 -0.039 0.867
Relevant pages from sky region: tilesize=4096&ramin=200.0&ramax=202.0&decmin=11.0&decmax=12.0 --> RA, Dec, nx,n y, nz
1015 -1 11015 -1 21015 -2 11015 -2 21015 0 11015 0 2
ImplementationbaseURL = http://nvo.caltech.edu:8080/hyperatlas (services try)
page 1015ref point RA=200, Dec=10
GET services from Python
import urllib
hyperatlasURL = self.hyperatlasServer + "/getChart?atlas=" + atlas \
+ "&RA=" + str(center1) + "&Dec=" + str(center2)
stream = urllib.urlopen(hyperatlasURL)
# result is a tab-separated line, so use split() to tokenize
tokens = stream.readline().split('\t')
print "Using page ", tokens[0], " of atlas ", atlas
self.scale = float(tokens[1])
self.CTYPE1 = tokens[2]
self.CTYPE2 = tokens[3]
rval1 = float(tokens[4])
rval2 = float(tokens[5])
This code uses a service to find the best hyperatlas page for a given sky location
VOTable parser in Python
import urllib
import xml.dom.minidom
stream = urllib.urlopen(SIAP_URL)
doc = xml.dom.minidom.parse(stream)
#Make a dictionary for the columns
col_ucd_dict = {}
for XML_TABLE in doc.getElementsByTagName("TABLE"):
for XML_FIELD in XML_TABLE.getElementsByTagName("FIELD"):
col_ucd = XML_FIELD.getAttribute("ucd")
col_ucd_dict[col_title] = col_counter
urlColumn = col_ucd_dict["VOX:Image_AccessReference"]
formatColumn = col_ucd_dict["VOX:Image_Format"]
raColumn = col_ucd_dict["POS_EQ_RA_MAIN"]
deColumn = col_ucd_dict["POS_EQ_DEC_MAIN"]
From a SIAP URL, we get the XML, and extract the columns that have the image references, image format, and image RA/Dec
VOTable parser in Python
import xml.dom.minidom
table=[]
for XML_TABLE in doc.getElementsByTagName("TABLE"):
for XML_DATA in XML_TABLE.getElementsByTagName("DATA"):
for XML_TABLEDATA in XML_DATA.getElementsByTagName("TABLEDATA"):
for XML_TR in XML_TABLEDATA.getElementsByTagName("TR"):
row=[]
for XML_TD in XML_TR.getElementsByTagName("TD"):
data = ""
for child in XML_TD.childNodes:
data += child.data
row.append(data)
table.append(row)
Table is a list of rows, and each row is a list of table cells
Science Gateways
Grid Impediments
Learn GlobusLearn MPILearn PBSPort code to ItaniumGet certificateGet logged inWait 3 months for accountWrite proposal
and now do some science....
A better way:Graduated Securityfor Science Gateways
Web form - anonymous
somescience....
Register - logging and reporting
morescience....
Authenticate X.509- browser or cmd line
big-ironcomputing
....
Write proposal- own account
power user
2MASS Mosaicking portalAn NVO-Teragrid projectCaltech IPAC
Three Types of Science Gateways
• Web-based Portals – User interacts with community-deployed web interface.– Runs community-deployed codes – Service requests forwarded to grid resources
• Scripted service call – User writes code to submit and monitor jobs
• Grid-enabled applications– Application programs on users' machines (eg IRAF)– Also runs program on grid resource
Secure Web services for Teragrid Access