using the grid for astronomical data roy williams, caltech

Using the Grid for Astronomical Data

Roy Williams, Caltech

Palomar-Quest SurveyCaltech, NCSA, Yale

P48 Telescope

Caltech Yale

NCSA

Transient pipeline computing reservation at sunrise for immediate followup of transients

Synoptic survey massive resampling (Atlasmaker) for ultrafaint detection

TG

NCSA and Caltech and Yale run different pipelines on the same data

50 Gbyte/night

5 Tbyte

ALERT

Wide-area Mosaicking (Hyperatlas)An NVO-Teragrid projectCaltech

High-qualityflux-preserving, spatial accuracy

StackableHyperatlas

Edge-freePyramid weight

Mining AND Outreach

DPOSS 15º

Griffith Observatory "Big Picture"

Synoptic Image Stack

PQ Pipeline

ComputingObservation Night28 columns x 4 filtersup to 70 Gbyte

real-time

next day

cleaned frames hyperatlas pages

coadd

VOEventNet

quasars @z>4

Mosaicking service

Logical SIAP

NVO Registry

Physical SIAP

Computing

Portal

Security

Request

Sandbox

http

Transient from PQ

from catalog pipeline

Event Synthesis Engine

Pairitel

Palomar 60”

Raptor

PQ next-daypipelines

catalog

Palomar-Quest

knownVariables

knownasteroids

SDSS2MASS

PQ Event Factory

remote archives

baselinesky

eStar

VOEventNet

VOEventNet: a Rapid-Response Telescope Grid GRBsatellites

VOEventdatabase

Correlation of mass distribution (SDSS) with CMB (ISW effect)-- statistical significance through ensemble of simulated universes

Connolly and Scrantom, U Pittsburgh

ISW Effect

Analysis of data from AMANDAAntarctic Muon and Neutrino Detector Array

Barwick and Silvestri, UC Irvine

Amanda analysis

Quasar ScienceAn NVO-Teragrid projectPennState, CMU, Caltech

• 60,000 quasar spectra from Sloan Sky Survey• Each is 1 cpu-hour: submit to grid queue• Fits complex model (173 parameter)

– derive black hole mass from line widths

clusters

globusrun

manager

NVO dataservices

N-point galaxy correlationAn NVO-Teragrid projectPitt, CMU

Finding triple correlation in 3D SDSS galaxy catalog (RA/Dec/z)

Lots of large parallel jobs

kd-tree algorithms

TeraGrid

TeraGrid Wide Area Network

TeraGrid Components

• Compute hardware– Intel/Linux Clusters, Alpha SMP clusters, POWER4

cluster, …

• Large-scale storage systems– hundreds of terabytes for secondary storage

• Very high-speed network backbone– bandwidth for rich interaction and tight coupling

• Grid middleware– Globus, data management, …

• Next-generation applications

Overview of Distributed TeraGrid Resources

HPSSHPSS

HPSS UniTree

External Networks

External NetworksExternal

Networks

External Networks

Site Resources Site Resources

Site ResourcesSite ResourcesNCSA/PACI10.3 TF240 TB

SDSC4.1 TF225 TB

Caltech Argonne

Cluster Supercomputer

100s of nodes

purged /scratch

parallel file system/home (backed-up)

login node

job submission and queueing(Condor, PBS, ..)

user

metadata node

parallel I/O

VO service

TeraGrid Allocations Policies

• Any US researcher can request an allocation– Policies/procedures posted at:

• http://www.paci.org/Allocations.html – Online proposal submission

• https://pops-submit.paci.org/

• NVO has an account on Teragrid– (just ask RW)

http://www.paci.org/Allocations.html



https://pops-submit.paci.org/



Data storage

Logical and Physical names

• Logical name– application-context

• eg frame_20050828.012.fits

• Physical name– storage-context

• eg /home/roy/data/frame_20050828.012.fits• eg file:///envoy4/raid3/frames/20050825/012.fits• eg

http://nvo.caltech.edu/vostore/6ab7c828fe73.fits.gz

Logical and Physical Names

• Allows – replication of data– movement/optimization of storage– transition to database (lname -> key)– heterogeneous/extensible storage hardware

• /envoy2/raid2, /pvfs/nvo/, etc etc

Physical Name

• Suggest URI form– protocol://identifier– if you know the protocol, you can interpret the

identifier

• Examplesfile://ftp://srb://uberftp://

• Transition to serviceshttp://server/MadeToOrder?frame=012&a=2&b=3

Typical types of HPC storage needs

Type

Typical size

Use Aggregate BW

Tolerance for Latency

Requirements

1 1-10TB Home filesystem

A lot of small files, high metadata rates, interactive use.

2 (optional)

100’s GB (per CPU)

Local scratch space

High bandwidth data cache.

3 10-100TB

Global filesystem

High aggregate bandwidth. Concurrent access to data. Moderate latency tolerated.

4 100TB-PB

Archival Storage

Large storage pools with low cost. Used for long term storage of results.

Disk Farms (datawulf)

Large files striped over disks

Management node for file creation, access, ls, etc etc

• Homogeneous Disk Farm(= parallel file system)

parallel file systemmetadata node

parallel I/O

Parallel File System

• Large files are striped– very fast parallel access

• Medium files are distributed– Stripes do not all start the same place

• Small files choke the PFS manager– Either containerize– or use blobs in a database

• not a file system anymore: pool of 108 blobs with lnames

•

Containerizing

• Shared metadata• Easier for bulk movement

container file in container

Extraction from Container

• tar container• slow extraction (reads whole container)

• zip container• indexed for fast partial extraction• 2 Gbyte limit on container size• used for fast access 2MASS image service at

Caltech

Storage Resource Broker (SRB)

• Single logical namespace while accessing distributed archival storage resources

• Effectively infinite storage (first to 1TB wins a t-shirt)

• Data replication• Parallel Transfers• Interfaces: command-line, API, web/portal.

Storage Resource Broker (SRB):Virtual Resources, Replication

NCSA

SDSC

workstation

SRB Client

(cmdline, or API)

hpss-sdsc

sfs-tape-sdsc

hpss-caltech

…

Running jobs

3 Ways to Submit a Job

1. Directly to PBS Batch Scheduler – Simple, scripts are portable among PBS TeraGrid clusters

2. Globus common batch script syntax– Scripts are portable among other grids using Globus

3. Condor-G– Nice interface atop Globus, monitoring of all jobs submitted via Condor-G– Higher-level tools like DAGMan

PBS Batch Submission

• Single executables to be on a single remote machine– login to a head node, submit to queue

• Direct, interactive execution– mpirun –np 16 ./a.out

• Through a batch job manager– qsub my_script

• where my_script describes executable location, runtime duration, redirection of stdout/err, mpirun specification…

• ssh tg-login.[caltech|ncsa|sdsc|uc].teragrid.org – qsub flatten.sh –v "FILE=f544"– qstat or showq– ls *.dat– pbs.out, pbs.err files

Remote submission

• Through globus– globusrun -r [some-teragrid-head-node].teragrid.org/jobmanager -f my_rsl_script

• where my_rsl_script describes the same details as in the qsub my_script!

• Through Condor-G– condor_submit my_condor_script

• where my_condor_script describes the same details as the globus my_rsl_script!

globus-job-submit

• For running of batch/offline jobs– globus-job-submit Submit job

• same interface as globus-job-run• returns immediately

– globus-job-status Check job status– globus-job-cancel Cancel job– globus-job-get-output Get job

stdout/err– globus-job-clean Cleanup after job

Condor-G

A Grid-enabled version of Condor that provides robust job management for Globus clients.

– Robust replacement for globusrun– Provides extensive fault-tolerance– Can provide scheduling across multiple

Globus sites– Brings Condor’s job management features

to Globus jobs

Condor DAGMan

• Manages workflow interdependencies• Each task is a Condor description file• A DAG file controls the order in which

the Condor files are run

Data intensive computing with NVO services

Two Key Ideas for Fault-Tolerance

• Transactions• No partial completion -- either all or nothing

– eg copy to a tmp filename, then mv to correct file name

• Idempotent• “Acting as if done only once, even if used multiple times”• Can run the script repeatedly until finished

DPOSS flattening

2650 x 1.1 Gbyte files

Cropping borders

Quadratic fit and subtract

Virtual data

Source Target

Driving the Queues

for f in os.listdir(inputDirectory):

# if the file exists, with the right size and age, then we keep it

ofile = outputDirectory +"/"+ f

if os.path.exists(ofile):

osize = os.path.getsize(ofile)

if osize != 1109404800:

print " -- wrong target size, remaking", osize

else:

time_tgt = filetime(ofile)

time_src = filetime(file)

if time_tgt < time_src:

print(" -- target too old or nonexistant, making")

else:

print " -- already have target file "

continue

cmd = "qsub flat.sh -v \"FILE=" + f +"\""

print " -- submitting batch job: ", cmd

os.system(cmd)

Here is the driver that makes and submits jobs

PBS script

#!/bin/sh

#PBS -N dposs

#PBS -V

#PBS -l nodes=1

#PBS -l walltime=1:00:00

cd /home/roy/dposs-flat/flat

./flat \

-infile /pvfs/mydata/source/${FILE}.fits \

-outfile /pvfs/mydata/target/${FILE}.fits \

-chop 0 0 1500 23552 \

-chop 0 0 23552 1500 \

-chop 0 22052 23552 23552 \

-chop 22052 0 23552 23552 \

-chop 18052 0 23552 4000

A PBS script. Can do "qsub script.sh –v "FILE=f345"

Hyperatlas

Standard naming for atlases and pagesTM-5-SIN-20Page 1589

Standard Scales:scale s means 220-s arcseconds per pixel

SIN projection

TAN projection

TM-5 layout

HV-4 layout

Standard Projections

StandardLayout

Hyperatlas is a Service

All Pages: <baseURL>/getChart?atlas=TM-5-SIN-20 (and no other arguments)

0 2.77777778E-4 'RA---SIN’ 'DEC--SIN' 0.0 -90.01 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 0.0 -85.02 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 36.0 -85.0...1731 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 288.0 85.01732 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 324.0 85.01733 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 0.0 90.0

Sky to Page: page=1603&RA=182&Dec=62 --> page, scale, ctype, RA, Dec. x, y1603 2.777777777777778E-4 'RA---TAN' 'DEC--TAN' 175.3 60.0 -11180.1 7773.7

Best Page: RA=182&Dec=62 --> page, scale, ctype, RA, Dec. x, y1604 2.77777778E-4 'RA---SIN‘ 'DEC--SIN' 184.61538 60.0 4422.4 7292.1

Page WCS: page=1604 --> page, scale, ctype, RA, Dec1604 2.77777778E-4 'RA---SIN' 'DEC--SIN' 184.61538 60.0

Replicated ImplementationsbaseURL = http://nvo.caltech.edu:8080/hyperatlas (services try)

http://nvo.caltech.edu:8080/hyperatlas

Hyperatlas Service

Page to Sky: page=1603&x=200&y=500 --> RA, Dec, nx,n y, nz

184.5 60.1 -0.496 -0.039 0.867

Relevant pages from sky region: tilesize=4096&ramin=200.0&ramax=202.0&decmin=11.0&decmax=12.0 --> RA, Dec, nx,n y, nz

1015 -1 11015 -1 21015 -2 11015 -2 21015 0 11015 0 2

ImplementationbaseURL = http://nvo.caltech.edu:8080/hyperatlas (services try)

page 1015ref point RA=200, Dec=10

http://nvo.caltech.edu:8080/hyperatlas

GET services from Python

import urllib

hyperatlasURL = self.hyperatlasServer + "/getChart?atlas=" + atlas \

+ "&RA=" + str(center1) + "&Dec=" + str(center2)

stream = urllib.urlopen(hyperatlasURL)

# result is a tab-separated line, so use split() to tokenize

tokens = stream.readline().split('\t')

print "Using page ", tokens[0], " of atlas ", atlas

self.scale = float(tokens[1])

self.CTYPE1 = tokens[2]

self.CTYPE2 = tokens[3]

rval1 = float(tokens[4])

rval2 = float(tokens[5])

This code uses a service to find the best hyperatlas page for a given sky location

VOTable parser in Python

import urllib

import xml.dom.minidom

stream = urllib.urlopen(SIAP_URL)

doc = xml.dom.minidom.parse(stream)

#Make a dictionary for the columns

col_ucd_dict = {}

for XML_TABLE in doc.getElementsByTagName("TABLE"):

for XML_FIELD in XML_TABLE.getElementsByTagName("FIELD"):

col_ucd = XML_FIELD.getAttribute("ucd")

col_ucd_dict[col_title] = col_counter

urlColumn = col_ucd_dict["VOX:Image_AccessReference"]

formatColumn = col_ucd_dict["VOX:Image_Format"]

raColumn = col_ucd_dict["POS_EQ_RA_MAIN"]

deColumn = col_ucd_dict["POS_EQ_DEC_MAIN"]

From a SIAP URL, we get the XML, and extract the columns that have the image references, image format, and image RA/Dec

VOTable parser in Python

import xml.dom.minidom

table=[]

for XML_TABLE in doc.getElementsByTagName("TABLE"):

for XML_DATA in XML_TABLE.getElementsByTagName("DATA"):

for XML_TABLEDATA in XML_DATA.getElementsByTagName("TABLEDATA"):

for XML_TR in XML_TABLEDATA.getElementsByTagName("TR"):

row=[]

for XML_TD in XML_TR.getElementsByTagName("TD"):

data = ""

for child in XML_TD.childNodes:

data += child.data

row.append(data)

table.append(row)

Table is a list of rows, and each row is a list of table cells

Science Gateways

Grid Impediments

Learn GlobusLearn MPILearn PBSPort code to ItaniumGet certificateGet logged inWait 3 months for accountWrite proposal

and now do some science....

A better way:Graduated Securityfor Science Gateways

Web form - anonymous

somescience....

Register - logging and reporting

morescience....

Authenticate X.509- browser or cmd line

big-ironcomputing

....

Write proposal- own account

power user

2MASS Mosaicking portalAn NVO-Teragrid projectCaltech IPAC

Three Types of Science Gateways

• Web-based Portals – User interacts with community-deployed web interface.– Runs community-deployed codes – Service requests forwarded to grid resources

• Scripted service call – User writes code to submit and monitor jobs

• Grid-enabled applications– Application programs on users' machines (eg IRAF)– Also runs program on grid resource

Secure Web services for Teragrid Access

using the grid for astronomical data roy williams, caltech

Documents

teragrid slide

container slide

caltech slide

data storage slide

z4 slide

lnames slide

rw slide

gz slide