data acquisition & distribution; and archiving plans bill boroski sdss-ii first year review...

17
Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Upload: wyatt-keith

Post on 27-Mar-2015

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Data Acquisition & Distribution; and Archiving Plans

Bill Boroski

SDSS-II First Year ReviewNational Science Foundation

August 7, 2006

Page 2: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Data Acquisition & Processing Operations

Apache PointObservatory

Fermilab

Data tra

nsfer v

ia the w

eb

Plug plate designs

Plug plates

U. of Washington

Princeton

Data transfer via the web

Page 3: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Data Acquisition / Observatory Operations

• Observing operations at APO– Performance metrics remain solid

Ave. imaging efficiency = 87% (Aug-Jan); 72% (Mar-Jun)

Ave. spectro efficiency = 67%

Ave. system uptime = 96%

– Data transfers via 15 Mbps microwave link• Only two disruptions in the first 9 months of operation

• Recovery was relatively quick; no data lost

– SN compute cluster for on-the-mountain reductions

– Fire protection upgrades

Page 4: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Data Processing Operations

• Data Processing Operations at Fermilab– Production processing of all Legacy, SEGUE and Supernova Survey

data; target selection and plate design

– Operations continue to run smoothly and efficiently

– Implemented automated method for pulling data from APO and verifying data integrity

– Implemented scripts to “crawl” entire spinning data set to detect file corruption, missing files, etc.

• Data Processing Operations at Princeton– DP hardware cluster installed and commissioned

– Imaging and spectro data are being reduced

– Imaging reductions being used to support Photo pipeline and photometric calibration work

– Spectro reductions being used in Spectro v5 pipeline development

Page 5: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Supernova Survey Data Release 1

• First public release of SN data occurred in January 2006– Contains data from 72 imaging runs obtained during the fall 2005 observing

campaign (Sep 1 through Nov 30, 2005) – Contains an additional 16 imaging runs obtained during the fall 2004 season– Data taken on a 2.5 degree wide region along the celestial equator, from -60 <

RA < 60 (SDSS Stripe 82)

• URL (linked off the SDSS home page): http://www.sdss.org/drsn1/DRSN1_data_release.html

• Web page provides access to corrected frames and uncalibrated catalogs via pared-down DAS script

– Also provides list of available runs with information regarding data quality

• Future data releases will occur incrementally during each fall SN observing season

– Releases will occur roughly 4-6 weeks after data collection

Page 6: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Data Release 5

• Contains survey-quality Legacy and SEGUE data collected through June 30, 2005.

Imaging

Footprint Area 8,000 sq. deg.

Imaging Catalog 215 million objects

Data Volume

Images 9.0 TB

Catalogs

(DAS, fits format)

(CAS, SQL DB)

1.8 TB

3.6 TB

Spectroscopy

Spectro Area 5,740 sq. deg.

Total # of spectra 1,048,960

Galaxies

Quasars (redshift < 2.3)

Quasars (redshift > 2.3)

Stars

M stars and later

Sky spectra

Unknown

674,749

79,934

11,217

154,925

60,808

55,555

12,312

• Also contains data from 361 “extra” and “special” plates– Repeat observations, observations of spectro targets selected by collaboration members for

specialized science programs.

• The “final” release of the SDSS Survey. All data now in public domain.

Page 7: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Data Release 5 (cont’d)

• October 21, 2005 - initial DR5 CAS and DAS released to collaboration

• May 4, 2006 – DAS and enhanced CAS released to collaboration• Photometric redshifts of galaxies

– computed by two different groups within the collaboration

• Detailed coverage masks– allow LSS researchers to easily calculate power spectrum and related quantities;– Allow computation of the area covered by statistical samples, such as the quasar sample in

DR5.

• QSO catalog tables– Master list of everything that “smells” like a quasar, with vital signs gathered from different data

sources (i.e., FIRST, ROSAT, Stetson, USNO, and USNO-B catalogs).

• runQA for target photometry (also bug fix)• Regenerated JPEGs to fix clipping error in some images.

• June 28, 2006: Enhanced version made available to the public• Usage varied between 80,000 and 100,000 SQL queries per day in the week

immediately following the public release.

Page 8: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

SkyServer Data Usage Statistics

Number of Web Hits per Month(EDR, DR1, DR2, DR3, DR4, DR5 Combined)

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

2002 2003 2004 2005 2006

Time

Nu

mb

er o

f W

eb H

its

per

Mo

nth

Page 9: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

SkyServer Data Usage Statistics

Number of SQL Queries Executed per Month(EDR, DR1, DR2, DR3, DR4, DR5 Combined)

0

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

2002 2003 2004 2005 2006

Time

Nu

mb

er o

f S

QL

Qu

erie

sE

xecu

ted

per

Mo

nth

Page 10: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

DAS Data Usage Statistics

Monthly Data Transfer through the DAS rsync Server

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

2004 2005 2006

Time

Dat

a V

olum

e Tr

ansf

erre

d (T

B)

DR327-Sep-04

DR215-Mar-04

DR429-Jun-05

DR528-Jun-06

Page 11: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

DAS Data Usage Statistics

Monthly Data Transfers through the DAS Web Interface

0.0

1.0

2.0

3.0

4.0

5.0

6.0

2004 2005 2006

Time

Dat

a V

olum

e Tr

ansf

erre

d (T

B)

Page 12: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Data Usage by Release

SQL Queries Executed, by Data Release

-

500,000

1,000,000

1,500,000

2,000,000

2,500,000

3,000,000

2003 2004 2005 2006

Time

Nu

mb

er o

f S

QL

qu

erie

sex

ecu

ted

per

mo

nth

DR5

DR4

DR3

DR2

DR1

Page 13: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Looking ahead to Data Release 6• Public release scheduled for July 1, 2007

– Legacy and SEGUE data collected through 30-Jun-2006

• DR6 will incorporate modest data model changes, such as:– Add table containing SEGUE stellar parameters

– Add primary and secondary target fields to photoObjAll and specObjAll tables to accommodate SEGUE and “other” plates

– Extract photometry for targets into new targPhotoObjAll table in BESTDR6, which allows us to eliminate TARG DB from some servers

– Develop sqlFits2CSV code to read in Pan-STARRS object tables and build CSV files for DB loading

• Will load and deploy using MS SQL Server 2005

• Code modification and testing currently underway

• Data loading will commence in 4-6 weeks, with collab release occurring in late fall.

Page 14: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Runs Database

• CAS-style database that is being loaded with all imaging runs obtained over the course of the survey, regardless of data quality

• Updated version released to the collaboration on July 27, 2006– Contains 370 imaging runs

• Incremental releases are being made every 6-8 weeks– Each increment has contained of order 40-60 new runs

• Collaboration DAS access is also available for these runs– FITS format corrected frames– Uncalibrated (fpObjc*, asTrans*) and calibrated (tsObj*, tsField*) data.– Collab access via rsync and wget

• We intend to make the RunsDB available to the public at the time of the final SDSS-II data release, when all data will be in the public domain.

Page 15: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Interim Archive Stewardship

• A formal plan for the long-term stewardship of the SDSS data set is still being formulated

• The possibility of an interim arrangement has been discussed with senior Fermilab management– Lab management has expressed a willingness to serve as interim host

until a more permanent steward is identified – timescale TBD;– Initial level of stewardship would include:

• Maintaining data integrity– Retention of multiple copies; automated scanning to detect data corruption.

• Maintaining existing systems (SkyServer, CAS, DAS, sdss.org, etc.)– Replace failed hardware, apply security patches, deploy OS upgrades, etc.– Helpdesk support?? (level to be determined)

– Estimated resource requirements• One FTE (HW/DBA support)• $100K/yr for equipment, materials and supplies.

Page 16: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Long-term Stewardship

• Curation– Mandatory

• Format conversion (e.g., OS upgrades, etc.)

• Platform migration (Database, OS, etc.)

– Value-added• Errata

• Bug fixes

• Annotations– First tier – team– Second tier – archiving centers– Third tier – community

• Virtual Observatory (VO) compliance

Page 17: Data Acquisition & Distribution; and Archiving Plans Bill Boroski SDSS-II First Year Review National Science Foundation August 7, 2006

Long-Term StewardshipPreliminary Thoughts on Resource Requirements• Operations Support

– System maintenance• Hardware-level support• Application-level DBA support• System performance monitoring

– Help desk– Logging

• Usage statistics

• Estimated Resource Requirements– Number of boxes: 10-20– Level of effort:

• 1 FTE for helpdesk support• 1 FTE for hardware/software support• 0.5 FTE for technical support (scientists, etc.)

– Budget for ongoing maintenance and support costs• Disk replacements, software upgrades, server upgrades, etc.• ~$100K, dropping off in future years