from astrophysics to sensor networks: facing the data explosion

From Astrophysics to Sensor Networks: Facing the Data Explosion

Alex SzalayThe Johns Hopkins University

Jim GrayMicrosoft Research

Living in an Exponential World

• Astronomers have a few hundred TB now– 1 pixel (byte) / sq arc second ~ 4TB– Multi-spectral, temporal, … → 1PB

• They mine it looking fornew (kinds of) objects or more of interesting ones (quasars), density variations in 400-D spacecorrelations in 400-D space

• Data doubles every year

19701975

19801985 1990 1995 2000

0.1

1

10

100

1000

CCDs Glass

The Challenges

DataCollection

Discoveryand Analysis Publishing

Exponential data growth:Distributed collectionsSoon Petabytes

New analysis paradigm:Data federations, Move analysis to data

New publishing paradigm:Scientists are publishersand Curators

Publishing Data

• Exponential growth:– Projects last at least 3-5 years– Data sent upwards only at the end of the project– Data will never be centralized

• More responsibility on projects– Becoming Publishers and Curators

• Data will reside with projects– Analyses must be close to the data

RolesAuthorsPublishersCuratorsConsumers

TraditionalScientistsJournalsLibrariesScientists

EmergingCollaborationsProject www siteBigger ArchivesScientists

Accessing Data

• If there is too much data to move around,take the analysis to the data!

• Do all data manipulations at database– Build custom procedures and functions in the database

• Automatic parallelism guaranteed• Easy to build-in custom functionality

– Databases & Procedures being unified– Example temporal and spatial indexing– Pixel processing

• Easy to reorganize the data– Multiple views, each optimal for certain analyses– Building hierarchical summaries are trivial

• Scalable to Petabyte datasets active databases!

Why Is Astronomy Special?

• Especially attractive for the wide public• Community is not very large• It has no commercial value

– No privacy concerns, freely share results with others– Great for experimenting with algorithms

• It is real and well documented– High-dimensional (with confidence intervals)– Spatial, temporal

• Diverse and distributed– Many different instruments from

many different places and many different times

• The questions are interesting• There is a lot of it (soon petabytes)

The Virtual Observatory

• Premise: most data is (or could be online)• The Internet is the world’s best telescope:

– It has data on every part of the sky– In every measured spectral band:

optical, x-ray, radio..– As deep as the best instruments (2 years ago).– It is up when you are up– The “seeing” is always great– It’s a smart telescope:

links objects and data to literature on them

• Software became the capital expense– Share, standardize, reuse..

Similarities to Particle Physics

Particle Physics• Van de Graaf• Cyclotrons• National Labs• International (CERN)• SSC vs LHC

Optical Astronomy• 2.5m telescopes• 4m telescopes• 8m class telescopes• Surveys/Time Domain• 30-100m telescopes

Similar trends with a 20 year delay,fewer and ever bigger projects…increasing fraction of cost is in software…more conservative engineering…

Can the exponential continue, or will be linear or logistic?What can we learn from High Energy Physics?

GoalCreate the most detailed map of the Northern sky

“The Cosmic Genome Project”Two surveys in one

Photometric survey in 5 bandsSpectroscopic redshift survey

Automated data reduction150 man-years of development

High data volume40 TB of raw data5 TB processed catalogsData is public

Sloan Digital Sky Survey

The University of ChicagoPrinceton UniversityThe Johns Hopkins UniversityThe University of WashingtonNew Mexico State UniversityFermi National Accelerator LaboratoryUS Naval ObservatoryThe Japanese Participation GroupThe Institute for Advanced StudyMax Planck Inst, Heidelberg

Sloan Foundation, NSF, DOE, NASA

The University of ChicagoPrinceton UniversityThe Johns Hopkins UniversityThe University of WashingtonNew Mexico State UniversityFermi National Accelerator LaboratoryUS Naval ObservatoryThe Japanese Participation GroupThe Institute for Advanced StudyMax Planck Inst, Heidelberg

Sloan Foundation, NSF, DOE, NASA

Drift scan of 10,000 square degrees24k x 1M pixel “panoramic” imagesin 5 colors – broad-band filters (u,g,r,i,z)

2.5 Terapixels of images

The Imaging Survey

Expanding universe redshift = distance

SDSS Redshift Survey1 million galaxies100,000 quasars100,000 stars

Two high throughput spectrographsspectral range 3900-9200 Å640 spectra simultaneouslyR=2000 resolution, 1.3 Å

FeaturesAutomated reduction of spectraVery high sampling density and completeness

The Spectroscopic Survey

Precision Cosmology

Power Spectrum of Fluctuationsfew percent accuracy!

Main challenge: with so much data the dominant errors are systematic, not statistical!

Using large simulations to understand significance of detection

The SkyServer Portal• Sloan Digital Sky Survey: Pixels + Objects• About 500 attributes per “object”, 400M objects• Currently 2.4TB fully public• Prototype eScience lab (800 users)

– Moving analysis to the data– Fast searches: color, spatial

• Visual tools– Join pixels with objects

• Prototype in data publishing– 200 million web hits in 5 years– 930,000 distinct users

http://skyserver.sdss.org/

SkyServer Traffic

1.E+04

1.E+05

1.E+06

1.E+07

2001

/720

01/10

2002

/120

02/4

2002

/720

02/10

2003

/120

03/4

2003

/720

03/10

2004

/120

04/4

2004

/7Web hits/moSQL queries/mo

Database Challenges

• Loading the Data• Organizing the Data• Accessing the Data• Delivering the Data• Analyzing the Data (spatial…)

Data Organization

• Jim Gray: 20 queries, now 35 queries• Represent most typical access patterns• Establishes a clean set of requirements that both

sides can understand• Also used a a regression test during development• Dual star-schema centered on photometric objects• Rich metadata information stored with

schema

DB Loading

• Wrote automated table driven workflow system for loading– Two-phase parallel load– Over 16K lines of SQL code, mostly data validation

• Loading process was extremely painful– Lack of systems engineering for the pipelines– Lots of foreign key mismatches– Fixing corrupted files (RAID5 disk errors)– Most of the time spent on scrubbing data

• Once data is clean, everything loads in 1 week• Reorganization of data is about 1 week

Data Delivery

• Small requests (<100MB) – Anonymous, putting data on the stream

• Medium requests (<1GB)– Queues with resource limits

• Large requests (>1GB)– Save data in scratch area and use asynch delivery– Only practical for large/long queries

• Iterative requests/workbench– Save data in temp tables in user space– Let user manipulate via web browser

MyDB: Workbench

• Need to register ‘power users’, with their own DB• Query output goes to ‘MyDB’• Can be joined with source database• Results are materialized from MyDB upon request• Users can do:

– Insert, Drop, Create, Select Into, Functions, Procedures– Publish their tables to a group area

• Data delivery via the CASJobs (C# WS)

=> Sending analysis to the data!

Public Data Release: Versions

• June 2001: EDR– Early Data Release

• July 2003: DR1– Contains 30% of final data– 150 million photo objects

• Now at DR5, with 2.4TB• 3 versions of the data

– Target, Best, Runs• Total catalog volume 5TB• Published releases served ‘forever’

– EDR, DR1, DR2, …. , soon DR5– Personal (1GB), Professional (100GB) subsets– Next: include email archives, annotations

• O(N2) – only possible because of Moore’s Law!

EDR

DR1 DR1

DR2 DR2 DR2

DR3 DR3 DR3 DR3

………

Mirroring

• Master copies kept at 2 sites– Fermilab, JHU: 3 copies of everything running in parallel– Necessary due to cheap hardware

• Multiple mirrors for DR2, DR3– Microsoft has multiple copies– Japan, China, India, Germany, UK, Australia, Hungary,

Brazil

• Replication– First via Sneakernet, 3 boxes with 1TB of disks– Now: distributed from StarTap at UIC, using the UDT

protocol (Grossman et al)– Transferred 1TB to Tokyo in about 2.2 hours

Web Interface

• Provide easy, visual access to exciting new data– Access based on web services

• Illustrate that advanced content does not mean a cumbersome interface

• Understand new ways of publishing scientific data• Target audience

– Advanced high-school students, amateur astronomers, wide public, professional astronomers

• Multilingual capabilities built in from the start– Heavy use of stylesheets, language branches

• First version built by AS+JG, later redesigned by Curtis Wong, Microsoft

• Many other contributors (Ani Thakar+…)• Meticulous logging and log harvesting of mirrors

Tutorials and Guides

• Developed by J. Raddick and Postdocs– How to use Excel– How to use a database (guide to SQL)– Expert advice on SQL

• Automated on-line documentation– Ani Thakar, Roy Gal– Database information, Glossary, Algorithms– Searchable Help– All stored in the DB, and generated on the fly– DB schema and metadata (documentation) generated

from a single source, parsed different ways• Templated for others, widely used in astronomy

– Now also for sensor networks

Visual Tools

• Goal: – Connect pixel space to objects without typing queries– Browser interface, using common paradigm (MapQuest)

• Challenge: – Images: 200K x 2K x1.5K resolution x 5 colors = 3 Terapix– 300M objects with complex properties– 20K geometric boundaries and about 6M ‘masks’– Need large dynamic range of scales (2^13)

• Assembled from a few building blocks:– Image Cutout SOAP Web Service– SQL query service + database– Images+overlays built on server side -> simple client– Spectrum Service, now also Footprint service, etc.

Spectrum Services

Search, plot, and retrieve SDSS, 2dF, and other spectraThe Spectrum Services web site is dedicated to spectrum related VO services. On this site you will find tools and tutorials on how to access close to 500,000 spectra from the Sloan Digital Sky Survey (SDSS DR1) and the 2 degree Field redshift survey (2dFGRS). The services are open to everyone to publish their own spectra in the same framework. Reading the tutorials on XML Web Services, you can learn how to integrate the 45 GB spectrum and passbanddatabase with your programs with few lines of code.

Spatial Information for Users

• What resources are in this part of the sky?• What is the common area of these surveys?• Is this point in the survey?• Give me all objects in this region• Give me all “good” objects (not in bad areas)• Cross-matching these two catalogs• Give me the cumulative counts over areas• Compute fast spherical transforms of densities• Interpolate sparsely sampled functions

(extinction maps, dust temperature, …)

Geometries

• SDSS has lots of complex boundaries– 20,000 regions– 6M masks (bright stars, satellites, etc),

represented as spherical polygons• A GIS-like library built in C++ and SQL• Now converted to C# for direct plugin into SQL

Server 2005 (17 times faster than C++)• Precompute arcs and store in database for rendering• Functions for point in polygon, intersecting polygons,

polygons covering points, all points in polygon• Fast searches using spherical quadtrees (HTM)• Visualization integrated with web services

Things Can Get Complex

AABB

AA

BB

Green area: A ∩ (B- ε) should find B if it contains an A and not maskedYellow area: A ∩ (B±ε) is an edge case may find B if it contains an A.

Indexing Using Quadtrees

• Cover the sky with hierarchical pixels• COBE – start with a cube• Hierarchical Triangular Mesh (HTM) uses trixels

– Samet, Fekete

• Start with an octahedron, andsplit each triangle into 4 children,down to 20 levels deep

• Smallest triangles are 0.3”• Each trixel has a unique htmID

2,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

2,2

2,1

2,0

2,32,2

2,1

2,0

2,3

2,3,0

2,3,12,3,2 2,3,3

2,3,0

2,3,12,3,2 2,3,3

Comparing Massive Point Datasets

• Want to “cross match” 109 objects with another set of 109 objects

• Using the HTM code, that is a LOT of calls (~106 seconds ~ a year)

• Want to do it in an hour• Added 10 databases running in parallel and a bulk

algorithms based on ‘zones’

Trends

CMB Surveys (pixels)• 1990 COBE 1000• 2000 Boomerang 10,000• 2002 CBI 50,000• 2003 WMAP 1 Million• 2008 Planck 10 Million

Galaxy Redshift Surveys (obj)• 1986 CfA 3500• 1996 LCRS 23000• 2003 2dF 250000• 2005 SDSS 750000

Angular Galaxy Surveys (obj)• 1970 Lick 1M• 1990 APM 2M• 2005 SDSS 200M• 2008 VISTA 1000M• 2012 LSST 3000M

Time Domain• QUEST• SDSS Extension survey• Dark Energy Camera• PanStarrs• SNAP…• LSST…

Petabytes/year by the end of the decade…

Analysis of Large Data Sets

• Data reside in multiple databases• Applications access data dynamically• Typical data set sizes in millions to billions• Typical dimensionality 3-10• Attributes are correlated in a non-linear fashion• One cannot assume a global metric • Data have non-negligible error• Current examples:

– 2-3 point angular correlations of about 10 million galaxies– 3D power spectrum of about 1 million galaxy distances– Image shear due to gravitational lensing in 2.5 Terapixels– Compare 1 million galaxy spectra to 100M model spectra

Next-Generation Data Analysis

• Looking for– Needles in haystacks – the Higgs particle– Haystacks: Dark matter, Dark energy

• Needles are easier than haystacks• ‘Optimal’ statistics have poor scaling

– Correlation functions are N2, likelihood techniques N3

– For large data sets main errors are not statistical• As data and computers grow with Moore’s Law,

we can only keep up with N logN• A way out?

– Discard notion of optimal (data is fuzzy, answers are approximate)– Don’t assume infinite computational resources or memory

• Requires combination of statistics & computer science

Cosmological Simulations

• Galaxy Formation Simulations– Collaboration with V. Springel, S. White and G. Lemson– Loading the ‘Millennium’ galaxies/halos into a database, build

Peano-Hilbert curve spatial index– 20 queries: focused on merger history of halos– Semi-analytic star formation: linkage with population synthesis

models• Statistical studies (with Jie Wang)

– 50 separate realizations of the large scale structure in Millennium– Analyzing non-Gaussian signatures on large scales

(PDF, 4th moments)– Currently loading these into the DB for analysis of nonlinearities on

large scales• Next: build fast ray tracing algorithms along light cone inside

the DB

Simulation of Turbulence

• Collaboration with C. Meneveau, S. Chen, G. Eyink, R. Burns and E. Vishniac– Take a 1024^3 simulation and store 1024 timesteps in a

database– Use hierarchical Peano-Hilbert curve index within each time

step, atoms are 64^3 cells– Database is tens of TB => build a cluster of cheap servers– Currently 100TB in place, loader works– Implementing low-level analysis tools inside DB– Interpolation schemes done, FFTW ported inside Yukon– Cheap disks => redundant data storage in dual space

Wireless Sensor Networks

• Collaboration with K. Szlavecz, A. Terzis, J. Gray, S. Ozer– Building 200 node network to measure soil moisture for

environmental monitoring– Expect 200 million measurements /yr– Deriving from the SkyServer template we were able to build

and end-to-end system in less than two weeks– Built a OLAP datacube, conditional sums along multiple

dimensional axes

Summary

• Data growing exponentially• Analyzing so much data requires a new model• More data coming: Petabytes/year by 2010

– Need scalable solutions– Move analysis to the data– Spatial and temporal features essential– 20 queries!

• Data explosion is coming from inexpensive sensors• Same thing happening in all sciences

– High energy physics, genomics, cancer research,medical imaging, oceanography, remote sensing, …

• eScience: an emerging new branch of science

from astrophysics to sensor networks: facing the data explosion

Documents