what to do with scientific data?

34
QuickTime™ and a decompressor are needed to see this picture. What to do with Scientific Data? Michael Stonebraker

Upload: lula

Post on 16-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

What to do with Scientific Data?. Michael Stonebraker. Outline. Science data -- what it looks like Software options File system RDBMS SciDB. O(100) petabytes. LSST Data. Raw imagery 2D arrays of telescope readings “Cooked” into observations Image intensity algorithm (data clustering) - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

What to do with Scientific Data?

Michael Stonebraker

Page 2: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Outline

• Science data -- what it looks like

• Software options File system RDBMS SciDB

Page 3: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

O(100) petabytes

Page 4: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Page 5: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

LSST Data

• Raw imagery 2D arrays of telescope readings

• “Cooked” into observations Image intensity algorithm (data clustering) Spatial data

• Further cooked into “trajectories” Similarity query Constrained by maximum distance

Page 6: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Example LSST Queries

• Re-cook raw imagery with my algorithm

• Find all observations in a spatial region

• Find all trajectories that intersect a cylinder in time

Page 7: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Chemical Plant Data

• Plant is a directed graph of plumbing

• Sensors at various places 1/sec observations

• Directed graph of time series

• Plant runs “near the edge” to optimize output

• And fails every once in a while -- down for a week

Page 8: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Chemical Plant Data

• Record all data{(time, sensor-1, … , sensor-5000)}

• Look for interesting events i.e. sensor values out of whack

• Cluster events near each other in 5000 dimension space

• Goal is to identify “near-failure modes”

Page 9: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Snow Cover in the SierrasSnow Cover in the Sierras

Page 10: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Earth Science Data

• Raw imagery – arrays• “cooked” into composite images based

on multiple passes Cloud free cells

• “baked” into snow cover, …

Page 11: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

General ModelGeneral Model

sensors

CookingAlgorithm(s)(pipeline)

Derived data

Page 12: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Traditional Wisdom (1)

• Cooking pipeline in hard code or custom hardware

• All data in some file system

Page 13: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Problems

• Can’t find anything• Metadata often not recorded

Sensor parameters Cooking parameters

• Can’t easily share anything• Can easily recook anything• Everything is custom code

Page 14: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Traditional Wisdom (2)

• Cooking pipeline outside DBMS

• Derived data loaded into DBMS for subsequent querying

Page 15: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Problems with This Approach

• Most of the issues remain

• Better, but stay tuned

Page 16: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

My Preference

• Load the raw data into a DBMS

• Cooking pipeline is a collection of user-defined functions (DBMS extensions)

• Activated by triggers or a workflow management system

• ALL data captured in a common system!!!!

Page 17: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

What DBMS to Use?

• RDBMS (e.g. Oracle) Pretty hopeless on raw data

Simulating arrays on top of tables likely to cost a factor of 10-100

Not pretty on time series data Find me a sensor reading whose average

value over the last 3 days is within 1% of the average value adjoining 5 sensors

Page 18: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

RDBMS Summary

• Wrong data model Arrays not tables

• Wrong operations Regrid not join

• Missing features Versions, no-overwrite, provenance,

support for uncertain data, …

Page 19: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

But your mileage may vary

• SQL Server working well for Sloan Skyserver database

• See paper in CIDR 2009 by Jose Blakeley

Page 20: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

How to Do Analytics (e.g. clustering)

• Suck out the data• Convert to array format• Pass to Matlab, SAS, R• Compute• Return answer to DBMS

Page 21: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Bad News

• Painful• Slow• Many analysis platforms are not parallel• Many are main memory only

Page 22: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

My Proposal - SciDB

• Build a commercial-quality array DBMS from the ground up with integrated analytics Open source!!!!

Page 23: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Data Model

• Nested multi-dimensional arrays Cells can be tuples or other arrays

• Time is an automatically supported extra dimension

• Ragged arrays allow each row/column to have a different dimensionality

• Support for multiple flavors of ‘null’ Array cells can be ‘EMPTY’ User-definable, context-

sensitive treatment

Page 24: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Array-oriented storage manager

• Optimized for both dense and sparse array data Different data storage, compression, and

access

• Arrays are “chunked” (in multiple dimensions)

• Chunks are partitioned across a collection of nodes

• Chunks have ‘overlap’ to support neighborhood operations

• Replication provides efficiency and back-up

• No overwrite

Page 25: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Local ExecutorLocal Executor

Storage ManagerStorage Manager

Node 1

Application Layer

Server Layer

Storage Layer

Application

Language Specific UI

Query Interface and Parser

Runtime Supervisor

Plan Generator

Node 3Node 2

Local ExecutorLocal Executor

Storage ManagerStorage Manager

Node 1

Application Layer

Server Layer

Storage Layer

Application

Language Specific UI

Query Interface and Parser

Runtime Supervisor

Plan Generator

Node 3Node 2

Local ExecutorLocal Executor

Storage ManagerStorage Manager

Node 1

Node 3Node 2

Doesn’t require JDBC, ODBC

AQL an extension of SQLAlso supports UDFs

Java, C++, whatever…

Architecture• Shared nothing cluster

10’s–1000’s of nodes Commodity hardware TCP/IP between nodes Linear scale-up

• Each node has a processor and storage

• Queries refer to arrays as if not distributed

• Query planner optimizes queries for efficient data access & processing

• Query plan runs on a node’s local executor&storage manager

• Runtime supervisor coordinates execution

Page 26: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

SciDB DDL

CREATE ARRAY Test_Array < A: integer NULLS, B: double, C: USER_DEFINED_TYPE > [I=0:99999,1000, 10, J=0:99999,1000, 10 ] PARTITION OVER ( Node1, Node2, Node3 ) USING block_cyclic();

chunk size

1000

overlap

10

attribute names

A, B, C

index names

I, J

Page 27: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Array Query Language (AQL)

• Array data management e.g. filter, aggregate, join, etc.

• Statistical & linear algebra operations multiply, QR factorization, etc. parallel, disk-oriented

• User-defined operators (Postgres-style)

• Interface to external stat packages (e.g. R)

Page 28: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Array Query Language (AQL)

SELECT Geo-Mean ( T.B )

FROM Test_Array T

WHERE

T.I BETWEEN :C1 AND :C2

AND T.J BETWEEN :C3 AND :C4

AND T.A = 10

GROUP BY T.I;

User-defined aggregate on anattribute B in array T

Subsample

FilterGroup-by

So far as SELECT / FROM / WHERE / GROUP BY queries are concerned, there is little logical difference between AQL and SQL

Page 29: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Matrix Multiply

• Algorithm sensitive to distribution (range, BC)

• For range, the smaller of the two arrays is replicated at all nodes Scatter-gather

• Each node does its “core” of the bigger array with the replicated smaller one

• Produces a distributed answer

CREATE ARRAY TS_Data < A1:int32, B1:double >

[ I=0:99999,1000,0, J=0:3999,100,0 ]

multiply (project (TS_Data, B1),

transpose(project (TS_Data, B1)))

10K x 4K

Project on one attribute

Page 30: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Other Features Science Guys Want(These could be in an RDBMS, but aren’t)

• Uncertainty Data has error bars Which must be carried along in the computation

interval arithmentic

Page 31: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Other Features

• Time travel Don’t fix errors by overwrite i.e. keep all the data – in the extra time

dimension

• Named versions Recalibration usually handled this way

Page 32: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Other Features

• Provenance (lineage)

What calibration generated the data

What was the “cooking” algorithm

In general - repeatabiltiy of derviced data

Page 33: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

SciDB statuswww.scidb.org

• Global development team (17 people) including many volunteers

• Support/distribution by a VC-backed company Paradigm4 in Waltham

• 0.75 on website now Have at it!!!

• 1.0 release will be on website within a month Very functional, but a little pokey

• Red Queen will be out in October Blazing performance Probably 5X current system

Page 34: What to do with Scientific Data?

QuickTime™ and a decompressor

are needed to see this picture.

Performance of SciDB 1.0

• Performance on stat operations Comparable to R, degrades in a similar way when data does not fit in

main memory But multi-node parallelism provided -- much better scalability

• Performance on SQL Comparable to or better than Postgres, especially on multi-attribute

restrictions And multi-node parallelism provided

• And we can do stuff they can’t….