what to do with scientific data?
DESCRIPTION
What to do with Scientific Data?. Michael Stonebraker. Outline. Science data -- what it looks like Software options File system RDBMS SciDB. O(100) petabytes. LSST Data. Raw imagery 2D arrays of telescope readings “Cooked” into observations Image intensity algorithm (data clustering) - PowerPoint PPT PresentationTRANSCRIPT
QuickTime™ and a decompressor
are needed to see this picture.
What to do with Scientific Data?
Michael Stonebraker
QuickTime™ and a decompressor
are needed to see this picture.
Outline
• Science data -- what it looks like
• Software options File system RDBMS SciDB
QuickTime™ and a decompressor
are needed to see this picture.
O(100) petabytes
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
LSST Data
• Raw imagery 2D arrays of telescope readings
• “Cooked” into observations Image intensity algorithm (data clustering) Spatial data
• Further cooked into “trajectories” Similarity query Constrained by maximum distance
QuickTime™ and a decompressor
are needed to see this picture.
Example LSST Queries
• Re-cook raw imagery with my algorithm
• Find all observations in a spatial region
• Find all trajectories that intersect a cylinder in time
QuickTime™ and a decompressor
are needed to see this picture.
Chemical Plant Data
• Plant is a directed graph of plumbing
• Sensors at various places 1/sec observations
• Directed graph of time series
• Plant runs “near the edge” to optimize output
• And fails every once in a while -- down for a week
QuickTime™ and a decompressor
are needed to see this picture.
Chemical Plant Data
• Record all data{(time, sensor-1, … , sensor-5000)}
• Look for interesting events i.e. sensor values out of whack
• Cluster events near each other in 5000 dimension space
• Goal is to identify “near-failure modes”
QuickTime™ and a decompressor
are needed to see this picture.
Snow Cover in the SierrasSnow Cover in the Sierras
QuickTime™ and a decompressor
are needed to see this picture.
Earth Science Data
• Raw imagery – arrays• “cooked” into composite images based
on multiple passes Cloud free cells
• “baked” into snow cover, …
QuickTime™ and a decompressor
are needed to see this picture.
General ModelGeneral Model
sensors
CookingAlgorithm(s)(pipeline)
Derived data
QuickTime™ and a decompressor
are needed to see this picture.
Traditional Wisdom (1)
• Cooking pipeline in hard code or custom hardware
• All data in some file system
QuickTime™ and a decompressor
are needed to see this picture.
Problems
• Can’t find anything• Metadata often not recorded
Sensor parameters Cooking parameters
• Can’t easily share anything• Can easily recook anything• Everything is custom code
QuickTime™ and a decompressor
are needed to see this picture.
Traditional Wisdom (2)
• Cooking pipeline outside DBMS
• Derived data loaded into DBMS for subsequent querying
QuickTime™ and a decompressor
are needed to see this picture.
Problems with This Approach
• Most of the issues remain
• Better, but stay tuned
QuickTime™ and a decompressor
are needed to see this picture.
My Preference
• Load the raw data into a DBMS
• Cooking pipeline is a collection of user-defined functions (DBMS extensions)
• Activated by triggers or a workflow management system
• ALL data captured in a common system!!!!
QuickTime™ and a decompressor
are needed to see this picture.
What DBMS to Use?
• RDBMS (e.g. Oracle) Pretty hopeless on raw data
Simulating arrays on top of tables likely to cost a factor of 10-100
Not pretty on time series data Find me a sensor reading whose average
value over the last 3 days is within 1% of the average value adjoining 5 sensors
QuickTime™ and a decompressor
are needed to see this picture.
RDBMS Summary
• Wrong data model Arrays not tables
• Wrong operations Regrid not join
• Missing features Versions, no-overwrite, provenance,
support for uncertain data, …
QuickTime™ and a decompressor
are needed to see this picture.
But your mileage may vary
• SQL Server working well for Sloan Skyserver database
• See paper in CIDR 2009 by Jose Blakeley
QuickTime™ and a decompressor
are needed to see this picture.
How to Do Analytics (e.g. clustering)
• Suck out the data• Convert to array format• Pass to Matlab, SAS, R• Compute• Return answer to DBMS
QuickTime™ and a decompressor
are needed to see this picture.
Bad News
• Painful• Slow• Many analysis platforms are not parallel• Many are main memory only
QuickTime™ and a decompressor
are needed to see this picture.
My Proposal - SciDB
• Build a commercial-quality array DBMS from the ground up with integrated analytics Open source!!!!
QuickTime™ and a decompressor
are needed to see this picture.
Data Model
• Nested multi-dimensional arrays Cells can be tuples or other arrays
• Time is an automatically supported extra dimension
• Ragged arrays allow each row/column to have a different dimensionality
• Support for multiple flavors of ‘null’ Array cells can be ‘EMPTY’ User-definable, context-
sensitive treatment
QuickTime™ and a decompressor
are needed to see this picture.
Array-oriented storage manager
• Optimized for both dense and sparse array data Different data storage, compression, and
access
• Arrays are “chunked” (in multiple dimensions)
• Chunks are partitioned across a collection of nodes
• Chunks have ‘overlap’ to support neighborhood operations
• Replication provides efficiency and back-up
• No overwrite
QuickTime™ and a decompressor
are needed to see this picture.
Local ExecutorLocal Executor
Storage ManagerStorage Manager
Node 1
Application Layer
Server Layer
Storage Layer
Application
Language Specific UI
Query Interface and Parser
Runtime Supervisor
Plan Generator
Node 3Node 2
Local ExecutorLocal Executor
Storage ManagerStorage Manager
Node 1
Application Layer
Server Layer
Storage Layer
Application
Language Specific UI
Query Interface and Parser
Runtime Supervisor
Plan Generator
Node 3Node 2
Local ExecutorLocal Executor
Storage ManagerStorage Manager
Node 1
Node 3Node 2
Doesn’t require JDBC, ODBC
AQL an extension of SQLAlso supports UDFs
Java, C++, whatever…
Architecture• Shared nothing cluster
10’s–1000’s of nodes Commodity hardware TCP/IP between nodes Linear scale-up
• Each node has a processor and storage
• Queries refer to arrays as if not distributed
• Query planner optimizes queries for efficient data access & processing
• Query plan runs on a node’s local executor&storage manager
• Runtime supervisor coordinates execution
QuickTime™ and a decompressor
are needed to see this picture.
SciDB DDL
CREATE ARRAY Test_Array < A: integer NULLS, B: double, C: USER_DEFINED_TYPE > [I=0:99999,1000, 10, J=0:99999,1000, 10 ] PARTITION OVER ( Node1, Node2, Node3 ) USING block_cyclic();
chunk size
1000
overlap
10
attribute names
A, B, C
index names
I, J
QuickTime™ and a decompressor
are needed to see this picture.
Array Query Language (AQL)
• Array data management e.g. filter, aggregate, join, etc.
• Statistical & linear algebra operations multiply, QR factorization, etc. parallel, disk-oriented
• User-defined operators (Postgres-style)
• Interface to external stat packages (e.g. R)
QuickTime™ and a decompressor
are needed to see this picture.
Array Query Language (AQL)
SELECT Geo-Mean ( T.B )
FROM Test_Array T
WHERE
T.I BETWEEN :C1 AND :C2
AND T.J BETWEEN :C3 AND :C4
AND T.A = 10
GROUP BY T.I;
User-defined aggregate on anattribute B in array T
Subsample
FilterGroup-by
So far as SELECT / FROM / WHERE / GROUP BY queries are concerned, there is little logical difference between AQL and SQL
QuickTime™ and a decompressor
are needed to see this picture.
Matrix Multiply
• Algorithm sensitive to distribution (range, BC)
• For range, the smaller of the two arrays is replicated at all nodes Scatter-gather
• Each node does its “core” of the bigger array with the replicated smaller one
• Produces a distributed answer
CREATE ARRAY TS_Data < A1:int32, B1:double >
[ I=0:99999,1000,0, J=0:3999,100,0 ]
multiply (project (TS_Data, B1),
transpose(project (TS_Data, B1)))
10K x 4K
Project on one attribute
QuickTime™ and a decompressor
are needed to see this picture.
Other Features Science Guys Want(These could be in an RDBMS, but aren’t)
• Uncertainty Data has error bars Which must be carried along in the computation
interval arithmentic
QuickTime™ and a decompressor
are needed to see this picture.
Other Features
• Time travel Don’t fix errors by overwrite i.e. keep all the data – in the extra time
dimension
• Named versions Recalibration usually handled this way
QuickTime™ and a decompressor
are needed to see this picture.
Other Features
• Provenance (lineage)
What calibration generated the data
What was the “cooking” algorithm
In general - repeatabiltiy of derviced data
QuickTime™ and a decompressor
are needed to see this picture.
SciDB statuswww.scidb.org
• Global development team (17 people) including many volunteers
• Support/distribution by a VC-backed company Paradigm4 in Waltham
• 0.75 on website now Have at it!!!
• 1.0 release will be on website within a month Very functional, but a little pokey
• Red Queen will be out in October Blazing performance Probably 5X current system
QuickTime™ and a decompressor
are needed to see this picture.
Performance of SciDB 1.0
• Performance on stat operations Comparable to R, degrades in a similar way when data does not fit in
main memory But multi-node parallelism provided -- much better scalability
• Performance on SQL Comparable to or better than Postgres, especially on multi-attribute
restrictions And multi-node parallelism provided
• And we can do stuff they can’t….