an array library for ms sql server - wordpress.com · •sql is a must forastronomersthesedays...

15
Scientific requirements and an implementation An Array Library for MS SQL Server EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 1 László Dobos 1,2 -- [email protected] Alex Szalay 2 , José Blakeley 3 , Tamás Budavári 2 , István Csabai 1,2 , Dragan Tomic 3 , Milos Milovanovic 3 , Marko Tintor 3 , Andrija Jovanovic 3 1 Eötvös University, Budapest, Hungary Departmentof PhysicsofComplexSystems 2 The Johns Hopkins University, Baltimore, USA 3 Microsoft

Upload: others

Post on 17-Mar-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Scientific requirements and an implementation

An Array Library for MS SQL Server

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 1

László Dobos1,2 -- [email protected]

Alex Szalay2, José Blakeley3,Tamás Budavári2, István Csabai1,2, Dragan Tomic3,Milos Milovanovic3, Marko Tintor3, Andrija Jovanovic3

1Eötvös University, Budapest, HungaryDepartment of Physics of Complex Systems

2The Johns Hopkins University, Baltimore, USA3Microsoft

Page 2: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Outline

• A short review of our earlier projects

• Scientific motivations and some use cases

• Requirements for a simple array type• Requirements for a simple array type

• Pros and cons of using RDBMS systems

• An implementation using MSSQL

• Possible upgrades to MSSQL

• Conclusions and Summary

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 2

Page 3: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Astronomical DBs in the exponential era

• Systematic surveying of the sky

• Repeated observations– Fainter objects

– Time domain data1,E+04

1,E+05

1,E+06

1,E+07

– Time domain data

• Multiple wavelengths– Different data sets

– Match by coordinates

• New software tools areneeded

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 3

1,E+00

1,E+01

1,E+02

1,E+03

SDSS - 2000 PanSTARRS - 2010 LSST - 2020

Camera pixels (Mpix)

Detected celestial objects (M)

DB Size (GB)

Page 4: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Earlier projects: SkyServer

• MSSQL database

• Astronomical data from theSloan Digital Sky Survey (SDSS)

• Brightness measurements for 580 million objects: stars, galaxies, etc. in five color filters

• 4.5 TB row data to date(+ images in blobs)

• Introduced SQL to the astronomercommunity

– Minimal GUI

– Main interface is pure SQL

– Lots of user-defined functions

– Users use ‘MyDB’ to store results

– Spatial search capabilities

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 4

http://skyserver.sdss.org

Alex Szalay, Jim Gray et al.

Page 5: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Our earlier projects: Spectrum Services

• Sloan Digital Sky Surveyspectroscopic observations

• 1 million spectra

• 1 TB range

• Spectra are high dimensional• Spectra are high dimensionalvectors in blobs (D > 3000)

• Rich web based GUI+ Web services

• User built pipelines forreprocessing measurements tofit science needs

• Search based on spectralfeatures

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 5

http://voservices.net/spectrum

Dobos et al.

Page 6: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Astronomy

• Living data

– Recalibration, transformation on daily basis

– Large bulk-inserts on weekly basis, merging

• Observations of individual objects

– Point-like data (D = 5 … 10)

– On the order of billions of rows– On the order of billions of rows

– Automatic classification, cluster findingalgorithms

• Spatial search

– High precision spherical geometry(milli-arcsecond resolution)

– Tree based search in many dimensions

• 2-point and n-point correlationfunctions

– On the sphere and in many dimensions

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 6

Page 7: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Astronomy and astrophysics

• Spectroscopy – o(100 GB) range

– High dimensional vectors [D = o(1000)]

– Complete reruns of data processing ondaily basis

– Resampling, PCA, function fitting, etc.

• Cosmological simulations – o(100 TB)

Nat

ure

– Position and velocity for each particle• No grid, high number of particles [Np = o(106-107)]

• Data dumps at timesteps [Nt = o(102)]

• Multiple runs of the simulation [Nr = 500]

– Individual points are impossible to store

– Organize points in octree, store chunksof data in arrays

– Indentify clusters of particles –friends-of-friends algorithm

– Compute density over regular grid,FFT,correlation functions

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 7

Imag

e fr

om

: Sp

rin

gel

, V.

et a

l. 2

005,

Nat

ure

Page 8: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Turbulent flow simulations

• Distribution of data over a regulargrid [size o(100 TB)]

• 10243 grid, 2000 time steps,velocity, pressure [D = 4]

• Idea:

• Do not download full data for analysis

• Put virtual sensors into the flow

E. P

erlm

an,

R. B

urn

s, Y

. Li,

an

d C

. Men

evea

u.

Ex

plo

rati

on

of

Su

per

com

pu

tin

g S

C07

, AC

M, I

EE

E,

2007

.

• Put virtual sensors into the flow(list of positions, time)

• Output: velocity field at the givenpositions

• Store grid in chunks, with buffer zones

• Organize chunks along space-fillingcurve

• Interpolation using various kernels(hence the buffers)

• Visualization…

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 8

Imag

e fr

om

: E. P

erlm

an,

R. B

urn

s, Y

. Li,

an

d C

. E

xp

lora

tio

n o

f S

up

erco

mp

uti

ng

SC

07, A

CM

, IE

EE

, 20

07.

Page 9: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Requirements for a SQL array type

• Simple extension to SQL

• Multi dimensions up to at least D = 6

• Main numerical data-types, complex numbers

• Interface to math libraries– LAPACK, FFTW, Matlab…– LAPACK, FFTW, Matlab…

• High-performance subsetting functions

• Easy aggregation over arrays

• Efficient support for both– small arrays (in-page) and

– large arrays (blob) data

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 9

Page 10: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Extending Microsoft SQL Server

• Pros:– Effective CPU, memory, disk and network management

out of the box

– Well supported by Microsoft

– Extensibility via user-defined functions and types– Extensibility via user-defined functions and types

– .Net interface

• Drawbacks:– No native array support in SQL CLR

– SQL CLR still lacks some features to be suitable

– No access to original codebaseEDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 10

Page 11: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Our implementation

• Support for in-page and out-of-page arrays

• Implemented in .NET SQL CLR– C++/CLI for two reasons:

• Benefit from template classes (no such in C#) toimplement support for all base data typesimplement support for all base data types

• Test possibilities

• Store data as pure binary with a short header

• No user-defined data types– Lots of user-defined functions instead

• Byte order to match common math libraries

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 11

Page 12: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Our implementation

• No SQL language extensions (yet)

• Sample code (kind of ugly), e.g.– Allocate vector with 5 items, get first item

DECLARE @a VARBINARY(100) =

– Get subarray from another array @b

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 12

DECLARE @a VARBINARY(100) =

FloatArray.Vector_5(1.0, 2.0, 3.0, 4.0, 5.0)

SELECT FloatArray.Item_1(@a, 3)

SET @b = FloatArrayMax.Subarray(@b,

IntArray.Vector_3(1, 4, 6),

IntArray.Vector_3(5, 5, 5), 0)

Page 13: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Our implementation: Performance

• CPU limited performance with small arrays– Very fast I/O subsystem was used

• Performance mostly limited by the currentcapabilities of SQL CLRcapabilities of SQL CLR– Relatively expensive function calls

– No cross-call caching of data

– Data copy from native to managed memory required

• No numbers on out-of-page arrays yet

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 13

Page 14: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Possible upgrades to SQL CLR

• We used scalar functions because of UDT limitations

• To make UDTs suitable one needs• To make UDTs suitable one needs– Partial Serialization/Deserialization of UDTs

– Cross-row UDT instance caching

– Support for ordered aggregation of UDTs

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 14

Page 15: An Array Library for MS SQL Server - WordPress.com · •SQL is a must forastronomersthesedays •Building a powerfularraylibraryin.NET SQL CLR is already feasible, though tricky

Conclusions and summary

• SQL is a must for astronomers these days

• Building a powerful array library in.NET SQL CLR is already feasible, though tricky.NET SQL CLR is already feasible, though tricky

• Upgrades to user-defined types would be great

• SQL language extensions required– At least in a form of a pre-parser

EDBT/ICDT 2011 Uppsala, Sweden – Workshop on Array Databases 15