the madlib analytics library - florian schoppmann · – dynamic execuoon of templated sql –...

28
The MADlib Analytics Library or MAD Skills, the SQL Caleb Welton Florian Schoppmann Christopher Ré 1

Upload: others

Post on 27-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

The MADlib Analytics Library or MAD Skills, the SQL

CalebWelton FlorianSchoppmann ChristopherRé

1

Page 2: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MADlib

Scalable Machine Learning for BigData

2

Page 3: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

Traditional analytics pipeline

sample.csv

Time-to-Insights

DataPrep DBExtract DBImportspec.docx scores.csv

3

Page 4: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

The MAD approach

EnterpriseData

RDBMS RDBMSRDBMS RDBMS

Time-to-Insights

DataPrep Model Score

ReducedData

Movement

Billionsofrows

inminutes

4

Page 5: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MADlib in Action

HospitalAdmiLanceCaseStudy

5

Page 6: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MADlib in Action

Step1:

•  IdenOfyhighriskpaOents

Goal:

•  HighriskpaOentswillbeeligibleforearlyadmiLance

andbeadministeredpreempOveanObioOcs

6

Page 7: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MADlib in Action

Step2:

•  Buildcostmodelfortreatment

y

xGoal:Goal:

•  Predictexpectedcostoftreatment

•  WithandwithoutearlyadmiLance.

7

Page 8: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MADlib in Action

Step3:

•  OpOmizeearlyadmiLancebasedonriskandcostmodel

Goal:

•  OverallhospitalcostswillbeminimizedandpaOents

willreceivebeLercare.

8

Page 9: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MADlib cycle of success

Value!

IdenOfy

Problem

UseMath

Insights

Whydidn’tI

thinkofthat

before?

9

Page 10: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

TheMADlibVision

•  AcademicandindustrycontribuOons

•  Thinkof“CRANfordatabases”

– Repositoryofopen-sourceMLalgorithms

– ThisOmewithdataparallelisminmind

•  Open-SourceFramework

EigenBSDLicense10

Page 11: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

SimpleExample:

OrdinaryLeastSquares

# SELECT (linregr(y, x)).* FROM data; !

-[ RECORD 1 ]+------------------------ !

coef | {1.7307,2.2428} !

r2 | 0.9475 !

std_err | {0.3258,0.0533} !

t_stats | {5.3127,42.0640} !

p_values | {6.7681e-07,4.4409e-16} !

condition_no | 169.5093!

# SELECT y, x[1] AS x1, x[2] AS x2 FROM data! y | x1 | x2 !-------+------+----- ! 10.14 | 0 | 0.3 ! 11.93 | 0.69 | 0.6 ! 13.57 | 1.1 | 0.9 ! 14.17 | 1.39 | 1.2 ! 15.25 | 1.61 | 1.5 ! 16.15 | 1.79 | 1.8 !

X y

y

x

11

Page 12: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

LinearAlgebraintheDatabase

XT

X

XT

y

-1

XTX XTy

( )

︸ ︷︷ ︸ ︸ ︷︷ ︸

β̂ = (X TX )−1X T

y

xixi

T

i=1

n

∑ xiyii=1

n

∑12

Page 13: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

BasicBuildingBlock:

User-DefinedAggregates

AggregaOonphase1oneachnode:

1.  IniOalize:

2.  TransiOonforallrows:

3.  Send(A,b)

x y

(1,0,3,…,5) 3

(-2,4,5,…,2) 2

… …

(A,b) = (0,0)

(A,b) = (A,b)+ (x ⋅ xT ,x ⋅ y)

map

reduce

(A,b)

…AggregaOonphase2onmasternode:

1.  Merge:

2.  Finalize: β̂ = solve(A,b) = A−1⋅b

(A,b) = (A,b)+ (A,b)

13

Page 14: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

Problemsolved?

No–notyet.

14

Page 15: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MLAlgorithmsBasedonSQL?

•  FourRepresentaOveChallenges

1.  LackofportablemulO-passiteraOons

2.  Rootsinfirst-orderlogic

3.  Lackoflanguagesupportforlinearalgebra

4.  ExtensibleSQLlimitedtosmallworkingsets

Need:

•  AbstracOonLayers

•  Afewcompromisesforuserinterface

15

Page 16: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

1.LackofportablemulO-pass

iteraOons

•  WITH RECURSIVEnotreliablebasisfor

portability

•  User-defineddriver

funcOonsinPython

– Outerloopsnot

performance-criOcal

•  Compromise:

Differentuserinterface

CREATE TEMP TABLE temp !

INSERT INTO temp SELECT step(...) FROM ... !

SELECT converged(...) FROM temp, ... !

SELECT result(...) !FROM temp!

false

true

16

Page 17: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

2.Rootsinfirst-orderlogic

•  Queriesneedbecognizantofdatabaseobjects

•  Emulatehigher-orderlogicby:–  dynamicexecuOonoftemplatedSQL

–  abstracOon-layersupport

•  Example:DistanceorkernelfuncOons

•  OnPostgreSQL,useoftypeREGPROC

FunctionHandle dist ! = args[0].getAs<FunctionHandle>(); !return dist(x, y);

17

Page 18: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

3.Lackoflanguagesupportfor

linearalgebra

•  C++AbstracOonLayerusesEigen

•  (Dense)Vectorsandmatrices:DOUBLE PRECISION[]!

•  Example:

AnyType!solve::run(AnyType& args) { ! MappedMatrix A = args[0].getAs<MappedMatrix>(); ! MappedColumnVector b = args[1].getAs<MappedColumnVector>(); ! ! MutableMappedColumnVector x = allocateArray<double>(A.cols()); ! x = A.colPivHouseholderQr().solve(b); ! return x; !} ! Performance:

•  Nounnecessarycopying

•  Nointernaltypeconversion

18

Page 19: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

4.ExtensibleSQLlimitedto

smallworkingsets

•  TablesonlyportableopOonforlargestates

•  AccessfromUDAssloworimpossible

•  Example:k-meansbenefitsfromexplicitpoint-to-centroidassignments–  ProblemaOc:

UPDATE points SET centroid_id = closest(state, coords)

–  Requiresownpass

– Notallowedinsubqueries

–  PostgreSQLlegacy

19

Page 20: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MADlibArchitecture

RDBMSQueryProcessing

(Greenplum,PostgreSQL,…)

Low-levelAbstracBonLayer

(matrixoperaOons,C++toRDBMStype

bridge,…)

RDBMS

Built-in

FuncBons

UserInterface

High-levelAbstracBonLayer

(iteraOoncontroller,convexopOmizers,...)

Row-levelFuncBons

(innerloopsofstreamingalgorithms,

convexopOmizaOoncallbacks,...)

“Driver”FuncBons

(outerloopsofiteraOvealgorithms,opOmizer

invocaOons)

C++

Python

Pythonwith

templatedSQL

SQL,generated

fromspecificaOon

20

Page 21: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

AnatomyofaniteraOveMADlibmodule

interState=Start(args)

Repeat

Inparallelforeachsegment:

intraState=IniOalize(interState)

Foreachrow

intraState=Transit(intraState,row)

ForeachintraState:

intraState=Merge(oldIntraState,intraState)

interState=Finalize(intraState)

UnBlConverged(interState)

ReturnEnd(interState)

User-defined

Aggregate

User-definedFuncOon

PythonDriverFuncOon

User-definedFuncOon

21

Page 22: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

PerformanceTrends

•  DiskI/Oisnotalways

theboLleneck•  Performancetuningis

essenOal

•  Overheadforsingle

queryverylow(fracOon

ofasecond)

•  Greenplumachieves

nearlyperfectspeedup0

5

10

15

20

25

30

35

40

6 12 18 24

20 40 80 160

OLSon10millionrows(inseconds)

#segments

#variables:

22

Page 23: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

CurrentModules

DataModeling

SupervisedLearning

•  NaiveBayesClassificaOon

•  LinearRegression

•  LogisOcRegression

•  DecisionTree

•  RandomForest

•  SupportVectorMachines

UnsupervisedLearning

•  AssociaOonRules

•  k-MeansClustering

•  SVDMatrixFactorizaOon

•  ParallelLatentDirichletAllocaOon

DescripOveStaOsOcs

Sketch-basedEsOmators

•  CountMin(Cormode-Muthukrishnan)

•  FM(Flajolet-MarOn)

• MFV(MostFrequentValues)

Profile

QuanOle

Support

ArrayOperaOons

ConjugateGradient

SparseVectors

ProbabilityFuncOons

FeatureExtracOon

InferenOalStaOsOcs

Hypothesistests23

Page 24: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MyMADlibExperience:

ATesBmonial.

ChristopherRé,Wisconsin 24

Page 25: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

Towards a Unified Architecture for in-RDBMS Analytics

Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré

Department of Computer SciencesUniversity of Wisconsin-Madison

{xfeng, arun, brecht, chrisre}@cs.wisc.edu

ABSTRACT

The increasing use of statistical data analysis in enterpriseapplications has created an arms race among database ven-dors to offer ever more sophisticated in-database analytics.

late 1990s and early 2000s, this brought a wave of data mining toolkits into the RDBMS. Several major vendors aragain making an effort toward sophisticated in-database an-alytics with both open source efforts, e.g., the MADlib plat-form from Greenplum [18], and several projects at majo

Towards a Unified Architecture for in-RDBMS Analytics

Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré

Department of Computer SciencesUniversity of Wisconsin-Madison

{xfeng, arun, brecht, chrisre}@cs.wisc.edu

ABSTRACT

The increasing use of statistical data analysis in enterpriseapplications has created an arms race among database ven-dors to offer ever more sophisticated in-database analytics.

late 1990s and early 2000s, this brought a wave of data mining toolkits into the RDBMS. Several major vendors aragain making an effort toward sophisticated in-database an-alytics with both open source efforts, e.g., the MADlib plat-form from Greenplum [18], and several projects at majo

RefiningIdeasandCode

QAfromGPhelptotransiOon

frompapertodeployedcode.

ConversaOonswithGP(andOracle)leadusto

beLerposiOonourSIGMOD12paper

Towards a Unified Architecture for in-RDBMS Analytics

Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré

Department of Computer SciencesUniversity of Wisconsin-Madison

{xfeng, arun, brecht, chrisre}@cs.wisc.edu

ABSTRACT

The increasing use of statistical data analysis in enterpriseapplications has created an arms race among database ven-dors to offer ever more sophisticated in-database analytics.

late 1990s and early 2000s, this brought a wave of data mining toolkits into the RDBMS. Several major vendors aragain making an effort toward sophisticated in-database an-alytics with both open source efforts, e.g., the MADlib plat-form from Greenplum [18], and several projects at majo

25

Page 26: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

MADlibisOpenSource

Learning&Inferencerunon(GPorPostgres)+MADLib

Cri5cal:it’sfree,open,andwecanmodifyit

hazy.cs.wisc.edu&www.youtube.com/HazyResearch

EnhanceWikipedia

withextractedfacts

fromtheWeb

(50+TBofdata)

26

Page 27: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

TesOmonialSummary

MADlibisopentocontribuOonsandopensource

27

Page 28: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,

Questions?

hLp://madlib.net

CalebWelton

[email protected]

FlorianSchoppmann

[email protected]

ChristopherRé

[email protected]