the madlib analytics library - florian schoppmann · – dynamic execuoon of templated sql –...
TRANSCRIPT
![Page 1: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/1.jpg)
The MADlib Analytics Library or MAD Skills, the SQL
CalebWelton FlorianSchoppmann ChristopherRé
1
![Page 2: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/2.jpg)
MADlib
Scalable Machine Learning for BigData
2
![Page 3: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/3.jpg)
Traditional analytics pipeline
sample.csv
Time-to-Insights
DataPrep DBExtract DBImportspec.docx scores.csv
3
![Page 4: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/4.jpg)
The MAD approach
EnterpriseData
RDBMS RDBMSRDBMS RDBMS
Time-to-Insights
DataPrep Model Score
ReducedData
Movement
Billionsofrows
inminutes
4
![Page 5: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/5.jpg)
MADlib in Action
HospitalAdmiLanceCaseStudy
5
![Page 6: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/6.jpg)
MADlib in Action
Step1:
• IdenOfyhighriskpaOents
Goal:
• HighriskpaOentswillbeeligibleforearlyadmiLance
andbeadministeredpreempOveanObioOcs
6
![Page 7: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/7.jpg)
MADlib in Action
Step2:
• Buildcostmodelfortreatment
y
xGoal:Goal:
• Predictexpectedcostoftreatment
• WithandwithoutearlyadmiLance.
7
![Page 8: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/8.jpg)
MADlib in Action
Step3:
• OpOmizeearlyadmiLancebasedonriskandcostmodel
Goal:
• OverallhospitalcostswillbeminimizedandpaOents
willreceivebeLercare.
8
![Page 9: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/9.jpg)
MADlib cycle of success
Value!
IdenOfy
Problem
UseMath
Insights
Whydidn’tI
thinkofthat
before?
9
![Page 10: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/10.jpg)
TheMADlibVision
• AcademicandindustrycontribuOons
• Thinkof“CRANfordatabases”
– Repositoryofopen-sourceMLalgorithms
– ThisOmewithdataparallelisminmind
• Open-SourceFramework
EigenBSDLicense10
![Page 11: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/11.jpg)
SimpleExample:
OrdinaryLeastSquares
# SELECT (linregr(y, x)).* FROM data; !
-[ RECORD 1 ]+------------------------ !
coef | {1.7307,2.2428} !
r2 | 0.9475 !
std_err | {0.3258,0.0533} !
t_stats | {5.3127,42.0640} !
p_values | {6.7681e-07,4.4409e-16} !
condition_no | 169.5093!
# SELECT y, x[1] AS x1, x[2] AS x2 FROM data! y | x1 | x2 !-------+------+----- ! 10.14 | 0 | 0.3 ! 11.93 | 0.69 | 0.6 ! 13.57 | 1.1 | 0.9 ! 14.17 | 1.39 | 1.2 ! 15.25 | 1.61 | 1.5 ! 16.15 | 1.79 | 1.8 !
X y
y
x
11
![Page 12: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/12.jpg)
LinearAlgebraintheDatabase
XT
X
XT
y
-1
XTX XTy
( )
︸ ︷︷ ︸ ︸ ︷︷ ︸
β̂ = (X TX )−1X T
y
xixi
T
i=1
n
∑ xiyii=1
n
∑12
![Page 13: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/13.jpg)
BasicBuildingBlock:
User-DefinedAggregates
AggregaOonphase1oneachnode:
1. IniOalize:
2. TransiOonforallrows:
3. Send(A,b)
x y
(1,0,3,…,5) 3
(-2,4,5,…,2) 2
… …
(A,b) = (0,0)
(A,b) = (A,b)+ (x ⋅ xT ,x ⋅ y)
map
reduce
(A,b)
…AggregaOonphase2onmasternode:
1. Merge:
2. Finalize: β̂ = solve(A,b) = A−1⋅b
(A,b) = (A,b)+ (A,b)
13
![Page 14: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/14.jpg)
Problemsolved?
No–notyet.
14
![Page 15: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/15.jpg)
MLAlgorithmsBasedonSQL?
• FourRepresentaOveChallenges
1. LackofportablemulO-passiteraOons
2. Rootsinfirst-orderlogic
3. Lackoflanguagesupportforlinearalgebra
4. ExtensibleSQLlimitedtosmallworkingsets
Need:
• AbstracOonLayers
• Afewcompromisesforuserinterface
15
![Page 16: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/16.jpg)
1.LackofportablemulO-pass
iteraOons
• WITH RECURSIVEnotreliablebasisfor
portability
• User-defineddriver
funcOonsinPython
– Outerloopsnot
performance-criOcal
• Compromise:
Differentuserinterface
CREATE TEMP TABLE temp !
INSERT INTO temp SELECT step(...) FROM ... !
SELECT converged(...) FROM temp, ... !
SELECT result(...) !FROM temp!
false
true
16
![Page 17: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/17.jpg)
2.Rootsinfirst-orderlogic
• Queriesneedbecognizantofdatabaseobjects
• Emulatehigher-orderlogicby:– dynamicexecuOonoftemplatedSQL
– abstracOon-layersupport
• Example:DistanceorkernelfuncOons
• OnPostgreSQL,useoftypeREGPROC
FunctionHandle dist ! = args[0].getAs<FunctionHandle>(); !return dist(x, y);
17
![Page 18: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/18.jpg)
3.Lackoflanguagesupportfor
linearalgebra
• C++AbstracOonLayerusesEigen
• (Dense)Vectorsandmatrices:DOUBLE PRECISION[]!
• Example:
AnyType!solve::run(AnyType& args) { ! MappedMatrix A = args[0].getAs<MappedMatrix>(); ! MappedColumnVector b = args[1].getAs<MappedColumnVector>(); ! ! MutableMappedColumnVector x = allocateArray<double>(A.cols()); ! x = A.colPivHouseholderQr().solve(b); ! return x; !} ! Performance:
• Nounnecessarycopying
• Nointernaltypeconversion
18
![Page 19: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/19.jpg)
4.ExtensibleSQLlimitedto
smallworkingsets
• TablesonlyportableopOonforlargestates
• AccessfromUDAssloworimpossible
• Example:k-meansbenefitsfromexplicitpoint-to-centroidassignments– ProblemaOc:
UPDATE points SET centroid_id = closest(state, coords)
– Requiresownpass
– Notallowedinsubqueries
– PostgreSQLlegacy
19
![Page 20: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/20.jpg)
MADlibArchitecture
RDBMSQueryProcessing
(Greenplum,PostgreSQL,…)
Low-levelAbstracBonLayer
(matrixoperaOons,C++toRDBMStype
bridge,…)
RDBMS
Built-in
FuncBons
UserInterface
High-levelAbstracBonLayer
(iteraOoncontroller,convexopOmizers,...)
Row-levelFuncBons
(innerloopsofstreamingalgorithms,
convexopOmizaOoncallbacks,...)
“Driver”FuncBons
(outerloopsofiteraOvealgorithms,opOmizer
invocaOons)
C++
Python
Pythonwith
templatedSQL
SQL,generated
fromspecificaOon
20
![Page 21: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/21.jpg)
AnatomyofaniteraOveMADlibmodule
interState=Start(args)
Repeat
Inparallelforeachsegment:
intraState=IniOalize(interState)
Foreachrow
intraState=Transit(intraState,row)
ForeachintraState:
intraState=Merge(oldIntraState,intraState)
interState=Finalize(intraState)
UnBlConverged(interState)
ReturnEnd(interState)
User-defined
Aggregate
User-definedFuncOon
PythonDriverFuncOon
User-definedFuncOon
21
![Page 22: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/22.jpg)
PerformanceTrends
• DiskI/Oisnotalways
theboLleneck• Performancetuningis
essenOal
• Overheadforsingle
queryverylow(fracOon
ofasecond)
• Greenplumachieves
nearlyperfectspeedup0
5
10
15
20
25
30
35
40
6 12 18 24
20 40 80 160
OLSon10millionrows(inseconds)
#segments
#variables:
22
![Page 23: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/23.jpg)
CurrentModules
DataModeling
SupervisedLearning
• NaiveBayesClassificaOon
• LinearRegression
• LogisOcRegression
• DecisionTree
• RandomForest
• SupportVectorMachines
UnsupervisedLearning
• AssociaOonRules
• k-MeansClustering
• SVDMatrixFactorizaOon
• ParallelLatentDirichletAllocaOon
DescripOveStaOsOcs
Sketch-basedEsOmators
• CountMin(Cormode-Muthukrishnan)
• FM(Flajolet-MarOn)
• MFV(MostFrequentValues)
Profile
QuanOle
Support
ArrayOperaOons
ConjugateGradient
SparseVectors
ProbabilityFuncOons
FeatureExtracOon
InferenOalStaOsOcs
Hypothesistests23
![Page 24: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/24.jpg)
MyMADlibExperience:
ATesBmonial.
ChristopherRé,Wisconsin 24
![Page 25: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/25.jpg)
Towards a Unified Architecture for in-RDBMS Analytics
Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré
Department of Computer SciencesUniversity of Wisconsin-Madison
{xfeng, arun, brecht, chrisre}@cs.wisc.edu
ABSTRACT
The increasing use of statistical data analysis in enterpriseapplications has created an arms race among database ven-dors to offer ever more sophisticated in-database analytics.
late 1990s and early 2000s, this brought a wave of data mining toolkits into the RDBMS. Several major vendors aragain making an effort toward sophisticated in-database an-alytics with both open source efforts, e.g., the MADlib plat-form from Greenplum [18], and several projects at majo
Towards a Unified Architecture for in-RDBMS Analytics
Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré
Department of Computer SciencesUniversity of Wisconsin-Madison
{xfeng, arun, brecht, chrisre}@cs.wisc.edu
ABSTRACT
The increasing use of statistical data analysis in enterpriseapplications has created an arms race among database ven-dors to offer ever more sophisticated in-database analytics.
late 1990s and early 2000s, this brought a wave of data mining toolkits into the RDBMS. Several major vendors aragain making an effort toward sophisticated in-database an-alytics with both open source efforts, e.g., the MADlib plat-form from Greenplum [18], and several projects at majo
RefiningIdeasandCode
QAfromGPhelptotransiOon
frompapertodeployedcode.
ConversaOonswithGP(andOracle)leadusto
beLerposiOonourSIGMOD12paper
Towards a Unified Architecture for in-RDBMS Analytics
Xixuan Feng Arun Kumar Benjamin Recht Christopher Ré
Department of Computer SciencesUniversity of Wisconsin-Madison
{xfeng, arun, brecht, chrisre}@cs.wisc.edu
ABSTRACT
The increasing use of statistical data analysis in enterpriseapplications has created an arms race among database ven-dors to offer ever more sophisticated in-database analytics.
late 1990s and early 2000s, this brought a wave of data mining toolkits into the RDBMS. Several major vendors aragain making an effort toward sophisticated in-database an-alytics with both open source efforts, e.g., the MADlib plat-form from Greenplum [18], and several projects at majo
25
![Page 26: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/26.jpg)
MADlibisOpenSource
Learning&Inferencerunon(GPorPostgres)+MADLib
Cri5cal:it’sfree,open,andwecanmodifyit
hazy.cs.wisc.edu&www.youtube.com/HazyResearch
EnhanceWikipedia
withextractedfacts
fromtheWeb
(50+TBofdata)
26
![Page 27: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/27.jpg)
TesOmonialSummary
MADlibisopentocontribuOonsandopensource
27
![Page 28: The MADlib Analytics Library - Florian Schoppmann · – dynamic execuOon of templated SQL – abstracOon-layer support • Example: Distance or kernel funcOons • On PostgreSQL,](https://reader031.vdocuments.mx/reader031/viewer/2022011823/5ed21ab4343c9467566d5617/html5/thumbnails/28.jpg)
Questions?
hLp://madlib.net
CalebWelton
FlorianSchoppmann
ChristopherRé