in-database predictive analytics
Post on 18-Jan-2015
2.236 Views
Preview:
DESCRIPTION
TRANSCRIPT
In-DatabasePredictive Analytics
John A. De Goes@jdegoes, john@precog.com
• Introduction
• Abusing SQL
• Painful by Design
• Database Extensions
• MADlib
• Other Approaches
• Summary
Agenda
Introduction
In-Database Predictive Analytics
In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.
Traditional Predictive Analytics
Introduction
database
R
SAS
Data Bottleneck:Painful, Slow
Introduction
database
R
SAS
What’s the answer?
Introduction
“MapReduce”
Move the Code, not the Data!
AdvancedAnalytics
Introduction
Let’s Do K-Means in SQL!
Abusing SQL
General Approach in RDBMS
SQL
Feedback
DatabaseDriver
Abusing SQL
Our Initial Model
model
d k n iteration avg_q
number of dimensions
number of clusters
number of points
number of iterations
variance
Abusing SQL
Our Initial Data Set
Y
Y1 Y2 Y3 Y3
n rows
Abusing SQL
Projection & Numbering
Y
Y1 Y2 Y3 ...
YH
i Y1 ... Yd
INSERT INTO YHSELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., YdFROM Y;
1
2
3
4
...
...
n
1
2
3
4
...
...
n
Abusing SQL
Flattening
YH
i Y1 ... Yd
INSERT INTO YV SELECT i,1,Y1 FROM YH;...INSERT INTO YV SELECT i,d,Yd FROM YH;
1
2
3
4
...
...
n
1
1
1
1
2
...
n
YV
i l val
1
2
...
d
1
...
d
n x d rows
1
1
...
1
2
...
n
Abusing SQL
Initializing k Cluster Centers
YH
i Y1 ... Yd
CH
j Y1 ... Yd
1
2
3
4
...
...
n
INSERT INTO CHSELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;...INSERT INTO CHSELECT k,Y1, ..., Yd FROM YH SAMPLE 1;
1
2
3
4
...
...
k
Abusing SQL
CH
j Y1 ... Yd
1
2
3
4
...
...
k
Flattening
C
l j val
d x k rows
1
1
...
1
2
...
d
1
2
...
k
1
...
k
INSERT INTO CSELECT 1, 1, Y1 FROM CH WHERE j = 1;...INSERT INTO CSELECT d, k, Yd FROM CH WHERE j = k;
Abusing SQL
Computing Distances to Clusters
INSERT INTO YDSELECT i, j, sum((YV.val - C.val)**2)FROM YV, C WHERE YV.l = C.l GROUP BY i, j;
YD
i j dist
1
2
...
k
1
...
k
n x k rows
1
1
...
1
2
...
n
Abusing SQL
Computing Nearest Neighbors
INSERT INTO YNNSELECT YD.i,Y D.jFROM YD, (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMINDWHERE Y D.i = YMIND.i and Y D.distance = YMIND.mindist;
nearest clusters
YNN
i j
n rows
1
2
3
4
5
...
n
Abusing SQL
Count Points Per Cluster
INSERT INTO W SELECT j, count(*)FROM YNN GROUP BY j;UPDATE W SET w = w/model.n;
Abusing SQL
Compute New Centroids
INSERT INTO CSELECT l, j, avg(YV.val) FROM YV, YNNWHERE YV.i = YNN.i GROUP BY l, j;
Abusing SQL
Compute Variances
INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;
Abusing SQL
Update Model
INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;
Abusing SQL
Let’s not do that again!
Abusing SQL
Why are predictive analytics so hard to express in SQL?
Painful by Design
#1: No Arrays
Setsrows
Tuplescolumns
Arrays
Painful by Design
#2: Relational Algebra Sucks
Projection Selection Rename Natural Join
R S
Theta JoinSemijoin
R S R S
Antijoin
÷R S
Division
⟕R S
Left outer join
R S
Right outer join
⟖ ⟗R S
Full outer join
G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)
Aggregation
Painful by Design
Iteration Recursion Multiple Dimensions
There’s GOT to be a better way!
Database Extensions
C Extension
Database Extensions
UDFUser-Defined Function
UDAUser-Defined Aggregate
Map Reducemap(a)
op2(a,b)init(a)
accum(a, b)merge(a, b)final(a)
Database Extensions
MADlib is an open-source library for scalable in-database analytics.It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.
MADlib
Mac OS X
http://www.madlib.net/files/madlib-0.6-Darwin.dmg
Linux
http://www.madlib.net/files/madlib-0.6-Linux.rpm
1. Download the binaryMADlib
Mac OS X
Double-click on installer
Linux
yum install $MADLIB_PACKAGE --nogpgcheck
2. Start the InstallationMADlib
Greenplum
source /path/to/greenplum/greenplum_path.sh
PostgreSQL
Make sure psql is in PATH
3. Verify LocatabilityMADlib
Greenplum
/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install
PostgreSQL
/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install
4. Register MADlibMADlib
Greenplum
/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check
PostgreSQL
/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check
5. Test InstallationMADlib
SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ]);
Clustering in MADlibMADlib
Ahhhhhh......
MADlib
Our Way or the Highway
Composability
MADlib
RDBMS Isn’t the Only Game in Town!
Other Approaches
1. Embrace Coding
• Hadoop Ecosystem• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,
of course, MapReduce
• BDAS Ecosystem• Spark
Other Approaches
2. Reject RDBMS
• Datalog + variants• In theory, ideal for many kinds of predictive analytics
• Suffers from a lack of distributed, feature-complete implementations
Other Approaches
2. Reject RDBMS
• Rasdaman / RASQL• Arrays but not analytics
Community Editionshttp://www.rasdaman.org
Other Approaches
2. Reject RDBMS
• MonetDB / SciQL• Array extension of SQL
• Poor analytics
Community Editionshttp://www.monetdb.org
Other Approaches
2. Reject RDBMS
• SciDB / AFL (AQL)• Excellent analytics
• Limited composability
Community Editionshttp://www.scidb.org/forum/viewtopic.php?f=16&t=364/
Other Approaches
2. Reject RDBMS
• Precog / Quirrel (simple “R for big data”)• Multidimensional, arrays + functions
• Still immature
Community Editionshttp://www.precog.com/editions/precog-for-mongodb (MongoDB)
http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)
Other Approaches
Summary
• Increase performance, reduce friction by doing more inside the database
• Not a panacea• Hard to do in SQL
• Hard to do in C (but you may not have to: MADlib)
• Pre-canned & brittle in most databases
• Ultimately what’s needed is tech designed for advanced analytics
References
• Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)
top related