Download - In-Database Predictive Analytics
![Page 2: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/2.jpg)
• Introduction
• Abusing SQL
• Painful by Design
• Database Extensions
• MADlib
• Other Approaches
• Summary
Agenda
![Page 3: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/3.jpg)
Introduction
In-Database Predictive Analytics
In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.
![Page 4: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/4.jpg)
Traditional Predictive Analytics
Introduction
database
R
SAS
![Page 5: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/5.jpg)
Data Bottleneck:Painful, Slow
Introduction
database
R
SAS
![Page 6: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/6.jpg)
What’s the answer?
Introduction
![Page 7: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/7.jpg)
“MapReduce”
Move the Code, not the Data!
AdvancedAnalytics
Introduction
![Page 8: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/8.jpg)
Let’s Do K-Means in SQL!
Abusing SQL
![Page 9: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/9.jpg)
General Approach in RDBMS
SQL
Feedback
DatabaseDriver
Abusing SQL
![Page 10: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/10.jpg)
Our Initial Model
model
d k n iteration avg_q
number of dimensions
number of clusters
number of points
number of iterations
variance
Abusing SQL
![Page 11: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/11.jpg)
Our Initial Data Set
Y
Y1 Y2 Y3 Y3
n rows
Abusing SQL
![Page 12: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/12.jpg)
Projection & Numbering
Y
Y1 Y2 Y3 ...
YH
i Y1 ... Yd
INSERT INTO YHSELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., YdFROM Y;
1
2
3
4
...
...
n
1
2
3
4
...
...
n
Abusing SQL
![Page 13: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/13.jpg)
Flattening
YH
i Y1 ... Yd
INSERT INTO YV SELECT i,1,Y1 FROM YH;...INSERT INTO YV SELECT i,d,Yd FROM YH;
1
2
3
4
...
...
n
1
1
1
1
2
...
n
YV
i l val
1
2
...
d
1
...
d
n x d rows
1
1
...
1
2
...
n
Abusing SQL
![Page 14: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/14.jpg)
Initializing k Cluster Centers
YH
i Y1 ... Yd
CH
j Y1 ... Yd
1
2
3
4
...
...
n
INSERT INTO CHSELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;...INSERT INTO CHSELECT k,Y1, ..., Yd FROM YH SAMPLE 1;
1
2
3
4
...
...
k
Abusing SQL
![Page 15: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/15.jpg)
CH
j Y1 ... Yd
1
2
3
4
...
...
k
Flattening
C
l j val
d x k rows
1
1
...
1
2
...
d
1
2
...
k
1
...
k
INSERT INTO CSELECT 1, 1, Y1 FROM CH WHERE j = 1;...INSERT INTO CSELECT d, k, Yd FROM CH WHERE j = k;
Abusing SQL
![Page 16: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/16.jpg)
Computing Distances to Clusters
INSERT INTO YDSELECT i, j, sum((YV.val - C.val)**2)FROM YV, C WHERE YV.l = C.l GROUP BY i, j;
YD
i j dist
1
2
...
k
1
...
k
n x k rows
1
1
...
1
2
...
n
Abusing SQL
![Page 17: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/17.jpg)
Computing Nearest Neighbors
INSERT INTO YNNSELECT YD.i,Y D.jFROM YD, (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMINDWHERE Y D.i = YMIND.i and Y D.distance = YMIND.mindist;
nearest clusters
YNN
i j
n rows
1
2
3
4
5
...
n
Abusing SQL
![Page 18: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/18.jpg)
Count Points Per Cluster
INSERT INTO W SELECT j, count(*)FROM YNN GROUP BY j;UPDATE W SET w = w/model.n;
Abusing SQL
![Page 19: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/19.jpg)
Compute New Centroids
INSERT INTO CSELECT l, j, avg(YV.val) FROM YV, YNNWHERE YV.i = YNN.i GROUP BY l, j;
Abusing SQL
![Page 20: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/20.jpg)
Compute Variances
INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;
Abusing SQL
![Page 21: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/21.jpg)
Update Model
INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;
Abusing SQL
![Page 22: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/22.jpg)
Let’s not do that again!
Abusing SQL
![Page 23: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/23.jpg)
Why are predictive analytics so hard to express in SQL?
Painful by Design
![Page 24: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/24.jpg)
#1: No Arrays
Setsrows
Tuplescolumns
Arrays
Painful by Design
![Page 25: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/25.jpg)
#2: Relational Algebra Sucks
Projection Selection Rename Natural Join
R S
Theta JoinSemijoin
R S R S
Antijoin
÷R S
Division
⟕R S
Left outer join
R S
Right outer join
⟖ ⟗R S
Full outer join
G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)
Aggregation
Painful by Design
Iteration Recursion Multiple Dimensions
![Page 26: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/26.jpg)
There’s GOT to be a better way!
Database Extensions
![Page 27: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/27.jpg)
C Extension
Database Extensions
![Page 28: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/28.jpg)
UDFUser-Defined Function
UDAUser-Defined Aggregate
Map Reducemap(a)
op2(a,b)init(a)
accum(a, b)merge(a, b)final(a)
Database Extensions
![Page 29: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/29.jpg)
MADlib is an open-source library for scalable in-database analytics.It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.
MADlib
![Page 30: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/30.jpg)
Mac OS X
http://www.madlib.net/files/madlib-0.6-Darwin.dmg
Linux
http://www.madlib.net/files/madlib-0.6-Linux.rpm
1. Download the binaryMADlib
![Page 31: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/31.jpg)
Mac OS X
Double-click on installer
Linux
yum install $MADLIB_PACKAGE --nogpgcheck
2. Start the InstallationMADlib
![Page 32: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/32.jpg)
Greenplum
source /path/to/greenplum/greenplum_path.sh
PostgreSQL
Make sure psql is in PATH
3. Verify LocatabilityMADlib
![Page 33: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/33.jpg)
Greenplum
/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install
PostgreSQL
/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install
4. Register MADlibMADlib
![Page 34: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/34.jpg)
Greenplum
/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check
PostgreSQL
/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check
5. Test InstallationMADlib
![Page 35: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/35.jpg)
SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ]);
Clustering in MADlibMADlib
![Page 36: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/36.jpg)
Ahhhhhh......
MADlib
![Page 37: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/37.jpg)
Our Way or the Highway
Composability
MADlib
![Page 38: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/38.jpg)
RDBMS Isn’t the Only Game in Town!
Other Approaches
![Page 39: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/39.jpg)
1. Embrace Coding
• Hadoop Ecosystem• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,
of course, MapReduce
• BDAS Ecosystem• Spark
Other Approaches
![Page 40: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/40.jpg)
2. Reject RDBMS
• Datalog + variants• In theory, ideal for many kinds of predictive analytics
• Suffers from a lack of distributed, feature-complete implementations
Other Approaches
![Page 41: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/41.jpg)
2. Reject RDBMS
• Rasdaman / RASQL• Arrays but not analytics
Community Editionshttp://www.rasdaman.org
Other Approaches
![Page 42: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/42.jpg)
2. Reject RDBMS
• MonetDB / SciQL• Array extension of SQL
• Poor analytics
Community Editionshttp://www.monetdb.org
Other Approaches
![Page 43: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/43.jpg)
2. Reject RDBMS
• SciDB / AFL (AQL)• Excellent analytics
• Limited composability
Community Editionshttp://www.scidb.org/forum/viewtopic.php?f=16&t=364/
Other Approaches
![Page 44: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/44.jpg)
2. Reject RDBMS
• Precog / Quirrel (simple “R for big data”)• Multidimensional, arrays + functions
• Still immature
Community Editionshttp://www.precog.com/editions/precog-for-mongodb (MongoDB)
http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)
Other Approaches
![Page 45: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/45.jpg)
Summary
• Increase performance, reduce friction by doing more inside the database
• Not a panacea• Hard to do in SQL
• Hard to do in C (but you may not have to: MADlib)
• Pre-canned & brittle in most databases
• Ultimately what’s needed is tech designed for advanced analytics
![Page 47: In-Database Predictive Analytics](https://reader034.vdocuments.mx/reader034/viewer/2022042606/54bb55754a79597c0b8b46e9/html5/thumbnails/47.jpg)
References
• Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)