in-database predictive analytics

In-DatabasePredictive Analytics

John A. De Goes@jdegoes, john@precog.com

• Introduction

• Abusing SQL

• Painful by Design

• Database Extensions

• MADlib

• Other Approaches

• Summary

Agenda

Introduction

In-Database Predictive Analytics

In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.

Traditional Predictive Analytics

Introduction

database

Data Bottleneck:Painful, Slow

Introduction

database

What’s the answer?

Introduction

“MapReduce”

Move the Code, not the Data!

AdvancedAnalytics

Introduction

Let’s Do K-Means in SQL!

Abusing SQL

General Approach in RDBMS

Feedback

DatabaseDriver

Abusing SQL

Our Initial Model

d k n iteration avg_q

number of dimensions

number of clusters

number of points

number of iterations

variance

Abusing SQL

Our Initial Data Set

Y1 Y2 Y3 Y3

n rows

Abusing SQL

Projection & Numbering

Y1 Y2 Y3 ...

i Y1 ... Yd

INSERT INTO YHSELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., YdFROM Y;

Abusing SQL

Flattening

i Y1 ... Yd

INSERT INTO YV SELECT i,1,Y1 FROM YH;...INSERT INTO YV SELECT i,d,Yd FROM YH;

i l val

n x d rows

Abusing SQL

Initializing k Cluster Centers

i Y1 ... Yd

j Y1 ... Yd

INSERT INTO CHSELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;...INSERT INTO CHSELECT k,Y1, ..., Yd FROM YH SAMPLE 1;

Abusing SQL

j Y1 ... Yd

Flattening

l j val

d x k rows

INSERT INTO CSELECT 1, 1, Y1 FROM CH WHERE j = 1;...INSERT INTO CSELECT d, k, Yd FROM CH WHERE j = k;

Abusing SQL

Computing Distances to Clusters

INSERT INTO YDSELECT i, j, sum((YV.val - C.val)**2)FROM YV, C WHERE YV.l = C.l GROUP BY i, j;

i j dist

n x k rows

Abusing SQL

Computing Nearest Neighbors

INSERT INTO YNNSELECT YD.i,Y D.jFROM YD, (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMINDWHERE Y D.i = YMIND.i and Y D.distance = YMIND.mindist;

nearest clusters

n rows

Abusing SQL

Count Points Per Cluster

INSERT INTO W SELECT j, count(*)FROM YNN GROUP BY j;UPDATE W SET w = w/model.n;

Abusing SQL

Compute New Centroids

INSERT INTO CSELECT l, j, avg(YV.val) FROM YV, YNNWHERE YV.i = YNN.i GROUP BY l, j;

Abusing SQL

Compute Variances

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Update Model

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Let’s not do that again!

Abusing SQL

Why are predictive analytics so hard to express in SQL?

Painful by Design

#1: No Arrays

Setsrows

Tuplescolumns

Arrays

Painful by Design

#2: Relational Algebra Sucks

Projection Selection Rename Natural Join

Theta JoinSemijoin

R S R S

Antijoin

Division

⟕R S

Left outer join

Right outer join

⟖ ⟗R S

Full outer join

G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)

Aggregation

Painful by Design

Iteration Recursion Multiple Dimensions

There’s GOT to be a better way!

Database Extensions

C Extension

Database Extensions

UDFUser-Defined Function

UDAUser-Defined Aggregate

Map Reducemap(a)

op2(a,b)init(a)

accum(a, b)merge(a, b)final(a)

Database Extensions

MADlib is an open-source library for scalable in-database analytics.It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.

MADlib

Mac OS X

http://www.madlib.net/files/madlib-0.6-Darwin.dmg

http://www.madlib.net/files/madlib-0.6-Linux.rpm

1. Download the binaryMADlib

Mac OS X

Double-click on installer

yum install $MADLIB_PACKAGE --nogpgcheck

2. Start the InstallationMADlib

Greenplum

source /path/to/greenplum/greenplum_path.sh

PostgreSQL

Make sure psql is in PATH

3. Verify LocatabilityMADlib

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install

4. Register MADlibMADlib

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check

5. Test InstallationMADlib

SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ]);

Clustering in MADlibMADlib

Ahhhhhh......

MADlib

Our Way or the Highway

Composability

MADlib

RDBMS Isn’t the Only Game in Town!

Other Approaches

1. Embrace Coding

• Hadoop Ecosystem• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,

of course, MapReduce

• BDAS Ecosystem• Spark

Other Approaches

2. Reject RDBMS

• Datalog + variants• In theory, ideal for many kinds of predictive analytics

• Suffers from a lack of distributed, feature-complete implementations

Other Approaches

2. Reject RDBMS

• Rasdaman / RASQL• Arrays but not analytics

Community Editionshttp://www.rasdaman.org

Other Approaches

2. Reject RDBMS

• MonetDB / SciQL• Array extension of SQL

• Poor analytics

Community Editionshttp://www.monetdb.org

Other Approaches

2. Reject RDBMS

• SciDB / AFL (AQL)• Excellent analytics

• Limited composability

Community Editionshttp://www.scidb.org/forum/viewtopic.php?f=16&t=364/

Other Approaches

2. Reject RDBMS

• Precog / Quirrel (simple “R for big data”)• Multidimensional, arrays + functions

• Still immature

Community Editionshttp://www.precog.com/editions/precog-for-mongodb (MongoDB)

http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)

Other Approaches

Summary

• Increase performance, reduce friction by doing more inside the database

• Not a panacea• Hard to do in SQL

• Hard to do in C (but you may not have to: MADlib)

• Pre-canned & brittle in most databases

• Ultimately what’s needed is tech designed for advanced analytics

Q&AJohn A. De Goes

@jdegoes, john@precog.com

References

• Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)

in-database predictive analytics

rselect c

madlib madlib

database analytics

abusing sql count points

abusing sqlcomputing

abusing sqllets

abusing sql general

abusing sql update modelinsert

Technology

go predictive analytics

predictive analytics techniques: what to use for...

a model of data maturity to support predictive analytics...

software predictive analytics · predictive analytics -...

sap predictive analytics hands-on · pdf filee-book:...

large scale predictive analytics for chronic illness...

“predictive analytics” contents index the author ·...

best practices and considerations in predictive...

making predictive analytics more practical with...

predictive analytics & information governance ·...

predictive analytics 2025_br

analytics overview #predictive analytics

big data predictive analytics in oracle database...

predictive analytics using statistical, learning, and...

predictive text analytics

forecasting hotspots - a predictive visual analytics...

your predictive journey - jump analytics€¦ · 1 hy...

predictive quality - spss analytics partner · evolution of...

predictive analytics by discourse analytics

visualdna predictive analytics