Transcript
Page 1: In-Database Predictive Analytics

In-DatabasePredictive Analytics

John A. De Goes@jdegoes, [email protected]

Page 2: In-Database Predictive Analytics

• Introduction

• Abusing SQL

• Painful by Design

• Database Extensions

• MADlib

• Other Approaches

• Summary

Agenda

Page 3: In-Database Predictive Analytics

Introduction

In-Database Predictive Analytics

In-database predictive analytics refers to the the process of performing advanced predictive analytics directly inside the database.

Page 4: In-Database Predictive Analytics

Traditional Predictive Analytics

Introduction

database

R

SAS

Page 5: In-Database Predictive Analytics

Data Bottleneck:Painful, Slow

Introduction

database

R

SAS

Page 6: In-Database Predictive Analytics

What’s the answer?

Introduction

Page 7: In-Database Predictive Analytics

“MapReduce”

Move the Code, not the Data!

AdvancedAnalytics

Introduction

Page 8: In-Database Predictive Analytics

Let’s Do K-Means in SQL!

Abusing SQL

Page 9: In-Database Predictive Analytics

General Approach in RDBMS

SQL

Feedback

DatabaseDriver

Abusing SQL

Page 10: In-Database Predictive Analytics

Our Initial Model

model

d k n iteration avg_q

number of dimensions

number of clusters

number of points

number of iterations

variance

Abusing SQL

Page 11: In-Database Predictive Analytics

Our Initial Data Set

Y

Y1 Y2 Y3 Y3

n rows

Abusing SQL

Page 12: In-Database Predictive Analytics

Projection & Numbering

Y

Y1 Y2 Y3 ...

YH

i Y1 ... Yd

INSERT INTO YHSELECT sum(1) over(rows unbounded preceding) AS i,Y1, Y2, ..., YdFROM Y;

1

2

3

4

...

...

n

1

2

3

4

...

...

n

Abusing SQL

Page 13: In-Database Predictive Analytics

Flattening

YH

i Y1 ... Yd

INSERT INTO YV SELECT i,1,Y1 FROM YH;...INSERT INTO YV SELECT i,d,Yd FROM YH;

1

2

3

4

...

...

n

1

1

1

1

2

...

n

YV

i l val

1

2

...

d

1

...

d

n x d rows

1

1

...

1

2

...

n

Abusing SQL

Page 14: In-Database Predictive Analytics

Initializing k Cluster Centers

YH

i Y1 ... Yd

CH

j Y1 ... Yd

1

2

3

4

...

...

n

INSERT INTO CHSELECT 1,Y1, ..., Yd FROM YH SAMPLE 1;...INSERT INTO CHSELECT k,Y1, ..., Yd FROM YH SAMPLE 1;

1

2

3

4

...

...

k

Abusing SQL

Page 15: In-Database Predictive Analytics

CH

j Y1 ... Yd

1

2

3

4

...

...

k

Flattening

C

l j val

d x k rows

1

1

...

1

2

...

d

1

2

...

k

1

...

k

INSERT INTO CSELECT 1, 1, Y1 FROM CH WHERE j = 1;...INSERT INTO CSELECT d, k, Yd FROM CH WHERE j = k;

Abusing SQL

Page 16: In-Database Predictive Analytics

Computing Distances to Clusters

INSERT INTO YDSELECT i, j, sum((YV.val - C.val)**2)FROM YV, C WHERE YV.l = C.l GROUP BY i, j;

YD

i j dist

1

2

...

k

1

...

k

n x k rows

1

1

...

1

2

...

n

Abusing SQL

Page 17: In-Database Predictive Analytics

Computing Nearest Neighbors

INSERT INTO YNNSELECT YD.i,Y D.jFROM YD, (SELECT i, min(dist) AS mindist FROM YD GROUP BY i) YMINDWHERE Y D.i = YMIND.i and Y D.distance = YMIND.mindist;

nearest clusters

YNN

i j

n rows

1

2

3

4

5

...

n

Abusing SQL

Page 18: In-Database Predictive Analytics

Count Points Per Cluster

INSERT INTO W SELECT j, count(*)FROM YNN GROUP BY j;UPDATE W SET w = w/model.n;

Abusing SQL

Page 19: In-Database Predictive Analytics

Compute New Centroids

INSERT INTO CSELECT l, j, avg(YV.val) FROM YV, YNNWHERE YV.i = YNN.i GROUP BY l, j;

Abusing SQL

Page 20: In-Database Predictive Analytics

Compute Variances

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Page 21: In-Database Predictive Analytics

Update Model

INSERT INTO RSELECT C.l, C.j, avg((YV.val-C.val)**2)FROM C, YV, YNNWHERE YV.i = YNN.i and YV.l = C.l and YNN.j = C.jGROUP BY C.l, C.j;

Abusing SQL

Page 22: In-Database Predictive Analytics

Let’s not do that again!

Abusing SQL

Page 23: In-Database Predictive Analytics

Why are predictive analytics so hard to express in SQL?

Painful by Design

Page 24: In-Database Predictive Analytics

#1: No Arrays

Setsrows

Tuplescolumns

Arrays

Painful by Design

Page 25: In-Database Predictive Analytics

#2: Relational Algebra Sucks

Projection Selection Rename Natural Join

R S

Theta JoinSemijoin

R S R S

Antijoin

÷R S

Division

⟕R S

Left outer join

R S

Right outer join

⟖ ⟗R S

Full outer join

G1, G2, ..., Gm g f1(A1'), f2(A2'), ..., fk(Ak') (r)

Aggregation

Painful by Design

Iteration Recursion Multiple Dimensions

Page 26: In-Database Predictive Analytics

There’s GOT to be a better way!

Database Extensions

Page 27: In-Database Predictive Analytics

C Extension

Database Extensions

Page 28: In-Database Predictive Analytics

UDFUser-Defined Function

UDAUser-Defined Aggregate

Map Reducemap(a)

op2(a,b)init(a)

accum(a, b)merge(a, b)final(a)

Database Extensions

Page 29: In-Database Predictive Analytics

MADlib is an open-source library for scalable in-database analytics.It is implemented using database extensions written in C, and is available for PostgreSQL and Greenplum.

MADlib

Page 30: In-Database Predictive Analytics

Mac OS X

http://www.madlib.net/files/madlib-0.6-Darwin.dmg

Linux

http://www.madlib.net/files/madlib-0.6-Linux.rpm

1. Download the binaryMADlib

Page 31: In-Database Predictive Analytics

Mac OS X

Double-click on installer

Linux

yum install $MADLIB_PACKAGE --nogpgcheck

2. Start the InstallationMADlib

Page 32: In-Database Predictive Analytics

Greenplum

source /path/to/greenplum/greenplum_path.sh

PostgreSQL

Make sure psql is in PATH

3. Verify LocatabilityMADlib

Page 33: In-Database Predictive Analytics

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install

4. Register MADlibMADlib

Page 34: In-Database Predictive Analytics

Greenplum

/usr/local/madlib/bin/madpack -p greenplum -c $USER@$HOST/$DATABASE install-check

PostgreSQL

/usr/local/madlib/bin/madpack -p postgres -c $USER@$HOST/$DATABASE install-check

5. Test InstallationMADlib

Page 35: In-Database Predictive Analytics

SELECT * FROM kmeans_random( 'rel_source', 'expr_point', k, [ 'fn_dist', 'agg_centroid', max_num_iterations, min_frac_reassigned ]);

Clustering in MADlibMADlib

Page 36: In-Database Predictive Analytics

Ahhhhhh......

MADlib

Page 37: In-Database Predictive Analytics

Our Way or the Highway

Composability

MADlib

Page 38: In-Database Predictive Analytics

RDBMS Isn’t the Only Game in Town!

Other Approaches

Page 39: In-Database Predictive Analytics

1. Embrace Coding

• Hadoop Ecosystem• Mahout, Cascading/Scalding, Crunch/Scrunch, Pangool, Cascalog, and,

of course, MapReduce

• BDAS Ecosystem• Spark

Other Approaches

Page 40: In-Database Predictive Analytics

2. Reject RDBMS

• Datalog + variants• In theory, ideal for many kinds of predictive analytics

• Suffers from a lack of distributed, feature-complete implementations

Other Approaches

Page 41: In-Database Predictive Analytics

2. Reject RDBMS

• Rasdaman / RASQL• Arrays but not analytics

Community Editionshttp://www.rasdaman.org

Other Approaches

Page 42: In-Database Predictive Analytics

2. Reject RDBMS

• MonetDB / SciQL• Array extension of SQL

• Poor analytics

Community Editionshttp://www.monetdb.org

Other Approaches

Page 43: In-Database Predictive Analytics

2. Reject RDBMS

• SciDB / AFL (AQL)• Excellent analytics

• Limited composability

Community Editionshttp://www.scidb.org/forum/viewtopic.php?f=16&t=364/

Other Approaches

Page 44: In-Database Predictive Analytics

2. Reject RDBMS

• Precog / Quirrel (simple “R for big data”)• Multidimensional, arrays + functions

• Still immature

Community Editionshttp://www.precog.com/editions/precog-for-mongodb (MongoDB)

http://www.precog.com/editions/precog-for-postgresql (PostgreSQL)

Other Approaches

Page 45: In-Database Predictive Analytics

Summary

• Increase performance, reduce friction by doing more inside the database

• Not a panacea• Hard to do in SQL

• Hard to do in C (but you may not have to: MADlib)

• Pre-canned & brittle in most databases

• Ultimately what’s needed is tech designed for advanced analytics

Page 47: In-Database Predictive Analytics

References

• Programming the K-means Clustering Algorithm in SQL (Teradata, NCR)


Top Related