database systems research on data mining

65
Database Systems Research on Data Mining Carlos Ordonez University of Houston USA Javier García- García UNAM Mexico Reference: Ordonez, C, Garcia-Garcia, J, Database Systems Research on Data Mining, Proc. ACM SIGMOD 2010, p.000-999 (tutorial).

Upload: tess98

Post on 10-May-2015

763 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Database Systems Research on Data Mining

Database Systems Research on Data Mining

Carlos Ordonez University of Houston

USA

Javier García-García UNAMMexico

Reference:Ordonez, C, Garcia-Garcia, J, Database Systems Research on Data Mining,Proc. ACM SIGMOD 2010, p.000-999 (tutorial).

Page 2: Database Systems Research on Data Mining

Global Outline

1. Data mining models and algorithms (JG,15 min)1.1 Preprocess to get Data set

1.2 Data set

1.3 Data Mining Models

1.4 Data Mining Algorithms

2.Processing alternatives (JG, 35 min)2.1 Inside DBMS: SQL

2.2 Outside DBMS: MapReduce

2.3 Example

3. Storage and Optimizations (CO, 35 min)3.1 Layouts: Horizontal and Vertical

3.2 Optimizations: Algorithmic and Systems

2/60

Page 3: Database Systems Research on Data Mining

1. Data Mining Models & Algorithms

• Data set preparation• Data set• Data mining models and patterns• Algorithms

3/60

Page 4: Database Systems Research on Data Mining

1.1 Data set preparation [CDHHL1999,KDD] [GCBLRVPP1997,JDMKD] [O2004,DMKD]

• In practice, 80% of project time– significant SQL code writing; some tools help

query writing– iterative: between modeling and data set prep

• Little attention in research literature– query optimization mostly in OLAP context– new operators: PIVOT, horizontal aggregations– research issue: can algorithms directly analyze

3NF tables?

4/60

Page 5: Database Systems Research on Data Mining

Data set preparation [GO2010,DKE] [OG2008,DSS]

• Overall goal: getting data set for analysis

• Database processing generally required

– normalized (many 3NF) databases cannot be directly analyzed

– joins, aggregations and pivoting (transposing)

• Data cleaning

– remove outliers

– null replacement

– repair referential integrity

• Data transformation: categorical columns; rescale; code

5/60

Page 6: Database Systems Research on Data Mining

Typical queries to create data setjoin and aggregations

6/60

Page 7: Database Systems Research on Data Mining

1.2 Data set

• Data set with n records• Each has attributes: numeric, discrete or both

(mixed)• Focus of the tutorial, d dimensions• Generally, • High d makes problem mathematically more

difficult• Extra column G/Y

7/60

Page 8: Database Systems Research on Data Mining

Example of data sethorizontal layout

n=5, d=3 and G/Y

i X1 X2 X3 G1 1.7 8.2 4.3 12 3.4 10.5 1.0 03 9.3 12.2 2.5 04 5.7 7.3 8.8 05 2.5 13.3 3.2 1

8/60

Page 9: Database Systems Research on Data Mining

1.3 Data Mining Models [STA1998,SIGMOD]

• Models, coming mostly from statistics and machine learning– based on matrix computations, probability and

calculus– time dimension not considered

• Patterns, mostly combinatorial– association rules, cubes, sequences and graphs– important, but not considered in tutorial– quite different algorithms from models– no strong statistical foundation

9/60

Page 10: Database Systems Research on Data Mining

Common data mining models [DLR1977,RSS]

• Unsupervised: – math: simpler– task: clustering, dimensionality reduction– models: KM, EM, PCA/SVD, FA– statistical tests overlap both

• Supervised– math: tuning and validation than unsupervised– task: classification, regression– models: decision trees, Naïve Bayes, Bayes,

linear/logistic regression, SVM, neural nets

10/60

Page 11: Database Systems Research on Data Mining

Data mining models characteristics

• Multidimensional– tens, hundreds of dimensions– feature selection and dimensionality reduction

• Represented & computed with matrices & vectors– data set: set of vectors or set of records; all

numeric, mixed attributes– model: numeric=matrices, discrete: histograms– intermediate computations: matrices and

histograms

11/60

Page 12: Database Systems Research on Data Mining

Why is it hard?

12/60

Page 13: Database Systems Research on Data Mining

Data mining Major tasks

• Model computation

– focus of most research

– generally requires matrix computations

– complex and slow algorithms (iterative)

– large n makes it slower

• Scoring data set

– assumes model exists

– useful for tuning, testing and model exchange

– fast: generally requires only one pass over X

– research issue: not studied enough in literature

13/60

Page 14: Database Systems Research on Data Mining

1.4 Data Mining Algorithmsinput and output

• Input: data set X with n records, d dimensions• Output: model, quality• Parameters (representative; vary a lot):

– k (clusters, principal components, discrete states)

– epsilon for stopping (accuracy, convergence, local optima)

– feature/variable selection (algorithm dependent, step-wise or now bayesian statistics)

14/60

Page 15: Database Systems Research on Data Mining

Data Mining Algorithms [ZRL1996,SIGMOD]

• Behavior with respect to data set X:– one pass, few passes– multiple passes, convergence, bigger issue

(most algorithms)• Time complexity:

• Research issues:– preserve time complexity in SQL/MapReduce– incremental learning

15/60

Page 16: Database Systems Research on Data Mining

2. Processing alternatives

2.1 Inside DBMS (SQL)

2.2 Outside DBMS (MapReduce)

2.3 Example

16/60

Page 17: Database Systems Research on Data Mining

2.1 Inside DBMS

• Assumption:– data records are in the DBMS; exporting slow

– row-based storage (not column-based)

• Programming alternatives:– SQL and UDFs: SQL code generation (JDBC),

precompiled UDFs. Extra: SP, embedded SQL, cursors

– Internal C Code (direct access to file system and mem)

• DBMS advantages:– important: storage, queries, security

– maybe: recovery, concurrency control, integrity, transactions

17/60

Page 18: Database Systems Research on Data Mining

Inside DBMS SQL code: CREATE + SELECT, Consider Layout

[CDDHW2009,VLDB]

• CREATE TABLE– Row storage: Clustered (to group rows of

pivoted tables), Block size (for large tables)– Index: primary (gen. for pk, critical for joins),

secondary (may help joins & searches)• SELECT

– Basic mechanism to write queries; standard across DBMSs, arbitrarily complex queries, including arithmetic expressions

18/60

Vertical layout: A(i,j,v), B(i,j,v)

A*B: SELECT A.i, B.j , sum(A.v * B.v) FROM A JOIN B ON A.j = B.i GROUP BY A.i, B.j

Page 19: Database Systems Research on Data Mining

Inside DBMS User-Defined Function (UDF)

• Classification:– Scalar UDF– Aggregate UDF– Table UDF

• Programming:– Called in a SELECT statement– C code or similar language – API provided by DBMS, in C/C++– Data type mapping

19/60

Page 20: Database Systems Research on Data Mining

Inside DBMSUDF pros and cons

• Advantages:

– arrays and flow control

– flexibility in code writing and no side effects

– No need to modify DBMS internal code

– In general, simple data types

• Limitations:

– OS and DBMS architecture dependent, not portable

– No I/O capability, no side effects

– Null handling and fixed memory allocation

– Memory leaks with arrays (matrices): fenced/protected mode

20/60

Page 21: Database Systems Research on Data Mining

Inside DBMSScalar UDF

[DNPT2006,SAC]

• Memory allocation in the stack• Returns one value of simple data type• Basic SQL data types (e.g. int, float, char)• May support UDT• Call & return value with every row• Useful for vector operations

21/60

Page 22: Database Systems Research on Data Mining

Inside DBMSAggregate UDF

[JM1998,SIGMOD]

• Table scan• Memory allocation in the heap• GROUP BY extend their power• Also require handling nulls• Advantage: parallel & multithreaded processing• Drawback: returns a single value, not a table• DBMSs: SQL Server, PostgreSQL,Teradata,

Oracle, DB2, among others• Useful for model computations

22/60

Page 23: Database Systems Research on Data Mining

Inside DBMSUDF: Aggregate User-Defined Function

1. Initialization: allocates variable storage

2. Accumulate: processes every record, aggregate some value (vector). Bottleneck.

3. Merge: consolidates partial results from multiple threads

4. Terminate: final processing, return result

23/60

Page 24: Database Systems Research on Data Mining

Inside DBMSTable UDF

[BRKPHK2008,SIGMOD]

• Main difference with aggregate UDF: returns a table (instead of single value)

• Also, it can take several input values • Called in the FROM clause in a SELECT• Stream: no parallel processing, external file• Computation power same as aggregate UDF• Suitable for complex math operations and

algorithms• Since result is a table it can be joined• DBMS: SQL Server ,DB2, Oracle,PostgreSQL

24/60

Page 25: Database Systems Research on Data Mining

Inside DBMSInternal C code

[LTWZ2005,SIGMOD], [MYC2005,VLDB] [SD2001,CIKM]

• Advantages:– access to file system (table record blocks),

– physical operators (scan, join, sort, search)

– main memory, data structures, libraries

– hardware optimizations: multithreading, multicore, caching RAM, caching LI/L2

• Disadvantages:– requires careful integration with rest of system

– not available to end users and practitioners

– may require exposing functionality with DM language or SQL

25/60

Page 26: Database Systems Research on Data Mining

Inside DBMSPhysical Operators

[DG1992,CACM] [SMAHHH2007,VLDB] [WH2009,SIGMOD]

• Serial DBMS (one CPU, maybe RAID):– table Scan– join: hash join, sort merge join, nested loop– external merge sort

• Parallel DBMS (shared-nothing):– even row distribution, hashing– parallel table scan– parallel joins: large/large (sort-merge, hash);

large/short (replicate short)– distributed sort

26/60

Page 27: Database Systems Research on Data Mining

2.2 Outside DBMS

• Alternatives:– MapReduce– Packages, libraries, Java/C++

• Issue: I/O bottleneck

27/60

Page 28: Database Systems Research on Data Mining

Outside DBMSMapReduce

[DG2008,CACM]

• Parallel processing; simple; shared-nothing

• Functions are programmed in a high-level programming language (e.g. Java, Python); flexible.

• <key,value> pairs processed in two phases:

– map(): computation is distributed and evaluated in parallel; independent mappers

– reduce(): partial results are combined/summarized

• Can be categorized as inside/outside DBMS, depending on level of integration with DBMS

• DBMS integration: Greenplum, Aster Data, Teradata...

28/60

Page 29: Database Systems Research on Data Mining

Outside DBMSMapReduce Files and Processing

• File Types:– Text Files: Common storage (e.g. CSV files.)

– SequenceFiles: Efficient processing

– Custom InputFormat (rarely used.)

• Processing:– Points are sorted by “key” before sending to reducers

– Small files should be merged

– Partial results are stored in file system

– Intermediate files should be managed in SequenceFiles for efficiency

29/60

Page 30: Database Systems Research on Data Mining

Outside DBMSPackages, libraries, Java/C++

[ZHY2009,CIDR] [ZZY2010,ICDE]

• Statistical and data mining packages: – exported flat files; proprietary file formats– Memory-based (processing data records,

models, internal data structures)• Programming languages:

– Arrays – flexibility of control statements

• Limitation: large number of records• Packages: R, SAS, SPSS, KXEN,Matlab, WEKA

30/60

Page 31: Database Systems Research on Data Mining

2.3 Naïve Bayes ExampleHorizontal layout

• NB– one pass– Gaussian, sufficient statistics (NLQ)

• Example in:– SQL– UDF– MapReduce

31/60

Data Structurespublic double N;public double[] L;public double[] Q;

Page 32: Database Systems Research on Data Mining

Naïve BayesSQL (2 passes, n L Q, triangular Q )

/*Inserting NL*/SELECT g ,sum(1.0) AS Ng /* N */ ,sum(X1) AS L_X1 /* L */ ,sum(X2) AS L_X2 ,sum(X3) AS L_X3FROM XGROUP BY g;

/*Inserting into Q */SELECT g, /* Q */ ,sum(power(X1,2)) AS Q_X1 ,sum(power(X2,2)) AS Q_X2 ,sum(power(X3,2)) AS Q_X3FROM XGROUP BY g;

32/60

/*Lower triangular for PCA and LR*/SELECT sum(X1*X1), null, ... ,null ,sum(X2*X1), sum(X2*X2), ... ,null ... ,sum(Xd*X1), sum(Xd*X2), ... ,sum(Xd*Xd)FROM X

Page 33: Database Systems Research on Data Mining

Naïve BayesAggregate UDF (1 pass)

33/60

public void Init() { nbnlq = new NBNLQ(); int h; nbnlq.N = 0; for (h = 1; h <= nbnlq.d; h++) { nbnlq.L[h] = 0; nbnlq.Q[h] = 0; }}

public void Merge(udf_nb_train_d3 thread) { int i, h; nbnlq.d = thread.nbnlq.d; nbnlq.N += thread.nbnlq.N; for (h = 1; h <= nbnlq.d; h++) { nbnlq.L[h] += thread.nbnlq.L[h]; nbnlq.Q[h] += thread.nbnlq.Q[h]; }}

public void Accumulate(Xd3 X) { int h; if (!X.IsNull) { nbnlq.d = X.getD(); nbnlq.N += 1.0; for (h = 1; h <= nbnlq.d; h++) // L,Q { nbnlq.L[h] += X.getColumn(h); nbnlq.Q[h] += X.getColumn(h) * X.getColumn(h); } }}

public SqlString Terminate() { for (h = 1; h <= nbnlq.d; h++) { result.Append("C" + h + "="); result.Append(nbnlq.L[h] / nbnlq.N); result.Append(","); } for (h = 1; h <= nbnlq.d; h++) { result.Append("R" + h + "="); result.Append(nbnlq.Q[h] / nbnlq.N - Math.Pow( nbnlq.L[h] /nbnlq.N, 2)); result.Append(","); }}

Page 34: Database Systems Research on Data Mining

Naïve Bayes MapReduce (text file, unoptimized)

public void map() { _key.set("n"); context.write(_key, new Text("1")); splitStr = lineWithoutTerminator.split(","); int d = splitStr.length-1; for(h=1;h<=d;h++){ _key.set("g"+splitStr[splitStr.length-1] + "_" + "h" + (h)); _val.set(splitStr[h-1].toString()); context.write(_key, _val); } }

public void reduce() {if(key.toString().contains("h")){ n=0;L=0;Q=0; for (Text val : values) { attr = Double.parseDouble(val.toString()); n += 1; L += attr; Q += attr*attr; } mean = L/n; var = Q/n -Math.pow(mean, 2); each_row = "N=" + n + ";L=" + L + ";Q=" + Q + ";mean=" + mean + ";var=" + var; _val.set(new Text(each_row)); context.write(new Text(key.toString()), _val);} else { n=0; for (Text val : values) { attr = Double.parseDouble(val.toString()); n+=attr; } _val.set(new Text(Double.toString(n))); context.write(new Text(key.toString()), _val);}

}

34/60

Page 35: Database Systems Research on Data Mining

3. Storage and Optimizations

• Storage layouts: – horizontal– vertical

• Optimizations: – algorithmic: general– systems-oriented: SQL and MapReduce

35/60

Page 36: Database Systems Research on Data Mining

Horizontal LayoutDBMS Data Set

X(i,X1,…,Xd,G/Y)

• • Most common format in DM• Join/Aggregations/arithmetic

expressions, pivot

i X1 X2 X3 G1 1.7 8.2 4.3 12 3.4 10.5 1.0 03 9.3 12.2 2.5 04 5.7 7.3 8.8 05 2.5 13.3 3.2 1

36/60

Page 37: Database Systems Research on Data Mining

Horizontal LayoutDBMS

[DM2006,SIGMOD]

• Physical operator (most common): – Table scan (default in SQL query or UDF)

• External Sort or hash table:– SQL Group By query– UDF Group By

• Join algorithm :– SQL queries– Not required in UDF

37/60

Page 38: Database Systems Research on Data Mining

Horizontal LayoutDBMS

• Size of table: n rows• Limited DDL control in SQL

– limited number of columns– requires assigning point id i (Primary index)

• Clustered storage by block on i allows processing several rows at the same time

38/60

Page 39: Database Systems Research on Data Mining

Horizontal LayoutDBMS: X1+X2+X3

• No arrays.• Access to dimensions through SQL generation

(Java/ C++)

Example:double X[4], SUM=0.0;int i = 1,h=0, d=3;while( fscanf(fp,"%lf%lf%lf%d",&X[1],&X[2],&X[3])

!= EOF ){ SUM = 0.0; for( h = 1; h <= d; h++ ) { SUM += X[h]; } printf("%d\t%g\r\n",i,SUM); i++;}

String query = ‘SELECT i’;String sum =‘’; for( int h = 1 ; h <= d; h++ ) { query += ‘, X’+h; sum += ‘X’+h+’+’; } query += ‘,‘ +sum.substring(0,sum.length()-1) +’ FROM X’;

SELECT i, x1, x2, x3, X1+X2+X3 FROM X

i X1 X2 X3 X1+X2+X31 1.7 8.2 4.3 14.22 3.4 10.5 1.0 14.93 9.3 12.2 2.5 24.04 5.7 7.3 8.8 21.85 2.5 13.3 3.2 19.0

39/60

C++

JAVA

Page 40: Database Systems Research on Data Mining

Vertical LayoutDBMS-X(i,h,v,g)

• When d exceeds the DBMS limits;d>n• Index by point i and dimension h• Clustered row storage by i (correctness in UDF,

efficiency in SQL query) • Queries require: joins & aggregations• Columns as subscripts• Size <= dn rows (sparse)• Two tables with n rows with same PK can be joined

in time O(n) using hash join

40/60

i h v G1 1 1.7 11 2 8.2 11 3 4.3 12 1 3.4 02 2 10.5 02 3 1.0 03 1 9.3 03 2 12.2 03 3 2.5 04 1 5.7 0

4 2 7.3 04 3 8.8 05 1 2.5 15 2 13.3 15 3 3.2 1

Page 41: Database Systems Research on Data Mining

Vertical LayoutDBMS: X1+X2+X3

• Requires using UDF functions• SQL statements using [index] joins are requireddouble X, SUM = 0.0;int h, old_i = 0, i;while( fscanf(fp,"%d%d%lf",&i,&h,&X) != EOF ) { if ( old_i != i ) { if ( old_i != 0 ) printf("%d\t%g\r\n",old_i,SUM); old_i = i; SUM = X; } else { SUM += X; }}printf("%d\t%g\r\n",i,SUM);

String query = ‘SELECT i, SUM(v) FROM X’; query += ‘GROUP BY i’

SELECT i, SUM(v) FROM X GROUP BY i

i X1+X2+X31 14.22 14.93 24.04 21.85 19.0

41/60

C++ JAVA

Page 42: Database Systems Research on Data Mining

Horizontal LayoutMapReduce

• Line number represents the point ID (implicit)• No indexes in general• Flat file

42/60

X.csv

x1,x2,x3,G1.7,8.2,4.3,13.4,10.5,1.0,09.3,12.2,2.5,05.7,7.3,8.8,02.5,13.3,3.2,1

Page 43: Database Systems Research on Data Mining

Horizontal vs VerticalHorizontal Vertical

Limitation with high d (max columns). No problems with high d.

Default layout for most algorithms. Requires clustered index.

SQL arithmetic expressions and UDFs.

SQL aggregations, joins, UDFs.

Easy to interpret. Difficult to interpret.

Suitable for dense matrices. Suitable for sparse matrices.

Complete record processing UDF: detect point boundaries

n rows, d columns dn rows, few (3 or 4) columns

Fast n I/Os Slow dn I/Os (n I/Os clustered)

43/60

Page 44: Database Systems Research on Data Mining

3.2 OptimizationsAlgorithmic & Systems

• Algorithmic– 90% research, many efficient algorithms– accelerate/reduce computations or convergence– database systems focus: reduce I/O– approximate solutions

• Systems (SQL, MapReduce)– Platform: parallel DBMS server vs cluster of

computers– Programming: SQL/C++ versus Java

44/60

Page 45: Database Systems Research on Data Mining

Algorithmic [ZRL1996,SIGMOD]

• Implementation: data set available as flat file, binary file required for random access

• May require data structures working in main memory and disk

• Programming not in SQL: C/C++ are preferred languages, although Java becoming common

• MapReduce is becoming popular• Assumption d<<n: n has received more attention• Issue: d>n produces numerical issues and large

covariance/correlation matrix (larger than X)

45/60

Page 46: Database Systems Research on Data Mining

Algorithmic Optimizations [STA1998,SIGMOD] [ZRL1996,SIGMOD][O2007,SIGMOD]

• Exact model computation:– summaries: sufficient statistics (Gaussian pdf),

histograms, discretization– accelerate convergence, reduce iterations– faster matrix operations: * +

• Approximate model computation:– Sampling: efficient in time O(s)– Incremental:

• math: escape local optima (EM), reseed

• database systems: favor table scan

46/60

Page 47: Database Systems Research on Data Mining

Systems OptimizationsDBMS

[O2006,TKDE], [ORD2010,TKDE]

• SQL query optimization– mathematical equations as queries– Turing-complete: SQL code generation and

programming language• UDFs as optimization

– substitute key mathematical operations– push processing into RAM memory

47/60

Page 48: Database Systems Research on Data Mining

Systems OptimizationsDBMS SQL query

[O2004,DMKD]

• Denormalization• Issue: Query rewriting (optimizer falls short)• Index depends on layout• Horizontal layout:

– indexed by i

– d may be an issue, thus vertical partition

• Vertical layout:– storage: clustered by point

– indexing by subscript

– Use specific join algorithm

48/60

Page 49: Database Systems Research on Data Mining

Systems OptimizationsDBMS SQL query

[O2006,TKDE] [OP2010,TKDE],[OP2010,DKE] ,[MC2002,ICDM]

• Join:– denormalized storage: model, intermediate tables– favor hash joins over mrg-srt: both tables PI on i– secondary indexing for join: sort-merge join

• Aggregation (compression):– push group-by before join: watch out nulls and

high cardinality columns like point i• synchronized table scans: several SELECTs on same

table; examples: unpivoting; 2+ models• Sampling: O(s), random access, truly random; error

49/60

Page 50: Database Systems Research on Data Mining

Naïve BayesSQL (optimized)

/*Inserting into NLQ */INSERT INTO NLQSELECT g ,sum(1.0) AS Ng /* N */ ,sum(X1) AS L_X1 /* L */ ,sum(X2) AS L_X2 ,sum(X3) AS L_X3 ,sum(power(X1,2)) AS Q_X1 /* Q */ ,sum(power(X2,2)) AS Q_X2 ,sum(power(X3,2)) AS Q_X3FROM XGROUP BY g;

/*Inserting into NB */INSERT INTO NBSELECT g ,Ng/T.Nglobal /* pi */ ,L_X1/Ng /* C */ ,L_X2/Ng ,L_X3/Ng ,Q_X1/Ng-power(L_X1/Ng,2) /* R */ ,Q_X2/Ng-power(L_X2/Ng,2) ,Q_X3/Ng-power(L_X3/Ng,2)FROM NLQ,( SELECT SUM(Ng) AS Nglobal FROM NLQ)T;

50/60

Page 51: Database Systems Research on Data Mining

Systems Optimization DBMS UDF

[HLS2005,TODS] [O2007,TKDE]

• UDFs can substitute SQL code– UDFs can express complex math computations– Scalar UDFs: vector operations

• Aggregate UDFs: compute data set summaries in parallel

• Table UDFs: stream model; external temporary file

51/60

Page 52: Database Systems Research on Data Mining

Naïve BayesAggregate UDF (optimized, 1 pass, same as before)

public void Init() { nbnlq = new NBNLQ(); int h; nbnlq.N = 0; for (h = 1; h <= nbnlq.d; h++) { nbnlq.L[h] = 0; nbnlq.Q[h] = 0; }}

public void Merge(udf_nb_train_d3 thread) { int i, h; nbnlq.d = thread.nbnlq.d; nbnlq.N += thread.nbnlq.N; for (h = 1; h <= nbnlq.d; h++) { nbnlq.L[h] += thread.nbnlq.L[h]; nbnlq.Q[h] += thread.nbnlq.Q[h]; }}

public void Accumulate(Xd3 X) { int h; if (!X.IsNull) { nbnlq.d = X.getD(); nbnlq.N += 1.0; for (h = 1; h <= nbnlq.d; h++) // L,Q { nbnlq.L[h] += X.getColumn(h); nbnlq.Q[h] += X.getColumn(h) * X.getColumn(h); } }}

public SqlString Terminate() { for (h = 1; h <= nbnlq.d; h++) { result.Append("C" + h + "="); result.Append(nbnlq.L[h] / nbnlq.N); result.Append(","); } for (h = 1; h <= nbnlq.d; h++) { result.Append("R" + h + "="); result.Append(nbnlq.Q[h] / nbnlq.N - Math.Pow( nbnlq.L[h] /nbnlq.N, 2)); result.Append(","); }}

52/60

Page 53: Database Systems Research on Data Mining

MapReduce [ABASR2009,VLDB] [CDDHW2009,VLDB] [SADMPPR2010,CACM]

• Data set– keys as input, partition data set– text versus sequential file– loading into file system may be required

• Parallel processing– high cardinality keys: i– handle skewed distributions– reduce row redistribution in Map( )

• Main memory processing

53/60

Page 54: Database Systems Research on Data Mining

MapReduceProcessing

[DG2008,CACM] [FPC2009,PVLDB] [PHBB2009,PVLDB]

• Modify Block Size• Disable Block Replication• Delay reduce()• Tune M and R (memory allocation and number)• Several M use the same R• Avoid full table scans by using subfiles (requires

naming convention)• combine() in map() to shrink intermediate files• SequenceFiles as input with custom data types.

54/60

Page 55: Database Systems Research on Data Mining

MapReduceIssues

• Loading, converting to binary may be necessary• Input key generally OK if high cardinality• Skewed map key distribution• Key redistribution (lot of message passing)

55/60

Page 56: Database Systems Research on Data Mining

MapReduce Optimized

public static class NBHMapper(){ context.write(key,val);}

public static class NBHCombiner() { for (DoubleArrayWritable val : values) { n++; x = (DoubleWritable[]) val.toArray();

for (int h = 1; h <= d; h++) {

attr = x[h - 1].get(); L[h] += attr; Q[h] += attr * attr; } } _val_array[1].set(n); for (int h = 1; h <= d; h++) {

_val_array[1+h].set(L[h]); } for (int h = 1; h <= d; h++) { _val_array[1+d+h].set(Q[h]);}}

public static class NBHReducer(){ for (DoubleArrayWritable val : values) { x = (DoubleWritable[]) val.toArray(); n += x[1].get(); for (int h = 1; h <= d; h++) { L[h] += x[1+ h].get();} for (int h = 1; h <= d; h++) { Q[h] += x[1+d+h].get();} } each_row = "N=" + n; each_row += ";C="; for (int h = 1; h <= d; h++) { each_row += L[h]/n + ",";} each_row += ";R="; for (int h = 1; h <= d; h++) { each_row += Q[h] / n - Math.pow((L[h] / n), 2) + ",";}}

56/60

Page 57: Database Systems Research on Data Mining

SQL vs MapReduceProcessing & I/O Bottleneck (bulk load)

[PPRADMS2009,SIGMOD] [O2010,TKDE]

n x 1M SQL MR* Import Build Total Import Build Total

1 18 4 22 48 38 862 41 4 45 94 59 1534 81 9 90 185 91 2768 147 18 165 367 153 520

16 331 41 372 730 285 1015*MR times include conversion into a SequenceFile.

Import and Model Computation Times for SQL and MR (times in secs).

57/60

Page 58: Database Systems Research on Data Mining

Systems optimizationsSQL vs MR (optimized versions, run same hardware)

Task SQL UDF MR

Speed: compute model 1 2 3

Speed: score data set 1 3 2

Programming flexibility 3 2 1

Process non-tabular data 3 2 1

Loading speed 1 1 2

Ability to add optimizations 2 1 3

Manipulating data key distribution 1 2 3

Immediate processing (push=SQL,pull=MR)

2 1 3

58/60

Page 59: Database Systems Research on Data Mining

Research Issues Both: SQL and MapReduce [BFR1998,KDD], [CFB1999,ICDE] [SADMPPR2010,CACM]

• Fast data mining algorithms solved? Yes, but not considering data sets are stored in a DBMS

• SQL and MR have many similarities: shared-nothing• Fast load/unload interfaces between both systems;

tighter integration• General tradeoffs in speed and programming:

horizontal vs vertical layout• Incremental algorithms

– one pass (streams) versus parallel processing– reduce passes/iterations

59/60

Page 60: Database Systems Research on Data Mining

Research Issues on Each [ABASR2009,VLDB], [CDDHW2009,VLDB [CKLRSS2009,VLDB]

• DBMS:

– C++/Java libraries generating SQL code, pushing processing: Oracle, Teradata, SAS, KXEN

– Internal C code: commercial DBMSs, open-source?

– Study aggregate UDFs for complex models; extend Table UDF support: I/O bottleneck, streams

– Extend SQL with more DM primitives and constructs or forget extending SQL for DM?

– Specialized DBMS, middleware: SciDB, RIOT

• MapReduce:

– SQL+MapReduce: Greenplum, Aster, Teradata

– MapReduce only: Mahout

– MapReduce for query processing and data mining: especially joins, aggregations OK

60/60

Page 61: Database Systems Research on Data Mining

Thank you… Q&A

• Special thanks:– Carlos Garcia-Alvarado

• DBMS Group at UH:– Sasi K. Pitchaimalai– Mario Navas– Zhibo Chen

Page 62: Database Systems Research on Data Mining

References• [ABASR2009,VLDB] A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an

architectural hybrid of MapReduce and DBMS technologies for analytical workloads. Proc. VLDB Endow., pages 922-933, 2009.

• [BFR1998,KDD] P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM KDD Conference, pages 9-15, 1998.

• [BRKPHK2008,SIGMOD] J.A Blakeley, V. Rao, I. Kunen, A. Prout, M. Henaire, and C. Kleinerman. .NET database programmability and extensibility in microsoft SQL server. In ACM SIGMOD, pages 1087-1098. 2008.

• [CFB1999,ICDE] S. Chaudhuri, U. Fayyad, and J. Bernhardt. Scalable classification over SQL databases. ICDE, 00:470, 1999.

• [CDHHL1999,KDD] J. Clear, D. Dunn, B. Harvey, M.L. Heytens, and P. Lohman. Non-stop SQL/MX primitives for knowledge discovery. In ACM KDD Conference, pages 425-429, 1999.

• [CDDHW2009,VLDB] J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C. Welton. MAD skills: New analysis practices for big data. In VLDB Conference, pages 1481-1492, 2009.

• [CKLRSS2009,VLDB] A demonstration of SciDB: a science-oriented DBMS. In VLDB Conference, pages 1534-1537,2009.

• [DG2008,CACM] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. Commun. ACM, 51(1):107-113,2008.

• [DLR1977,RSS] A.P. Dempster, N.M. Laird, and D. Rubin. Maximum Likelihood from Incomplete Data via the EM Algorithm. Journal of The Royal Statistical Society, 39(1):1-38, 1977.

• [DM2006,SIGMOD] A. Deshpande and S. Madden. MauveDB: supporting model-based user views in database systems. In SIGMOD Conference, pages 73-84, 2006.

Page 63: Database Systems Research on Data Mining

References• [DG1992,CACM] D. DeWitt, J. Gray. Parallel database systems: the future of high performance database systems. In

Communications of the ACM, 35(6): 85-98, 1992.

• [DNPT2006,SAC] A. Dorneich, R. Natarajan, E.P.D. Pednault, and F. Tipu. Embedded predictive modeling in a parallel relational database. In SAC, pages 569-574, 2006.

• [FPC2009,PVLDB] E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB, 2(2):1402-1413, 2009.

• [GO2010,DKE] J. García-García, C. Ordonez: Extended aggregations for databases with referential integrity issues. Data Knowl. Eng. 69(1): 73-95 (2010).

• [GCBLRVPP1997,JDMKD] J. Gray and S. Chaudhuri and A. Bosworth and A. Layman and D. Reichart and M. Venkatrao and F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery., 1(1):29-53,1997.

• [HLS2005,TODS] Z. He, B. S. Lee, and R. Snapp. Self-tuning cost modeling of user-defined functions in an object-relational DBMS. ACM Trans. Database Syst., 30(3):812-853, 2005.

• [JM1998,SIGMOD] M. Jaedicke and B. Mitschang. On parallel processing of aggregate and scalar functions in object-relational DBMS. In ACM SIGMOD Conference, pages 379-389, 1998.

• [LTWZ2005,SIGMOD] C. Luo, H. Thakkar, H. Wang, and C. Zaniolo. A native extension of SQL for mining data streams. In ACM SIGMOD, pages 873-875, New York, NY, USA, 2005.

• [MC2002,ICDM] B.L. Milenova and M.M. Campos. O-cluster: Scalable clustering of large high dimensional data sets. In Proc. IEEE ICDM Conference, page 290, Washington, DC, USA, 2002.

• [MYC2005,VLDB] B.L. Milenova, J. Yarmus, and M.M. Campos. SVM in Oracle database 10g: Removing the barriers to widespread adoption of support vector machines. In VLDB Conference, 2005.

• [NCFB2001,ICDE] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases: OLE DB for data mining. In IEEE ICDE Conference, pages 379-387, 2001.

Page 64: Database Systems Research on Data Mining

References• [O2004,DMKD] C. Ordonez. Horizontal aggregations for building tabular data sets. In ACM SIGMOD Data Mining

& Knowledge Discovery Workshop (DMKD), pages 35-42, 2004.

• [O2006,TKDE] C. Ordonez. Integrating K-means clustering with a relational DBMS using SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 18(2):188-201, 2006.

• [O2007,SIGMOD] C. Ordonez. Building Statistical Models and Scoring with UDFs. In SIGMOD Conference, pages 1005-1016, 2007.

• [O2010,TKDE] C. Ordonez. Statistical Model Computation with UDFs. IEEE Transactions on Knowledge and Data Engineering (TKDE), 2010

• [OP2010,TKDE] C. Ordonez, S.K. Pitchaimalai. Bayesian Classifiers Programmed in SQL. IEEE Transactions on Knowledge and Data Engineering (TKDE), 22(1):139-144, 2010.

• [OP2010,DKE] C. Ordonez, S.K. Pitchaimalai. Fast UDFs to Compute Sufficient Statistics on Large Data Sets exploiting Caching and Sampling, Data and Knowledge Engineering Journal (DKE), 2010.

• [OG2008,DSS] C. Ordonez, J. García-García: Referential integrity quality metrics. Decision Support Systems 44 (2): 495-508 (2008)

• [O2003,JLINUX] M. Owens. Embedding an SQL database with SQLite. Linux J., 2003(110):2, 2003.

• [PHBB2009,PVLDB] B. Panda, J. Herbach, S. Basu, and R.J. Bayardo. PLANET: Massively parallel learning of tree ensembles with MapReduce. PVLDB, 2(2):1426-1437, 2009.

• [PPRADMS2009,SIGMOD] A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D.J. DeWitt, S. Madden, and Stonebraker, M. A comparison of approaches to large-scale data analysis. In SIGMOD Conference, pages 165-178, 2009.

• [STA1998,SIGMOD] S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule mining with relational database systems: alternatives and implications. In ACM SIGMOD, pages 343-354, 1998.

• [SD2001,CIKM] K. Sattler and O. Dunemann. SQL database primitives for decision tree classifiers. In ACM CIKM Conference, pages 379-386, 2001.

Page 65: Database Systems Research on Data Mining

References• [NCFB2001,ICDE] A. Netz, S. Chaudhuri, U. Fayyad, and J. Berhardt. Integrating data mining with SQL databases:

OLE DB for data mining. In IEEE ICDE Conference, pages 379-387, 2001.

• [SMAHHH2007,VLDB] M. Stonebraker, S. Madden, D. J. Abadi, S. Harizopoulos, N. Hachem, and P. Helland. The end of an architectural era: (it's time for a complete rewrite). In VLDB, pages 1150-1160, 2007.

• [WH2009,SIGMOD] F. M. Waas and J. M. Hellerstein. Parallelizing extensible query optimizers. In SIGMOD Conference, pages 871-878, 2009.

• [ZHY2009,CIDR] Y. Zhang, H. Herodotou, and J. Yang. Riot: I/O-efficient numerical computing without SQL. In CIDR, 2009.

• [ZZY2010,ICDE] Y. Zhang, W. Zhang, J. Yang. I/O-Efficient Statistical Computing with RIOT. In ICDE, 2010.

• [ZRL1996,SIGMOD] T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: An efficient data clustering method for very large databases. In ACM SIGMOD Conference, pages 103-114, 1996.