multi-layered software system framework for distributed data mining masum serazi, amal perera, qiang...

MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR

DISTRIBUTED DATA MINING

Masum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov,

William Perrizo

North Dakota State UniversityComputer Science Department

Outline Introduction to Distributed Data Mining

• Demands• Existing Projects• Architecture

Importance of a Layered Architecture A Prototype System

• System Architecture (layered)• Server• Communication• Client & GUI

• System Characteristics Conclusion

Demands of distributed data mining

Large dataset size Diversity of data Geographic distribution of users and

resources Computationally intensive result

generation

Large scale distributed data mining project Kensington project

• Mining enterprise data distributed across the internet. Papyrus project

• Based on mobile agents implemented using java. PaDDMAS

• A component based tool set that integrates pre-developed or custom packages

JAM • Agent based distributed system that has been developed to

mine stored in different sites. BODHI

• Collective data mining with stress on the learning vertically partitioned data.

Architecture

Client-Server• Advantage:

• Able to use high performance computing on the server side to do the data mining.

Agent based Hybrid

Importance of a Layered Architecture

Layered framework helps to manage complexity.

Provides the flexibility to add/remove/modify layer and components of a layer

Allows for a better tracking of progress of large, complex projects.

Human input is required to tune the data and the algorithms to suite the need (Mix of greyware versus software can be changed over time).

System Architecture DataMIMETM

developed as proof-of-concept.

Based on patent pending, “P-tree technology”

Efficient and scalable system.

Flexible plug-ins. Conceptual view of

the system

Client Side

Server Side

Integrate data (synchronize to

existing)System

performance ananlysis

Mine on DataMIME™

One of the Slave Servers Master Server

InternetInternet

Capture dataset to DataMIMET

M

Server Architecture Data capture and

integration layer (DCI/DII) Data mining interface

(DMI) Distributed Ptree

Management Interface (DPMI)• Uniform data structure

Data mining algorithms (DMA)

Client-server communication

Client interface

DCI/DII Layer

Room for new feeder

DMA Layer

Plugs for new

algorithms

DMI Layer

DPMI: Distributed Ptree Management Interface

Already Plugged Algorithm

Distributed Ptree

database

The Distributed P-tree Database The DPD collects all data in

vertical format (as opposed to the ubiquitous horizontal (record-based) data structure used in DBMSs), as Predicate-trees (P-trees) based on the patent pending P-tree technology).

P-trees can be 0-dimensional, 1-dimensional, 2-dimensional, etc.

Next slide shows the detailed construction of 1-D P-trees from a generic horizontal table of data.

DCI/DII Layer

Room for new feeder

DMA Layer

Plugs for new

algorithms

DMI Layer

DPMI: Distributed Ptree Management Interface

Already Plugged Algorithm

Distributed P-tree database

(DPD)

6. Lf half of lf of rt? true1

00 0 0 1 1

4. Left half of rt half ? false0 00 0 0

2. Left half pure1? false 0

00 0

1. Whole is pure1? false 0

5. Rt half of right half? true1

00 0 0 1

R11 0 0 0 0 1 0 1 1

Horizontally AND basic Ptrees

Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans)

Top-down construction of the 1-dimensional Ptree representation of R11, denoted, P11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves, until purity is achieved.

3. Right half pure1? false 0 00 0

7. Rt half of lf of rt? false0

00 0 0 1 10

0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0

R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43

R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

But it is pure (pure0) so this branch ends

then vertically project each bit position of each attribute,then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows:

To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level

P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =201 21-level^

7 0 1 4

P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10

0 1 0 0 1 01

0 0 00 0 0 1 01 10

0 1 0

0 1 0 1 0

0 0 01 0 01

0 1 0

0 0 0 1 0

0 0 10 1

0 0 10 1 01

0 0 00 1 01

0 0 0 0 1 0 010 01^ ^ ^ ^ ^ ^ ^ ^ ^

0 0 0 0 1 10̂

P11

P11

pure1? false=0

pure1? false=0

pure1? false=0pure1? true=1

And it’s pure so branch ends

pure1? false=0

R(A1 A2 A3 A4)2 7 6 16 7 6 02 7 5 12 7 5 75 2 1 42 2 1 57 0 1 47 0 1 4

Horizontally structuredrecords

Scanned vertically

010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100

=

2-D P-tree Data Structure

1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1

m

1 m m 1

m 0 1 m 1 1 m 1

1 1 1 0 0 0 1 0 1 1 0 1

Provides an efficient format for ANDing, ORing and Complementing.

Lossless, compressed, count-computation_ready representations.

Peano or Z-ordering Pure (Pure-1/Pure-0/Mixed)

quadrants Root Count (count of 1s in

the tree)

DCI/DII (Data Capture and Data Integration Interface) layer Allows user to capture and

to integrate data to the required format (p-tree format).

The main component of this layer is the feeder.

An individual feeder can process a particular format of incoming data.

User can write his own feeder and plug it very easily in this architecture.

DCI/DII Layer

Room for new feeder

Already Plugged feeder

DMI (Data Mining Interface) layer DMI does counting, the most important

operation for data mining provided by P-trees, including:• basic P-trees • value P-trees • tuple P-trees • Interval P-trees • Cube P-trees

DMI also provide the P-tree algebra, which has four operations:• AND • OR • NOT (complement) and • XOR, to implement the point wise

logical operations on P-trees for (Data Mining Algorithms) DMA.

D

M

I

Distributed Ptree Management Interface (DPMI) Layer The DPMI layer provides:

• access • location • and concurrency transparency

by hiding the fact that:• data representation may differ • resources may be located in different places• resources may be shared by several competitive

users. By resource we meant data and its converted form

Ptree.

DMA (Data Mining Algorithms) layer This layer is a collection of data

mining tools (algorithms). Upon receiving a request from the

client side an algorithm will be fired up for mining.

This layer depends on the DMI for accessing meta-info and required counts needed in:• Ptree based K Nearest Neighbor

PKNN• Podium Incremental Neighbor

Evaluator PINE• P-BAYESIAN• Etc.

The architecture has the flexibility to plug-in any new algorithm on this layer.

DMA Layer

Communication The communication between different layers is designed in

such a way that it minimizes the data flow over the network. In the DCI and the DMA communication protocols a client will

create a connection, send a request, receive a response and close the connection. A client will send only one request in a single threaded connection. The response for a request is a line with a message indicating the outcome of the request.

A DMA protocol request has a similar structure : header and an optional set of binary files with checksums. The header in the DMA protocol is a set of key / value pairs (properties. Response to the DMA protocol request also contains key / value pairs.

Client Structure The two main functionalities are:

• Capture: Which sends datasets along with their meta information (description of the data) to the DII/DCI layer of the server for capturing.

Client Side DCI

MetaData Data

DC

I

Meta-data generatorData

Client Side DCI

Client side DMA

Prediction Model

DM

A

Visualization Tool

Unclassified data

• Mining: This sends requests to the DMA layer for applying data mining applications on previously captured datasets and the presentation of the results.

Client and GUI Data Capturing Data Mining

In the client side DataMIMETM has a graphical user interface (GUI) to visually interact with a user (http://midas.cs.ndsu.nodak.edu/~datasurg/datamime )

System Characteristics Ability to handle formatted record-based, relational-like

data with numerical and/or categorical attributes. The data could be in text format, relational format, or TIFF image format.

Easy conversion from any other machine readable format can be provided through customized data feeders.

Users can do any data analysis and mining on data sets in the system, or on any new data they capture or integrate into the system.

Capable of handling large quantities of data and mines them in scalable time.

Clients of the system can run on UNIX and Microsoft Windows platform with the server designed to be a UNIX-based system.

System Characteristics (cont.) Supports major RDBMS platforms. The server engine can be run on a single machine or

distributed across multiple computers for better scalability and efficiency.

The system has an open architecture provides high degree of software extensibility and integration capabilities.

The system provides high level of asynchronous background operations, performing most data intensive operations in the background or offline and allowing users to continue their work.

The system minimizes the flow of data across the network.

Conclusion We have shown the importance of having a layered

architecture for a distributed data mining system. Key elements were identified in deciding on the different

layers. Able to identify a unique efficient vertical data structure at the

lowest layer that can take advantages of the latest hardware. To facilitate the data distribution a management layer is also

recognized. Two other layers are defined: data capture and data mining

layer. A prototype system was developed as a proof-of-concept to

show the feasibility of the approach.

multi-layered software system framework for distributed data mining masum serazi, amal perera, qiang...

Documents

structure data

partitioned data

bodhi collective data

distributed system

ptree technologyefficient

d ptrees

predicatetrees ptrees

pure pure0