multi-layered software system framework for distributed data mining masum serazi, amal perera, qiang...
TRANSCRIPT
MULTI-LAYERED SOFTWARE SYSTEM FRAMEWORK FOR
DISTRIBUTED DATA MINING
Masum Serazi, Amal Perera, Qiang Ding, Vasiliy Malakhov,
William Perrizo
North Dakota State UniversityComputer Science Department
Outline Introduction to Distributed Data Mining
• Demands• Existing Projects• Architecture
Importance of a Layered Architecture A Prototype System
• System Architecture (layered)• Server• Communication• Client & GUI
• System Characteristics Conclusion
Demands of distributed data mining
Large dataset size Diversity of data Geographic distribution of users and
resources Computationally intensive result
generation
Large scale distributed data mining project Kensington project
• Mining enterprise data distributed across the internet. Papyrus project
• Based on mobile agents implemented using java. PaDDMAS
• A component based tool set that integrates pre-developed or custom packages
JAM • Agent based distributed system that has been developed to
mine stored in different sites. BODHI
• Collective data mining with stress on the learning vertically partitioned data.
Architecture
Client-Server• Advantage:
• Able to use high performance computing on the server side to do the data mining.
Agent based Hybrid
Importance of a Layered Architecture
Layered framework helps to manage complexity.
Provides the flexibility to add/remove/modify layer and components of a layer
Allows for a better tracking of progress of large, complex projects.
Human input is required to tune the data and the algorithms to suite the need (Mix of greyware versus software can be changed over time).
System Architecture DataMIMETM
developed as proof-of-concept.
Based on patent pending, “P-tree technology”
Efficient and scalable system.
Flexible plug-ins. Conceptual view of
the system
Client Side
Server Side
Integrate data (synchronize to
existing)System
performance ananlysis
Mine on DataMIME™
One of the Slave Servers Master Server
InternetInternet
Capture dataset to DataMIMET
M
Server Architecture Data capture and
integration layer (DCI/DII) Data mining interface
(DMI) Distributed Ptree
Management Interface (DPMI)• Uniform data structure
Data mining algorithms (DMA)
Client-server communication
Client interface
DCI/DII Layer
Room for new feeder
DMA Layer
Plugs for new
algorithms
DMI Layer
DPMI: Distributed Ptree Management Interface
Already Plugged Algorithm
Distributed Ptree
database
The Distributed P-tree Database The DPD collects all data in
vertical format (as opposed to the ubiquitous horizontal (record-based) data structure used in DBMSs), as Predicate-trees (P-trees) based on the patent pending P-tree technology).
P-trees can be 0-dimensional, 1-dimensional, 2-dimensional, etc.
Next slide shows the detailed construction of 1-D P-trees from a generic horizontal table of data.
DCI/DII Layer
Room for new feeder
DMA Layer
Plugs for new
algorithms
DMI Layer
DPMI: Distributed Ptree Management Interface
Already Plugged Algorithm
Distributed P-tree database
(DPD)
6. Lf half of lf of rt? true1
00 0 0 1 1
4. Left half of rt half ? false0 00 0 0
2. Left half pure1? false 0
00 0
1. Whole is pure1? false 0
5. Rt half of right half? true1
00 0 0 1
R11 0 0 0 0 1 0 1 1
Horizontally AND basic Ptrees
Predicate tree technology: vertically project each attribute, Current practice: Structure data into horizontal records. Process vertically (scans)
Top-down construction of the 1-dimensional Ptree representation of R11, denoted, P11, is built by recording the truth of the universal predicate “pure 1” in a tree recursively on halves, until purity is achieved.
3. Right half pure1? false 0 00 0
7. Rt half of lf of rt? false0
00 0 0 1 10
0 1 0 1 1 1 1 1 0 0 0 10 1 1 1 1 1 1 1 0 0 0 00 1 0 1 1 0 1 0 1 0 0 10 1 0 1 1 1 1 0 1 1 1 11 0 1 0 1 0 0 0 1 1 0 00 1 0 0 1 0 0 0 1 1 0 11 1 1 0 0 0 0 0 1 1 0 01 1 1 0 0 0 0 0 1 1 0 0
R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43
R[A1] R[A2] R[A3] R[A4] 010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100
But it is pure (pure0) so this branch ends
then vertically project each bit position of each attribute,then compress each bit slice into a basic Ptree. e.g., compression of R11 into P11 goes as follows:
To count occurrences of 7,0,1,4 use pure111000001100: 0 23-level
P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = 0 0 22-level =201 21-level^
7 0 1 4
P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 0 1 10
0 1 0 0 1 01
0 0 00 0 0 1 01 10
0 1 0
0 1 0 1 0
0 0 01 0 01
0 1 0
0 0 0 1 0
0 0 10 1
0 0 10 1 01
0 0 00 1 01
0 0 0 0 1 0 010 01^ ^ ^ ^ ^ ^ ^ ^ ^
0 0 0 0 1 10̂
P11
P11
pure1? false=0
pure1? false=0
pure1? false=0pure1? true=1
And it’s pure so branch ends
pure1? false=0
R(A1 A2 A3 A4)2 7 6 16 7 6 02 7 5 12 7 5 75 2 1 42 2 1 57 0 1 47 0 1 4
Horizontally structuredrecords
Scanned vertically
010 111 110 001011 111 110 000010 110 101 001010 111 101 111101 010 001 100010 010 001 101111 000 001 100111 000 001 100
=
2-D P-tree Data Structure
1 1 1 1 1 1 0 01 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
m
1 m m 1
m 0 1 m 1 1 m 1
1 1 1 0 0 0 1 0 1 1 0 1
Provides an efficient format for ANDing, ORing and Complementing.
Lossless, compressed, count-computation_ready representations.
Peano or Z-ordering Pure (Pure-1/Pure-0/Mixed)
quadrants Root Count (count of 1s in
the tree)
DCI/DII (Data Capture and Data Integration Interface) layer Allows user to capture and
to integrate data to the required format (p-tree format).
The main component of this layer is the feeder.
An individual feeder can process a particular format of incoming data.
User can write his own feeder and plug it very easily in this architecture.
DCI/DII Layer
Room for new feeder
Already Plugged feeder
DMI (Data Mining Interface) layer DMI does counting, the most important
operation for data mining provided by P-trees, including:• basic P-trees • value P-trees • tuple P-trees • Interval P-trees • Cube P-trees
DMI also provide the P-tree algebra, which has four operations:• AND • OR • NOT (complement) and • XOR, to implement the point wise
logical operations on P-trees for (Data Mining Algorithms) DMA.
D
M
I
Distributed Ptree Management Interface (DPMI) Layer The DPMI layer provides:
• access • location • and concurrency transparency
by hiding the fact that:• data representation may differ • resources may be located in different places• resources may be shared by several competitive
users. By resource we meant data and its converted form
Ptree.
DMA (Data Mining Algorithms) layer This layer is a collection of data
mining tools (algorithms). Upon receiving a request from the
client side an algorithm will be fired up for mining.
This layer depends on the DMI for accessing meta-info and required counts needed in:• Ptree based K Nearest Neighbor
PKNN• Podium Incremental Neighbor
Evaluator PINE• P-BAYESIAN• Etc.
The architecture has the flexibility to plug-in any new algorithm on this layer.
DMA Layer
Communication The communication between different layers is designed in
such a way that it minimizes the data flow over the network. In the DCI and the DMA communication protocols a client will
create a connection, send a request, receive a response and close the connection. A client will send only one request in a single threaded connection. The response for a request is a line with a message indicating the outcome of the request.
A DMA protocol request has a similar structure : header and an optional set of binary files with checksums. The header in the DMA protocol is a set of key / value pairs (properties. Response to the DMA protocol request also contains key / value pairs.
Client Structure The two main functionalities are:
• Capture: Which sends datasets along with their meta information (description of the data) to the DII/DCI layer of the server for capturing.
Client Side DCI
MetaData Data
DC
I
Meta-data generatorData
Client Side DCI
Client side DMA
Prediction Model
DM
A
Visualization Tool
Unclassified data
• Mining: This sends requests to the DMA layer for applying data mining applications on previously captured datasets and the presentation of the results.
Client and GUI Data Capturing Data Mining
In the client side DataMIMETM has a graphical user interface (GUI) to visually interact with a user (http://midas.cs.ndsu.nodak.edu/~datasurg/datamime )
System Characteristics Ability to handle formatted record-based, relational-like
data with numerical and/or categorical attributes. The data could be in text format, relational format, or TIFF image format.
Easy conversion from any other machine readable format can be provided through customized data feeders.
Users can do any data analysis and mining on data sets in the system, or on any new data they capture or integrate into the system.
Capable of handling large quantities of data and mines them in scalable time.
Clients of the system can run on UNIX and Microsoft Windows platform with the server designed to be a UNIX-based system.
System Characteristics (cont.) Supports major RDBMS platforms. The server engine can be run on a single machine or
distributed across multiple computers for better scalability and efficiency.
The system has an open architecture provides high degree of software extensibility and integration capabilities.
The system provides high level of asynchronous background operations, performing most data intensive operations in the background or offline and allowing users to continue their work.
The system minimizes the flow of data across the network.
Conclusion We have shown the importance of having a layered
architecture for a distributed data mining system. Key elements were identified in deciding on the different
layers. Able to identify a unique efficient vertical data structure at the
lowest layer that can take advantages of the latest hardware. To facilitate the data distribution a management layer is also
recognized. Two other layers are defined: data capture and data mining
layer. A prototype system was developed as a proof-of-concept to
show the feasibility of the approach.