dm_impl

34
1 DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Basic data mining algorithms have traditionally assumed data is memory resident, as for example in the case of the Apriori algorithm for association rule mining or basic clustering algorithms. Hence, techniques to ensure efficient data mining algorithms with large data sets have been developed. Another problem is that once a data mining model has been developed, there has traditionally been no mechanism for that model to be reused programmatically by other applications on other data sets. Hence, standards for data mining model exchange have been developed.

Upload: julio-omar-palacio-nino

Post on 21-Jul-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: dm_impl

1

DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS • The basic data mining algorithms introduced may be enhanced in a number of

ways. • Basic data mining algorithms have traditionally assumed data is memory

resident, as for example in the case of the Apriori algorithm for association rule mining or basic clustering algorithms.

• Hence, techniques to ensure efficient data mining algorithms with large data

sets have been developed. • Another problem is that once a data mining model has been developed, there

has traditionally been no mechanism for that model to be reused programmatically by other applications on other data sets.

• Hence, standards for data mining model exchange have been developed.

Page 2: dm_impl

2

• This trend has been accelerated as interoperability issues become of increasing importance to enable the deployment of cloud computing data mining applications.

• Finally, even for data which is held in a database or data warehouse, data mining has traditionally been performed by dumping data from the database to an external file, which is then transformed and mined.

• This results in a series of files for each data mining application, with the

resulting problems of data redundancy, inconsistency and data dependence, which database technology was designed to overcome.

• Hence, techniques and standards for tighter integration of database and data

mining technology have been developed.

Page 3: dm_impl

3

DATA MINING OF LARGE DATA SETS

• Algorithms for classification, clustering, and association rule mining are considered.

CLASSIFYING LARGE DATA SETS: SUPPORT VECTOR MACHINES

• To reduce the computational cost of solving the SVM optimization problem with large training sets, chunking is used. This partitions the training set into “chunks” each of which fits into memory. The support vector parameters are computed iteratively for each chunk. However, multiple passes of the data are required to obtain an optimal solution.

• Another approach is to use squashing. In this, the SVM is trained over clusters derived from the original training set, with the clusters reflecting the distribution of the original training records.

• A further approach reformulates the optimization problem to allow solution by efficient iterative algorithms.

Page 4: dm_impl

4

CLUSTERING LARGE DATA SETS: K-MEANS

• Unless there is sufficient main memory to hold the data being clustered, the

data scan at each iteration of the K-means algorithm will be very costly.

• An approach for large databases would

• Perform at most one scan of the database.

• Work with limited memory.

• Approaches include the following:

• Identify three kinds of data objects: • those which are discardable because membership of a cluster has

been established; • those which are compressible which while not discardable belong to a

well-defined subcluster which can be characterized in a compact structure;

• those which are neither discardable nor compressible which must be retained in main memory.

• Alternatively, first group data objects into microclusters and then perform k-means clustering on those microclusters.

Page 5: dm_impl

5

• An approach developed at Microsoft uses such approaches as follows:

1. Read a sample subset of data from the database.

2. Cluster that data with the existing model as usual to produce an updated model.

3. On the basis of the updated model, decide for each data item from the sample whether it needs to be

• retained in memory

• discarded with summary information being updated

• retained in a compressed form as summary information.

4. Repeat from 1 until termination condition met.

Page 6: dm_impl

6

ASSOCIATION MINING OF LARGE DATA SETS: APRIORI

• With one database scan for each itemset size tested, the cost of scans would be prohibitive for the Apriori algorithm unless the database is resident in memory.

• Approaches to enhance the efficiency of Apriori include the following.

• While generating 1-itemsets for each transaction, generate 2-itemsets at the same time, hashing the 2-itemsets to a hash table structure. All buckets whose final count of itemsets is less than the minimum support threshold can be ignored subsequently since any itemset therein will itself not have the minimum required support.

Page 7: dm_impl

7

• Overlap the testing of k-itemsets and (k+1)-itemsets by counting (k+1)-itemsets in parallel with counting k-itemsets. Unlike in conventional Apriori in which candidate (k+1)-itemsets are only generated after the k-itemset database scan, in this approach a database scan is divided into blocks before any of which candidate (k+1)-itemsets can be generated during the k-itemset scan.

• Two database scans only are needed if a partitioning approach is adopted under which transactions are divided into n partitions each of which can be held in memory. In the first scan frequent itemsets for each partition are generated. These are combined to create a candidate frequent itemsets list for the database as a whole. In the second scan, the actual support for members of the candidate frequent itemsets list is checked.

Page 8: dm_impl

8

• Pick a random sample of transactions which will fit in memory and search for frequent itemsets in that sample. This may result in some global frequent itemsets being missed. The chance of this happening can be lessened by adopting a lower minimum support threshold for the sample, with the remaining database then being used to check actual support for the candidate itemsets. A second database scan may be needed to ensure no frequent itemsets have been missed.

Page 9: dm_impl

9

DATA MINING STANDARDS

• Data mining standards and related standards for data grids, web services and the semantic web enable the easier deployment of data mining applications across platforms.

• Standards cover:

• The overall KDD process.

• Metadata interchange with data warehousing applications.

• The representation of data cleaning, data reduction and transformation

processes.

• The representation of data mining models.

• APIs for performing data mining processes from other languages including

SQL and Java.

Page 10: dm_impl

10

CRISP-DM

• CRISP-DM (CRoss Industry Standard Process for Data Mining) specifies a

process model covering the following 6 phases of the KDD process:

• Business Understanding

• Data Understanding

• Data Preparation

• Modeling

• Evaluation

• Deployment

• www.the-modeling-agency.com/crisp-dm.pdf

Page 11: dm_impl

11

PREDICTIVE MODEL MARKUP LANGUAGE PMML • PMML is an XML-based standard developed by the Data Mining Group

(www.dmg.org) which is a consortium of data mining product vendors. • PMML represents data mining models as well as operations for cleaning and

transforming data prior to modeling. • The aim is to enable an application to produce a data mining model in a form –

PMML XML - which another data mining application can read and apply. • Below is shown the PMML representation of an example association rules

model for the following transaction data.

Page 12: dm_impl

12

Page 13: dm_impl

13

Page 14: dm_impl

14

• The association model XML schema specification follows.

Page 15: dm_impl

15

Page 16: dm_impl

16

• The components of a PMML document consist of (the first two and last being used in the example model above):

• Data dictionary. Defines models’ input attributes with type and value

range. • Mining schema. Defines the attributes and roles specific to a particular

model. • Transformation dictionary. Defines the following mappings: normalization

(continuous or discrete values to numbers), discretization (continuous to discrete values), value mapping (discrete to discrete values), aggregation (grouping values as in SQL).

• Model Statistics. Statistics about individual attributes. • Models. Includes regression models, cluster models, association rules,

neural networks, Bayesian models, sequence models. • PMML is used within the standards CWM , SQL/MM Part 6 Data Mining, JDM,

and MS Analysis Services (OLE DB for Data Mining) providing a degree of compatibility between them all.

Page 17: dm_impl

17

COMMON WAREHOUSE METAMODEL CWM • CWM supports the interchange of warehouse and business intelligence

metadata between warehouse tools, warehouse platforms and warehouse metadata repositories in distributed heterogeneous environments.

• http://www.omg.org/technology/documents/modeling_spec_catalog.htm SQL/MM DATA MINING • The SQL Multimedia and Applications Package Standard (SQL/MM) Part 6

specifies an SQL interface to data mining applications and services through SQL:1999 user-defined types as follows.

• User-defined types for four data mining functions: association rules,

clustering, classification and regression.

Page 18: dm_impl

18

• Routines to manipulate these user-defined types to allow:

• Setting parameters for mining activities. • Training of mining models, in which a particular mining technique is

chosen, parameters for that technique are set, and the mining model is built with training data sets.

• Testing of mining models applicable only to regression and classification

models, in which the trained model is evaluated by comparing with results for known data.

• Application of mining models in which the model is applied to new data to

cluster, predict or classify as appropriate. This phases is not applicable to rule models in which rules are determined during the training phase.

• User-defined types for data structures common across these data mining

models. • Functions to capture metadata for data mining input.

Page 19: dm_impl

19

• For example, for the association rule model type DM_RuleModel the following methods are supported:

• DM_impRuleModel CHARACTER LARGE OBJECT (DM_MaxContentLength))

• Import rule model expressed as PMML • Return DM_RuleModel

• DM_expRuleModel()

• Export rule model as PMML

• DM_getNORules()

• Return number of rules

• DM_getRuleTask()

• Return data mining task value, data mining settings etc.

Page 20: dm_impl

20

JAVA DATA MINING JDM • Java Data Mining (http://www.jcp.org/en/jsr/detail?id=73) is a Java API

developed under the Java Community Process supporting common data mining operations as well as the metadata supporting mining activities.

• JDM 1.0 supports the following mining functions: classification, regression,

attribute importance (ranking), clustering and association rules. • JDM 1.0 supports the following tasks: model building, testing, application

and model import/export. • JDM does not support tasks such as data transformation, visualization and

mining unstructured data.

• JDM has been designed so that metadata maps closely to PMML to provide support for the generation of XML for mining models. Likewise, metadata maps closely to CWM to support generation of XML for mining tasks.

• The JDM API maps closely to SQL/MM Data Mining to support an

implementation of JDM on top of SQL/MM.

Page 21: dm_impl

21

OLE DB FOR DATA MINING & DMX – SQL SERVER ANALYSIS SERVICES • OLE DB for Data Mining, developed by Microsoft and incorporated in SQL

Server Analysis Services, specifies a structure for holding information defining a mining model and a language for creating and working with these mining models.

• The approach has been to adopt an SQL-like framework for creating, training

and using a mining model – a mining model is treated as though it is a special kind of table. The DMX language, which is SQL-like, is used to create and work with models.

Page 22: dm_impl

22

CREATE MINING MODEL [AGE PREDICTION] ( [CUSTOMER ID] LONG KEY, [GENDER] TEXT DISCRETE, [AGE] DOUBLE DISCRETIZED() PREDICT, [ITEM PURCHASES] TABLE ([ITEM NAME] TEXT KEY, [ITEM QUANTITY] DOUBLE NORMAL CONTINUOUS, [ITEM TYPE] TEXT RELATED TO [ITEM NAME] ) ) USING [MS DECISION TREE]

• The column to be predicted, AGE, is identified, together with the keyword

DISCRETIZED() indicating that a discretization into ranges of values is to take place.

• ITEM QUANTITY is identified as having a normal distribution, which may be

exploited by some mining algorithms.

Page 23: dm_impl

23

• ITEM TYPE is identified as being related to ITEM NAME. This reflects a 1-many constraint – each item has one type.

• It can be seen from the column specification of the table inserted into, that a

nested table representation is used with ITEM PURCHASES iself being a table nested within AGE PREDICTION. A conventional table representation would result in duplicate data in a single non-normalized table or data in multiple normalized tables.

• The USING clause specifies the algorithm that will be used to construct the

model. • Having created a model, it may be populated with a caseset of training data

using an INSERT statement. • Predictions are obtained by executing a prediction join to match the trained

model with the caseset to be mined. This process can be thought of as matching each case in the data to be mined with every possible case in the trained model to find a predicted value for each case which matches a case in the model.

Page 24: dm_impl

24

• SQL Server Analysis Services supports data mining algorithms for use with:

• conventional relational tables

• OLAP cube data

• Mining techniques supported include:

• classification - decision trees

• clustering - k-means

• association rule mining

• Predictive Model Markup Language (PMML) is supported.

• SQL Server Analysis Services Data Mining Tutorials

Page 25: dm_impl

25

DATA MINING PRODUCTS – OPEN SOURCE

• A number of open-source packages and tools support data mining capabilities,

including R, Weka, RapidMiner and Mahout.

• R is both a language for statistical computing and visualisation of results, and a wider environment consisting of packages and other tools for the development of statistical applications.

• Data mining functionality is supported through a number of packages, including

classification with decision trees using the rpart package, clustering with k-means using the kmeans package, and association rule mining with Apriori using the arules package.

• http://www.r-project.org • Weka is a collection of data mining algorithms written in Java including those for

classification, clustering and association rule mining as well as for visualisation.

• http://www.cs.waikato.ac.nz/ml/weka/

Page 26: dm_impl

26

• RapidMiner consists of both tools for developing standalone data mining applications and an environment for use of RapidMiner functions from other programme languages.

• Weka and R algorithms may be integrated within RapidMiner.

• An XML-based interchange format is used to enable interchange of data between data mining algorithms.

• http://www.rapidminer.com • Mahout is an Apache project to develop data mining algorithms for the Hadoop

platform.

• Core MapReduce algorithms for clustering, classification… are provided, but the project also incorporates algorithms designed to run on single-node architectures and non-Hadoop cluster architectures.

• http://mahout.apache.org

Page 27: dm_impl

27

DATA MINING PRODUCTS – ORACLE

• Oracle supports data mining algorithms for use with conventional relational

tables.

• Mining techniques supported include:

• classification - decision trees, support vector machines...

• clustering - k-means...

• association rule mining – Apriori

• Predictive Model Markup Language (PMML) support is included

• In addition to SQL and PL/SQL interfaces, until Oracle 11, a Java API was supported to allow applications to be developed which mine data. This was Oracle’s implementation of JDM 1.0 introduced above.

Page 28: dm_impl

28

• From Oracle 12, the Java API is no longer supported. Instead, support for R has been introduced with the Oracle R Enterprise component.

• Oracle R Enterprise allows R to be used to perform analysis on Oracle database tables.

• A collection of packages supports mapping of R data types to Oracle database objects and the transparent rewriting of R expressions to SQL expressions on those corresponding objects.

• A related product is Oracle R Connector for Hadoop. This is an R package which provides an interface between a local R environment and file system and Hadoop enabling R functions to be executed on data in memory, on the local file system and HDFS.

Page 29: dm_impl

29

DATA MINING PRODUCTS – SPSS MODELER

• SPSS Modeler (formerly Clementine and PASW Modeler) is a data mining tool from IBM. Using it you can:

• Obtain data from a variety of sources

• Select and transform data

• Visualise the data using a variety of plots and graphs

• Model the data with data mining methods including

• classification - decision trees, support vector machines...

• clustering - k-means...

• association rule mining – Apriori

• Output the results in a variety of forms.

• CRISP-DM and PMML support is included.

• SPSS Modeler also supports integration with the data mining tools available from database vendors including Oracle Data Miner, IBM DB2 InfoSphere Warehouse, and Microsoft Analysis Services.

Page 30: dm_impl

30

• SPSS Modeler has a data stream approach enabling data to be processed by a series of nodes performing operations on the data.

Page 31: dm_impl

31

• In this data stream approach, data flows from:

• data source nodes representing, for example, data files or database tables

via

• operation nodes representing, for example selection, sampling or aggregation operations and

Page 32: dm_impl

32

• modelling nodes representing data mining methods, for example classification, clustering and association rule mining methods to

• graph nodes representing the results in a variety of formats, for example, data files, charts, plots or output nodes for further analysis or export nodes for import by external applications.

• It supports a language CLEM for specifying the operations for analyzing and manipulating the data within nodes in the stream.

Page 33: dm_impl

33

• Two tutorials are referenced below.

• The first tutorial is based on data recording the response of patients with the same illness to various drugs used to treat that illness.

• A stream is built which is used to analyse and visualise the data to identify the relationship between the drugs administered and the levels of sodium (Na) and potassium (K) measured in patients.

• The second tutorial is based on market basket data for supermarket transactions.

• A stream is built to enable association rule mining (based on Apriori) to identify links between items purchased.

• A rule induction method C5.0, often used in the construction of decision trees, is then applied to profile the purchasers of the product groups identified by the rule mining.

Page 34: dm_impl

34

READING • P Bradley et al., Scaling Mining Algorithms to Large Databases, CACM, 45(8),

38-43, 2002.

• Y Chen et al., Practical Lessons of Data Mining at Yahoo!, Proc. CIKM’09, 1047-1055, 2009.

• J Lin & D Ryaboy, Scaling Big Data Mining Infrastructure: The Twitter Experience, ACM SIGKDD Exploration 14(2), 6-19, 2012.

• R Sumbaly et al., The “Big Data” Ecosystem at LinkedIn, Proc. SIGMOD’13, 1125-1134. 2013.

• SPSS Modeler Tutorial 1 SPSS Modeler Tutorial 2

FOR REFERENCE • IBM SPSS Modeler 15 User's Guide