data mining techniques

27
Data Mining Data Mining with JDM API with JDM API Regina Wang Regina Wang

Upload: tommy96

Post on 14-Nov-2014

489 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data Mining Techniques

Data Mining Data Mining with JDM APIwith JDM API

Regina WangRegina Wang

Page 2: Data Mining Techniques

Data MiningData MiningKnowledge-Discovery in Databases (KDD)Knowledge-Discovery in Databases (KDD)

Searching large volumes of data for patterns.Searching large volumes of data for patterns.

The nontrivial extraction of implicit, previously The nontrivial extraction of implicit, previously known, and potentially useful information from known, and potentially useful information from data.data.

The science of extracting useful information The science of extracting useful information from large data sets or databases.from large data sets or databases.

Uses computational techniques from Uses computational techniques from statistics, statistics, machine learning, machine learning, and and pattern recognitionpattern recognition..

Page 3: Data Mining Techniques

Descriptive StatisticsDescriptive Statistics

Collect data Collect data Classify data Classify data Summarize data Summarize data present data present data Make inferences to draw a conclusionsMake inferences to draw a conclusions--Point and interval estimation--Point and interval estimation--Hypothesis testing--Hypothesis testing--Prediction--Prediction

Page 4: Data Mining Techniques

Machine LearningMachine Learning

Concerned with the development of Concerned with the development of techniques which allow computers to techniques which allow computers to "learn". "learn".

Concerned with the algorithmic Concerned with the algorithmic complexity of computational complexity of computational implementations.implementations.

Many inference problems turn out to be Many inference problems turn out to be NP-hard or harder .NP-hard or harder .

Page 5: Data Mining Techniques

Common Machine Learning Common Machine Learning AlgorithmAlgorithm

Supervised learning—prior knowledgeSupervised learning—prior knowledge

Unsupervised learning—statistical Unsupervised learning—statistical regularity of the patternsregularity of the patterns

Semi-supervised learningSemi-supervised learning

Reinforcement learningReinforcement learning

TransductionTransduction

Learning to learnLearning to learn

Page 6: Data Mining Techniques

Pattern RecognitionPattern Recognition

The act of taking in raw data and taking an The act of taking in raw data and taking an action based on the category of the data.action based on the category of the data.

Aims to classify data patterns based on prior Aims to classify data patterns based on prior knowledge or on statistical info. knowledge or on statistical info.

Based on availability of training set: Based on availability of training set: supervised and unsupervised leaningssupervised and unsupervised leanings

Two approaches: statistical (decision theory) Two approaches: statistical (decision theory) and syntactic (structural).and syntactic (structural).

Page 7: Data Mining Techniques

Supervised TechniquesSupervised Techniques

Classification:Classification:-- -- kk-Nearest Neighbors-Nearest Neighbors--Naïve Bayes--Naïve Bayes--Classification Trees--Classification Trees--Descriminant Analysis--Descriminant Analysis--Logistic Regression--Logistic Regression--Neural Nets --Neural Nets

Page 8: Data Mining Techniques

Supervised TechniquesSupervised Techniques

Prediction (Estimation):Prediction (Estimation):

--Regression--Regression

--Regression Trees--Regression Trees

----kk-Nearest Neighbors-Nearest Neighbors

Page 9: Data Mining Techniques

Unsupervised TechniquesUnsupervised Techniques

Cluster AnalysisCluster Analysis

Principle ComponentsPrinciple Components

Association RulesAssociation Rules

Collaborative FilteringCollaborative Filtering

Page 10: Data Mining Techniques

Data-mining tools were traditionally Data-mining tools were traditionally provided in products with vendor-provided in products with vendor-specific interfaces.specific interfaces.

The Java Data Mining API (JDM) The Java Data Mining API (JDM) defines a common Java API to interact defines a common Java API to interact with data-mining systems.with data-mining systems.

Developed by Java Community Data Developed by Java Community Data Mining Expert GroupMining Expert Group

JAVA Data Mining API (JDM)JAVA Data Mining API (JDM)

Page 11: Data Mining Techniques

JDM Current VersionsJDM Current Versions

JDM 1.0 (JSR 73) final specification in JDM 1.0 (JSR 73) final specification in August, 2004August, 2004

http://http://www.jcp.org/en/jsr/detail?idwww.jcp.org/en/jsr/detail?id=73=73

JDM 2.0 (JSR 247) Early ReviewJDM 2.0 (JSR 247) Early Review

http://http://www.jcp.org/en/jsr/detail?idwww.jcp.org/en/jsr/detail?id=247=247

JDM is for the Java™ 2 Platform JDM is for the Java™ 2 Platform (J2EE™) and (J2SE™)(J2EE™) and (J2SE™)

Page 12: Data Mining Techniques

Data Mining SystemData Mining System

A typical data-mining system consists ofA typical data-mining system consists of

--a data-mining engine --a data-mining engine

--a repository that persists the data-mining --a repository that persists the data-mining artifacts, such as the models, created in artifacts, such as the models, created in the process. the process.

The actual data is obtained via a database The actual data is obtained via a database connection, or via a file-system API. connection, or via a file-system API.

Page 13: Data Mining Techniques

JDM Architectural componentsJDM Architectural components

Application programming interface (API)Application programming interface (API)

Data mining engine (DME) Data mining engine (DME) – or – or data mining data mining server server (DMS), provides the infrastructure (DMS), provides the infrastructure that offers a set of data mining services to its that offers a set of data mining services to its API clients. API clients.

Mining object repository (MOR) Mining object repository (MOR) - The - The DME uses a mining object repository which DME uses a mining object repository which serves to persist data mining objectsserves to persist data mining objects

Page 14: Data Mining Techniques

Key JDM API benefit :Key JDM API benefit :abstracts out the physical components, tasks, and algorithms to java classes

Figure 1. Components of a data-mining system

Page 15: Data Mining Techniques

Building a data-mining modelBuilding a data-mining model

1.1. Decide what you want to learn.Decide what you want to learn.2.2. Select and prepare your data. Select and prepare your data. 3.3. Choose mining tasks and configure the Choose mining tasks and configure the

mining algorithms.mining algorithms.4.4. Build your data-mining model. Build your data-mining model. 5.5. Test and refine the models. Test and refine the models. 6.6. Report findings or predict future Report findings or predict future

outcomes. outcomes.

Page 16: Data Mining Techniques

Data Mining ProcessData Mining Process

Figure 2. Data mining steps.

Page 17: Data Mining Techniques

Usage of JDM API Usage of JDM API

Using JDM to explore mining object Using JDM to explore mining object repository (MOR) and find out what repository (MOR) and find out what models and model building parameters models and model building parameters work best.work best.

Follow a few simple steps that map the Follow a few simple steps that map the process to JDM interactions. process to JDM interactions.

Build Java Data Mining GUI ApplicationBuild Java Data Mining GUI Application

Page 18: Data Mining Techniques

Figure 3. Top level packages.

Figure 4. Top level interfaces.

Page 19: Data Mining Techniques

Figure 4. Top level interfaces.

Page 20: Data Mining Techniques

Using the JDM APIUsing the JDM API

1.1. Identify the dataIdentify the data you wish to use to build your you wish to use to build your model—your model—your build databuild data—with a URL that points to —with a URL that points to that data.that data.

2.2. Specify the type of modelSpecify the type of model you want to build, and you want to build, and parameters to the build process. Such parameters parameters to the build process. Such parameters are termed are termed build settingsbuild settings in JDM. such as in JDM. such as clustering, classification, or association rules. clustering, classification, or association rules. These tasks are represented by API classes. These tasks are represented by API classes.

3.3. Create a logical representation of your dataCreate a logical representation of your data to to select certain attributes of the physical data, and select certain attributes of the physical data, and then map those attributes to logical values.then map those attributes to logical values.

Page 21: Data Mining Techniques

Using the JDM APIUsing the JDM API

4.4. SpecifySpecify the parameters to your data-mining the parameters to your data-mining algorithmsalgorithms

5.5. Create a build taskCreate a build task, and apply to that task , and apply to that task the physical data references and the build the physical data references and the build settings. settings.

6.6. Finally, you Finally, you execute the taskexecute the task. The outcome . The outcome of that execution is your data model. That of that execution is your data model. That model will have a model will have a signaturesignature—a kind of —a kind of interface—that describes the possible input interface—that describes the possible input attributes for later applying the model to attributes for later applying the model to additional data.additional data.

Page 22: Data Mining Techniques

Using data model and resultsUsing data model and results

Once you've created a model, you can test Once you've created a model, you can test that model, and then even apply the model that model, and then even apply the model to additional data. Building, testing, and to additional data. Building, testing, and applying the model to additional data is an applying the model to additional data is an iterative process that, ideally, yields iterative process that, ideally, yields increasingly accurate models. increasingly accurate models.

Those models can then be saved in the Those models can then be saved in the MOR, and used to either explain data, or MOR, and used to either explain data, or to predict the outcome of new data in to predict the outcome of new data in relation to your data-mining objective. relation to your data-mining objective.

Page 23: Data Mining Techniques

JDM Data ConnectionJDM Data Connection

A JDM connection is represented by the A JDM connection is represented by the engineengine variable, which is of type variable, which is of type javax.datamining.resource.Connection. JDM javax.datamining.resource.Connection. JDM connections are very similar to JDBC connections are very similar to JDBC connections, with one connection per thread. connections, with one connection per thread.

PhysicalDataSetFactory dataSetFactory = PhysicalDataSetFactory dataSetFactory = (PhysicalDataSetFactory) (PhysicalDataSetFactory) engine.getFactory("javax.datamining.data.PhysicalDataSengine.getFactory("javax.datamining.data.PhysicalDataSet");et");

Page 24: Data Mining Techniques

JDM Data ConnectionJDM Data Connection

Build data is referenced via a PhysicalDataSet Build data is referenced via a PhysicalDataSet object, which, in turn, loads the data from a file object, which, in turn, loads the data from a file or a database table, referenced with a URL. or a database table, referenced with a URL.

PhysicalDataSet dataSet = PhysicalDataSet dataSet = pdsFactory.create( pdsFactory.create( "file:///export/data/textFileData.data", true);"file:///export/data/textFileData.data", true);

Page 25: Data Mining Techniques

Code Example: Building a Code Example: Building a clustering modelclustering model

// Create the physical representation of the data// Create the physical representation of the data(1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory) dme-(1) PhysicalDataSetFactory pdsFactory = (PhysicalDataSetFactory) dme-Conn.getFactory( “javax.datamining.data.PhysicalDataSet” );Conn.getFactory( “javax.datamining.data.PhysicalDataSet” );(2) PhysicalDataSet buildData = pdsFactory.create( uri, true );(2) PhysicalDataSet buildData = pdsFactory.create( uri, true );(3) dmeConn.saveObject( “myBuildData”, buildData, false );(3) dmeConn.saveObject( “myBuildData”, buildData, false );// Create the logical representation of the data from physical data// Create the logical representation of the data from physical data(4) LogicalDataFactory ldFactory = (LogicalDataFactory) dmeConn.getFactory((4) LogicalDataFactory ldFactory = (LogicalDataFactory) dmeConn.getFactory(““javax.datamining.data.LogicalData” );javax.datamining.data.LogicalData” );(5) LogicalData ld = ldFactory.create( buildData );(5) LogicalData ld = ldFactory.create( buildData );(6) dmeConn.saveObject( “myLogicalData”, ld, false );(6) dmeConn.saveObject( “myLogicalData”, ld, false );// Create the settings to build a clustering model// Create the settings to build a clustering model(7) ClusteringSettingsFactory csFactory = (ClusteringSettingsFactory) dme-(7) ClusteringSettingsFactory csFactory = (ClusteringSettingsFactory) dme-Conn.getFactory( “javax.datamining.clustering.ClusteringSettings”);Conn.getFactory( “javax.datamining.clustering.ClusteringSettings”);(8) ClusteringSettings clusteringSettings = csFactory.create();(8) ClusteringSettings clusteringSettings = csFactory.create();(9) clusteringSettings.setLogicalDataName( “myLogicalData” );(9) clusteringSettings.setLogicalDataName( “myLogicalData” );(10) clusteringSettings.setMaxNumberOfClusters( 20 );(10) clusteringSettings.setMaxNumberOfClusters( 20 );

Page 26: Data Mining Techniques

Code Example: Building a Code Example: Building a clustering model con’tclustering model con’t

(11) clusteringSettings.setMinClusterCaseCount( 5 );(11) clusteringSettings.setMinClusterCaseCount( 5 );(12) dmeConn.saveObject( “myClusteringBS”, clusteringSettings, false );(12) dmeConn.saveObject( “myClusteringBS”, clusteringSettings, false );// Create a task to build a clustering model with data and settings// Create a task to build a clustering model with data and settings(13) BuildTaskFactory btFactory = (BuildTaskFactory) dmeConn.getFactory((13) BuildTaskFactory btFactory = (BuildTaskFactory) dmeConn.getFactory(““javax.datamining.task.BuildTask” );javax.datamining.task.BuildTask” );(14) BuildTask task = btFactory.create( “myBuildData”, “myClusteringBS”,(14) BuildTask task = btFactory.create( “myBuildData”, “myClusteringBS”,““myClusteringModel” );myClusteringModel” );(15) dmeConn.saveObject( “myClusteringTask”, task, false );(15) dmeConn.saveObject( “myClusteringTask”, task, false );// Execute the task and check the status// Execute the task and check the status(16) ExecutionHandle handle = dmeConn.execute( “myClusteringTask” );(16) ExecutionHandle handle = dmeConn.execute( “myClusteringTask” );(17) handle.waitForCompletion( Integer.MAX_VALUE ); // wait until done(17) handle.waitForCompletion( Integer.MAX_VALUE ); // wait until done(18) ExecutionStatus status = handle.getLatestStatus();(18) ExecutionStatus status = handle.getLatestStatus();(19) if( ExecutionState.success.equals( status.getState() ) )(19) if( ExecutionState.success.equals( status.getState() ) )(20) // task completed successfully...(20) // task completed successfully...

Page 27: Data Mining Techniques

ReferencesReferences

Java Data Mining SpecificationJava Data Mining Specification

http://www.jcp.org/en/jsr/detail?id=73 http://www.jcp.org/en/jsr/detail?id=73

Mine Your Own Data with the JDM Mine Your Own Data with the JDM API, Frank Sommers, July 7, 2005API, Frank Sommers, July 7, 2005

http://www.artima.com/lejava/articles/http://www.artima.com/lejava/articles/data_mining.htmldata_mining.html

http://www.stanford.edu/class/http://www.stanford.edu/class/cs345a/#handoutscs345a/#handouts