data mining

23
SUBMITETD BY: PROJECT GUIDE: ANKITA AGARWAL MR. JARNAIL SINGH DEEPIKA RAIPURIA FACULTY C-11 FET MODY INSTITUTE MODY INSTITUTE OF TECH. AND SCIENCE TECH. AND SCIENCE

Upload: sachin-sisodia-pmp-itil-csm-gb-lss

Post on 27-Jan-2015

232 views

Category:

Internet


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Data mining

SUBMITETD BY: PROJECT GUIDE:

ANKITA AGARWAL MR. JARNAIL SINGH

DEEPIKA RAIPURIA FACULTY

C-11 FET

MODY INSTITUTE MODY INSTITUTE OF

TECH. AND SCIENCE TECH. AND SCIENCE

Page 2: Data mining

ACKNOWLEDGEMENT

We wish to express our deepest gratitude for Mr.Jarnail Singh, for his utmost concern, guidance and encouragement during our major project. We would like to thank him for his pleasant and insightful interaction with us. We are extremely grateful for their constant guidance and encouragement and for their help in gaining access to the resources essential for the successful completion of this assignment. We would like to thank him for sharing their valuable knowledge and resources with us and showed utmost co-operation and understanding.

ANKITA AGARWAL

DEEPIKA RAIPURIA

MITS,LAXMANGARH

Page 3: Data mining

EXECUTIVE SUMMARY

Data mining is the process of posing various queries and extracting useful information,

patterns and trends often previously unknown from large quantities of the data possibly

stored in databases. The goals of data mining include detecting abnormal patterns and

predicting the future based on past experiences and current trends. Some of the data

mining technique includes those based on rough sets, inductive losic programming,

machine learning and neutral networks.

The data mining problem includes

Classification: finding rules to partition data into groups.

Association: finding rules to make association between data.

Sequencing: finding rules to order data.

Data mining is also referred as Knowledge mining from databases, Knowledge

extraction, data/pattern analysis, data archaeology and data dredging.

Another popular term for it is Knowledge discovery in databases (KDD). it consist of

an iterative sequence of following steps:

Data integration

Data selection

Data transformation

Data mining

Pattern evaluation

Knowledge presentation

Data can be stored in many types of databases. One database architecture is data

warehouse, a repository of multiple, heterogeneous data sources. It include data

cleansing, data integration, and on-line analytical processing(OLAP).

In addition user interfaces, optimized query processing, and transaction management.

Efficient method for this is on-line transaction processing that is(OLTP)

Page 4: Data mining

BRIEF CONTENTS S. No. DESCRIPTION PAGE NO. 1. DATA MINING 1

Knowledge discovery in databases

2. DATA WAREHOUSE 8

OLAP technology for data mining

3. MINING ASSOCIATION RULES 15

In large databases

4. CLASSIFICATION AND PREDICTION 18

5. REFERENCES 20

Page 5: Data mining

1.0 DATA MINING-Knowledge discovery in databases

1.1 Motivation: Why data mining?

“Necessity Is the Mother of Invention”

1.1.1 Data explosion problem: The major reason that data mining has attracted a great deal of attention in the

information industry in recent years is due to the wide availability of huge amounts of

data and the imminent need for turning such data into useful information and knowledge.

The information and knowledge gained can be used for applications ranging from

business management , production control, and market analysis, to engineering design

and science exploration. Automated data collection tools and mature database technology

lead to tremendous amounts of data accumulated and/or to be analyzed in databases, data

warehouses, and other information repositories. We are drowning in data, but starving for

knowledge! Data mining can be viewed as a result of the natural evolution of information

technology.

1.2Evolution of Database Technology 1960s:

Data collection, database creation, IMS and network DBMS

1970s:

Relational data model, relational DBMS implementation

1980s:

RDBMS, advanced data models (extended-relational, OO, deductive, etc.)

Application-oriented DBMS (spatial, scientific, engineering, etc.)

1990s:

Data mining, data warehousing, multimedia databases, and Web databases

2000s

Stream data management and mining

Data mining with a variety of applications

Web technology and global information systems

1.2.1Solution: Data warehousing and data mining

Data warehousing and on-line analytical processing(OLAP)

Data warehouse is a repository of multiple heterogeneous data sources, organized under a

unified schema at a single site in order to facilitate management decision making. It

comprises data cleansing,

Page 6: Data mining

data integration and OLAP, which is analysis techniques with

functionalities such as summarization, consolidation, and aggregation, as

well as the ability to view information from different angles.

Mining interesting knowledge (rules, regularities, patterns, constraints)

from data in large databases

1.3 What is data mining?

Data mining refers to extracting or “mining” knowledge from large amounts of data. The

term is actually a misnomer. Mining of gold from rocks or sand is referred to as gold

mining rather than rock or sand mining. Thus data mining should have been appropriately

named “knowledge mining from data,” which is unfortunately somewhat long.

Nevertheless, mining is a vivid term characterizing the process that finds a small set of

precious nuggets from a great deal of raw material. Thus such a misnomer that carries

both “data” and “mining” became a popular choice. There are many other terms carrying

a similar or slightly different meaning to data mining such as knowledge mining from

databases, knowledge extraction, data or pattern analysis, data archeology and data

dredging. Many people treat data mining as a synonym for another popularly used term,

“Knowledge discovery in databases” or KDD. It consist of an iterative sequence of

the following steps:

Fig.1.1

Data mining—core of knowledge discovery process

Data Cleaning

Data Integration

Databases

Data

Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

Page 7: Data mining

Main steps are:

1. Data cleaning: to remove noise and inconsistent data.

2. Data integration: here multiple data sources are combined.

3. Data selection :data relevant to the analysis task are retrieved from the

database.

4. Data reduction and transformation: here data are transformed or consolidate

into forms appropriate for mining by performing summary or aggregation

operations. we find useful features dimensionality /

variable reduction, invariant representation.

5. Data mining : an essential process where intelligent methods and several

mining algorithms are applied in order to extract data patterns.

6. Pattern evaluation: to identify the truly interesting patterns representing

knowledge based on some interestingness measures.

7. Knowledge presentation: here visualization and knowledge representation

techniques are used to present the mined knowledge to

the user.

1.4 Architecture: Typical Data Mining System

The architecture of a typical data mining system have the following major

components as shown in fig1.2.

fig. 1.2

Data

Warehouse

Data cleaning & data integration Filtering

Databases

Database or data warehouse server

Data mining engine

Pattern evaluation

Graphical user interface

Knowledge-base

Page 8: Data mining

1.4.1 Database, Data warehouse, or other information repository:

This is one or a set of databases, data warehouses, spreadsheets or other kinds of

information repositories. Data cleaning and data integration are performed on the data.

1.4.2 Database or data warehouse server:

The database or data warehouse server is responsible for fetching the relevant data, based

on the users data mining request.

1.4.3 Knowledge base:

This is the domain knowledge that is used to guide the search, or evaluate the

interestingness of resulting patterns. Such knowledge can include concept hierarchies,

used to organize attributes or attribute values into different level of abstraction.

1.4.4 Data mining engine: This is essential to the data mining system and ideally consist of set of function modules

for task such as characterization, association, classification, cluster analysis, and

evolution and deviation analysis.

1.4.5 Pattern evaluation module:

This component typically employs interestingness measures and interacts with the data

mining modules so as to focus the search towards interesting patterns.

1.4.6 Graphical User Interface:

This module communicates between users and the data mining system, allowing the user

to interact with the system by specifying a data mining query or task, providing

information to help focus the search and performing exploratory data mining based on the

intermediate data mining results.

1.5 Data mining-on what kind of data?

1.5.1 Relational databases:

A relational database is a collection of tables, each of which is assigned a unique

name. each table consist of set of attributes and usually stores a large set of tuples.

Each tuple in a relational table represents an object identified by a unique key and

described by a set of attribute values.

1.5.2 Data warehouses: A data warehouse is a repository of information collected from multiple sources,

stored under a unified schema, and which usually resides at a single site. Data

warehouses are constructed via a process of data cleaning, data transformation,

data integration, data loading and periodic data refreshing.

Page 9: Data mining

1.5.3 Transactional databases:

A transactional database consist of a file where each record represents a

transaction. A transaction typically includes a unique transaction identity

no,(trans_id) , and a list of the items making up the transaction.

1.5.4 Advance database systems and advance database applications:

With the advances of database technology, various of advance database systems

have emerged and are undergoing development to address the requirements of

new database applications. The new database applications include:

Object-oriented databases

Object-relational databases

Spatial databases

Temporal databases and time-series databases

Text databases and multimedia databases

Heterogeneous databases and legacy databases

World wide web

1.6 Data Mining Functionalities:

Data mining functionalities are used to specify the kind of patterns to be found in data

mining tasks. They can be classified into two categories: descriptive and predictive.

Descriptive mining tasks characterize the general properties of the data into database.

Predictive mining tasks perform inference on the current data in order to make

predictions.

1.6.1 Concept /Class Description: Characterization and Discrimination

Data can be associated with classes or concepts. It can be useful to describe individual

classes or concepts in summarized, concise, and yet precise terms. Such descriptions of a

class or a concept are called class/concept descriptions. These descriptions can be

derived via:

Data Characterization: It is summarization of the general characteristics or features of a

target class of data. The data corresponding to the user-specified class are typically

collected by a database query.

Data Discrimination: It is a comparison of the general features of target class data

objects with the general features of objects from one or a set of contrasting classes. The

target and contrasting classes can be specified by the user, and the corresponding data

objects retrieved through database queries.

Page 10: Data mining

1.6.2 Association analysis:

Association analysis is the discovery of association rules showing attribute value

conditions that occur frequently together in a given set of data. Association analysis is

widely used for market basket or transaction data analysis.

More formally, association rules are of the form X => Y, that is, “A1

^………….^AmB1^…….^Bn”, where Ai(for i in {1,….,m}) and Bj(for j in {1,…,n})

are attribute-value pairs. The association rule X => Y is interpreted as “ database tuples

that satisfy the conditions in X are also likely to satisfy the conditions in Y.”

1.6.3 Classification and Prediction:

Classification is the process of finding a set of models that describe and distinguish data

classes or concepts, for the purpose of being able to use the module to predict the class of

objects whose class label is unknown, i.e. training data. The derived model is based on

the analysis of a set of training data. Classification can be used for predicting the class

label of data objects. Prediction referred to both data value prediction and class label

prediction. It also encompasses the identification of distribution trends based on the

available data.

1.6.4 Cluster Analysis:

Unlike classification and prediction, clustering analyses data objects without consulting a

known class label. It can be used to generate such labels. The objects are clustered or

grouped based on the principle of maximizing the intraclass similarity and minimizing

the interclass similarity. That is, cluster of objects are formed so that objects within a

cluster have high similarity in comparison to one another, but are very dissimilar to

objects in other clusters. Each cluster that is formed can be viewed as a class of objects,

from which rules can be derived.

1.6.5 Outlier Analysis:

A database may contain data objects that do not comply with the general behavior or the

model of the data. These data objects are outliers. Most data mining methods discard

outliers as noise or exceptions. However, in some applications such as fraud detection,

the rare events can be more interesting than the more regularly occurring ones. The

analysis of outlier data is referred to as outlier mining.

Page 11: Data mining

1.7 Are all of the patterns interesting

A data mining system ha the potential to generate thousands or even millions of patterns,

or rules.

The pattern is interesting if

It is easily understood by humans.

Valid on new or test data with some degree of certainty.

Potentially useful.

Novel.

Several objective measures of pattern interestingness exist. These are based on the

structure of discovered patterns and the statistics underlying them. An objective

measure for association rules of the form X=>Y is rule support, representing the

percentage of transactions from a transaction database that the given rule satisfies.

This is taken to be the probability P(X U Y) where X U Y indicates that a transaction

contain both X and Y, that is, the union of item sets X and Y.

Another objective measure for association rules is confidence, which accesses the

degree of certainty of the detected association. This is taken to be the conditional

probability P(Y|X), that is, the probability that a transaction containing X also

contains Y.

Support and Confidence are defined as:

Support(X=>Y)=P(XUY).

Confidence(X=>Y)=P(Y|X).

Page 12: Data mining

2.0DATA WAREHOUSE- OLAP technology for data mining

Data warehousing provides architecture and tools for business executives to

systematically organize , understand ,and use their data to make strategic decisions . a

large number of organizations have found that data warehouse

System are valuable tools in today’s competitive, fast evolving world. in the last several

years, many firms have spent millions dollar in building enterprise-wide data warehouses.

A data warehouse is a subject oriented ,integrated ,time variant and non volatile

collection of data in support of management’s decision making process.

The four keywords subject oriented ,integrated ,time variant and non volatile ,and

distinguish data warehouses from other data repository systems, such as relational

database systems, transaction processing systems , and file systems.

Subject-Oriented: A data warehouse is organized around major subjects, such as

customer, supplier, product and sales. Rather than concentrating on the day to day

operations and transaction processing of an organization, a data warehouse

focuses on the modeling and analysis of data for decision makers.

Integrated: A data warehouse is usually constructed by integrating multiple

heterogeneous sources, such as relational databases, flat files, and on-line

transaction records. Data cleaning and data integration techniques are applied to

ensure consistency in naming conventions, encoding structures, attributes

measures, and so on.

Time Variant: Data are stored to provide information from a historical

perspective. Every key structure in the data warehouse contains, either implicitly

or explicitly, an element of time.

Nonvolatile: A data warehouse is always a physically separate store of data

transformed from the application data found in the operational environment. Due

to this separation, a data warehouse does not require transaction processing,

recovery, and concurrency control mechanisms. It usually requires only two

operations in data accesing : initial loading of data and access of data.

2.1 Differences between Operational Database Systems and Data

Warehouses:

The major task of on-line operational database systems is to perform on-line transaction

and query processing. These systems are called on-line transaction processing (OLTP)

systems. They cover most of the day-to-day operations of an organization, such as

purchasing, inventory, manufacturing, banking, payroll, registration, and accounting.

Page 13: Data mining

Data warehouse systems, on the other hand, serve users or knowledge workers in the role

of data analysis and decision making. Such systems can organize and present data in

various formats in order to accommodate the diverse needs of the different users. These

systems are known as on-line analytical processing (OLAP) systems .Major

distinguishing features between OLTP and OLAP are summarized as follows:

Feature OLTP OLAP

Characteristic operational processing informational processing

Orientation transaction analysis

User clerk, DBA, database knowledge worker

Professional

Function day-to-day operations long term informational

Requirements

DB-design ER-based, application star/snowflake, subject

oriented oriented

Data current; guaranteed historical; accuracy

Up to date maintained over time

Summarization primitive, highly detailed summarized, consolidated

View detailed, flat relational summarized, multi-

dimensional

Unit of work short, simple transaction complex query

Access read/write mostly read

Focus data in information out

No. of users thousands hundreds

2.2 OLAP Systems versus Statistical databases

Many of the characteristics of OLAP systems, such as the use of a multidimensional data

model and concept hierarchies, the association of measures with dimensions, and the

notions of roll up and drill down, also exist in earlier work on statistical databases(SDBs).

A statistical database is a database system that is designed to support statistical

applications. Similarities between the two types of systems are rarely discussed. Mainly

due to differences in terminology and application domains.

OLAP and SDB systems, however, have distinguishing differences. While SDBs tend to

focus on socio-economic applications, OLAP has been targeted for business applications.

Privacy issues regarding concept hierarchies are a major concern for SDBs. For example,

given summarized socio-economic data, it is controversial to allow users to view the

corresponding low level data. Finally, unlike SDBs, OLAP systems are designed for

handling huge amount of data efficiently.

Page 14: Data mining

2.3 Three Tier data warehouse architecture

Data warehouses often adopt a three-tier architecture:

1. The bottom tier is a warehouse database server that is almost always a relational

database system. Data from operational databases and external sources are

extracted using application program interfaces known as gateways. A gateway is

supported by underlying DBMS and allows client programs to generate SQL code

to be executed at a server.

2. The middle tier is an OLAP server that is typically implemented using either (1) a

relational OLAP(ROLAP) model, that is, an extended relational DBMS that

maps operations on multidimensional data to standard relational operations; or (2)

a multidimensional OLAP (MOLAP) model, that is, a special purpose server

that directly implements multidimensional data and operations.

3. The top tier is a client, which contains query and reporting tools, analysis tools,

and /or data mining tools.

Fig2.1

MMuullttii--TTiieerreedd AArrcchhiitteeccttuurree

Data

Warehouse

Extract

Transform

Load

Refresh

OLAP Engine

Analysis

Query

Reports

Data mining

Monitor

&

Integrator Metadata

Data Sources Front-End

Serve

Data Marts

Operational DBs

other source

s

Data Storage

OLAP Server

Page 15: Data mining

From the architectural point of view, there are three data warehouse models:

Enterprise warehouse: an enterprise warehouse collects all of the information

about subjects spanning the entire organization. It provides corporate-wide data

integration, usually from one or more operational systems or external information

providers, and is cross functional in scope.

Data mart: A data mart contains a subset of corporate wide data that is of value

to a specific group of users. The scope is confined to specific selected subjects.

They are usually implemented on low cost departmental servers that are unix or

windows/NT based.

Virtual warehouse: A virtual warehouse is a set of views over operational

databases. For efficient query processing, only some of the possible summary

views may be materialized. They are easy to built but requires excess capacity on

operational database servers.

Page 16: Data mining

Fig2.2

2.4 Metadata Repository

Metadata are data about data. When used in data warehouse, metadata are the data that

define warehouse objects. They are created for the data names and definitions of the

given warehouse. Additional metadata are created and captured for time stamping any

extracted data, the source of the extracted data, and missing fields that have been added

by data cleaning or integration processes. A metadata should contain the following:

A description of the structure of the data warehouse, which includes the

warehouse schema, view, dimensions, hierarchies, and derived data definitions,

as well as data mart locations and contents.

Operational metadata, which include data lineage, currency of data, and

monitoring information.

The algorithms used for summarization, which include measure and dimension

definition algorithms, data on granularity, partitions, subject areas, aggregation,

summarization, and predefined queries and reports.

Data Warehouse Development

Define a high-level corporate data model

Data

Mart

Data

Mart

Distributed

Data Marts

Multi-Tier Data

Warehouse

Enterprise

Data

Warehouse

Model refinement Model refinement

Page 17: Data mining

The mapping from operational environment to the data warehouse, which

includes, source databases and their content, gateway descriptions, data

partitions, data extraction, cleaning, transformation rules and defaults, data

refresh and purging rules, and security.

Data related to system performance, which include indices and profiles that

improve data access and retrieval performance, in addition to rules for he timing

and scheduling of refresh, update, and replication cycles.

Business metadata, which include business terms and definitions, data ownership

information, and charging policies.

2.5 From data warehousing to data mining

Data warehouse usage:

There are three kinds of data warehouse applications:

Information processing: It supports querying, basic statistical analysis, and

reporting using cross tabs, tables, charts, or graphs. A current trend in data

warehouse information processing is to construct low cost web based accessing

tools that are integrated with web browsers.

Analytical processing; It supports basic OLAP operations, including slice and

dice, drill down, roll up, and pivoting. It generally operates on historical data in

both and detailed forms. The major strength of on-line analytical processing over

information processing is the multi dimensional data analysis of data warehouse

data.

Data mining: It supports knowledge discovery by finding hidden patterns and

associations., constructing analytical models, performing classification and

prediction, and presenting the mining results using visualization tools.

2.6 From On-line Analytical Processing to On-line Analytical Mining

Among many different paradigms and architectures of data mining systems, on-line

analytical mining(OLAM), which integrates on-line analytical processing(OLAP) with

data mining and mining knowledge in multidimensional databases, is particularly

important for the following reasons :

High quality of data in data warehouses

Available information processing infrastructure surrounding data warehouses.

OLAP based exploratory data analysis

On-line selection of data mining functions.

2.6.1 Architecture for On-line Analytical Mining

Page 18: Data mining

Fig2.3

An OLAM server performs analytical mining in data cubes in a similar manner as an

OLAP server performs on on-line analytical processing. An integrated OLAM and OLAP

architecture is shown in he fig. where the OLAM and OLAP servers both accept user on-

line queries via an graphical user interface API and work with the data cube in the data

analysis via a cube API. A metadata directory is used to guide the access off tha data

cube.

The data cube can be constructed by accessing and /or integrating multiple databases via

an MDDB API and/or by filtering a data warehouse via a database API that may support

OLEDB or ODBC connections.

An OLAM Architecture

Data

Warehouse

Meta Data

MDDB

OLAM

Engine

OLAP

Engine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data

Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Page 19: Data mining

3.0 Mining Association Rules in Large Databases

Association rule mining searches for interesting relationships among items in a given data

set. The basic concepts of mining association are:

Let J=(i1,i2,…………,im} be a set of items. Let D, the task relevant data, be a set of

database transactions where each transaction T is a set of items such that T is a subset of

J. each transaction is associated with an identifier, called TID. Let A be a set of items. A

transaction T is said to contain A if and only if A is subset of T. An association rule is an

implication of the form A=>B, where A is subset of J, B is subset of J, and A

intersection B equals to null. The rule A=> B holds in the transaction set D with support

S, ewhere S is the percentage of transactions in D that contain AUB. This is taken to be

the probability, P(AUB). The rule A=>B has confidence C in the transaction set D if C is

the percentage of transactions in D containing A that also contains B. this is taken to be

the conditional probability, P(B|A). That is,

Support (A=>B)=P(AUB)

Confidence (A=>B)=P(B|A)

Rules that satisfy both a minimum support threshold(min_sup) and a minimum

confidence threshold(min_conf) are called strong. By convention, we write support and

confidence values as to occur between 0% and 100% rather than 0 to 1.

A set of item is referred to as an itemset. An itemset that contains k items is a k-itemset.

The occurrence frequency of an itemset is the no. of transactions that contain the itemset.

This is also known, simply, as the frequency, support-count, or count of the itemset. An

itemset satisfies min. support if theoccurence frequency of the itemset is greater than or

equal to the product of min_sup and the total no. of transactions in D. the no. Of

transactions required for the itemset to satisfy min. support is therefore referred to as the

min. support count. If an itemset satisfies min. support, then it is a frequent itemset. The

set of frequent k itemsets is commonly denoted by Lk.

Association rule mining is a two step process:

Find all frequent itemsets

Generate strong association rules from the frequent itemsets.

Page 20: Data mining

3.1 The Apriori Algorithm: Finding Frequent Itemsets Using

Candidate Generation

Apriori is an influential algorithm for mining frequent itemsets for boolean

association rules.The name of the algorithm is based on the fact that the algorithm

uses prior knowledge of frequent itemset properties. Apriori employs an iterative

approach known as level-wise search,where k-itemsets are used to explore (k+1)-

itemsets.First,the set of frequent 1-itemsets is found. This set is denoted L1. L1 is

used to find L2,the set of frequent 2-itemsets,which is used to find L3 and so

on,until no more frequent k-itemsets can be found. The finding pf each Lk

requires one full scan of the database.

To improve the efficiency of the level-wise generation of frequent itemsets,an

important property called the Apriori property,is used to reduce the search space.

In order to use the Apriori property,all nonempty subsets of a frequent itemset

must also be frequent. This property is based on the folllowing observation. By

definition, if an itemset I does not satisfy the minimum support threshold,

min_sup, then I is not frequent, that is,P(I)<min_sup. If an item A is added to the

itemset I,the the resulting itemset (i.e.,IUA)cannot occur more frequently than I.

therefore , IUA is not frequent either , that is , P(IUA) < min_sup.

This property belongs to a special category of properties called anti-monotone in the

sense that if a set cannot pass a test, all of it supersets will fail the same test as well. It is

called anti-monotone because the property is monotonic in the context of failing a test.

let us look at how Lk-1 is used to find Lk. A two step process i followed, consisting of

join and prune actions.

1. The Join step:To find Lk, a set of candidate K-itemsets is generated by joining Lk-1

with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1. The

notation Li[j] refers to the jth item in Li. By covention, apriori assumes that item

within a transaction or itemset are sorted in lexicographic order. The join, Lk-1 to Lk-

1, is performed, where members of Lk-1 are joinable if their first (k-2) items are in

common. That is, members l1 and l2 of Lk-1 are joined if (l1[1]=l2[2] ) and

(l1[2]=l2[2]) and ...and (l1[k-2]=l2[k-2]) and (l1[k-1]<l2[k-1]). The condition l1[k-

1]<l2[k-1] simply insures that no duplicates are generated. The resulting itemset

formed by joining l1 and l2 is l1[1]l1[2]...l1[k-1]l2[k-1].

2. The prune step: Ck is a superset of Lk, that is its member may or may not be frequent,

but all of the frequent k itemsets are included in Ck. A scan of the database to determine

the count of each candidate in Ck would result in the determination of Lk. Ck, however

Page 21: Data mining

can be huge, and so this could involve heavy computation. to reduce the size of Ck, the

apriori property is used as follows. Any (k-1) itemset that is not frequent cannot be a

subset of a frequent k itemset. hence, if any (k-1) subset of a candidate k-itemset is not in

Lk-1, then the candidate cannot be frequent either and so can be removd from Ck. This

subset testing can be done quickly by maintaining a hash tree of all frequent itemsets.

4.0 Classification and Prediction

4.1 What Is Classification?

Data Classification is a two step process. In the first step, a model is built describing a

predetermined set of data classes or concepts. The model is constructed by analyzing

database tuples described by attributes. Each tuple is assumed to belong to a predefined

class, as determined by one of the attributes, called the class label attribute. In the context

of classification, data tuples are also referred to as samples, examples, or objects. The

data tuples analyzed to build the model collectively form the training data set. The

individual tuples making of the training set are referred to as training samples and are

randomly selected from the sample population. Since the class label of each training

sample is provided, this step is also known as supervised learning.

In the second step, the model is used for classification. First, the predictive accuracy of

the model is estimated. The hold out method is a simple technique that uses a test set of

class label samples. These samples are randomly selected and are independent of the

training samples. The accuracy of a model on a given test et is the percentage of test set

samples that are correctly classified by the model. For each test sample, the known class-

label is compared with the learned model’s class prediction for that sample. If the

accuracy of the model were estimated based on the training data set, this estimate could

be optimistic since the learned model tends to over fit the data. Therefore a test set is

used.

Page 22: Data mining

4.2 What is prediction?

Prediction can be viewed as the construction and use of a model to access the class of an

unlabeled sample, or to access the value or value range of an attribute that a given sample

is likely to have.

4.3 Classification by decision tree induction

A decision tree is a flow chart like tree structure, where each internal node denotes a test

on an attribute, each branch represents an outcome of the test, and leaf nodes represent

classes or class distributions. The topmost node in the tree is the root node.

In order to classify an unknown sample, the attribute-values of the sample are tested

against the decision tree. A path is traced from the root to a leaf node that holds the class

prediction for that sample. Decision trees can easily be converted to classification rules.

We describe a basic algorithm for learning decision trees.

Algorithm: Generate_decision_tree. Generate a decision tree from the given training

data.

Input: the training samples, samples, represented by discrete valued attributes; the set of

candidate attributes, attribute list.

Output: a decision tree.

Method:

(1) create a node N;

(2) if samples are all of the same class, C then

(3) return N as a leaf node labeled with the class C;

(4) If attribute list is emplty then

(5) Return N as a leaf node labekleed with the most common class in samples;

//majority voting.

(6) Select test-attribute, th attribute among the attribute-list with the highest

information gain;

(7) Label node N with test-attribute;

(8) For each known value ai of test-attribute //partition the samples

(9) Grow a branch from node N for the condition test attribute=ai ;

(10) Let si be the set of samples in samples for which test attribute = ai;

(11) If si is empty then

Page 23: Data mining

(12) Attach a leaf labeled with the most common class in samples;

(13) Else attach the node returned by Generate_decision_tree(si,attribute-list-test-

attribute);

REFERENCES

DATA MINING-CONCEPTS AND TECHNIQUES

- JIAWEI HAN

- MICHELINE KAMBER

PUBLISHERS: Morgan Kaufmann Publishers