book of event

1Chapter

Data Mining: A First View

Chapter Objectives

) Def ine data mining and understand how data mining can beused to solve problems.

) Understand that complrters are best at learning conceptdefinitions.

) Know when data mining should be considered as a possibleprob lem-solvi ng strategy.

) Understand that expert systems and data mining use differentmeans to accomplish simi lar goals.

) Understand that supervised learning builds models byforming concept definitions from data containing predefinedclasses.

) Understand that unsupervised cluster ing bui lds models fromdata without the aid of predefined classes.

) Recognize that classification models are able to classify newdata of unknown origin.

) Real ize that data mining has been successful ly appl ied tosolve problems in several domains.

27

28 Chopter I c Doto Mining: A First Vrew

This chapter offers an introduction to the fascinating world of data mirung and

knowledge discovery.You will learn about the basics of data mining and ho*' dau

mining has been applied to solve real-world problems. In Section 1.1 we provide a de=

finitiorr of data mining. Section 1.2 discusses how computers are best at learning con-

cept de{initions and how concept definitions can be transformed into useful patterns.

Section 1.3 offers guidance in understanding the rypes of problems that may be ap-

propriate for data mrning. In Section 1.4 we introduce expert systems and explain

how expert systems build models by extracting problem-solving knowledge from one

or more human experts. In Section 1.5 we offer a simple process model for data min-

ing. Section 1.6 explores why using a simple table search may not be an appropriate

data mining strategy. In Section 1.7 we detail several data mining applications.We

conclude this chapter, as well as all chapters in this book, with a short summary key

term definitions, and a set ofexercises. Let's get started!

Data Mining: A Definit ion1."1

We define data rnining as the process of employing one or more computer learning

techniques to automatical|y analyze and extract knowledge from data contained

within a database.,The purpose of a data mining session is to identifi' trends and pat-

terns in data. Ray Kurzwell,"the father of voice-recognition software" and designer of

the Kurzwell keyboard, recently stated that 98% of all human learning is "pattern

recognition."The knowledge gained from a data mining session is given as a model or general-

tzation of the data. Several data mining techrriques exist. However, all data mining

methods use induction-based learning. Induction-based learning is the process of

forming general concept definitions by observing specific examples of concepts to be

learned. Here are three examples of knowledge gained through the process of induc-

tion-based learning:

Did you ever wonder why so many televised golf tournaments are sponsored by

online brokerage firms such as Charles Schwab and TD Waterhouse? A primary

reason is that over 70% of all online investors are males over the age of 40 who

play golf. In addition, 60% of a1l, stock investors are golfers.

Does it make sense for a music company to advertise rap music in magazines for

senior citizens? It does when you iearn that senior citizens often purchase rap

music for their teenage grandchildren.

Did you know that credit card companies can often susPect a stolen credit card,

even if the card holder is unaware of the theft? Manv credit card companies store

a generalized model of your credit card purchxing habits. The model alerts the

t.2 . What Con ComPuters LeornT 29

credi tcatdcompanyofapossib iesto lencardassoonassomeoneat temPtsatrans-a c t i o n t h a t d o . , , ' o t f i t y o u r g e n e r a l p u r c h a s i n g p r o f i l e . .

Thesethreeexamplesmakei teasyforustoseervhydata 'min- lngls fastbecoming

apreferredtechniqueforextractingusefulknowledgefiomdata.Laterinthischapter

and throughout the rexr you *iilr.. additional .*"mpl., of how data mining has

been applied to solve real-world problems'

Knowledge lir.o*"y in batabases (KDD) is a term frequently used inter-

changeably with data minittg'Technically' KDD i1 the applicatio:,.tf tht scientific

method to data ,nl.,t,'g' In adition to performing data mining' a rypicai KDD process

model includes a methodology for extracting anJpreparing data as w_ell as making de-

c is ionsaboutact ionstobet"kt"o" t tdataminingh's t 'k t "p lace 'Whenapart icu larapplication involves the

"nalvsis of large.voluniel of oi:l:::::*;everal locations'

data extractior, "rrd

p*prration becom. th. *ort dme-consuming parts of the discov-

ery process. As data mining has become a popular name for the broader term' we do

notconcernourselveswi t "hc lear lyd iscr iminat ingbetweendataminingandKDD.

However, we do recognize the distinction and have devoted chapter 5 to detaiiing

the stePs of two popular KDD process models'

What Can Computers Learn?'1 )

As the definition implies, data mining is about

pro..rr. Four levels oflearning can be differentiated

. Facts. A fact is a simple statement of truth'

oconcep ts .Aconcep t i sase to fob jec t s , symbo ls ,o reven tsg rouped toge the rbe -

cause they share certain characteristics'

rP rocedu res .Ap rocedu re i sas tep -by -s tepcou rseo fac t i on toach ieveagoa l .We

useproceduresinoureverydayfunct ioningaswel las inthesolut ionofd i { i icu l tproblems.

o principles. Principles represent the highest level oflearning' Principles are gen-

eral truths or laws that are basic to other trutns'

Computers are good at learning concepts' Concepts are the output of a data min-

ingsession.Thedata* in i , 'g tool"d i . t " t . , theformof learnedconcepts.Common

conceptstructuresincludet"ts' 'ult"net'works'andmathematicalequations'Tiee

s t ruc tu resandp roduc t i on ru lesa reeasy fo rhumans to in te rp re tandunde rs tand '

Nerworksandmathemat ica lequat ionsareblack-boxconceptst luctures inthat the

knowledge they contain is not easily understood' We will examine these and other

learning. Learning is a comPlex

(Merril and TennYson, 1977):

e

e

il Clnger t o Doto Mining: A First View

data mining structures throughout the text. First, we take a look at three conunonconcept views.

Three Concept ViewsConcepts can be viewed from different perspectives. An understanding of each viewwill help you categorize the data mining techniques discussed in this text. Let's take amoment to define and illustrate each view.

The classical view attests that all concepts have definite defining properries.These properties determine if an individual item is an example of a particular con-cept.The classical view definition of a concept is'crisp and leaves no room for misin-terpretation. This view supports all examples of a particular concept as being equallyrePresentative of the concept. Here is a rule that employs a classical view definition ofa good credit risk for an unsecured loan:

It Annual lncome >= 30,000& Years at Current Position >=5& Owns Home = TrueTHEN Cood Credit Risk = True

The classical view states that all three rule conditions must be mer for the applicant tobe considered a good credit risk.

The probabilistic and exemplar views are similar in that neither requires conceptrepresentations to have defining properties, The probabilistic view holds that con-cepts are represented by properties that are probable of concept members. The as-sumption is that people store and recall concepts as generalizations created fromindividual exemplar (instance) observations.A probabilistic view definition of a goodcredit risk might look like this:

o The mean annual income for individuals who consistently make loan paymentson time is $30,000.

Most individuals who are good credit risks have been working for the same com-pany for at least five years.

The majoriry of good credit risks own their own home.

This definition olfers general guidelines about the characterisrics representative ofa good credit risk. Unlike the classical view definition, this definition cannot be di-recdy applied to achieve an answer about whether a speci6c person should be givenan unsecured loan. Howevea the definition can be used to help with the decision-making process.The probabilistic view may also associate a probabiliry of membership

|.2 e Whot Con Computers Leorn? 31

with a specific classification. For example, a horneowner with an annual income of$27,000 employed at the same position for four years might be classified as a goodcredit risk with a'probabiliry of 0.85.

The exemplar view states that a given instance is determined to be an exampleof a particular concept if the instance is similar enough to a set of one or more knownexamples of the concept. The view attests that people store and recall likely conceptexemplars that are then used to classify new instances. Consider the loan applicant de-scribed in the previous paragraph. The applicant would be classified as a good creditrisk if the applicant were similar enough to one or more of the stored instances repre-senting good credit risk candidates. Flere is a possible list of exemplars considered tobe good credit risks:

. Exemplar #1:

Annual Income = 32,000Number of Years at Current Position = 6Homeowner

. Exemplar #2:

Annual Income = 52,000Numbir of Years at Current Position = 16Renter

o Exemplar #3:Annual Income = 28,000Number of Years at Current Position = 72Homeowner

As with the probabilistic view, the exemplar view can associate a pobabiliry of con-cept membership with edch classification.

fu we have seen, concepts can be studied from at least three points of view. In ad-dition, concept definitions can be formed in several ways. Supervised learning is prob-ably the best understood concept learning method and the most widely usedtechnique for data mining.wo introduce supervised learning in the next secrion.

Supervised Learning

when we are young, we use induction to form basic concept definitions.we see in-stances of concepts representing animals, plants, buildiirg structures, and the like.Wehear the labels given to individual instances and choose what we believe to be thedefining concept features (attributes) and form ourown classification models. Later,w€ use the models we have developed to help us identify objects of similar srrucrure.

32 Chopter I t Doto Mining: A First View

The name for this rype of learning is induction-based supervised concept learning or

just supervised learning.

The purpose of supervised learning is two-fold. First, we use supervised learning

to build classification models from sets of data containing examples and nonexamples

of the concepts to be learned. Each example or nonexample is known as an instance

of data. Second, once a classification model has been constructed, the model is used to

determine the classification of newly presented instances of unknown origin. It is

worth noting that, although model creation is inductive, applying the model to classify

new instances of unknown origin is a deductive process.

To more clearly illustrate the idea of supervised learning, consider the hypotheti-

cal dataset shown inThble 1.1.The dataset is very small and is relevant for illustrative

purposes only.The table data is displayed in attribute-value format where the first

row shows names for the attributes whose values are contained in the table. The at-

tributes sore throat, feuer, swollen glands, congestion, and headache are possible symptoms

experienced by individuals who have a particular affliction (a strep throat, a cold, or an

allergy).These attributes are known as input attributes and are used to create a model

to represent the data. Diagnosis is the attribute whose value we wish to predict.

Diagnosis is known as the class or output attribute.

Starting with the second row of the table, each remaining row is an instance of

data.An individual row shows the symptoms and aftliction of a single patient. For ex-

ample, the patient with ID = t has a sore throat, fever, swollen glands, congestion, and

a headache.The patient has been diagnosed as having strep throat'

Table I .l o Hypothetical Training Data for Disease Diagnosis

PatienttD#

SoreThroat Fever

SwollenGlands Congestion

YeS

No

No

Yes

No

Headache Diagnosis

Strep throat

Allergy

Cold

Strep throat

Cold

Allergy

Strep throat

Allergy

Cold

LOtO

Yes

Yes

No

No

No

No

No

YeS

Yes

Yes

YeS

YeS

Yes

No

Yes

Yes

No

YeS

Yes

YeS

No

Yes

No

No

No

Yes

No

Yes

No

Yes

No

No

No

Yes

Yes

YeS

No

YeS

Yes

No

No

No

Yes

No

Yes

I

2

3

4

5

6

7

8

9

t 0

|.2 o Whot Can Computers LeornT 33

Suppose we wish to develop a generalized rnodel to represent the data shown inTeble 1.1.Even though this dataset is small,it would be difficult for us to develoo agcneral represeirtation unless we knew something about the relative importance ofthe individual attributes and possible relationships among rhe attributes. Fortunately,en appropriate supervised learning algorithm can do the work for us.

Supervised learning: A Decision Tree Examplewe presented the data inTable 1.1 to c4.5 (euinlan, 1993),a supervised learningProgram that generalizes a set ofinput instances by building a decision tree.A deci-sion tree is a simple structure where nonterminal nodes represent tests on one ormore attributes and terminal nodes reflect decision outcomes. Decision trees haveseveral advantages in that they are easy for us to understand, can be transformed intorules, and have been shown ro work well experimentally. A supervised algorithm forcreating a decision tree will be detailed in Chapter 3.

Figure 1.1 shows the decision tree created from the data inTable 1.1.The deci-sion tree generalizes the table data. Specifically,

Figure 1 .1 o A decision tree for the data in Table t.l

Strep Throat

Diagnosls = Cold

Yes

\Diagnosis =

Yes

Diagnosis

No

= Allergy

SwollenGlands

v Chopter I t Doto Mining: A First View

If a patient has swollen glands, the diagnosis is strep throat.

If a patient does not have swollen glands and has a fever, the diagnosis is a cold.

If a patient does not have swollen glands and does not have a fever, the diagnosis isan allergy.

The decision tree tells us that we can accurately diagnose a patient in this datasetby concerning ourselves only with whether the patient has swollen glands and a fever.The attributes sore throat, congestion, and headache do not play a role in determining adiagnosis. As we can see, the decision tree has generalized the data and provided uswith a summary of those attributes and a*ribute relationships important for an accu-rate diagnosis.

The instances used to create the decision tree model are known as training data.At this point, the training instances are the only instances known to be correctly clas-sified by the model. Flowever, our model is useful to the extent that it can correcrlyclassi$z new instances whose classification is not known. To determine how well themodel is able to be of general use we test the accuracy of the model using a test set.The instances of the test set have a known classification. Therefore we can comparethe test set instance classifications determined by the model with the correct classifica-tion values.Test set classification correctness gives us some indication about the futureperformance of the model.

Let's use the decision tree to classifi the first rwo instances shown inTable 1.2.

Since the patient with ID = 11 has a value of Yes for swollen glands, we follow theright link from the root node of the decision tree.The right link leads to a termi-nal node, indicating the patient has strep throat.

The patient with ID = 12 has a value of No for swollen glands.We follow the leftlink and check the value of the attrtbuteJever.SinceJeuer equals Yes, we diagnosethe patient with a cold.

Table 1.2 o Data lnstances with an Unknown Classification

PatienttD#

SoreThroat

SwollenGlands Congestion Headache Diagnosis

?

?7

Yes

Yes

Yes

Yes

No

l\o

Yes

No

No

Fever

No

Yes

No

t\o

Yes

No

1 t

t 2

t 3

1.2 c Whot Con Computers Leorn? 35

'We can translate any decision tree into a set of production rules. Production

rules are rules of the form:

lF antecedent conditions

THEN consequent conditions

The antecedent conditions detail values or value ranges for one or more input

attributes.The consequent conditions specifi the values or value ranges for the out-

put attributes. The technique for nrapping a decision tree to a set of production

rules is simple. A rule is created by starting at the root node and following one path

of the tree to a leaf node. The antecedent of a rule is given by the attribute value

combinations seen along the path.The consequent of the corresponding rule is the

value at the leaf node. Here are the three production rules for the decision tree

shown in Fig. 1.1:

1 . It SwoIIen Clands = Yes

THEN Diagnosis = StrepThroat

It Swollen Glands = No & Feuer = Yes

THEN Diagnosis = Cold

lF Swollen Glands = No & Feuer = No

THEN Diagnosis = Allergy

Let's use the production rules to classi$z the table instance with patient ID = 13.

Because swollen glands equals No, we pass over the first rule. Likewise,because feuerequals No, the second rule does not apply. Finally, both antecedent conditions for the

third rule are satisfied.Therefore we are able to apply the third rule and diagnose the

patient as having an allergy.

Unsupervised ClusteringUnlike supervised learning, unsupervised clustering builds models from data with-

out predefined classes. Data instances are grouped together based on a similariry

scheme defined by the clustering system.With the help of one or several evaluation

techniques; it is up to us to decide the meaning of the formed clusters.

To further distinguish between supervised learning and unsupervised clustering,

consider the hypothetical data inTable 1,3.The table provides a sampling of infor-

mation about {ive'customers maintaining a brokerage account with Acme Investors

Incorporated.The attributes custoffier ID, sex, age,fauorite recreation, and income are self-

explanatory. Account type indrcates whether the account is held by a single person

2.

3 .

book of event

Documents

data mining session

data mining applications

basics of data mining

finitiorr of data mining

data mining techrriques

data containedwithin

data rnining

data mrning