knowledge discovery and data mining(kdd)
TRANSCRIPT
-
7/23/2019 knowledge discovery and data mining(kdd)
1/52
Knowledge Discovery
and Data Mining (KDD)
-
7/23/2019 knowledge discovery and data mining(kdd)
2/52
Knowledge Discovery & DataMining
process of extractingpreviously unknown, valid,and actionable(understandable) information from
large databases
Data mining is a step in the KDD process of
applying data analysis and discovery algorithms
Machine learning, pattern recognition, statistics,
databases, data visualization. raditional techni!ues may be inade!uate
" large data
-
7/23/2019 knowledge discovery and data mining(kdd)
3/52
Why Mine Data?
#uge amounts of data being collected and
$arehoused" %almart records &' millions per day
" health care transactions multigigabyte databases
" Mobil *il geological data of over +'' terabytes
ffordable computing
-ompetitive pressure" gain an edge by providing improved, customized services
" information as a product in its o$n right
-
7/23/2019 knowledge discovery and data mining(kdd)
4/52
Knowledge Discovery Process
" Data mining: the coreof knowledge discoveryprocess.
Data Cleaning
Data Integration
Databases
Preprocessed
Data
Task-relevant Data
Data transformation
s
Selection
Data Mining
Knowledge Interpretation
-
7/23/2019 knowledge discovery and data mining(kdd)
5/52
Knowledge Discovery Process oal
" understanding the application domain, and goals of KDD effort
Data selection, ac!uisition, integration
Data cleaning" noise, missing data, outliers,etc.
/xploratory data analysis" dimensionality reduction, transformations" selection of appropriate model for analysis, hypotheses to test
Data mining" selecting appropriate method that match set goals (classification,
regression, clustering, etc)" selecting algorithm
esting and verification
0nterpretation
-onsolidation and use
-
7/23/2019 knowledge discovery and data mining(kdd)
6/52
'
+'
&'
1'
2'
3'
4'
5'
6'
7'
+''
BusinessObjective
Determination
DataPreparation DataMining Analysis ofResults and
Knowledge
Assimilation
Effort for eac data!mining process step
-
7/23/2019 knowledge discovery and data mining(kdd)
7/52
Isses and challenges
large data
" number of variables (features), number of cases (examples)" multi gigabyte, terabyte databases" efficient algorithms, parallel processing
high dimensionality" large number of features exponential increase in search space" potential for spurious patterns" dimensionality reduction
*verfitting" models noise in training data, rather than 8ust the generalpatterns
-hanging data, missing and noisy data
9se of domain :no$ledge" utilizing :no$ledge on complex data relationships, :no$n facts
9nderstandability of patterns
-
7/23/2019 knowledge discovery and data mining(kdd)
8/52
Data Mining
;rediction Methods" using some variables to predict un:no$n or future values of
other variables
Descriptive Methods" finding humaninterpretable patterns describing the data
-
7/23/2019 knowledge discovery and data mining(kdd)
9/52
Data Mining !asks
-lassification
-lustering
ssociation
-
7/23/2019 knowledge discovery and data mining(kdd)
10/52
"lassification
Data defined in terms of attributes, one of $hich is the class
>ind a model for class attribute as a function of the
values of other(predictor) attributes, such that previously
unseen records can be assigned a class as accuratelyas possible.
raining Data used to build the model
est data used to validate the model (determine accuracy of themodel)
iven data is usually divided into training and test sets.
-
7/23/2019 knowledge discovery and data mining(kdd)
11/52
"lassification:#$ample
-
7/23/2019 knowledge discovery and data mining(kdd)
12/52
"lassification: Direct Marketing
oal
-
7/23/2019 knowledge discovery and data mining(kdd)
13/52
"lassification: %rad detection
oal ;redict fraudulent cases in credit card
transactions.
Data
" 9se credit card transactions and information on its accountholder as input variables
" label past transactions as fraud or fair.
Aearn a model for the class of transactions
9se the model to detect fraud by observing creditcard transactions on a given account.
-
7/23/2019 knowledge discovery and data mining(kdd)
14/52
"lstering
iven a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that" data points in one cluster are more similar to one another
" data points in separate clusters are less simislar to one
another.
=imilarity measures" /uclidean distance if attributes are continuous
" ;roblem specific measures
-
7/23/2019 knowledge discovery and data mining(kdd)
15/52
"lstering: Market egmentation
oal subdivide a mar:et into distinct subsets of
customers $here any subset may conceivably be
selected as a mar:et target to be reached $ith a
distinct mar:eting mix.
pproach" collect different attributes on customers based on
geographical, and lifestyle related information
" identify clusters of similar customers
" measure the clustering !uality by observing buying patterns
of customers in same cluster vs. those from different clusters.
-
7/23/2019 knowledge discovery and data mining(kdd)
16/52
'ssociation (le Discovery
iven a set of records, each of $hich contain
some number of items from a given collection" produce dependency rules $hich $ill predict occurrence of
an item based on occurences of other items
-
7/23/2019 knowledge discovery and data mining(kdd)
17/52
'ssociation (les:'pplication
Mar:eting and =ales ;romotion
-onsider discovered rule
{Bagels, } --> {Potato Chips}" ;otato -hips as conse!uent can be used to determine
$hat may be done to boost sales
" Bagels as an antecedent can be used to see $hich
products may be affected if bagels are discontinued
" -an be used to see $hich products should be sold $ith
Bagels to promote sale of ;otato -hips
-
7/23/2019 knowledge discovery and data mining(kdd)
18/52
'ssociation (les: 'pplication
=upermar:et shelf management
oal to identify items $hich are bought together
(by sufficiently many customers)
pproach process pointofsale data (collected
$ith barcode scanners) to find dependencies
among items.
/xample" 0f a customer buys Diapers and Mil:, then he is very li:ely to
but Beer
" so stac: sixpac:s next to diapersC
-
7/23/2019 knowledge discovery and data mining(kdd)
19/52
)isali*ation
complement to other DM techni!ues li:e=egmentation,etc.
-
7/23/2019 knowledge discovery and data mining(kdd)
20/52
&'
Data Mining in "(M:
"stomer +ife "ycle -ustomer Aife -ycle" he stages in the relationship bet$een a customer and a
business
Key stages in the customer lifecycle" Prospects: people $ho are not yet customers but are in
the target mar:et" (esponders:prospects $ho sho$ an interest in a product
or service" 'ctive "stomers: people $ho are currently using the
product or service" %ormer "stomers:may be badE customers $ho did not
pay their bills or $ho incurred high costs
0tFs important to :no$ life cycle events (e.g.retirement)
-
7/23/2019 knowledge discovery and data mining(kdd)
21/52
&+
Data Mining in "(M:
"stomer +ife "ycle %hat mar:eters $ant 0ncreasing customer
revenue and customer profitability" 9psell
" -rosssell
" Keeping the customers for a longer period of time
=olution pplying data mining
-
7/23/2019 knowledge discovery and data mining(kdd)
22/52
&&
Data Mining in "(M
DM helps to" Determine the behavior surrounding a particular lifecycle
event
" >ind other people in similar life stages and determine $hichcustomers are follo$ing similar behavior patterns
-
7/23/2019 knowledge discovery and data mining(kdd)
23/52
&1
Data Mining in "(M ,cont.-
Data "areouse Data Mining
#ampaign Management
Customer Profile
Customer Life Ccle Info!
-
7/23/2019 knowledge discovery and data mining(kdd)
24/52
&2
Data Mining !echnies
Data Mining echni!ues
Descriptive ;redictive
-lustering
ssociation
-lassification
-
7/23/2019 knowledge discovery and data mining(kdd)
25/52
&3
Predictive Data Mining
ridas Hic:ie Mi:e
1onest
Barney%aldo%ally
"rooked
-
7/23/2019 knowledge discovery and data mining(kdd)
26/52
&4
Prediction
ridas Hic:ie Mi:e
#onest I has round eyes and a smile
-
7/23/2019 knowledge discovery and data mining(kdd)
27/52
&5
Decision !rees
Data
eigt air eyes class
sort blond blue A
tall blond brown B
tall red blue A
sort dar$ blue B
tall dar$ blue B
tall blond blue Atall dar$ brown B
sort blond brown B
-
7/23/2019 knowledge discovery and data mining(kdd)
28/52
&6
Decision !rees ,cont.-
hair
dar:
red
blond
short, blue I B
tall, blue I B
tall, bro$nI B
?tall, blue I % short, blue I
tall, bro$n I B
tall, blue I
short, bro$n I B
-ompletely classifies dar:haired
and redhaired people
Does not completely classify
blonde!aired people&
More wor$ is re'uired
-
7/23/2019 knowledge discovery and data mining(kdd)
29/52
&7
Decision !rees ,cont.-
hair
dar:
red
blond
short, blue I Btall, blue I B
tall, bro$nI B
?tall, blue I % short, blue I
tall, bro$n I B
tall, blue I
short, bro$n I B
eyeblue bro$n
short I
tall I
tall I B
short I B
Decision tree is complete because
(& All ) cases appear at nodes*& At eac node+ all cases are in
te same class ,A or B-
-
7/23/2019 knowledge discovery and data mining(kdd)
30/52
1'
Decision !rees:
+earned Predictive (lesair
eyesB
B
A
A
dar:
red
blond
blue bro$n
-
7/23/2019 knowledge discovery and data mining(kdd)
31/52
1+
Decision !rees:
'nother #$ample!otal list
234 mem0er
356 child 758 child734 mem0er
9 children
;235
-
7/23/2019 knowledge discovery and data mining(kdd)
32/52
1&
(le Indction
ry to find rules of the form
0> Jlefthandside #/G Jrighthandside" his is the reverse of a rulebased agent, $here the rules
are given and the agent must act. #ere the actions aregiven and $e have to discover the rulesL
;revalence I probability that A#= and
-
7/23/2019 knowledge discovery and data mining(kdd)
33/52
11
"ssociation #ules from
Market $asket "nalsis
.Dairy!Mil$!Refrigerated/.0oft Drin$s #arbonated/
prevalence 1 2&334+ predictability 1 **&)34
.Dry Dinners ! Pasta/.0oup!#anned/
prevalence 1 5&324+ predictability 1 *)&(24 .Dry Dinners ! Pasta/.#ereal ! Ready to Eat/
prevalence 1 (&674+ predictability 1 2(&5*4
.#eese 0lices /.#ereal ! Ready to Eat/
prevalence 1 (&(74+ predictability 1 6)&5(4
-
7/23/2019 knowledge discovery and data mining(kdd)
34/52
12
%se of #ule "ssociations
#oupons+ discounts Don8t give discounts on * items tat are fre'uently bougt
togeter& 9se te discount on ( to :pull; te oter
Product placement Offer correlated products to te customer at te same time&
#R purcasers *!6 monts after
>#R purcase
Discovery of patterns People wo bougt ?+ @ and ,but not any pair- bougt "
over alf te time
-
7/23/2019 knowledge discovery and data mining(kdd)
35/52
13
&inding #ule "ssociations
"lgorit'm EampleC grocery sopping
or eac item+ count of occurrences ,say out of (55+555-
apples ()3(+ caviar 6+ ice cream (5))+ F
Drop te ones tat are below a minimum support level
apples ()3(+ ice cream (5))+ pet food *2G(+ F
Ma$e a table of eac item against eac oter itemC
Discard cells below support tresold& How ma$e a cube for
triples+ etc& Add ( dimension for eac product on IJ0&
apples ice cream pet food
apples 6=?6 >=2 79
ice cream 55555 63== 877pet food 55555 55555 7926
-
7/23/2019 knowledge discovery and data mining(kdd)
36/52
14
"lstering
he art of finding groups in data
*b8ective gather items from a database into sets
according to (un:no$n) common characteristics Much more difficult than classification since the
classes are not :no$n in advance (no training)
echni!ue unsupervised learning
-
7/23/2019 knowledge discovery and data mining(kdd)
37/52
15
!he K-Means"lstering Method
'
+
&
1
2
3
4
5
6
7
+'
' + & 1 2 3 4 5 6 7 +'
5
(
*
6
2
G
7
)
3
(5
5 ( * 6 2 G 7 ) 3 (5
'
+
&
1
2
3
4
5
6
7
+'
' + & 1 2 3 4 5 6 7 +'
'
+
&
1
2
3
4
5
6
7
+'
' + & 1 2 3 4 5 6 7 +'
5
(
*
6
2
G
7
)
3
(5
5 ( * 6 2 G 7 ) 3 (5
K=2
Arbitrarily chooseK objects as initialcluster center
Assign
eachof theobjectstomostsimilarcenter
Update
theclustermeans
Updatetheclustermeans
reassignreassign
-
7/23/2019 knowledge discovery and data mining(kdd)
38/52
38
@pinion 'nalysis
Word5of5moth on the We0 he %eb has dramatically changed the $ay that
consumers express their opinions.
*ne can post revie$s of products at merchant
sites, %eb forums, discussion groups, blogs
echni!ues are being developed to exploit these
sources.
Benefits of
-
7/23/2019 knowledge discovery and data mining(kdd)
39/52
39
%eatre Aased 'nalysis &mmari*ation
/xtracting product features (called *pinion
>eatures) that have been commented on by
customers.
0dentifying opinion sentences in each revie$ and
deciding $hether each opinion sentence is positive
or negative.
=ummarizing and comparing results.
-
7/23/2019 knowledge discovery and data mining(kdd)
40/52
entiment 'nalysis and opinion mining
25
-
7/23/2019 knowledge discovery and data mining(kdd)
41/52
2(
Introdction $o main types of textual information.
" >acts and *pinions Gote factual statements can imply opinions too.
Most current text information processing methods(e.g., $eb search, text mining) $or: $ith factualinformation.
=entiment analysis oropinion mining" computational study of opinions, sentiments and emotions
expressed in text.
%hy opinion mining no$CMainly because of the%eb huge volumes of opinionated text.
-
7/23/2019 knowledge discovery and data mining(kdd)
42/52
2*
Introdction B ser5generated
media 0mportance of opinions" *pinions are important because $henever $e need to
ma:e a decision, $e $ant to hear othersF opinions.
" 0n the past, 0ndividuals opinions from friends and family businesses surveys, focus groups, consultants N
%ordofmouth on the %eb
" 9sergenerated media *ne can express opinions on
anything in revie$s, forums, discussion groups, blogs ..." *pinions of global scale Go longer limited to
0ndividuals oneFs circle of friends Businesses =mall scale surveys, tiny focus groups, etc.
-
7/23/2019 knowledge discovery and data mining(kdd)
43/52
' %ascinating Pro0lemC
0ntellectually challenging O ma8or applications." popular research topic in recent years in GA; and %eb data
mining.
" &'4' companies in 9= alone
0t touches every aspect of GA; and yet is restrictedand confined." Aittle research in GA;PAinguistics in the past.
;otentially a ma8or technology from GA;.
" But not yetE and not easyL" Data sourcing and data integration are hard tooL
26
-
7/23/2019 knowledge discovery and data mining(kdd)
44/52
'n #$ample (eview
bought an iPhone a !ew days ago" t was such anice phone" #he touch screen was really cool" #he
voice $uality was clear too" %lthough the battery li!e
was not long, that is ok !or &e" 'owever, &y &other
was &ad with &e as did not tell her be!ore boughtthe phone" (he also thought the phone was too
e)pensive, and wanted &e to return it to the shop" *
%hat do $e seeC" @pinions targets of opinions and opinion holders
22
-
7/23/2019 knowledge discovery and data mining(kdd)
45/52
2G
!arget @0Eect ,+i We0 Data Mining 0ook 733>-
Definition(o0Eect) n ob+ectois a product, person,event, organization, or topic. ois represented as
" a hierarchy of components, subcomponents, and so on." /ach node represents a component and is associated $ith
a set of attributesof the component.
n opinion can be expressed on any node or attribute ofthe node.
o simplify our discussion, $e use the term featurestorepresent both components and attributes.
-
7/23/2019 knowledge discovery and data mining(kdd)
46/52
What is an @pinion? ,+i a "h. in /+Phand0ook-
n opinionis a !uintuple(o+, !+k, soi+kl, hi, tl),
$here" o+is a target ob8ect.
" !+kis a feature of the ob8ect o+.
" soi+klis the sentiment value of the opinion of the opinion holder hi
on feature !+kof ob8ect o+at time tl. soi+klis Qve, ve, or neu, or a
more granular rating.
" hiis an opinion holder.
" tlis the time $hen the opinion is expressed.
27
-
7/23/2019 knowledge discovery and data mining(kdd)
47/52
2
@0Eective B strctre the
nstrctred *b8ective iven an opinionated document," Discover all !uintuples (o+, !+k, soi+kl, hi, tl),
i.e., mine the five corresponding pieces of information in each!uintuple, and
" *r, solve some simpler problems
%ith the !uintuples,
" 9nstructured ext=tructured Data
raditional data and visualization tools can be used to slice, dice
and visualize the results in all :inds of $ays /nable !ualitative and !uantitative analysis.
-
7/23/2019 knowledge discovery and data mining(kdd)
48/52
2)
entiment "lassification: doc5level,Pang and +ee et al 7337 and !rney 7337-
-lassify a document (e.g., a revie$) based on the
overall sentiment expressed by opinion holder
" -lasses ;ositive, or negative (and neutral)
0n the model, (o+, !+k, soi+kl, hi, tl), 0t assumes
" /ach document focuses on a single ob8ect and contains
opinions from a single opinion holder.
" 0t considers opinion on the ob8ect, o+(or o+I !+k)
-
7/23/2019 knowledge discovery and data mining(kdd)
49/52
23
0Eectivity 'nalysis,Wie0e et al 7339-
=entencelevel sentiment analysis has t$o tas:s" =ub8ectivity classification =ub8ective or ob8ective.
*b8ective e.g., bought an iPhone a !ew days ago"
=ub8ective e.g., t is such a nice phone"
" =entiment classification >or sub8ective sentences orclauses, classify positive or negative.
;ositive t is such a nice phone"
#o$ever.(Aiu, -hapter in GA; handboo:)
" sub8ective sentences RQve or "ve opinions /.g., think he ca&e yesterday"
" *b8ective sentence R no opinion 0mply "ve opinion y phone broke in the second day"
-
7/23/2019 knowledge discovery and data mining(kdd)
50/52
G5
%eatre5Aased entiment 'nalysis
=entiment classification at both document andsentence (or clause) levels are not sufficient,
" they do not tell $hat people li:e andPor disli:e
" positive opinion on an ob8ect does not mean that the
opinion holder li:es everything.
" n negative opinion on an ob8ect does not mean N..
*b8ective Discovering all !uintuples
(o+, !+k, soi+kl, hi, tl)
%ith all !uintuples, all :inds of analyses become
possible.
-
7/23/2019 knowledge discovery and data mining(kdd)
51/52
G(
%eatre5Aased @pinion mmary,1 & +i KDD57339-
bought an iPhonea !ew days
ago" t was such a nicephone"
#he touch screen was really
cool" #he voice $uality was
clear too" %lthough the battery
li!e was not long, that is ok !or&e" 'owever, &y &other was
&ad with &e as did not tell her
be!ore bought the phone" (he
also thought the phone was too
e)pensive, and wanted &e toreturn it to the shop" *
N.
%eatre Aased mmary:
%eatre6 !och screen;ositive &+& #he touch screen was really cool. #he touch screen was so easy to
use and can do a&aing things"FGegative 4 he screenis easily scratched. 0 have a lot of difficulty in removing
finger mar:s from the touch screen.
F%eatre7 0attery lifeF
.ote/ 0e o&it opinion holders
-
7/23/2019 knowledge discovery and data mining(kdd)
52/52
G*
)isal "omparison ,+i et al. WWW57332- =ummary of
revie$s of-ell ;hone+
)oice creen i*e WeightAattery
Q
S
-omparison of
revie$s of
-ell ;hone +
-ell ;hone &
S
Q