an 15 dm caracterizacion
TRANSCRIPT
-
8/20/2019 An 15 DM Caracterizacion
1/38
Data Mining:Characterization
-
8/20/2019 An 15 DM Caracterizacion
2/38
Concept Description:
Characterization and
Comparison What is concept description? Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute
relevance
Mining class comparisons: Discriminating between
dierent classes Mining descriptive statistical measures in large
databases
ummary
-
8/20/2019 An 15 DM Caracterizacion
3/38
What is Concept
Description?
Descriptive vs! predictive data mining Descriptive mining: describes concepts or tas"-
relevant data sets in concise# summarative#
informative# discriminative forms $redictive mining: %ased on data and analysis#
constructs models for the database# and predictsthe trend and properties of un"nown data
&oncept description: &haracterization: provides a concise and succinctsummarization of the given collection of data
&omparison: provides descriptions comparing twoor more collections of data
-
8/20/2019 An 15 DM Caracterizacion
4/38
Concept Description:
Characterization and
Comparison What is concept description? Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute
relevance
Mining class comparisons: Discriminating between
dierent classes Mining descriptive statistical measures in large
databases
ummary
-
8/20/2019 An 15 DM Caracterizacion
5/38
Data Generalization and
Summarization-based
Characterization Data generalization
A process which abstracts a large set of tas"-relevant data in a database from a low
conceptual levels to higher ones!12
3
4
5Conceptual levels
-
8/20/2019 An 15 DM Caracterizacion
6/38
Attribute-Oriented
Induction $roposed in '()( *+DD ,)( wor"shop .ot con/ned to categorical data nor particular
measures!
0ow it is done? &ollect the tas"-relevant data* initial relation using a
relational database 1uery
$erform generalization by attribute removal or
attribute generalization! Apply aggregation by merging identical# generalized
tuples and accumulating their respective counts!
2nteractive presentation with users!
-
8/20/2019 An 15 DM Caracterizacion
7/38
asic !rinciples o"
Attribute-Oriented
Induction Data focusing: tas"-relevant data# including dimensions# andthe result is the initial relation!
Attribute-removal: remove attribute A if there is a large set of
distinct values for A but *' there is no generalization
operator on A# or *3 A4s higher level concepts are e5pressedin terms of other attributes!
Attribute-generalization: 2f there is a large set of distinct
values for A# and there e5ists a set of generalization operators
on A# then select an operator and generalize A!
Attribute-threshold control: typical 3-)# speci/ed6default!
7eneralized relation threshold control: control the /nal
relation6rule size!
-
8/20/2019 An 15 DM Caracterizacion
8/38
#$ample
Describe general characteristics of graduatestudents in the %ig-8niversity database
use %ig98niversity9D%
mine characteristics as cience9tudents;
in relevance to name# gender# ma statement:Select name# gender# ma
-
8/20/2019 An 15 DM Caracterizacion
9/38
Class Characterization: An
#$ampleName Gender Major Birth-Place Birth_date Residence Phone # GPAJimoodman
M !" anco$%er&B!&
!anada
'-12-() 3511 Main "t*&
Richmond
)'(-45+' 3*)(
"cott
,achance
M !" Montreal& $e&
!anada
2'-(-(5 345 1st A%e*&
Richmond
253-+1.) 3*(.
,a$ra ,ee
/
0
/
Phsics
/
"eattle& A& "A
/25-'-(.
/
125 A$stin A%e*&
B$rna
/
42.-5232
/
3*'3
/
Remo%ed Retained "ci&n&B$s
!o$ntr Ae rane !it Remo%ed 6cl&G&**
Gender Major Birth_region Age_range Residence GPA Count
M Science Canada 20-25 Richond !er"-good #$
% Science %oreign 25-&0 Burna'" ()cellent 22
* * * * * * *
Birth_Region
Gender
Canada %oreign +otal
M #$ #, &0
% #0 22 &2
+otal 2$ &$ $2
Prime
Generali7ed
Relation
8nitial
Relation
-
8/20/2019 An 15 DM Caracterizacion
10/38
Concept Description:
Characterization and
Comparison What is concept description? Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute
relevance
Mining class comparisons: Discriminating between
dierent classes Mining descriptive statistical measures in large
databases
ummary
-
8/20/2019 An 15 DM Caracterizacion
11/38
Attribute %ele&ance
Anal'sis Why? Which dimensions should be included?
0ow high level of generalization?
Automatic vs! interactive Beduce = attributesC easy to understand patterns
What? statistical method for preprocessing data
/lter out irrelevant or wea"ly relevant attributes
retain or ran" the relevant attributes
relevance related to dimensions and levels
analytical characterization# analytical comparison
-
8/20/2019 An 15 DM Caracterizacion
12/38
Attribute rele&ance
anal'sis (cont)d* 0ow? Data &ollection
Analytical 7eneralization
8se information gain analysis *e!g!# entropy or othermeasures to identify highly relevant dimensions andlevels!
Belevance Analysis ort and select the most relevant dimensions and levels!
Attribute-oriented 2nduction for class description n selected dimension6level
A$ operations *e!g! drilling# slicing on relevancerules
-
8/20/2019 An 15 DM Caracterizacion
13/38
%ele&ance Measures
>uantitative relevance measuredetermines the classifying power of
an attribute within a set of data! Methods
information gain *2DE
gain ratio *&F!G χ3 contingency table statistics
uncertainty coeHcient
-
8/20/2019 An 15 DM Caracterizacion
14/38
In"ormation-+heoretic
Approach Decision tree each internal node tests an attribute
each branch corresponds to attribute value
each leaf node assigns a classi/cation
2DE algorithm build decision tree based on training ob
-
8/20/2019 An 15 DM Caracterizacion
15/38
+op-Do,n Induction o"
Decision +reeAttri'utes ./utloo1 +eperature1 uidit"1 3ind4
/utloo
uidit" 3ind
sunn" rainovercast
"es
no "es
high noral
no
strong ea
"es
Pla"+ennis ."es1 no4
-
8/20/2019 An 15 DM Caracterizacion
16/38
#$ample: Anal'tical
Characterization Ias"
Mine general characteristics describing graduatestudents using analytical characterization
7iven attributes name, gender, major, birth_place,
birth_date, phone## and gpa
Gen(ai ) J concept hierarchies on ai Ui J attribute analytical thresholds for ai T i J attribute generalization thresholds for ai R J attribute relevance threshold
-
8/20/2019 An 15 DM Caracterizacion
17/38
#$ample: Anal'tical
Characterization (cont)d*
'! Data collection target class: graduate student
contrasting class: undergraduate student
3! Analytical generalization using 8i attribute removal
remove name and phone#
attribute generalization
generalize major # birth_place# birth_date and gpa accumulate counts
candidate relation: gender # major #birth_country # age_range and gpa
-
8/20/2019 An 15 DM Caracterizacion
18/38
#$ample: Anal'tical
characterization (*ender major irth_co$ntr ae_rane 9a co$ntM Science Canada 20-25 !er"_good #$
% Science %oreign 25-&0 ()cellent 22
M (ngineering %oreign 25-&0 ()cellent #6
% Science %oreign 25-&0 ()cellent 25
M Science Canada 20-25 ()cellent 2#
% (ngineering Canada 20-25 ()cellent #6
Candidate relation for Target class: Graduate students (
=120)
ender major irth_co$ntr ae_rane 9a co$nt
M Science %oreign 720 !er"_good #6
% Business Canada 720 %air 20M Business Canada 720 %air 22
% Science Canada 20-25 %air 2,
M (ngineering %oreign 20-25 !er"_good 22
% (ngineering Canada 720 ()cellent 2,
Candidate relation for Contrasting class: Undergraduate students ( =130)
-
8/20/2019 An 15 DM Caracterizacion
19/38
#$ample: Anal'tical
characterization (.*
E! Belevance analysis &alculate e5pected info re1uired to classify an
arbitrary tuple
&alculate entropy of each attribute: e!g! major
88660250
#&0
250
#&0
250
#20
250
#20#&0#20 222# .log log ) , I( ) s , I(s =−−==
%or major=”Science”9 S##6, S2#,2 :;s##1s2#
-
8/20/2019 An 15 DM Caracterizacion
20/38
#$ample: Anal'tical
Characterization (/* &alculate e5pected info re1uired to classify agiven sample if is partitioned according to theattribute
&alculate information gain for each attribute
2nformation gain for all attributes
B6B&0250
,2
250
62
250
#2$2&22#22### . ) s , s( I ) s , s( I ) s , s( I E(major) =++=
2##502# . E(major) ) s , I(s )Gain(major =−=
Gain;gender< 0=000&
Gain;'irth_countr"< 0=0,0
Gain;ajor< 0=2##5
Gain;gpa< 0=,,80
Gain;age_range< 0=58#
-
8/20/2019 An 15 DM Caracterizacion
21/38
#$ample: Anal'tical
characterization (0* F! 2nitial wor"ing relation *WK derivation B J K!'
remove irrelevant6wea"ly relevant attributes from candidaterelation JL drop gender # birth_country
remove contrasting class candidate relation
G! $erform attribute-oriented induction on WK using Ii
major ae_rane 9a co$nt
Science 20-25 !er"_good #$
Science 25-&0 ()cellent ,
Science 20-25 ()cellent 2#
(ngineering 20-25 ()cellent #6
(ngineering 25-&0 ()cellent #6
8nitial taret class :or;in relation .< Grad$ate st$dents
-
8/20/2019 An 15 DM Caracterizacion
22/38
Concept Description:
Characterization and
Comparison What is concept description? Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attribute
relevance
Mining class comparisons: Discriminating between
dierent classes
Mining descriptive statistical measures in large
databases
ummary
-
8/20/2019 An 15 DM Caracterizacion
23/38
Mining Class Comparisons
&omparison: &omparing two or more classes! Method:
$artition the set of relevant data into the target classand the contrasting class*es
7eneralize both classes to the same high levelconcepts
&ompare tuples with the same high level descriptions $resent for every tuple its description and two
measures:
support - distribution within single class comparison - distribution between classes
0ighlight the tuples with strong discriminant features
Belevance Analysis: ind attributes *features which best distinguish
dierent classes!
-
8/20/2019 An 15 DM Caracterizacion
24/38
#$ample: Anal'tical
comparison
Ias" &ompare graduate and undergraduate
students using discriminant rule!
DM> 1uery
$se Big_niversit"_DBmine com9arison as @grad_vs_undergrad_students
in rele%ance to name, gender, major, birth_place, birth_date, residence, phone, gpa
=or @graduate_students
:here status in @graduate%ers$s @undergraduate_students
:here status in @undergraduateanal7e countE
=rom student
-
8/20/2019 An 15 DM Caracterizacion
25/38
#$ample: Anal'tical
comparison (* 7iven
attributes name, gender, major, birth_place,birth_date, residence, phone# and gpa
Gen(ai ) J concept hierarchies on attributes ai
Ui J attribute analytical thresholds for
attributes ai
T i J attribute generalization thresholds forattributes ai
R J attribute relevance threshold
-
8/20/2019 An 15 DM Caracterizacion
26/38
#$ample: Anal'tical
comparison (.*
'! Data collection target and contrasting classes
3! Attribute relevance analysis remove attributes name, gender, major, phone#
E! ynchronous generalization controlled by user-speci/ed dimension thresholds
prime target and contrasting class*esrelations6cuboids
-
8/20/2019 An 15 DM Caracterizacion
27/38
#$ample: Anal'tical
comparison (/*Birth_co$ntr Ae_rane G9a !o$nt>Canada 20-25 Good 5=5&E
Canada 25-&0 Good 2=&2E
Canada /ver_&0 !er"_good 5=6$E
* * * *
/ther /ver_&0 ()cellent ,=$6E
Prime enerali7ed relation =or the taret class< Grad$ate st$dents
Birth_co$ntr Ae_rane G9a !o$nt>
Canada #5-20 %air 5=5&E
Canada #5-20 Good ,=5&E
* * * *
Canada 25-&0 Good 5=02E
* * * *
/ther /ver_&0 ()cellent 0=$6E
Prime enerali7ed relation =or the contrastin class< nderrad$ate st$dents
-
8/20/2019 An 15 DM Caracterizacion
28/38
#$ample: Anal'tical
comparison (0*
F! Drill down# roll up and other A$ operationson target and contrasting classes to ad
-
8/20/2019 An 15 DM Caracterizacion
29/38
Concept Description:
Characterization and
Comparison What is concept description? Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attributerelevance
Mining class comparisons: Discriminating between
dierent classes
Mining descriptive statistical measures in large
databases
ummary
-
8/20/2019 An 15 DM Caracterizacion
30/38
Mining Data Dispersion
Characteristics Motivation Io better understand the data: central tendency# variation
and spread
Data dispersion characteristics
median# ma5# min# 1uantiles# outliers# variance# etc!
.umerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of
precision
%o5plot or 1uantile analysis on sorted intervals
Dispersion analysis on computed measures
olding measures into numerical dimensions
%o5plot or 1uantile analysis on the transformed cube
-
8/20/2019 An 15 DM Caracterizacion
31/38
Measuring the Central
+endenc' Mean
Weighted arithmetic mean
Median: A holistic measure
Middle value if odd number of values# or average ofthe middle two values otherwise
estimated by interpolation
Mode
Palue that occurs most fre1uently in the data
8nimodal# bimodal# trimodal
Qmpirical formula:
∑=
=n
i
i !
n !
#
#
∑
∑
=
==n
i
i
n
i
ii
"
!"
!
#
#
c #
l # n $median
median
<
-
8/20/2019 An 15 DM Caracterizacion
32/38
Measuring the Dispersion o"
Data >uartiles# outliers and bo5plots
>uartiles: >' *3Gth percentile# >E *RGth percentile
2nter-1uartile range: 2>B J >E S >'
ive number summary: min# >'# M# >E# ma5
%o5plot: ends of the bo5 are the 1uartiles# median is mar"ed#
whis"ers# and plot outlier individually
utlier: usually# a value higher6lower than '!G 5 2>B
Pariance and standard deviation
Pariance s2: *algebraic# scalable computation
tandard deviation s is the s1uare root of variance s2
∑ ∑∑= ==
−−
=−−
=n
i
n
i
ii
n
i
i !
n !
n ! !
n s
# #
22
#
22 G
-
8/20/2019 An 15 DM Caracterizacion
33/38
o$plot Anal'sis
ive-number summary of a distribution:
Minimum# >'# M# >E# Ma5imum
%o5plot
Data is represented with a bo5
Ihe ends of the bo5 are at the /rst and third1uartiles# i!e!# the height of the bo5 is 2B>
Ihe median is mar"ed by a line within thebo5
Whis"ers: two lines outside the bo5 e5tendto Minimum and Ma5imum
-
8/20/2019 An 15 DM Caracterizacion
34/38
A o$plot
A 'o)plot
-
8/20/2019 An 15 DM Caracterizacion
35/38
Concept Description:
Characterization and
Comparison What is concept description? Data generalization and summarization-based
characterization
Analytical characterization: Analysis of attributerelevance
Mining class comparisons: Discriminating between
dierent classes
Mining descriptive statistical measures in large
databases
ummary
-
8/20/2019 An 15 DM Caracterizacion
36/38
Summar'
&oncept description: characterization and
discrimination
A$-based vs! attribute-oriented induction
QHcient implementation of A2 Analytical characterization and comparison
Mining descriptive statistical measures in large
databases
Discussion
2ncremental and parallel mining of description
Descriptive mining of comple5 types of data
-
8/20/2019 An 15 DM Caracterizacion
37/38
%e"erences
T! &ai# .! &ercone# and U! 0an! Attribute-oriented induction in relationaldatabases! 2n 7! $iatets"y-hapiro and W! U! rawley# editors# +nowledgeDiscovery in Databases# pages 3'E-33)! AAA26M2I $ress# '(('!
! &haudhuri and 8! Dayal! An overview of data warehousing and A$technology! A&M 27MD Becord# 3V:VG-RF# '((R
&! &arter and 0! 0amilton! QHcient attribute-oriented generalization for"nowledge discovery from large databases! 2QQQ Irans! +nowledge and DataQngineering# 'K:'(E-3K)# '(()!
W! &leveland! Pisualizing Data! 0obart $ress# ummit .U# '((E!
U! ! Devore! $robability and tatistics for Qngineering and the cience# Fth ed!Du5bury $ress# '((G!
I! 7! Dietterich and B! ! Michals"i! A comparative review of selected methodsfor learning from e5amples! 2n Michals"i et al!# editor# Machine earning: AnArti/cial 2ntelligence Approach# Pol! '# pages F'-)3! Morgan +aufmann# '()E!
U! 7ray# ! &haudhuri# A! %osworth# A! ayman# D! Beichart# M! Pen"atrao# !$ellow# and 0! $irahesh! Data cube: A relational aggregation operatorgeneralizing group-by# cross-tab and sub-totals! Data Mining and +nowledgeDiscovery# ':3(-GF# '((R!
U! 0an# T! &ai# and .! &ercone! Data-driven discovery of 1uantitative rules inrelational databases! 2QQQ Irans! +nowledge and Data Qngineering# G:3(-FK#'((E!
-
8/20/2019 An 15 DM Caracterizacion
38/38
%e"erences (cont1*
U! 0an and T! u! Q5ploration of the power of attribute-oriented induction in datamining! 2n 8!M! ayyad# 7! $iatets"y-hapiro# $! myth# and B! 8thurusamy#editors# Advances in +nowledge Discovery and Data Mining# pages E((-F3'!AAA26M2I $ress# '((V!
B! A! Uohnson and D! A! Wichern! Applied Multivariate tatistical Analysis# Erded! $rentice 0all# '((3!
Q! +norr and B! .g! Algorithms for mining distance-based outliers in large
datasets! PD%()# .ew Tor"# .T# Aug! '(()! 0! iu and 0! Motoda! eature election for +nowledge Discovery and Data
Mining! +luwer Academic $ublishers# '(()!
B! ! Michals"i! A theory and methodology of inductive learning! 2n Michals"i etal!# editor# Machine earning: An Arti/cial 2ntelligence Approach# Pol! '# Morgan+aufmann# '()E!
I! M! Mitchell! Persion spaces: A candidate elimination approach to rule
learning! 2U&A2(R# &ambridge# MA! I! M! Mitchell! 7eneralization as search! Arti/cial 2ntelligence# '):3KE-33V#
'()3!
I! M! Mitchell! Machine earning! Mc7raw 0ill# '((R!
U! B! >uinlan! 2nduction of decision trees! Machine earning# ':)'-'KV# '()V!
D! ubramanian and U! eigenbaum! actorization in e5periment generation!AAA2)V# $hiladelphia# $A# Aug! '()V!