data preprocessing_ data cleaning
TRANSCRIPT
-
8/9/2019 Data Preprocessing_ Data Cleaning
1/29
January 20, 2015 Data Mining: Concepts and Techniques 1
Data Preprocessing
-
8/9/2019 Data Preprocessing_ Data Cleaning
2/29
January 20, 2015 Data Mining: Concepts and Techniques 2
Data Preprocessing
Why preprocess the data?
Data c eaning
Data integration and trans!or"ation
Data reduction
Discreti#ation and concept hierarchygeneration
$u""ary
-
8/9/2019 Data Preprocessing_ Data Cleaning
3/29
January 20, 2015 Data Mining: Concepts and Techniques %
Data Preprocessing
Why preprocess the data?
Data c eaning
Data integration and trans!or"ation
Data reduction
Discreti#ation and concept hierarchygeneration
$u""ary
-
8/9/2019 Data Preprocessing_ Data Cleaning
4/29
January 20, 2015 Data Mining: Concepts and Techniques &
Why Data Preprocessing?
Data in the rea 'or d is dirtyinco"p ete : ac(ing attri)ute *a ues,
ac(ing certain attri)utes o! interest, orcontaining on y aggregate data
e+g+, occupation - .noisy : containing errors or out iers
e+g+, $a ary -/10.
inconsistent : containing discrepancies incodes or na"ese+g+, ge -&2. irthday -0% 03 1443.e+g+, Was rating -1,2,%., no' rating - , , C.
e+g+, discrepancy )et'een dup icate records
-
8/9/2019 Data Preprocessing_ Data Cleaning
5/29
January 20, 2015 Data Mining: Concepts and Techniques 5
Why s Data Dirty?
nco"p ete data "ay co"e !ro"-6ot app ica) e. data *a ue 'hen co ectedDi7erent considerations )et'een the ti"e 'hen the data'as co ected and 'hen it is ana y#ed+8u"an hard'are so!t'are pro) e"s
6oisy data 9incorrect *a ues "ay co"e !ro";au ty data co ection instru"ents8u"an or co"puter error at data entry
-
8/9/2019 Data Preprocessing_ Data Cleaning
6/29
January 20, 2015 Data Mining: Concepts and Techniques =
y s a a reprocess ng"portant?
6o qua ity data, no qua ity "ining resu ts>
ua ity decisions "ust )e )ased on qua ity datae+g+, dup icate or "issing data "ay cause incorrect ore*en "is eading statistics+
Data 'arehouse needs consistent integration o!qua ity data
Data e@traction, c eaning, and trans!or"ation
co"prises the "aAority o! the 'or( o! )ui ding adata 'arehouse
-
8/9/2019 Data Preprocessing_ Data Cleaning
7/293
u / "ens ona easure o a aua ity
Measures !or data qua ity: "u tidi"ensiona *ie'ccuracy: correct or 'rong, accurate or not
Co"p eteness: not recorded, una*ai a) e, B
Consistency: so"e "odi ed )ut so"e not,dang ing, B
Ti"e iness: ti"e y update?
e ie*a)i ity: ho' trusta) e the data are correct?
nterpreta)i ity: ho' easi y the data can )e
understood?
-
8/9/2019 Data Preprocessing_ Data Cleaning
8/29
MaAor Tas(s in Data Preprocessing
Data cleaning;i in "issing *a ues, s"ooth noisy data, identi!y orre"o*e out iers, and reso *e inconsistencies
Data integration
ntegration o! "u tip e data)ases, data cu)es, or esData reduction
Di"ensiona ity reduction
6u"erosity reduction
Data co"pression
Data transformation and data discretization
6or"a i#ation
Concept hierarchy generation
-
8/9/2019 Data Preprocessing_ Data Cleaning
9/29 January 20, 2015 Data Mining: Concepts and Techniques 4
;or"s o! Data Preprocessing
-
8/9/2019 Data Preprocessing_ Data Cleaning
10/29 January 20, 2015 Data Mining: Concepts and Techniques 10
Data Preprocessing
Why preprocess the data?
Data c eaning
Data integration and trans!or"ation
Data reduction
Discreti#ation and concept hierarchy
generation
$u""ary
-
8/9/2019 Data Preprocessing_ Data Cleaning
11/29 January 20, 2015 Data Mining: Concepts and Techniques 11
Data C eaning
"portance-Data c eaning is one o! the three )iggestpro) e"s in data 'arehousing.EFa phGi")a
-Data c eaning is the nu")er one pro) e" indata 'arehousing.EDC sur*ey
-
8/9/2019 Data Preprocessing_ Data Cleaning
12/29 January 20, 2015 Data Mining: Concepts and Techniques 12
Data C eaning
"portance-Data c eaning is one o! the three )iggestpro) e"s in data 'arehousing.EFa ph Gi")a-Data c eaning is the nu")er one pro) e" in
data 'arehousing.EDC sur*eyData c eaning tas(s
;i in "issing *a ues
denti!y out iers and s"ooth out noisy data
Correct inconsistent data
Feso *e redundancy caused )y data integration
-
8/9/2019 Data Preprocessing_ Data Cleaning
13/291%
nco"p ete 9Missing Data
Data is not a 'ays a*ai a) e
-
8/9/2019 Data Preprocessing_ Data Cleaning
14/291&
nco"p ete 9Missing Data
Data is not a 'ays a*ai a) e
-
8/9/2019 Data Preprocessing_ Data Cleaning
15/2915
8o' to 8and e Missing Data?
gnore the tup e: usua y done 'hen c ass a)e is"issing 9'hen doing c assi cation Enot e7ecti*e'hen the H o! "issing *a ues per attri)ute *ariesconsidera) y
;i in the "issing *a ue "anua y: tedious Iin!easi) e?
-
8/9/2019 Data Preprocessing_ Data Cleaning
16/291=
8o' to 8and e Missing Data?
gnore the tup e: usua y done 'hen c ass a)e is"issing 9'hen doing c assi cation Enot e7ecti*e 'henthe H o! "issing *a ues per attri)ute *aries considera) y
;i in the "issing *a ue "anua y: tedious I in!easi) e?
;i in it auto"atica y 'itha g o)a constant : e+g+, -un(no'n., a ne' c ass?>
the attri)ute "ean
the attri)ute "ean !or a sa"p es )e onging to thesa"e c ass: s"arter
the "ost pro)a) e *a ue: in!erence/)ased such asayesian !or"u a or decision tree
-
8/9/2019 Data Preprocessing_ Data Cleaning
17/29 January 20, 2015 Data Mining: Concepts and Techniques 13
6oisy Data
6oise: rando" error or *ariance in a "easured*aria) e
ncorrect attri)ute *a ues "ay due to
!au ty data co ection instru"entsdata entry pro) e"sdata trans"ission pro) e"stechno ogy i"itationinconsistency in na"ing con*ention
-
8/9/2019 Data Preprocessing_ Data Cleaning
18/29 January 20, 2015 Data Mining: Concepts and Techniques 1
8o' to 8and e 6oisy Data?
inningrst sort data and partition into 9equa /
!requency )insthen one can s"ooth )y )in "eans, s"ooth )y
)in "edian, s"ooth )y )in )oundaries , etc+
$i" Di i# i M h d
-
8/9/2019 Data Preprocessing_ Data Cleaning
19/29 January 20, 2015 Data Mining: Concepts and Techniques 14
$i"p e Discreti#ation Methods:inning
-
8/9/2019 Data Preprocessing_ Data Cleaning
20/29 January 20, 2015 Data Mining: Concepts and Techniques 20
nn ng e o s or a a$"oothing
$orted data !or price 9in do ars : &, , 4, 15, 21, 21, 2&,25, 2=, 2 , 24, %&L Partition into equa /!requency 9equi/depth )ins: / in 1: &, , 4, 15 / in 2: 21, 21, 2&, 25 / in %: 2=, 2 , 24, %&
-
8/9/2019 Data Preprocessing_ Data Cleaning
21/29 January 20, 2015 Data Mining: Concepts and Techniques 21
nn ng e o s or a a$"oothing
$orted data !or price 9in do ars : &, , 4, 15, 21, 21, 2&,25, 2=, 2 , 24, %&L Partition into equa /!requency 9equi/depth )ins: / in 1: &, , 4, 15 / in 2: 21, 21, 2&, 25 / in %: 2=, 2 , 24, %&L $"oothing )y )in "eans: / in 1: 4, 4, 4, 4 / in 2: 2%, 2%, 2%, 2%
/ in %: 24, 24, 24, 24
-
8/9/2019 Data Preprocessing_ Data Cleaning
22/29 January 20, 2015 Data Mining: Concepts and Techniques 22
nn ng e o s or a a$"oothing
$orted data !or price 9in do ars : &, , 4, 15, 21, 21, 2&, 25,2=, 2 , 24, %&L Partition into equa /!requency 9equi/depth )ins: / in 1: &, , 4, 15 / in 2: 21, 21, 2&, 25
/ in %: 2=, 2 , 24, %&L $"oothing )y )in "eans: / in 1: 4, 4, 4, 4 / in 2: 2%, 2%, 2%, 2%
/ in %: 24, 24, 24, 24L $"oothing )y )in )oundaries: / in 1: &, &, &, 15 / in 2: 21, 21, 25, 25 / in %: 2=, 2=, 2=, %&
-
8/9/2019 Data Preprocessing_ Data Cleaning
23/29 January 20, 2015 Data Mining: Concepts and Techniques 2%
8o' to 8and e 6oisy Data?
inningrst sort data and partition into 9equa /
!requency )insthen one can s"ooth )y )in "eans, s"ooth )y
)in "edian, s"ooth )y )in )oundaries , etc+Fegression
s"ooth )y tting the data into regression!unctions
-
8/9/2019 Data Preprocessing_ Data Cleaning
24/29
January 20, 2015 Data Mining: Concepts and Techniques 2&
Fegression
x
y
y = x + 1
X1
Y1
Y1
-
8/9/2019 Data Preprocessing_ Data Cleaning
25/29
January 20, 2015 Data Mining: Concepts and Techniques 25
8o' to 8and e 6oisy Data?
inningrst sort data and partition into 9equa /
!requency )insthen one can s"ooth )y )in "eans, s"ooth )y
)in "edian, s"ooth )y )in )oundaries , etc+Fegression
s"ooth )y tting the data into regression!unctions
C usteringdetect and re"o*e out iers
-
8/9/2019 Data Preprocessing_ Data Cleaning
26/29
January 20, 2015 Data Mining: Concepts and Techniques 2=
C uster na ysis
-
8/9/2019 Data Preprocessing_ Data Cleaning
27/29
January 20, 2015 Data Mining: Concepts and Techniques 23
8o' to 8and e 6oisy Data?
inningrst sort data and partition into 9equa /!requency
)insthen one can s"ooth )y )in "eans, s"ooth )y
)in "edian, s"ooth )y )in )oundaries , etc+Fegression
s"ooth )y tting the data into regression !unctionsC ustering
detect and re"o*e out iersCo")ined co"puter and hu"an inspection
detect suspicious *a ues and chec( )y hu"an9e+g+, dea 'ith possi) e out iers
-
8/9/2019 Data Preprocessing_ Data Cleaning
28/29
January 20, 2015 Data Mining: Concepts and Techniques 2
Pro) e"s
%+% $uppose that the data !or ana ysis inc udes theattri)ute age+ The age *a ues !or the data tup esare 9in increasing order1%,15,1=,1=,14,20,20,21,22,22,25,25,25,25,%0,%%,%%,%5,%5,%5,%5,%=,&0,&5,&=,52,30+
i+ se s"oothing )y )in "eans and )ondaries tos"ooth the data, using a )in depth o! %+ ustrate
your steps+
ii+ 8o' "ight you deter"ine the out iers?
-
8/9/2019 Data Preprocessing_ Data Cleaning
29/29
Data C eaning as a Process
Data discrepancy detectionse "etadata 9e+g+, do"ain, range, dependency, distri)ution
Chec( e d o*er oadingChec( uniqueness ru e, consecuti*e ru e and nu ru e
se co""ercia too s
Data scru))ing: use si"p e do"ain (no' edge 9e+g+, postacode, spe /chec( to detect errors and "a(e correctionsData auditing: )y ana y#ing data to disco*er ru es andre ationship to detect *io ators 9e+g+, corre ation andc ustering to nd out iers
Data "igration and integrationData "igration too s: a o' trans!or"ations to )e speci ed