data preprocessing_ data cleaning

Upload: tiersarge

Post on 02-Jun-2018

226 views

Category:

Documents


1 download

TRANSCRIPT

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    1/29

    January 20, 2015 Data Mining: Concepts and Techniques 1

    Data Preprocessing

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    2/29

    January 20, 2015 Data Mining: Concepts and Techniques 2

    Data Preprocessing

    Why preprocess the data?

    Data c eaning

    Data integration and trans!or"ation

    Data reduction

    Discreti#ation and concept hierarchygeneration

    $u""ary

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    3/29

    January 20, 2015 Data Mining: Concepts and Techniques %

    Data Preprocessing

    Why preprocess the data?

    Data c eaning

    Data integration and trans!or"ation

    Data reduction

    Discreti#ation and concept hierarchygeneration

    $u""ary

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    4/29

    January 20, 2015 Data Mining: Concepts and Techniques &

    Why Data Preprocessing?

    Data in the rea 'or d is dirtyinco"p ete : ac(ing attri)ute *a ues,

    ac(ing certain attri)utes o! interest, orcontaining on y aggregate data

    e+g+, occupation - .noisy : containing errors or out iers

    e+g+, $a ary -/10.

    inconsistent : containing discrepancies incodes or na"ese+g+, ge -&2. irthday -0% 03 1443.e+g+, Was rating -1,2,%., no' rating - , , C.

    e+g+, discrepancy )et'een dup icate records

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    5/29

    January 20, 2015 Data Mining: Concepts and Techniques 5

    Why s Data Dirty?

    nco"p ete data "ay co"e !ro"-6ot app ica) e. data *a ue 'hen co ectedDi7erent considerations )et'een the ti"e 'hen the data'as co ected and 'hen it is ana y#ed+8u"an hard'are so!t'are pro) e"s

    6oisy data 9incorrect *a ues "ay co"e !ro";au ty data co ection instru"ents8u"an or co"puter error at data entry

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    6/29

    January 20, 2015 Data Mining: Concepts and Techniques =

    y s a a reprocess ng"portant?

    6o qua ity data, no qua ity "ining resu ts>

    ua ity decisions "ust )e )ased on qua ity datae+g+, dup icate or "issing data "ay cause incorrect ore*en "is eading statistics+

    Data 'arehouse needs consistent integration o!qua ity data

    Data e@traction, c eaning, and trans!or"ation

    co"prises the "aAority o! the 'or( o! )ui ding adata 'arehouse

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    7/293

    u / "ens ona easure o a aua ity

    Measures !or data qua ity: "u tidi"ensiona *ie'ccuracy: correct or 'rong, accurate or not

    Co"p eteness: not recorded, una*ai a) e, B

    Consistency: so"e "odi ed )ut so"e not,dang ing, B

    Ti"e iness: ti"e y update?

    e ie*a)i ity: ho' trusta) e the data are correct?

    nterpreta)i ity: ho' easi y the data can )e

    understood?

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    8/29

    MaAor Tas(s in Data Preprocessing

    Data cleaning;i in "issing *a ues, s"ooth noisy data, identi!y orre"o*e out iers, and reso *e inconsistencies

    Data integration

    ntegration o! "u tip e data)ases, data cu)es, or esData reduction

    Di"ensiona ity reduction

    6u"erosity reduction

    Data co"pression

    Data transformation and data discretization

    6or"a i#ation

    Concept hierarchy generation

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    9/29 January 20, 2015 Data Mining: Concepts and Techniques 4

    ;or"s o! Data Preprocessing

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    10/29 January 20, 2015 Data Mining: Concepts and Techniques 10

    Data Preprocessing

    Why preprocess the data?

    Data c eaning

    Data integration and trans!or"ation

    Data reduction

    Discreti#ation and concept hierarchy

    generation

    $u""ary

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    11/29 January 20, 2015 Data Mining: Concepts and Techniques 11

    Data C eaning

    "portance-Data c eaning is one o! the three )iggestpro) e"s in data 'arehousing.EFa phGi")a

    -Data c eaning is the nu")er one pro) e" indata 'arehousing.EDC sur*ey

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    12/29 January 20, 2015 Data Mining: Concepts and Techniques 12

    Data C eaning

    "portance-Data c eaning is one o! the three )iggestpro) e"s in data 'arehousing.EFa ph Gi")a-Data c eaning is the nu")er one pro) e" in

    data 'arehousing.EDC sur*eyData c eaning tas(s

    ;i in "issing *a ues

    denti!y out iers and s"ooth out noisy data

    Correct inconsistent data

    Feso *e redundancy caused )y data integration

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    13/291%

    nco"p ete 9Missing Data

    Data is not a 'ays a*ai a) e

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    14/291&

    nco"p ete 9Missing Data

    Data is not a 'ays a*ai a) e

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    15/2915

    8o' to 8and e Missing Data?

    gnore the tup e: usua y done 'hen c ass a)e is"issing 9'hen doing c assi cation Enot e7ecti*e'hen the H o! "issing *a ues per attri)ute *ariesconsidera) y

    ;i in the "issing *a ue "anua y: tedious Iin!easi) e?

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    16/291=

    8o' to 8and e Missing Data?

    gnore the tup e: usua y done 'hen c ass a)e is"issing 9'hen doing c assi cation Enot e7ecti*e 'henthe H o! "issing *a ues per attri)ute *aries considera) y

    ;i in the "issing *a ue "anua y: tedious I in!easi) e?

    ;i in it auto"atica y 'itha g o)a constant : e+g+, -un(no'n., a ne' c ass?>

    the attri)ute "ean

    the attri)ute "ean !or a sa"p es )e onging to thesa"e c ass: s"arter

    the "ost pro)a) e *a ue: in!erence/)ased such asayesian !or"u a or decision tree

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    17/29 January 20, 2015 Data Mining: Concepts and Techniques 13

    6oisy Data

    6oise: rando" error or *ariance in a "easured*aria) e

    ncorrect attri)ute *a ues "ay due to

    !au ty data co ection instru"entsdata entry pro) e"sdata trans"ission pro) e"stechno ogy i"itationinconsistency in na"ing con*ention

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    18/29 January 20, 2015 Data Mining: Concepts and Techniques 1

    8o' to 8and e 6oisy Data?

    inningrst sort data and partition into 9equa /

    !requency )insthen one can s"ooth )y )in "eans, s"ooth )y

    )in "edian, s"ooth )y )in )oundaries , etc+

    $i" Di i# i M h d

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    19/29 January 20, 2015 Data Mining: Concepts and Techniques 14

    $i"p e Discreti#ation Methods:inning

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    20/29 January 20, 2015 Data Mining: Concepts and Techniques 20

    nn ng e o s or a a$"oothing

    $orted data !or price 9in do ars : &, , 4, 15, 21, 21, 2&,25, 2=, 2 , 24, %&L Partition into equa /!requency 9equi/depth )ins: / in 1: &, , 4, 15 / in 2: 21, 21, 2&, 25 / in %: 2=, 2 , 24, %&

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    21/29 January 20, 2015 Data Mining: Concepts and Techniques 21

    nn ng e o s or a a$"oothing

    $orted data !or price 9in do ars : &, , 4, 15, 21, 21, 2&,25, 2=, 2 , 24, %&L Partition into equa /!requency 9equi/depth )ins: / in 1: &, , 4, 15 / in 2: 21, 21, 2&, 25 / in %: 2=, 2 , 24, %&L $"oothing )y )in "eans: / in 1: 4, 4, 4, 4 / in 2: 2%, 2%, 2%, 2%

    / in %: 24, 24, 24, 24

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    22/29 January 20, 2015 Data Mining: Concepts and Techniques 22

    nn ng e o s or a a$"oothing

    $orted data !or price 9in do ars : &, , 4, 15, 21, 21, 2&, 25,2=, 2 , 24, %&L Partition into equa /!requency 9equi/depth )ins: / in 1: &, , 4, 15 / in 2: 21, 21, 2&, 25

    / in %: 2=, 2 , 24, %&L $"oothing )y )in "eans: / in 1: 4, 4, 4, 4 / in 2: 2%, 2%, 2%, 2%

    / in %: 24, 24, 24, 24L $"oothing )y )in )oundaries: / in 1: &, &, &, 15 / in 2: 21, 21, 25, 25 / in %: 2=, 2=, 2=, %&

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    23/29 January 20, 2015 Data Mining: Concepts and Techniques 2%

    8o' to 8and e 6oisy Data?

    inningrst sort data and partition into 9equa /

    !requency )insthen one can s"ooth )y )in "eans, s"ooth )y

    )in "edian, s"ooth )y )in )oundaries , etc+Fegression

    s"ooth )y tting the data into regression!unctions

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    24/29

    January 20, 2015 Data Mining: Concepts and Techniques 2&

    Fegression

    x

    y

    y = x + 1

    X1

    Y1

    Y1

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    25/29

    January 20, 2015 Data Mining: Concepts and Techniques 25

    8o' to 8and e 6oisy Data?

    inningrst sort data and partition into 9equa /

    !requency )insthen one can s"ooth )y )in "eans, s"ooth )y

    )in "edian, s"ooth )y )in )oundaries , etc+Fegression

    s"ooth )y tting the data into regression!unctions

    C usteringdetect and re"o*e out iers

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    26/29

    January 20, 2015 Data Mining: Concepts and Techniques 2=

    C uster na ysis

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    27/29

    January 20, 2015 Data Mining: Concepts and Techniques 23

    8o' to 8and e 6oisy Data?

    inningrst sort data and partition into 9equa /!requency

    )insthen one can s"ooth )y )in "eans, s"ooth )y

    )in "edian, s"ooth )y )in )oundaries , etc+Fegression

    s"ooth )y tting the data into regression !unctionsC ustering

    detect and re"o*e out iersCo")ined co"puter and hu"an inspection

    detect suspicious *a ues and chec( )y hu"an9e+g+, dea 'ith possi) e out iers

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    28/29

    January 20, 2015 Data Mining: Concepts and Techniques 2

    Pro) e"s

    %+% $uppose that the data !or ana ysis inc udes theattri)ute age+ The age *a ues !or the data tup esare 9in increasing order1%,15,1=,1=,14,20,20,21,22,22,25,25,25,25,%0,%%,%%,%5,%5,%5,%5,%=,&0,&5,&=,52,30+

    i+ se s"oothing )y )in "eans and )ondaries tos"ooth the data, using a )in depth o! %+ ustrate

    your steps+

    ii+ 8o' "ight you deter"ine the out iers?

  • 8/9/2019 Data Preprocessing_ Data Cleaning

    29/29

    Data C eaning as a Process

    Data discrepancy detectionse "etadata 9e+g+, do"ain, range, dependency, distri)ution

    Chec( e d o*er oadingChec( uniqueness ru e, consecuti*e ru e and nu ru e

    se co""ercia too s

    Data scru))ing: use si"p e do"ain (no' edge 9e+g+, postacode, spe /chec( to detect errors and "a(e correctionsData auditing: )y ana y#ing data to disco*er ru es andre ationship to detect *io ators 9e+g+, corre ation andc ustering to nd out iers

    Data "igration and integrationData "igration too s: a o' trans!or"ations to )e speci ed