2summer 2012 - data preprocessing
TRANSCRIPT
-
7/30/2019 2Summer 2012 - Data Preprocessing
1/66
Trngihc Khoa hcT nhinKhoa Cng ngh Thng tin
TI LIU L THUYT KTDL & UD
Ging vin: ThS. L Ngc Thnh
Email: [email protected]
Summer 2012
CHUN B D LIU
-
7/30/2019 2Summer 2012 - Data Preprocessing
2/66
Powerpoint Templates
2
Ni dung
Ti sao cn chun b d liu?
Lm sch d liu (data cleaning)
Chn lc d liu (data selection) Rt gn d liu (data reduction)
Bin i d liu (datatransformation)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
3/66
Powerpoint Templates
3
D liu D liu dng thuc tnh -
gi tr (Attribute-value data)
Cc kiu d liu
s (numeric), phi s(categorical)
Tnh, ng (thi gian) Cc dng d liu khc
DL phn tn
DL vn bn
DL web, siu DL
Hnh nh, audio/video
....
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
4/66
Powerpoint Templates
4
Th no l d liu xu
Bi tp ng vaiNg cnh: thu thp d liu
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
5/66
Powerpoint Templates
5
Cht lng d liu Thiu, khng y : thiu gi trca
thuc tnh, thiu cc thuc tnh quantm, hocchcha DL tch hp
VD : tui, cn nng = Tp,nhiu (noise): cha lihoc cc
saibit
VD : Lng=-100 000
Mu thun : c s khng thng nhttrong m hoc trong tnVD : Tui =42 , Ngy sinh = 03/07/1997;
US=USA?
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
6/66
Powerpoint Templates
6
H qu cht lng d liu
Quyt nh ng n phi da trncc d liu chnh xc
VD : vic trng lp hoc thiu d liu c
th dn ti vic thng k khng chnh xc,thm ch lm lc li.
Kho d liu cn s tch hp ng nhtcc DL cht lng
D liu khng cht lngkhai thc khng tt
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
7/66
Powerpoint Templates 7
Gii php? (1/2)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
8/66
Powerpoint Templates 8
Gii php? (2/2) Cn lm sch DL(Data Cleaning)
o in cc gi trthiu,kh DL nhiu, xc nhv loib DL sai bit, DL nhiu v giiquytDL mu thun
Cn chn lc/ Tch hp DL (Data Intergration)
o Tng hp, tch hp DL t nhiu CSDL, tptin khc nhau .
Cn bin i DL (Data transformation)
o Chun ho v tnghp (aggregation) .
Cn rt gn DL
o Gim kch thc DL nhngmboktqu
phn tch .
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
9/66
Powerpoint Templates 9
Ni dung
Ti sao cn chun b d liu?
Lm sch d liu (data cleaning)
Chn lc d liu (data selection) Rt gn d liu (data reduction)
Bin i d liu (datatransformation)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
10/66
Powerpoint Templates 10
Lm sch d liu
Lm sch d liu l vn quantrng bc nht
Lm sch d liu l qu trnh: in cc gi tr thiu
Xc nh v loi b d liu sai bit, dliu nhiu
Gii quyt d liu mu thun
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
11/66
Powerpoint Templates 11
Lm sch d liu
Lm sch d liu l vn quantrng bc nht
Lm sch d liu l qu trnh: in cc gi tr thiu
Xc nh v loi b d liu sai bit, dliu nhiu
Gii quyt d liu mu thun
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
12/66
Powerpoint Templates 12
in cc gi tr thiu?
Brainstorm
Suy ngh v gii php?
Li v bt li?
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
13/66
Powerpoint Templates 13
in gi tr thiu (1/2)
B qua cc mu tin c gi tr thiu: Thng dng khi thiu nhn ca lp ( trong
phn lp)
D, nhng khng hiu qu, c bit khi t lgi tr thiu ca thuc tnh cao.
in cc gi tr thiu bng tay: v v v
khng kh thi in cc gi tr thiu t ng:
Thay th bng hng s chung. VD, khng
bit. C th thnh lp mi trong DL
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
14/66
Powerpoint Templates 14
in gi tr thiu (2/2)
in cc gi tr thiu t ng : Thay th bng gi tr trung bnh ca
thuc tnh
Thay th bng gi tr trung bnh cathuc tnh trong mt lp
Thay th bng gi tr c nhiu kh
nng nht : suy ra t cng thcBayesian, cy quyt nh hoc thutgii EM (Expectation Maximization)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
15/66
Powerpoint Templates 15
Lm sch d liu
Lm sch d liu l vn quantrng bc nht
Lm sch d liu l qu trnh: in cc gi tr thiu
Xc nh v loi b d liu sai bit, dliu nhiu
Gii quyt d liu mu thun
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
16/66
Powerpoint Templates 16
Kh nhiu?
Cc phng php c bn khnhiu : Phng php chia gi (Binning):
Sp xp v chia DL vo cc gi c cng su (equal-depth)
Kh nhiu bng gi tr TB, trung tuyn,bin gi,
Phng php gom nhm (Clustering): Pht hin v loi b cc khc bit
Phng php hi qui (Regression):
a DL vo hm hi qui
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
17/66
Powerpoint Templates 17
Kh nhiu pp chia gi (1/4)
Phng php chia gi (Binning) Chia theo rng (Equal-width
khong cch):
Chia vng gi tr thnh N khong cngkch thc
rng ca tng khong = (gi tr lnnht - gi tr nh nht)/N
Chia theo su (Equal-depthtnsut):
Chia vng gi tr thnh N khong m mikhong c cha gn nh cng s lngmu
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
18/66
Powerpoint Templates 18
Bin tri
-
7/30/2019 2Summer 2012 - Data Preprocessing
19/66
Powerpoint Templates 19
Kh nhiu pp chia gi (3/4)
Nhng khng tt cho DL b lch
[0200,000) .
1
m
Mc lng trong Cng ty
[1,800,0002,000,000]
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
20/66
Powerpoint Templates 20
Kh nhiu pp chia gi (4/4)
Chia gi theo su:
su = 4, ngoi tr gi cui cng
[64 .. .. .. .. 69] [70 .. 72] [73 .. .. .. .. .. .. .. .. 81] [83 .. 85]
4
m
4 42
Chia vng gi tr thnh N khong m mi khong c
cha gn nh cng s lng mu
Gi tr nhit vi N = 4:64 65 68 69 70 71 72 72 75 75 80 81 83 85
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
21/66
Powerpoint Templates 21
Kh nhiu vi gi chia
Sp xp DL gi ($) :4, 8, 15, 21, 21, 24, 25, 28, 34
Phn chia thnh gi c cng su(equal-depth) vi N = 3 Bin 1: 4, 8, 15
Bin 2: 21, 21, 24
Bin 3: 25, 28, 34
Lm g vi gi chia?
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
22/66
-
7/30/2019 2Summer 2012 - Data Preprocessing
23/66
Powerpoint Templates 23
Bi tp kh nhiu vi gi
Cho DL gi ($) :15, 17, 19, 25, 29, 31, 33, 41, 42, 45, 45,47, 52, 52, 64
Dng phng php chia gi theo rng v su vi s gi l 4 : Tnh gi tr ca gi theo lm trn trung v.
Tnh gi tr ca gi theo lm trn bin gi. Tnh gi tr ca gi theo lm trn TB gi.
Nhn xt kt qu t c.
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
24/66
Powerpoint Templates 24
Kh nhiu?
Cc phng php c bn khnhiu : Phng php chia gi (Binning):
Sp xp v chia DL vo cc gi c cng su (equal-depth)
Kh nhiu bng gi tr TB, trung tuyn,bin gi,
Phng php gom nhm (Clustering): Pht hin v loi b cc khc bit
Phng php hi qui (Regression):
a DL vo hm hi qui
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
25/66
Powerpoint Templates 25
Kh nhiu pp gom nhm
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
26/66
Powerpoint Templates 26
Kh nhiu pp hi quy
x
y = x + 1
X1
Y1
Y1
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
27/66
Powerpoint Templates 27
Lm sch d liu
Lm sch d liu l vn quantrng bc nht
Lm sch d liu l qu trnh: in cc gi tr thiu
Xc nh v loi b d liu sai bit, dliu nhiu
Gii quyt d liu mu thun
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
28/66
Powerpoint Templates 28
Gii quyt mu thun
c thm trong ti liu tham kho tr li cu hi: Lm th no x l DL mu thun?
Cho v d tng phng php giiquyt mu thun.
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
29/66
Powerpoint Templates 29
Ni dung
Ti sao cn chun b d liu?
Lm sch d liu (data cleaning)
Chn lc d liu (data selection) Rt gn d liu (data reduction)
Bin i d liu (datatransformation)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
30/66
Powerpoint Templates 30
Chn lc d liu
Chn la v tp hp DL t nhiungun khc nhau vo trong mtCSDL
Nhng vn g xy ra khi chnla v tng hp d liu?
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
31/66
Powerpoint Templates 31
Qu trnh chn lc d liu (1/4)
Qu trnh: Ch chn nhng DL cn thit cho tin
trnh khai thc DL.
So khp lc d liu Loi b DL d tha v trng lp
Pht hin v gii quyt cc mu thun
trong DL
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
32/66
Powerpoint Templates 32
Qu trnh chn lc d liu (2/4)
So khp lc d liu Bi ton nhn din thc th
Lm th no cc thc th t nhiu
ngun DL tr nn tng xng US=USA; customer_id = cust_number
S dng siu DL(metadata)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
33/66
Powerpoint Templates 33
Qu trnh chn lc d liu (3/4)
Loi b d liu d tha, trng lp Mt thuc tnh l tha nu n c th
suy ra t cc thuc tnh khc
Cng mt thuc tnh c th c nhiutn trong cc CSDL khc nhau
Mt s mu tin DL b lp li
Dng php phn tch tng quan r=0: X v Y khng tng quan
r>0 : tng quan thun. XY
r
-
7/30/2019 2Summer 2012 - Data Preprocessing
34/66
Powerpoint Templates 34
Qu trnh chn lc d liu (4/4)
Gii quyt mu thun trong d liu V d: trng lng c o bng kg
hoc pound
Xc nh chun v nh x da trnsiu d liu (metadata)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
35/66
Powerpoint Templates 35
Ni dung
Ti sao cn chun b d liu?
Lm sch d liu (data cleaning)
Chn lc d liu (data selection) Rt gn d liu (data reduction)
Bin i d liu (datatransformation)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
36/66
Powerpoint Templates 36
Rt gn d liu
D liu c th qu ln i vi mts ng dng KTDL: tn thi gian.
Rt gn d liu l qu trnh thu gn
d liu (kch thc) sao cho vn thuc cng (hoc gn nh cng) ktqu phn tch.
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
37/66
Powerpoint Templates 37
Cc phng php rt gn
Cc phng php: Tng hp
Gim chiu d liu
Nn d liu
Gim s lng
Ri rc ha v phn cp khi nim
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
38/66
Powerpoint Templates 38
Rt gn Tng hp (1/3)
Tng hp T hp t 2 thuc tnh (i tng) tr
ln thnh 1 thuc tnh (i tng)
VD : cc thnh ph tng hp vo vng,khu vc, nc,
Tng hp d liu cp thp vo d liucp cao :
Gim kch thc tp d liu : gim sthuc tnh
Tng tnh l th ca mu
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
39/66
Powerpoint Templates 39
Rt gn Tng hp (2/3)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
40/66
Powerpoint Templates 40
Rt gn Tng hp (3/3)
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
41/66
Powerpoint Templates 41
Rt gn Gim chiu (1/6)
Gim chiu d liu Chn la c trng (tp con cc thuc
tnh)
Chn m t n thuc tnh, m n Loi b cc thuc tnh khng lin quan,
d tha
Cch xc nh thuc tnh khng linquan?
S liu thng k
li thng tin
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
42/66
Powerpoint Templates 42
Rt gn Gim chiu (2/6)
Gim chiu d liu bng cch no? Vt cn
C 2d tp con thuc tnh ca d thuc tnh
phc tp tnh ton qu cao
PP Heuristic Stepwise forward selection
Stepwise backward elimitation Kt hp c hai
Cy quyt nh qui np
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
43/66
Powerpoint Templates 43
Rt gn Gim chiu (3/6)
PP Heuristic - Stepwise forward u tin : chn thuc tnh n tt nht
Chn tip thuc tnh tt nht trong s cn
li, .. V d : tp thuc tnh ban u{A1,A2,A3,A4,A5,A6}
Tp rt gn ban u ={}
B1= {A1}
B2= {A1,A4}
B3= {A1,A4,A6}
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
44/66
Powerpoint Templates 44
Rt gn Gim chiu (4/6)
PP Heuristic - Stepwise backward u tin : loi thuc tnh n xu nht
Loi tip thuc tnh xu nht trong s cn
li, V d : tp thuc tnh ban u{A1,A2,A3,A4,A5,A6}
Tp rt gn ban u ={A1,A2,A3,A4,A5,A6}
B1= {A1,A3,A4,A5,A6}
B2= {A1,A4,A5,A6}
B3= {A1,A4, A6}
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
45/66
Powerpoint Templates 45
Rt gn Gim chiu (5/6)
PP Heuristic - Kt hp u tin : chn thuc tnh n tt nht v
loi thuc tnh n xu nht
Chn tip thuc tnh tt nht v loi tipthuc tnh xu nht trong s cn li,
V d : tp thuc tnh ban u{A1,A2,A3,A4,A5,A6}
Tp rt gn ban u ={A1,A2,A3,A4,A5,A6}
B1= {A1,A3,A4,A5,A6}
B2= {A1,A4,A5,A6}
B3= {A1,A4, A6}
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
46/66
Powerpoint Templates 46
Rt gn Gim chiu (6/6)
PP HeuristicCy quyt nh qui np u tin : xy dng cy quyt nh
Loi cc thuc tnh khng xut hin trn
cy V d : tp thuc tnh ban u{A1,A2,A3,A4,A5,A6}
Tp rt gn = {A1, A4, A6}A4 ?
A1? A6?
Class 1 Class 2 Class 1 Class 2
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
47/66
Powerpoint Templates 47
Rt gn Nn
Nn d liu: M ho hoc bin i
d liu
Nn khng mt thngtin (lossless) D liu c th phc hi li
Nn c mt thng tin (lossy)
D liu khng th phc hi li hon ton
Dng bin i wavelet, phn tch thnhphn c bn (principal component
analysis-PCA),
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
48/66
Powerpoint Templates 48
Rt gn Gim s lng
Gim s lng (numerosity reduction):chn dng biu din khc ca d liu(nh hn)
Mt s phng php: PP tham s:
S dng m hnh ton hc lu cc tham s
M hnh hi qui v log-tuyn tnh
PP khng tham s : Khng s dng m hnh ton hc m lu biu
din rt gn
Biu , gom nhm, ly mu
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
49/66
Powerpoint Templates 49
Rt gn Gim s lng
PP hi qui tuyn tnh :Y = + X(ch lu , )
PP hi qui bi : Y = b0 + b1 X1 + b2 X2 M hnh log-tuyn tnh :
Xc sut : p(a, b, c, d) = ab ac adbcd
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
50/66
Powerpoint Templates 50
Rt gn Gim s lng
PP biu (histogram) PP thng dng rt gn DL
Phn chia DL vo cc gi v chiu caoca ct l s i tng nm trong mi gi.
Ch lu gi tr trung bnh ca mi gi. Hnh dng ca biu ty thuc vo s
lng gi
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
51/66
Powerpoint Templates 51
Rt gn Gim s lng
PP gom nhm Phn chia d liu vo cc nhm v lu
biu din ca nhm .
Rt hiu qu nu d liu tp trung thnhnhm nhng ngc li khi DL ri rc
Rt nhiu thut ton gom nhm.
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
52/66
Powerpoint Templates 52
Rt gn Gim s lng
PP ly mu (sampling) Dng tp mu ngu nhin nh hn nhiu
thay th cho tp d liu ln.
PP ly mu ngu nhin khng thay th(SRSWOR)
PP ly mu ngu nhin c thay th(SRSWR )
PP ly mu theo nhm/phn cp
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
53/66
Powerpoint Templates 53
Rt gn Gim s lng
Raw Data
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
54/66
Powerpoint Templates 54
Rt gn Gim s lng
Raw Data Cluster/Stratified Sample
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
55/66
Powerpoint Templates 55
Rt gn Ri rc v phn cp
Ri rc ha: Bin i min gi tr thuc tnh (lin tc)
bng cch chia min gi tr thnh tng
khong. Lu nhn ca khong thay cho cc gi
tr thc
Dnh cho d liu dng s lin tc. Phng php: chia gi, phn tch biu, gom nhm, ri rc ho theo entropy,phn on t nhin.
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
56/66
Powerpoint Templates 56
Rt gn Ri rc v phn cp
Phn cp khi nim: Tp hp v thay th khi nim cp thp
bng khi nim cp cao hn.
Dnh cho d liu dng phi s: to s phn cp.
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
57/66
Powerpoint Templates 57
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
58/66
Powerpoint Templates 58
Rt gn Ri rc v phn cp
V d : Chuyn i gi tr logic thnh 1,0
Chuyn i gi tr ngy thng thnh s
Chuyn i cc ct c gi tr s ln thnhtp cc gi tr trong vng nh hn, chnghn chia chng cho h s no
Nhm cc gi tr c cng ng ngha nh :
Hot ng trc CMT8 l nhm 1; t01/08/4531/06/54 ; nhm 2; t 01/07/5430/4/75 l nhm 3,
Thay th gi tr ca tui thnh tr, trungnin, gi
http://www.powerpointstyles.com/http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
59/66
Powerpoint Templates 59
Ni dung
Ti sao cn chun b d liu?
Lm sch d liu (data cleaning)
Chn lc d liu (data selection) Rt gn d liu (data reduction)
Bin i d liu (datatransformation)
Bi i d li
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
60/66
Powerpoint Templates 60
Bin i d liu
Bin i d liu: chuyn i d liuthnh dng ph hp v thun tin chocc thut ton KTDL
Qu trnh bin i d liu: Lm trn (smoothing)
Tch hp (aggregation)
Tng qut ha (generalization) Chun ha (normalization)
Xy dng thuc tnh (attribute construction)
Q t h bi i d li
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
61/66
Powerpoint Templates 61
Qu trnh bin i d liu
Lm trn: l qu trnh b i nhiu t dliu.
Tch hp: tm tt hay tch hp d liu.
Tng qut ha: thay th khi nim mcthp bng cc khi nim mc cao.
Chun ha: d liu thuc tnh nn ca v phm vi gi tr nh nh t 0 ti 1.
Xy dng thuc tnh: thuc tnh mi chnh thnh v thm vo tp thuc tnh chotrc
T tt
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
62/66
Powerpoint Templates 62
Tm tt
D liu thng thiu, nhiu, muthun v nhiu chiu.D liu tt lcha kha to ra cc m hnh gi tr
v ng tin cy. Chun b DL gm cc qu trnh:
Lm sch
La chn Rt gn
Bin i
C hi i bi 1
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
63/66
Powerpoint Templates 63
Cu hi cui bi 1
Ti sao chun b DL l cng vic cp thit v tn nhiuthi gian?
Cc cch gii quyt vn thiu gi tr trong cc mu tinca CSDL?
Gi s CSDL c thuc tnh Tui vi cc gi tr trong ccmu tin (tng dn):13,15,16,16,19,20,20,21,22,22,25,25,25,25,30,33,33,35,35, 35,35,36,40,45,46,52,70.
Kh nhiu DL trn bng gi tr TB ca gi. Nhn xthiu qu ca k thut ny vi DL trn.
C th p dng cc k thut no kh nhiu DL ?
Dng DL trn v biu cng chiu rng (equal-widthhistogram) vi rng = 10
C hi i bi 2
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
64/66
Powerpoint Templates 64
Cu hi cui bi 2
Ti sao cn phi chn la/tch hp d liu? Hy nu qutrnh chn la d liu.
Ti sao cn phi rt gn d liu? Qu trnh rt gn d liuc th lm mt mt thng tin hay khng? Nu c hy nucch khc phc.
Hy tm hiu cc qu trnh bin i d liu. Cho v d chotng hng bin i.
Ti li th kh
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
65/66
Powerpoint Templates 65
Ti liu tham kho
E.Rahm, H.H.Do. Data cleaning :Problems and Current Approaches.IEEE bulletin of Technical
Committee on Data engineering,Vol. 23, N.4, 2000
J.Han, M.Kamber, Chng 2 Data
mining : Concepts and Techniques
Hi &
http://www.powerpointstyles.com/http://www.powerpointstyles.com/ -
7/30/2019 2Summer 2012 - Data Preprocessing
66/66
Hi & p