compsci 590.01 spring 2017 statistical distortion: …...consequences of data cleaning data cleaning...
TRANSCRIPT
![Page 1: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/1.jpg)
Statistical Distortion: Consequences of Data Cleaning
Data Cleaning & IntegrationCompSci 590.01 Spring 2017
Junyang Gao, Amir Rahimzadeh Ilkhechi, Yuhao Wen
Some contents were based on :Tamraparni Dasu’s DSAA Tutorial, 2016
![Page 2: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/2.jpg)
Tamraparni Dasu, Ji Meng Loh. “Statistical Distortion: Consequences of Data Cleaning.” VLDB, 2012
![Page 3: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/3.jpg)
Biggest take-away points?
(For us:)
● Cleaner data do not necessarily imply more useful or useable data
● In practice, simple cleaning strategy may outperform a more sophisticated method that have assumptions not suitable over the data
![Page 4: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/4.jpg)
Outline
● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis
![Page 5: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/5.jpg)
Data Cleaning or Data Mangling?● Changed the shape:
a. Most frequent values (Mode)
b. Least frequent values (Anomalies)
● Moved good values● Turned good values to
glitches
![Page 6: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/6.jpg)
How to measure data cleaning strategies?
● Three dimensional data quality metric:a. Statistical Distortionb. Glitch Improvementc. Cost
![Page 7: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/7.jpg)
Experimental Framework
1. Glitch Index● Weighted Sum:
● The lower the glitch index,
The “cleaner” the data set
Name City State
0 0 0
0 0 0
0 0 1
0 0 1
0 0 1
0 0 0
0 0 0
0 0 0
0 0 0
Glitch Vector
![Page 8: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/8.jpg)
Experimental Framework
2. Statistical Distortion
![Page 9: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/9.jpg)
Distance between two distributions
1. Kullback-Liebler “distance”● P,Q are two probability distributions over the same event
space●
![Page 10: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/10.jpg)
Distance between two distributions
1. Kullback-Liebler divergence●
Entropy of PCross-Entropy of P,Q
![Page 11: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/11.jpg)
Distance between two distributions
2. Jensen–Shannon divergence
● symmetrized and smoothed version of the Kullback–Leibler divergence
● , where
![Page 12: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/12.jpg)
Distance between two distributions
3. Earth Mover’s distance
● Minimum cost of converting P to Q, transportation problem●
![Page 13: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/13.jpg)
Experimental Framework
3. Cost
● Highly context dependent● In this paper: glitch-based (percentage of glitches
removed)
![Page 14: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/14.jpg)
Experimental Framework
“Best” strategies depends on user’s tolerance
● Statistical Distortion● Glitch Improvement● Cost
![Page 15: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/15.jpg)
Outline
● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis
![Page 16: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/16.jpg)
Applicable to: ● Structured,● Hierarchical,● Spatio-temporal,● Unstructured data
![Page 17: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/17.jpg)
A hierarchical network exampleN_1
N_13
N_132
![Page 18: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/18.jpg)
A hierarchical network example (cont.)
● Each node measures v variables (time series)● For N_ijk: represents the collected data at time t● F_t: the history up to time t-1● represents the window of time-step history from t-⍵ up
to time t-1
![Page 19: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/19.jpg)
Glitch Types:● Multitype
● Co-occurring
● Stand alone
![Page 20: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/20.jpg)
Glitch DetectionGlitch detector is a function of
X^t
Missing values
![Page 21: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/21.jpg)
Glitch Detection(cont.)
Glitch detector is a function of X^t
Inconsistent values
![Page 22: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/22.jpg)
Glitch Detection(cont.)
Glitch detector is also a function of other parameters
Outlier values
![Page 23: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/23.jpg)
Glitch Detection(cont.)
Glitch Matrix:
![Page 24: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/24.jpg)
Glitch Index*
*1 ⨉ p is a typo in the paper
![Page 25: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/25.jpg)
Statistical Distortion Measure(EMD)*
Slides adopted from Pete Barnum presentation*
![Page 26: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/26.jpg)
Statistical Distortion Measure(EMD)
![Page 27: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/27.jpg)
Statistical Distortion Measure(EMD)
![Page 28: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/28.jpg)
Linear programming approach for EMD:
![Page 29: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/29.jpg)
Linear programming approach for EMD:
![Page 30: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/30.jpg)
Linear programming approach for EMD:
![Page 31: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/31.jpg)
Linear programming approach for EMD:
![Page 32: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/32.jpg)
Linear programming approach for EMD (as an instance of transportation problem):
![Page 33: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/33.jpg)
Linear programming approach for EMD (constraint 1):
![Page 34: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/34.jpg)
Linear programming approach for EMD (constraint 2):
![Page 35: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/35.jpg)
Linear programming approach for EMD (constraint 3):
![Page 36: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/36.jpg)
Linear programming approach for EMD (constraint 4):
![Page 37: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/37.jpg)
Final result for EMD:
![Page 38: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/38.jpg)
Final result for EMD:
![Page 39: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/39.jpg)
Outline
● Introduction & Experimental Framework● Methodology & Formulation● Experiments & Analysis
![Page 40: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/40.jpg)
ExperimentDataset:
● 20,000 time series , Length at most 170, 3 variables
Glitches
● Inconsistencies○ A1 >= 0○ 0<= A3 <=1○ If A3 is missing, A1 should not be populated
● Outliers● Missing values
All graph in the following slides are from T. Dasu and J. Loh. "Statistical Distortion: Consequences of Data Cleaning." VLDB 2012.
![Page 41: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/41.jpg)
ExperimentSampling
● From D & DI with replacement, 50 pairs in total● Sample size: 100, 500 (no significant impact)
Factors concerned:
● Attribute transformations● Strategies● Cost● Sample size
![Page 42: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/42.jpg)
Experiment
![Page 43: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/43.jpg)
Analysis
● Cleaning Strategies● Studying Cost● Data Transformation and Cleaning● Strategies and Attribute Distributions● Strategies Evaluation● Cleaning Cost
![Page 44: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/44.jpg)
Analysis - Cleaning StrategiesApplied 5 strategies to each of the 100 test pairs of data streams.
Strategies Missing values Inconsistent values Outliers
1 Impute using SAS PROC MI Winsorization by attribute basis
2 Impute using SAS PROC MI ignore
3 ignore ignore Winsorization by attribute basis
4 Replace with mean attribute from ideal dataset ignore
5 Replace with mean attribute from ideal dataset & Winsorization (outlier only)
weight 0.25 0.25 0.5
![Page 45: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/45.jpg)
Analysis - Studying Cost
● Cost ~ Proportion of the glitches cleaned● Process:
○ Compute normalized glitch score for each time series○ Rank them○ Top x% cleaned (x=0 -> Nothing cleaned, x=100 -> everything cleaned)
![Page 46: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/46.jpg)
Analysis - Data Transformation and CleaningStrategy: 1
Gray:
Imputed missing values
X=Y:
Untouched data
Back dots:
Winsorized values
![Page 47: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/47.jpg)
Analysis - Strategies and Attribute Distributions
![Page 48: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/48.jpg)
Analysis - Strategies Evaluation
Figure 6
![Page 49: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/49.jpg)
Analysis - Strategies Evaluation
● Single cleaning method○ SAS PROC MI○ Mean
● Winsorization only● Using two methods
○ Impute+ Winsorize○ Mean + Winsorize
![Page 50: CompSci 590.01 Spring 2017 Statistical Distortion: …...Consequences of Data Cleaning Data Cleaning & Integration CompSci 590.01 Spring 2017 Junyang Gao, Amir Rahimzadeh Ilkhechi,](https://reader033.vdocuments.mx/reader033/viewer/2022042211/5eb1d7625ac8db63c870e9a3/html5/thumbnails/50.jpg)
Analysis - Cleaning Cost
Strategy: Imputation + Winsorization