chapter 3 – data visualization - snu data mining...
TRANSCRIPT
![Page 1: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/1.jpg)
Chapter 3 – Data Visualization
© Galit Shmueli and Peter Bruce 2010
Data Mining for Business IntelligenceShmueli, Patel & Bruce
![Page 2: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/2.jpg)
Graphs for Data Exploration
Basic PlotsLine GraphsBar ChartsScatterplots
Distribution PlotsBoxplotsHistograms
![Page 3: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/3.jpg)
Line Graph for Time Series
![Page 4: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/4.jpg)
Bar Chart for Categorical Variable
95% of tracts do not border Charles River
Excel can confuse: y-axis is actually “% of records that have a value for CATMEDV” (i.e., “% of all records”)
![Page 5: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/5.jpg)
Scatterplot
Displays relationship between two numerical variables
![Page 6: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/6.jpg)
Distribution Plots
Display “how many” of each value occur in a data set
Or, for continuous data or data with many possible values, “how many” values are in each of a series of ranges or “bins”
![Page 7: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/7.jpg)
Histograms
Histogram shows the distribution of the outcome variable (median house value)
Boston Housing example:
![Page 8: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/8.jpg)
Boxplots
Boston Housing Example: Display distribution of outcome variable (MEDV) for neighborhoods on Charles river (1) and not on Charles river (0)
Side-by-side boxplots are useful for comparing subgroups
![Page 9: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/9.jpg)
Box Plot Top outliers defined as
those above Q3+1.5(Q3-Q1).
“max” = maximum of non-outliers
Analogous definitions for bottom outliers and for “min”
Details may differ across software
Median
Quartile 1
“max”
“min”
outliers
meanQuartile 3
![Page 10: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/10.jpg)
Heat Maps
Color conveys information
In data mining, used to visualizeCorrelationsMissing Data
![Page 11: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/11.jpg)
Heatmap to highlight correlations(Boston Housing)
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDVCRIM 1.00ZN -0.20 1.00INDUS 0.41 -0.53 1.00CHAS -0.06 -0.04 0.06 1.00NOX 0.42 -0.52 0.76 0.09 1.00RM -0.22 0.31 -0.39 0.09 -0.30 1.00AGE 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00DIS -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00RAD 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00TAX 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00PTRATIO 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00B -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18 1.00LSTAT 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37 -0.37 1.00MEDV -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51 0.33 -0.74 1.00
In Excel (using conditional formatting)
In Spotfire
![Page 12: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/12.jpg)
Multidimensional Visualization
![Page 13: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/13.jpg)
Scatterplot with color addedBoston Housing
NOX vs. LSTATRed = low median valueBlue = high median
value
![Page 14: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/14.jpg)
Matrix Plot
0
0
1.8
1.8
3.6
3.6
5.4
5.4
7.2
7.2
9
9
0
0
0.2
0.2
0.4
0.4
0.6
0.6
0.8
0.8
1
1
0
0
0.6
0.6
1.2
1.2
1.8
1.8
2.4
2.4
3
3
CRIM101
ZN102
INDUS101
Matrix Plot
Shows scatterplots for variable pairs
Example: scatterplots for 3 Boston Housing variables
![Page 15: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/15.jpg)
Rescaling to log scale (on right)“uncrowds” the data
![Page 16: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/16.jpg)
Aggregation
![Page 17: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/17.jpg)
Amtrak Ridership – Monthly Data
![Page 18: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/18.jpg)
Aggregation – Monthly Average
![Page 19: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/19.jpg)
Aggregation – Yearly Average
![Page 20: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/20.jpg)
Scatter Plot with Labels (Utilities)
![Page 21: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/21.jpg)
Scaling: Smaller markers, jittering, color contrast (Universal Bank; red = accept loan)
![Page 22: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/22.jpg)
Jittering
Moving markers by a small random amount Uncrowds the data by allowing more markers to be
seen
![Page 23: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/23.jpg)
Without jittering (for comparison)
![Page 24: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/24.jpg)
Parallel Coordinate Plot (Boston Housing)Filter Settings-CAT. MEDV: (1)
CATMEDV =1
CATMEDV =0
![Page 25: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/25.jpg)
Linked plots(same record is highlighted in each plot)
![Page 26: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/26.jpg)
Network Graph – eBay Auctions(sellers on left, buyers on right)
Circle size = # of transactions for the node
Line width =# of auctions for the buyer-seller pair
Arrows point from buyer to seller
![Page 27: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/27.jpg)
Treemap – eBay Auctions (Hierarchical eBay data: Category> sub-category> Brand)
Rectangle size = average closing price (=item value)
Color = % sellers with negative feedback (darker=more)
![Page 28: Chapter 3 – Data Visualization - SNU Data Mining Centerdm.snu.ac.kr/static/docs/dm2015/Chap3_Visualization.pdf · 2015-09-02 · 95% of tracts do not border Charles River. Excel](https://reader033.vdocuments.mx/reader033/viewer/2022042809/5f941191df1c783ca71de882/html5/thumbnails/28.jpg)
Map Chart (Comparing countries’ well-being with GDP)
Darker = higher value