chapter 10 exploratory data analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1...

13
1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of Exploratory Data Analysis (page 410) Definition 12.1. Exploratory data analysis (EDA) is a subfield of applied statistics that is concerned with the investigation of the collected or transformed data to reveal patterns, peculiarities and relationships using visual displays, resistant statistics and a thorough examination of the residuals. EDA is a preliminary step in data analysis. It can be used to determine if the planned method for analysis is appropriate for the collected data. Four major themes that describe the methods used in EDA: revelation, resistance, reexpression, and residuals.

Upload: others

Post on 21-Jul-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

1

Chapter 10

Exploratory Data Analysis

Chapter 10. Introduction to EDA

Definition of Exploratory Data Analysis (page 410)

Definition 12.1.

Exploratory data analysis (EDA) is a subfield of appliedstatistics that is concerned with the investigation of thecollected or transformed data to reveal patterns, peculiaritiesand relationships using visual displays, resistant statistics and athorough examination of the residuals.

• EDA is a preliminary step in data analysis. It can be used todetermine if the planned method for analysis is appropriate forthe collected data.

• Four major themes that describe the methods used in EDA:revelation, resistance, reexpression, and residuals.

Page 2: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

2

Chapter 10. Introduction to EDA

REVELATION (page 410-411)

EDA reveals the essential features of the dataset usually via simple graphical displays. (Example: stem-and-leaf display and the boxplot)

These graphs can give us a general idea about the distribution such as its center and other quantiles, spread, symmetry, and kurtosis.

Graphs can help detect sources of problems in analysis such as the presence of outliers and multimodality.

Graphs can also help reveal patterns and possible relationships among the different variables in the study.

Chapter 10. Introduction to EDA

RESISTANCE (page 411-412)

Definition 12.2

A statistic is said to be resistant if its value is not adversely affected (i) when we replace some of the values in a dataset with totally different values; or, (ii) when there are minor changes in all of the data values possibly due to rounding.

* The mean and the variance are not resistant statistics that is why they are seldom used in EDA.

Page 3: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

3

Chapter 10. Introduction to EDA

Example 12.1 (page 412)

ORIGINAL DATASET: Mean and Median = 74.

Obs No. Obs No Obs No. (i) Xi (i) Xi (i) Xi

1 55 6 70 11 82 2 58 7 73 12 82 3 60 8 74 13 83 4 68 9 77 14 88 5 70 10 81 15 89

Let us examine the effect on the sample mean and median if we change one value in the dataset by an outlying value such as 1,000.

Obs No. Obs No. Obs No. Modified Modified Modified (i) Mean Md (i) Mean Md (i) Mean Md

1 137 77 6 136 77 11 135.2 74 2 136.8 77 7 135.8 77 12 135.2 74 3 136 2

3 77 8 1351115 77 13 135 2

15 74 4 136 2

15 77 9 135 815 74 14 134.8 74

5 136 77 10 135 415 74 15 13411

15 74

Chapter 10. Introduction to EDA

Definition of Stem-and-Leaf Display (page 416)Definition 12.4. The stem-and-leaf display (SALD) is a histogram-like display of the data where the digits of the data values replace the bars in representing the frequencies. Example:

Stem Leaf (unit = 0.1) 2 3 4 5 6 7 8 9

2 5 5 6 7 2 5 5 8 1 3 5 6 8 9 1 2 5 4 1

Note: We can retrieve the data value from the display by joining the digits in the stem and the leaf together then multiplying the number by the specified unit. For example, the smallest data value in the SALD above is 22 x 0.1=2.2. In the third row, the observations are 4.5, 4.6 and 4.7.

Page 4: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

4

Chapter 10. Introduction to EDA

Steps in Constructing the SALD (page 416)

Step 1. Choose the common division point of each observation where we will split each data value into its stem and leaf components.

Example 12.3: Smallest value is 102.4.Largest value is 1394.9.

Choices: Example

Location (for Abra: 235.9) Values of Stembetween ones and tenths place 235 | 9 102 to 1394between tens and ones place 23 | 59 10 to 139between hundreds and tens 2 | 359 1 to 13between thousands and hundreds 0 | 2359 0 to 1

Chapter 10. Introduction to EDA

Steps in Constructing the SALD (pages 416-417)

Step 2: In a vertical column, list the smallest stem value up to the largest stem value, using increments of 1 unit.

Step 3: Draw a vertical line to the right of the stem value. Example: Stem 1 2 3 4 5 6 7 8 9 10 11 12 13

Page 5: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

5

Chapter 10. Introduction to EDA

Steps in Constructing the SALD (page 417)

Step 4. Record the leaf portion of the first observation in the row corresponding to its stem value. Do the same for all of the observations.

Step 5. Sort the leaves within each stem row from lowest to highest. Maintain uniform spacing in between the leaves for each one of the rows. By doing so, the stem with the most number of leaves (observations) will have the longest line; that is, it will appear to have the longest bar.

1 | 024 034 187 192 355 413 623 670 799 870 875 939 998 2 | 024 084 219 253 322 359 459 535 548 626 708 734 747 767 790 840 858 911 3 | 007 159 183 212 226 258 284 292 320 337 400 420 463 465 503 623 638 4 | 055 057 366 388 478 480 820 923 5 | 046 058 173 235 256 736 786 6 | 645 764 984 998 7 | 014 406 468 8 | 088 138 669 928 9 | 004 072 10 | 317 11 | 682 12 | 13 | 949

Chapter 10. Introduction to EDA

Steps in Constructing the SALD (page 417)

Step 6. Indicate the unit of the leaves to allow the recreation of the actual data values from the display. For example,

Unit = 0.1 35 | 6 represents 356 × 0.1 = 35.6Unit = 1 35 | 6 represents 356 × 1 = 356Unit = 10 35 | 6 represents 356 × 10 = 3,560

(Unit = 0.1 million pesos) 1 | 024 034 187 192 355 413 623 670 799 870 875 939 998 2 | 024 084 219 253 322 359 459 535 548 626 708 734 747 767 790 840 858 911 3 | 007 159 183 212 226 258 284 292 320 337 400 420 463 465 503 623 638 4 | 055 057 366 388 478 480 820 923 5 | 046 058 173 235 256 736 786 6 | 645 764 984 998 7 | 014 406 468 8 | 088 138 669 928 9 | 004 072 10 | 317 11 | 682 12 | 13 | 949

Page 6: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

6

Chapter 10. Introduction to EDA

Split Stem-and-Leaf Display (page 421)

If there are too many leaves in some of the rows is too large, we may split each stem into two groups. In the first group, we include all leaves with leading digits from 0 – 4. In the second group, we include all leaves with leading digits from 5 – 9. We mark the stem of the first group with “ * ” and we mark the stem of the second group with “ . ” . If the number of leaves is still too large, we can divide each stem into five groups. We mark the stem of the first group with “ * ” and include all leaves with leading digits from 0 -1. The second group is marked “t” and includes leaves with leading digits from 2 -3, the third group is marked “f” and includes leaves with leading digits from 4 -5, the fourth group is marked “s” and includes leaves with leading digits from 6 -7 . The last group is marked “ . ” and includes leaves with leading digits from 8 – 9.

Chapter 10. Introduction to EDA

Example Below are the starting salaries of a sample of 100 computer science majors who earned their baccalaureate degrees during a recent year: Starting Salaries (P000) 24.2 29.9 23.4 23.0 25.5 22.0 33.9 20.4 26.6 24.0 28.9 22.5 18.7 32.6 26.1 26.2 26.7 20.4 22.2 24.7 18.6 18.5 19.6 24.4 24.8 27.8 27.6 27.2 20.8 22.1 19.7 25.3 28.2 34.2 32.5 30.8 26.8 20.6 21.2 20.7 25.2 25.7 32.2 28.8 24.7 18.7 20.5 25.5 19.1 25.5 22.1 27.5 25.8 25.2 25.6 25.2 25.2 27.9 18.9 37.2 29.9 23.2 19.8 20.8 29.5 27.6 21.2 38.7 21.3 24.8 32.3 20.1 26.8 25.4 26.3 21.2 19.5 22.8 21.7 25.6 32.3 28.1 27.5 25.3 19.3 27.4 26.4 20.9 34.5 25.9 31.4 27.4 27.3 20.6 31.8 25.8 25.2 21.9 26.8 26.5 Values range from 18.5 to 38.7.

Page 7: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

7

Chapter 10. Introduction to EDA

Example of Split SALD

Stem Leaf (unit = 0.1 thousand pesos) 1 85 86 87 87 89 91 93 95 96 97 98 2 01 04 04 05 06 06 07 08 08 09 12 12 12 13 17 19 2 t 20 21 21 22 25 28 30 32 34 2 f 40 42 44 47 47 48 48 52 52 52 52 52 53 53 54 55 55 55 56 57 58 58 59 2 s 61 62 63 64 65 66 67 68 68 68 72 73 74 74 75 76 76 78 79 2 81 82 88 89 95 99 99 3 08 14 18 3 t 22 23 23 25 26 39 3 f 42 45 3 s 72 3 87 Note: If there are outlying values then these values can be reported inside the parentheses on a special row in the first row (if value is extremely low) labelled as “low” or on the last row (if value is extremely large) labelled as “hi”. For example, if the starting salaries of two graduates are as large as 120.3 and 150.4 then we will add the following row at the bottom of the SALD, hi | (120.3, 150.4)

Definition of Depth (page 419)

Definition 12.5

If we determine the two ranks of a data valueby recording its position from each end of thearray, then its depth is the smaller betweenthese two ranks.

Chapter 10. Introduction to EDA

Page 8: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

8

Example 12.4 (page 419)

Chapter 10. Introduction to EDA

12 8 15 5 25 30 22 24 10 28 18

The array and the corresponding ranks and depths of each observation in the array are as follows:

Array 5 8 10 12 15 18 22 24 25 28 30

Rank A 1 2 3 4 5 6 7 8 9 10 11 (from lowest to highest)

Rank B 11 10 9 8 7 6 5 4 3 2 1 (from highest to lowest)

Depth 1 2 3 4 5 6 5 4 3 2 1 (the smaller between Rank A and Rank B)

Q1=10 Md=18 Q3=25

We will observe that the depths of the 1st and 3rd quartiles are both equal to (n+1)/4=(11+1)/4=3; and, the depth of the median is (n+1)/2=6.

Five-Number Summary (pages 425-426)Definition 12.6A letter value is a statistic whose value depends on its defined depth, which we tag using a particular letter.

The median is a letter value whose depth is (n+1)/2 and its tag is M.

Definition 12.7The extremes are the two data values in the array with depths equal to 1.

Definition 12.8The fourths or the hinges are the two data values in the array with the following depth:

We use the letter F as our tag for the fourth.

Definition 12.9The five-number summary is a collection of letter values consisting of the median,the fourths, and the extremes

Chapter 10. Introduction to EDA

( ) 12

( ) 0.52

depth of median when n is odddepth of fourth

depth of median when n is even

Page 9: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

9

Note on the Fourth (page 426)

The fourth can be viewed as the two observations that are halfway between the median and the corresponding extremes.

The depth of the fourth is either a whole number or has a remainder of ½ since the numerator is always a whole number and the denominator is 2.

Interpolation is needed only when the depth has a remainder of ½. In this case, we just get the midpoint of the two values adjacent to the fourth.

Example: n=6, depth of fourth = (depth of median + 0.5)/2 = ((6+1)/2 + 0.5)/2 = 2

The fourths are the 2nd and the 2nd to the last ordered statistics. n=7, depth of fourth = (depth of median + 1)/2 = ((7+1)/2 + 1)/2 = 2.5

The fourths are interpolated values. On each end of the array, it is computed as the average of the two observations with depths equal to 2 and 3.

Chapter 10. Introduction to EDA

Position: 1 2 3 4 5 6

Lower Fourth Median Upper Fourth

Position: 1 2 3 4 5 6 7

Lower Fourth Median Upper Fourth

Chapter 10. Introduction to EDA

Definition of Box-and-Whisker Plot (page 430)

Definition 12.11The box-and-whisker plot, or boxplot, is a simple graphical display of the data used to display the 5-letter summary.

Note: The boxplot displays the following features of the data: (i) location, (ii) spread, (iii) symmetry, (iv) extremes, and (v) outliers.

Page 10: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

10

Chapter 10. Introduction to EDA

Steps in Constructing the Boxplot (pages 430-431)

0 5

10 15 205

25 30

Step 1: Construct a rectangle with one end at the lower fourth (FL) and the other end at the upper fourth (FU)

Step 2: Put a line across the interior of the rectangle at the median. 1 10 14 15 18 20 21 22 22 22 23 24 24 25 28 Depth of median = (15 +1)/2 = 8 Med=22 Depth of fourth = (depth of median + 1)/2 = 9/2 =4.5

FL = (15+18)/2=16.5 FU = (24+23)/2 = 23.5

Chapter 10. Introduction to EDA

Steps in Constructing the Boxplot (cont’d)

Step 3: Compute for the fourth-spread (dF), lower fence and upper

fence as follows:

dF = FU – FL

lower fence = FL – 1.5 dF upper fence = FU + 1.5 dF The lower and upper fences are outlier cutoffs. We will consider all data points smaller than the lower fence or larger than the upper fence as outliers Example: dF = 23.5 – 16.5 = 7 Lower fence = 16.5 – (1.5)(7) = 6 Upper fence = 23.5 + (1.5)(7) = 34

Page 11: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

11

Chapter 10. Introduction to EDA

Steps in Constructing the Boxplot (cont’d)

0 5

10 15 205

25 30

x

Step 4: Excluding outliers, identify the two data values that are closest to the lower fence and upper fence, respectively. Draw a line, starting from these values up to each side of the rectangle. We sometimes refer to these lines as the whiskers.

Step 5: Plot each outlier at its corresponding value, using an x-mark or any other distinctive mark. We consider an outlying observation that is less than FL – 3dF or greater than FU +3dF as an extreme outlier. We sometimes distinguish extreme outliers from other outliers by placing a circle at their actual location, instead of an “x.”

1 10 14 15 18 20 21 22 22 22 23 24 24 25 28 Lower fence= 6 FL =16.5 Med=22 FU = 23.5 Upper fence = 34 Outlier : 1 Closest data point to lower fence that is not an outlier: 10 Closest data point to upper fence that is not an outlier: 28

Chapter 10. Introduction to EDA

Remarks (page 431)

The height of the rectangle is usually arbitrary and has no specific meaning. If several boxplots appear together, however, the height is sometimes made proportional to the different sample sizes. This is rarely done, however, because an accurate representation is very difficult to achieve.

The different statistical software present varying versions of the boxplot. For example, instead of plotting the sides of the rectangles at the lower fourth and upper fourth, these are plotted to related summary measures, the 1st and 3rd quartiles respectively and the fences are computed as follows:

Lower fence = Q1 – 1.5 IQRUpper fence = Q3 1.5 IQR

Page 12: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

12

Chapter 10. Introduction to EDA

Interpreting the Boxplot (page 433)

1. The line inside the rectangle shows the location of the median, our measure of central tendency.

2. The sides of the rectangle, which are plotted either at the fourths or the quartiles, indicate where the middle 50% of the observations lie.

3. The length of the rectangle represents the magnitude of either the fourth-spread or the inter-quartile range, our measure of dispersion.

4. The relative position of the line inside the rectangle to its sides gives us an idea on the degree and direction of symmetry because this shows the respective distances of the median to the lower and upper fourths. A line that is in the middle of the rectangle indicates that the distribution is symmetric; while a line that is closer to the lower fourth (or 1st quartile) indicates that the distribution is skewed to right, and, a line that is closer to the upper fourth (or 3rd quartile) indicates that the distribution is skewed to the left).

5. If there are no outliers then the ends of the whiskers indicate the respective values of both extremes; but, if there are outliers then the farthest outlier is our extreme.

6. The outliers are clearly identified by the distinctive marks used to plot them.

Chapter 10. Introduction to EDA

Interpreting the Boxplot Symmetric distribution

Negatively-skewed distribution

Positively-skewed distribution

Page 13: Chapter 10 Exploratory Data Analysiserho.weebly.com/uploads/2/7/8/4/27841631/chapter_10.pdf · 1 Chapter 10 Exploratory Data Analysis Chapter 10. Introduction to EDA Definition of

13

Chapter 10. Introduction to EDA

Comparing Distributions using the Boxplot

mill

ion

peso

s

MindanaoVisayasLuzon

1400

1200

1000

800

600

400

200

0

Total Financial Resources Generated by Major Geographic Region

Assignment

Use the data in page 434, Exercise 3, on the illiteracy rate among the male and female populations, 15 years of age and over, in Asia in 2001.

1. Construct a split stem-and-leaf display of the illiteracy rate among the male population. Let the common division point be in between the tens and ones digit. Split each stem into two lines.

2. Compute for the median, lower fourth, upper fourth, fourth spread, lower fence, and upper fence of the illiteracy rate of:

i. Male populationii. Female population

3. Use the values computed in no. 2 to draw the boxplot of the illiteracy rate among the male population. On the same plotting area, draw the boxplot of the illiteracy rate among the female population.

Chapter 12. Exploratory Data Analysis