stt 315 ashwini maurya this lecture is based on chapter 2 of the textbook. 1 acknowledgement: author...

Download STT 315 Ashwini Maurya This lecture is based on Chapter 2 of the textbook. 1 Acknowledgement: Author is thankful to Dr. Ashok Sinha, Dr. Jennifer Kaplan

If you can't read please download the document

Upload: melina-shelton

Post on 26-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • STT 315 Ashwini Maurya This lecture is based on Chapter 2 of the textbook. 1 Acknowledgement: Author is thankful to Dr. Ashok Sinha, Dr. Jennifer Kaplan and Dr. Parthanil Roy for allowing him to use/edit some of their slides.
  • Slide 2
  • 2 Topic of this chapter These materials can be read from Chapter 2.1- 2.5 of the textbook. We shall first cover some descriptive statistics of qualitative variables ( Ch2.1 ). Later we shall study descriptive statistics of quantitative variables ( Ch2.2-2.5 ). In descriptive statistics we summarize data through graphs and tables. 2
  • Slide 3
  • 3 How to display Qualitative Data? Frequency Tables Bar graph (or bar chart) Pie chart (or pie diagram) Pareto chart (or Pareto diagram) 3
  • Slide 4
  • Qualitative variables Qualitative or categorical variable cannot be usually measured in numerical scale, and simply records quality. Each category of a qualitative variable is also called class or level. For instance, the qualitative variable GENDER has two classes, namely Male and Female. If we count number of observations belonging to each class, then this count is called class frequency or simply frequency. Relative frequency of a class is obtained by dividing the class frequency by total number of observations. 4
  • Slide 5
  • 5 Frequency Tables These are tables in which classes (categories) are written in the left most column and the corresponding counts are written in the second column. Count is also known as frequency. Sometimes proportions (or percentages) are also written instead of or in addition to the actual counts. Proportion is also called relative frequency. 5
  • Slide 6
  • 6 Frequency Table: An Example Frequency Table of the number of Golf Balls sold in different days of a week Day# of Golf Balls Sold% of Golf Balls Sold (Frequency) Monday1719.54 Tuesday1314.94 Wednesday1517.24 Thursday2022.99 Friday2225.29 Total87100 6
  • Slide 7
  • 7 Bar Charts A bar chart or bar graph is a chart with rectangular bars with lengths proportional to the values that they represent. The bars can be plotted vertically (more common) or horizontally (less common). The percentage or relative proportions can also be plotted instead of the actual values. 7
  • Slide 8
  • 8 Bar Chart: Golf Balls Sold 8
  • Slide 9
  • 9 Bar Chart: % Golf Balls Sold 9
  • Slide 10
  • 10 Pie Chart A pie chart (or a circle graph) is a circular chart divided into sectors, illustrating proportion. The arc length of each sector (and consequently its central angle and area), is proportional to the quantity it represents. The math is carried out based on the following: 100% is same as 360 degrees. 10
  • Slide 11
  • 11 Pie Chart: Golf Ball Sold 11
  • Slide 12
  • 12 Pie Chart: An Example Pie Chart of English Native Speakers 12
  • Slide 13
  • 13 Bar Chart vs. Pie Chart Bar chart is used more often to represent the actual values while pie chart is used to represent relative proportions (in %). When comparison of relative proportion is important, pie chart is more appropriate. When the absolute counts or values are more important, a bar chart should be used. 13
  • Slide 14
  • 14 Major points so far First step in organizing data draw a picture Appropriate pictures for categorical data Pie chart Bar chart 14
  • Slide 15
  • Pareto diagram Pareto diagram is a particular type of bar diagram in which the classes are arranged on the horizontal axis in decreasing frequencies. That means in Pareto diagram the leftmost class has the highest frequency bar, followed by the class with next highest frequency bar, and so on. 15
  • Slide 16
  • 16 The following Pareto diagram represents the incarceration rate (per 100000 people) of various countries. 16
  • Slide 17
  • 17 Displaying Quantitative Data Histograms Stem-and-Leaf Displays Dotplots 17
  • Slide 18
  • 18 Histograms Histogram is a graphical representation, showing a visual impression of the distribution of quantitative data. It consists of adjacent rectangles, erected over intervals (also known as bins or classes). The lengths of the intervals may be different. The interval may contain a single value. The heights are equal to the number (frequency) of the observations in the corresponding bins. Sometimes percentages (or relative frequencies) are also represented by the heights. 18
  • Slide 19
  • 19 Histogram: An Example The heights of 31 Black Cherry trees 19
  • Slide 20
  • 20 A Few Questions How to choose the bin size? Let the computer decide it for you. What happens for the observations in the boundary of two bins? Put them in the higher bin. Dont we lose information? Yes, we do. 20
  • Slide 21
  • 21 Stem-and-Leaf Display Another device for presenting quantitative data in a graphical format. Assists in visualizing the shape of the distribution of the observations. Unlike histograms, stem-and-leaf displays retain the original data. Contains two columns separated by a vertical line. The left column contains the stems and the right column contains the leaves. Suppose we have the following data on weights (in lb) of 17 school-kids: 88 47 68 76 46 106 49 63 72 64 84 66 68 75 72 81 44 21
  • Slide 22
  • 22 How do they work? Sorted data: 44 46 47 49 63 64 66 68 68 72 72 75 76 81 84 88 106 StemLeaf 4 4 6 7 9 5 6 3 4 6 8 8 7 2 2 5 6 8 1 4 8 9 106 key: 6|3 = 63 leaf unit: 1.0 stem unit: 10.0 22
  • Slide 23
  • 23 Dotplots A dotplot is a statistical chart consisting of group of data points plotted on a simple scale. They can be drawn both horizontally and vertically. 23
  • Slide 24
  • 24 Summary We have learnt three methods of displaying quantitative data: histogram, stem-and-leaf display and dotplot. When the data-size is small, stem-and-leaf display and dotplot are more useful. When the data-size is large, histogram is more useful. 24
  • Slide 25
  • 25 Distribution of the Data-points Three important features: Shape of the distribution, Center of the distribution, Spread of the distribution. 25
  • Slide 26
  • 26 Shape of a Distribution: Modes The peaks of a histogram are called modes. A distribution is unimodal if it has one mode, bimodal if it has two modes, multimodal if it has three or more modes. 26
  • Slide 27
  • 27 Unimodal, Bimodal or Multimodal? Unimodal Bimodal Multimodal 27
  • Slide 28
  • 28 Uniform Histogram A histogram that doesnt appear to have any mode. All the bars are approximately the same. 28
  • Slide 29
  • 29 Shape of a Distribution: Symmetry If the histogram can be folded along a vertical line through the middle and have the edges match pretty closely, then the distribution is symmetric. Otherwise, it is skewed. 29
  • Slide 30
  • 30 Skewed to the left or right? Skewed to the left Skewed to the right 30
  • Slide 31
  • 31 Shape of a Distribution: Outliers Outliers are the data-points that stand off away from the body of the histogram. They are too high or too low compared to most of the observations. 31
  • Slide 32
  • 32 The following distribution is A.Unimodal and skewed to the left B.Bimodal and skewed to the right C.Bimodal and symmetric D.Multimodal and symmetric E.Unimodal and skewed to the right 32
  • Slide 33
  • 33 Does this distribution have an outlier? (a)Yes, it does (b)No, it doesnt 33
  • Slide 34
  • 34 The following distribution is A.Unimodal and skewed to the left B.Bimodal and skewed to the right C.Bimodal and symmetric D.Multimodal and symmetric E.Unimodal and skewed to the right 34
  • Slide 35
  • Numerical measures for quantitative data 35
  • Slide 36
  • 36 Center of a Distribution Median: The middlemost observation when the data is sorted in increasing order Median can always be used as the center of a distribution. Mean: The average of all data-points. Mean can be used as the center of a distribution when the distribution is symmetric. 36
  • Slide 37
  • 37 What is Median? Median is the middlemost observation when the data is sorted in increasing order. Data: 23, 33, 12, 39, 27 Sorted Data: 12, 23, 27, 33, 39 Median: 27 37
  • Slide 38
  • 38 What if there are even number of observations? Take the average of two middlemost observations in that case Data: 23, 33, 12, 39, 27, 10 Sorted Data: 10, 12, 23, 27, 33, 39 Median = (23+27)/2 = 25. 38
  • Slide 39
  • 39 What is the general rule? Suppose there are n observations. Sort them in increasing order. If n is odd then the median is the observation in the (n+1)/2 th position. If n is even, then the median is the average of the observations in the (n/2) th and (n/2 + 1) th positions. 39
  • Slide 40
  • 40 When n is odd Data: 23, 33, 12, 39, 27 n = 5 (odd) Sorted Data: 12, 23, 27, 33, 39 Median = observation in the (5+1)/2 th position = observation in the 3 rd position = 27. 40
  • Slide 41
  • 41 When n is even Data: 23, 33, 12, 39, 27, 10 n = 6 (even) Sorted Data: 10, 12, 23, 27, 33, 39 Median = average of the observations in the (6/2) th and (6/2 +1) th positions = average of the observations in the 3 rd and 4 th positions = (23+27)/2 = 25. 41
  • Slide 42
  • 42 What is mean? Mean is the average of all the observations (i.e., add up all the values and divide by the number of values). If an observation repeats, we add it the number of times it repeats when we calculate the average. Mean can be used as the center of a distribution when the distribution is symmetric. Data: 10, 13, 18, 22, 29 Mean = (10 + 13 + 18 + 22 + 29)/5 = 18.40 42
  • Slide 43
  • 43 Mean vs. Median Data: 10, 13, 18, 22, 29 Without the outlier: Mean = 18.40 Median = 18 Data: 10, 13, 18, 22, 29, 68 With the outlier: Mean = 26.67 Median = 20 Conclusion: Mean is more outlier-sensitive compared to the median. 43
  • Slide 44
  • 44 Mean vs. Median Mean is more outlier-sensitive compared to median. For a symmetric distribution, mean = median. Thus mean is more useful as the center of a distribution when the distribution is symmetric. But median can always be used as the center of a distribution. For a right-skewed distribution, mean > median. For a left-skewed distribution, mean < median. Learn to use TI 83/84 Plus to compute mean and median. 44
  • Slide 45
  • TI 83/84 Plus commands To enter the data: Press [STAT] Under EDIT select 1: Edit and press ENTER Columns with names L 1, L 2 etc. will appear Type the data value under the column; each data entry will be followed by ENTER. To clear data: Pressing CLEAR will clear the particular data. To clear all data from all columns press [2nd] & + and then choose 4: ClrAllLists. 45
  • Slide 46
  • TI 83/84 Plus commands 46
  • Slide 47
  • 47 Effect of Linear Transformation Suppose every observation is multiplied by a fixed constant. Then median of transformed observations is the median of the original observations times that same constant. mean of transformed observations is the mean of the original observations times that same constant. Data: 10, 13, 18, 22, 29 Mean = 18.40. Median = 18. Suppose transformed data = (-3)*original data. So transformed data: -30, -39, -54, -66, -87 Mean = (-3)*18.40 = -55.20. Median = (-3)*18 = -54. 47
  • Slide 48
  • 48 Effect of Linear Transformation Suppose a fixed constant is added to (or subtracted from) each observation. Then median of transformed observations is the median of the original observations plus (or minus) that same constant. mean of transformed observations is the mean of the original observations plus (or minus) that same constant. Data: 10, 13, 18, 22, 29 Mean = 18.40. Median = 18. Suppose transformed data = original data + 2.5. Hence transformed data: 12.5, 15.5, 20.5, 24.5, 31.5 Mean = 18.40 + 2.5 = 20.90. Median = 18 + 2.5 = 20.50. 48
  • Slide 49
  • 49 Spread of a Distribution Are the values concentrated around the center of the distribution or they are spread out? Range, Interquartile Range, Variance, Standard Deviation. Note: Variance and standard deviation are more appropriate when the distribution is symmetric. 49
  • Slide 50
  • 50 Range Range of the data is defined as the difference between the maximum and the minimum values. Data: 23, 21, 67, 44, 51, 12, 35. Range = maximum minimum = 67 12 = 55. Disadvantage: A single extreme value can make it very large, giving a value that does not really represent the data overall. On the other hand, it is not affected at all if some observation changes in the middle. 50
  • Slide 51
  • 51 Interquartile Range (IQR) What is IQR? IQR = Third Quartile (Q 3 ) First Quartile (Q 1 ). What are quartiles? Recall: Median divides the data into 2 equal halves. The first quartile, median and the third quartile divide the data into 4 roughly equal parts. 51
  • Slide 52
  • 52 Quartiles The first quartile (Q 1, lower quartile) is that value which is larger than 25% of observations, but smaller than 75% of observations. The second quartile (Q 2 ) is the median, which is larger than 50% of observations, but smaller than 50% of observations. The third quartile (Q 3, upper quartile) is that value which is larger than 75% of observations, but smaller than 25% of observations. Obviously, Q 1 < Q 2 (= median) < Q 3. How to compute the quartiles? We shall use TI 83/84 Plus. 52
  • Slide 53
  • 53 IQR vs. Range IQR is a better summary of the spread of a distribution than the range because it has some information about the entire data, where as range only has information on the extreme values of the data. IQR is less outlier-sensitive than range. 53
  • Slide 54
  • 54 Outlier-sensitivity Data: 10, 13, 17, 21, 28, 32 Without the outlier IQR = 15 Range = 22 Data: 10, 13, 17, 21, 28, 32, 59 With the outlier IQR = 19 Range = 49 Conclusion: IQR is less outlier-sensitive than range. 54
  • Slide 55
  • 55 Variance and Standard Deviation The sample variance (s 2 ) is defined as: Subtract the mean from each value, square each difference, add up the squares, divide by one fewer than the sample size. The sample standard deviation (s), is the positive square root of sample variance, i.e. 55
  • Slide 56
  • 56 Variance and Standard Deviation Larger the variance (and standard deviation) more dispersed are the observations around the mean. The unit of variance is square of the unit of the original data, whereas standard deviation has the same unit as the original data. Both variance and standard deviation are more appropriate for symmetric distributions. 56
  • Slide 57
  • 57 Standard Deviation: An Example Data: 3, 12, 8, 9, 3 (n=5 in this case) Mean = (3+12+8+9+3)/5 = 35/5 =7. Data Deviations from mean Squared Deviations ------------------------------------------------------------------------------ 3 3 7 = -4 (-4)x(-4) =16 12 12 7 = 5 5 x 5 =25 8 8 7 = 1 1 x 1 = 1 9 9 7 = 2 2 x 2 = 4 3 3 7 = -4 (-4)x(-4) =16 ------------------------------------------------------------------------------ Total = 62 Now divide by n-1=4: s 2 = 62/4 = 15.50. s = 15.5 = 3.94. Answer: The standard deviation in this example is 3.94 and the variance is 15.50. 57
  • Slide 58
  • 58 Effect of Linear Transformation Suppose every observation is multiplied by a fixed constant. Then range/IQR/standard deviation of transformed observations is the range/IQR/standard deviation of the original observations times the absolute value of that same constant. variance of transformed observations is the variance of the original observations times the square of that same constant. Temperature data (in F): 10, 13, 18, 22, 29 Range = 19 F, IQR =14 F, s = 7.5 F, s 2 = 56.25 F 2. Suppose transformed data = (-3)*original data. So transformed data (in F): -30, -39, -54, -66, -87 Range = |-3|*19 = 57 F, IQR = |-3|*14 = 42 F, s = |-3|* 7.5 = 22.50 F, s 2 = (-3) 2 *56.25 = 506.25 F 2. 58
  • Slide 59
  • 59 Effect of Linear Transformation Suppose a fixed constant is added to (or subtracted from) each observation. Then range/IQR/standard deviation/variance of transformed observations remains the same as that of the original observations. Temperature data (in F): 10, 13, 18, 22, 29 Range = 19 F, IQR =14 F, s = 7.5 F, s 2 = 56.25 F 2. Suppose transformed data = original data + 2.5. Hence transformed data (in F): 12.5, 15.5, 20.5, 24.5, 31.5 Range = 19 F, IQR =14 F, s = 7.5 F, s 2 = 56.25 F 2. 59
  • Slide 60
  • Empirical rule & Chebyshevs rule 60
  • Slide 61
  • Empirical rule For approximately symmetric unimodal (bell- shaped/mound shaped) distribution Approximately 68% of observations fall within 1 standard deviation of mean. Approximately 95% of observations fall within 2 standard deviations of mean. Approximately 99.7% of observations fall within 3 standard deviations of mean. 61
  • Slide 62
  • Empirical rule 62
  • Slide 63
  • Empirical rule 63
  • Slide 64
  • Chebyshevs rule 64
  • Slide 65
  • Box plot 65
  • Slide 66
  • 66 Box Plot 1.Minimum Value, 2.Lower Quartile, 3.Median (the middle value), 4.Upper Quartile, 5.Maximum Value. NOTE: Data must be ordered from lowest value to highest value before finding the 5 number summary. Box plot is a graphical representation of the following 5 number summary:
  • Slide 67
  • 67 Box Plots Are a representation of the five number summary (Minimum, Maximum, Median, Lower Quartile, Upper Quartile). Half the data are in the box One-quarter of the data are in each whisker. If one part of the plot is long, the data are skewed. Box-plot is very useful for comparing distributions This box plot indicates data are skewed to the left.
  • Slide 68
  • 68 Box Plot Box Plot is a pictorial representation of the 5-number summary.
  • Slide 69
  • 69 Outliers Any observation farther than 1.5 times IQR from the closest boundary of the box is an outlier. If it is farther than 3 times IQR, it is an extreme outlier, otherwise a mild outlier. One can also indicate the outliers in a box plot, by drawing the whiskers only up to 1.5 times IQR on both sides, and indicating outliers with stars or crosses (or other symbols).
  • Slide 70
  • 70 An example Suppose min = 2, Q 1 = 18, median = 20, Q 3 = 22, max = 35. Which of the following observations are outliers? A.10 B.15 C.25 D.30
  • Slide 71
  • 71 Histogram vs. Box plot Both histogram and box plot capture the symmetry or skewness of distributions. Box plot cannot indicate the modality of the data. Box plot is much better in finding outliers. The shape of histogram depends to some extent on the choice of bins.
  • Slide 72
  • Comparing Distributions We can compare between distributions of various data-sets using Box Plots (or the 5-Number Summary), Histograms. We shall first compare distributions using box plots.
  • Slide 73
  • 73 Which type of car has the largest median Time to accelerate? A.upscale B.sports C.small D.large E.family
  • Slide 74
  • 74 Which type of car has the smallest median time value? A.upscale B.sports C.small D.Large E.Luxury
  • Slide 75
  • 75 Which type of car always take less than 3.6 seconds to accelerate? A.upscale B.sports C.small D.Large E.Luxury
  • Slide 76
  • 76 Which type of car has the smallest IQR for Time to accelerate? A.upscale B.sports C.small D.Large E.Luxury
  • Slide 77
  • 77 What is the shape of the distribution of acceleration times for luxury cars? A.Left skewed B.Right skewed C.Roughly symmetric D.Cannot be determined from the information given.
  • Slide 78
  • 78 What percent of luxury cars accelerate to 30 mph in less than 3.5 seconds? A.Roughly 25% B.Exactly 37.5% C.Roughly 50% D.Roughly 75% E.Cannot be determined from the information given
  • Slide 79
  • 79 What percent of family cars accelerate to 30 mph in less than 3.5 seconds? A.Less than 25% B.More than 50% C.Less than 50% D.Exactly 75% E.None of the above
  • Slide 80
  • Comparing Distributions Use of Histograms
  • Slide 81
  • 81 Which data have more variability? A.Graph A B.Graph B C.Both have the same variability A B
  • Slide 82
  • 82 Which data have more variability? A.Graph A B.Graph B C.Both have the same variability A B
  • Slide 83
  • 83 Which data have a higher median? A.Graph A B.Graph B C.Both have the same median A B
  • Slide 84
  • 84 Which data have more variability? A.Graph A B.Graph B C.Roughly, both have the same variability A B
  • Slide 85
  • z-score 85
  • Slide 86
  • 86 How to compare apples with oranges? A college admissions committee is looking at the files of two candidates, one with a total SAT score of 1500 and another with an ACT score of 22. Which candidate scored better? How do we compare things when they are measured on different scales? We need to standardize the values.
  • Slide 87
  • 87 How to standardize? Subtract mean from the value and then divide this difference by the standard deviation. The standardized value = the z-score z-scores are free of units.
  • Slide 88
  • 88 z-scores: An Example Data: 4, 3, 10, 12, 8, 9, 3 (n=7 in this case) Mean = (4+3+10+12+8+9+3)/7 = 49/7 =7. Standard Deviation = 3.65. Original Value z-score -------------------------------------------------------------- 4 (4 7)/3.65 = -0.82 3 (3 7)/3.65 = -1.10 10 (10 7)/3.65 = 0.82 12 (12 7)/3.65 = 1.37 8 (8 7)/3.65 = 0.27 9 (9 7)/3.65 = 0.55 3 (3 7)/3.65 = -1.10 --------------------------------------------------------------
  • Slide 89
  • 89 Interpretation of z-scores The z-scores measure the distance of the data values from the mean in the standard deviation scale. A z-score of 1 means that data value is 1 standard deviation above the mean. A z-score of -1.2 means that data value is 1.2 standard deviations below the mean. Regardless of the direction, the further a data value is from the mean, the more unusual it is. A z-score of -1.3 is more unusual than a z-score of 1.2.
  • Slide 90
  • 90 How to use z-scores? A college admissions committee is looking at the files of two candidates, one with a total SAT score of 1500 and another with an ACT score of 22. Which candidate scored better? SAT score mean = 1600, std dev = 500. ACT score mean = 23, std dev = 6. SAT score 1500 has z-score = (1500-1600)/500 = -0.2. ACT score 22 has z-score = (22-23)/6 = -0.17. ACT score 22 is better than SAT score 1500.
  • Slide 91
  • 91 Which is more unusual? A. A 58 in tall woman z-score = (58-63.6)/2.5 = -2.24. B. A 64 in tall man z-score = (64-69)/2.8 = -1.79. C. They are the same. Heights of adult men have mean of 69.0 in. std. dev. of 2.8 in. Heights of adult women have mean of 63.6 in. std. dev. of 2.5 in.
  • Slide 92
  • 92 Using z-scores to solve problems An example using height data and U.S. Marine and Army height requirements Question: Are the height restrictions set up by the U.S. Army and U.S. Marine more restrictive for men or women or are they roughly the same?
  • Slide 93
  • 93 Heights of adult women have mean of 63.6 in. standard deviation of 2.5 in. Data from a National Health Survey Heights of adult men have mean of 69.0 in. standard deviation of 2.8 in. Men Minimum Women Minimum U.S. Army 60 in58 in U.S. Marine Corps 64 in58 in Height Restrictions
  • Slide 94
  • 94 Heights of adult men have mean of 69.0 in. standard deviation of 2.8 in. Heights of adult women have mean of 63.6 in. standard deviation of 2.5 in. Men MinimumWomen minimum U.S. Army U.S. Marine 60 in z-score = -3.21 Less restrictive 58 in z-score = -2.24 More restrictive 64 in z-score = -1.79 More restrictive 58 in z-score = -2.24 Less restrictive
  • Slide 95
  • 95 Effect of Standardization Standardization into z-scores does not change the shape of the histogram. Standardization into z-scores changes the center of the distribution by making the mean 0. Standardization into z-scores changes the spread of the distribution by making the standard deviation 1.
  • Slide 96
  • 96 Z-score and Empirical Rule When data are bell shaped, the z-scores of the data values follow the empirical rule.
  • Slide 97
  • 97 Outlier detection with z-score Empirical Rule tells us that if data are mound-shaped distributed, then almost all the data-points are within plus minus 3 standard deviations from the mean. So an absolute value of z-score larger than 3 can be considered as an outlier.
  • Slide 98
  • Austra Skujyte (Lithunia) Shot Put = 16.40m, Long Jump = 6.30m. Carolina Kluft (Sweden) Shot Put = 14.77m, Long Jump = 6.78m. Shot PutLong Jump Mean (all contestant) 13.29m6.16m Std.Dev.1.24m0.23m n2826 2004 Olympics Womens Heptathlon 98
  • Slide 99
  • Which performance was better? A.Skujytes shot put, z-score of Skujytes shot put = 2.51. B.Klufts long jump, z-score of Klufts long jump = 2.70. C.Both were same. Shot PutLong Jump Mean (all contestant) 13.29m6.16m Std.Dev.1.24m0.23m n2826 99
  • Slide 100
  • Based on shot put and long jump whose performance was better? A.Skujytes, z-score: shot put = 2.51, long jump = 0.61. Total z-score = (2.51+0.61) = 3.12. B.Klufts, z-score: shot put = 1.19, long jump = 2.70. Total z-score = (1.19+2.70) = 3.89. C.Both were same. 100
  • Slide 101
  • Scatterplot 101
  • Slide 102
  • Example: Height and Weight How is weight of an individual related to his/her height? Typically, one can expect a taller person to be heavier. Is it supported by the data? If yes, how to determine this association? 102
  • Slide 103
  • What is a scatterplot? A scatterplot is a diagram which is used to display values of two quantitative variables from a data-set. The data is displayed as a collection of points, each having the value of one variable determining the position on the horizontal axis and the value of the other variable determining the position on the vertical axis. 103
  • Slide 104
  • Example 1: Scatterplot of height and weight 104
  • Slide 105
  • Example 2: Scatterplot of hours watching TV and test scores 105
  • Slide 106
  • Looking at Scatterplots We look at the following features of a scatterplot:- Direction (positive or negative) Form (linear, curved) Strength (of the relationship) Unusual Features. When we describe histograms we mention Shape Center Spread Outliers 106
  • Slide 107
  • Asking Questions on a Scatterplot Are test scores higher or lower when the TV watching is longer? Direction (positive or negative association). Does the cloud of points seem to show a linear pattern, a curved pattern, or no pattern at all? Form. If there is a pattern, how strong does the relationship look? Strength. Are there any unusual features? (2 or more groups or outliers). 107
  • Slide 108
  • Positive and Negative Associations Positive association means for most of the data- points, a higher value of one variable corresponds to a higher value of the other variable and a lower value of one variable corresponds to a lower value of the other variable. Negative association means for most of the data- points, a higher value of one variable corresponds to a lower value of the other variable and vice-versa. 108
  • Slide 109
  • This association is: A.positive B.negative. 109
  • Slide 110
  • This association is: A.positive B.negative. 110
  • Slide 111
  • Linear Scatterplot Unless we see a curve, we shall call the scatterplot linear. 111
  • Slide 112
  • Curved Scatterplot When the plot shows a clear curved pattern, we shall call it a curved scatterplot. 112
  • Slide 113
  • Which one has stronger linear association? A.left one, B.right one. Because, in the right graph the points are closer to a straight line. 113
  • Slide 114
  • Which one has stronger linear association? A.left one, B.right one. Hard to say. 114
  • Slide 115
  • Unusual Feature: Presence of Outlier This scatterplot clearly has an outlier. 115
  • Slide 116
  • Unusual Feature: Two Subgroups This scatterplot clearly has two subgroups. 116
  • Slide 117
  • Time series plot (Time plot) 117
  • Slide 118
  • Time plot Time series is a collection of observations made sequentially through time. In time plot (or time series plot) the time series data are plotted (on vertical axis) against the time (on horizontal axis), and the plots are connected with straight line. From time series plot one can find the movement of the observed values over time and find patterns such as: Trend Seasonality Business cycle (for business data) Unusual features 118
  • Slide 119
  • Example: US population 119
  • Slide 120
  • Example: US accidental death 120
  • Slide 121
  • Example: Australian red wine sell 121