161.120 Introductory Statistics Week 2 Lecture slides

Download 161.120 Introductory Statistics  Week 2 Lecture slides

Post on 09-Feb-2016

38 views

Category:

Documents

1 download

DESCRIPTION

161.120 Introductory Statistics Week 2 Lecture slides. Graphical Displays of Univariate Data: Dot Plots & Stem-and-leaf Plots and Histograms Text sections 2.4 and 2.5 CAST sections 2.2 and 2.4 Describing Centre and Spread Text sections 2.6 and 2.7 CAST sections 2.5 and 2.6 - PowerPoint PPT Presentation

TRANSCRIPT

  • 161.120 Introductory Statistics Week 2 Lecture slidesGraphical Displays of Univariate Data: Dot Plots & Stem-and-leaf Plots and HistogramsText sections 2.4 and 2.5CAST sections 2.2 and 2.4

    Describing Centre and SpreadText sections 2.6 and 2.7CAST sections 2.5 and 2.6

    Transformation & Discrete DataCAST sections 2.7 and 2.8

  • Dot PlotsA graphical display of a batch of numbersEach value is shown as a dot against a numerical axisProblem of overlappingJittering dots (used in CAST)randomly move the dots perpendicularly to the axis in order to separate them somewhatStacking dotsgroup values into classes, then vertically stack the dots in each classthe heights of the stacks show the density for each classThe loss of detailed information in a stacked dot plot is rarely important

  • Stem and leaf PlotsBasically a stacked dot plot using digits instead of dots and slightly different layout

    The 'axis' is drawn vertically

    A value is printed on the axis for each stack, giving the most significant digits that are common for all values on that stack. This is called the stem for the stack.

    The digits representing the values are called the leaf digits and are drawn in a row to the right of the stems

  • Decimal points are not shown in the stems or the leavesThe stem '12' and leaf '3' could represent 12300 or 1230 or 123 or 12.3 or 1.23 or 0.123, etc. so need to provide a key or state the units of the stemDistribution of values is shown by canopy of leaves

    Sometimes not shown well Can change the value of the leavesOr split the stems

  • Example 2.8 Big Music Collection About how many CDs do you own?Stem is 100s and leaf unit is 10s. Final digit is truncated. Numbers ranged from 0 to about 450, with 450 being a clear outlier and most values ranging from 0 to 99. The shape is skewed right.

  • Outlier: a data point that is not consistent with the bulk of the data.Outliers and How to Handle ThemLook for them via graphs.

    Can have big influence on conclusions.

    Can cause complications in some statistical analyses.

    Cannot discard without justification.

  • Possible Reasons for Outliers and Reasonable ActionsMistake made while taking measurement or entering it into computer. If verified, should be discarded/corrected.

    Individual in question belongs to a different group than bulk of individuals measured. Values may be discarded if summary is desired and reported for the majority group only.

    Outlier is legitimate data value and represents natural variability for the group and variable(s) measured. Values may not be discarded they provide important information about location and spread.

  • Example 2.7 Tiny Boatsmen Weights (in pounds) of 18 men on crew team:Note: last weight in each list is unusually small. They are the coxswains for their teams, while others are rowers.Cambridge:188.5, 183.0, 194.5, 185.0, 214.0, 203.5, 186.0, 178.5, 109.0Oxford: 186.0, 184.5, 204.0, 184.5, 195.5, 202.5, 174.0, 183.0, 109.5

  • ClustersIf a dot plot or stem and leaf plot separates into two or more groups of values (clusters), this suggests that the 'individuals' from which the data were recorded may similarly be split into two or more groups.

    Clusters may correspond to males and females, different varieties of plants,

    Detecting the cause of differences between the groups may lead to valuable insights into the data.

  • HistogramsDirectly displays the 'canopy' shape, without separately displaying the individual values. Are particularly useful displays for large data setsArea equals relative frequencyEach value must contribute the same area to the histogramEqual width classesheight of the rectangles equals the frequency of the classvertical axis labeled frequencyMixed class widthsvertical axis labeled density

  • Choice of histogram classesHistogram classes should be chosen to give an outline that is as smooth as possibleToo narrow leads to jagged histogramToo wide leads to 'blocky' histogram and detail is lost

    Adjusting the class width and the starting position for the first class can give a surprising amount of variability in histogram shape for small data sets. As a result, you must be wary of over-interpreting features such as clusters or skewness in such histograms.

  • Values are centered around 20 cm.Two possible low outliers.Apart from outliers, spans range from about 16 to 23 cm.Interpreting Histograms, Stemplots, and Dotplots

  • Find extremes (high, low), the median, and the quartiles (medians of lower and upper halves of the values).

    Quick overview of the data values.

    Information about the center, spread, and shape of data.Five-Number Summaries

  • Notation and Finding the QuartilesSplit the ordered values into the half that is below the median and the half that is above the median.Q1 = lower quartile = median of data values that are below the medianQ3 = upper quartile = median of data values that are above the median

  • Example 2.10 Fastest Speeds (cont)Ordered Data (in rows of 10 values) for the 87 males:Median = (87+1)/2 = 44th value in the list = 110 mphQ1 = median of the 43 values below the median = (43+1)/2 = 22nd value from the start of the list = 95 mphQ3 = median of the 43 values above the median = (43+1)/2 = 22nd value from the end of the list = 120 mph55 60 80 80 80 80 85 85 85 85 90 90 90 90 90 92 94 95 95 95 95 95 95 100 100 100 100 100 100 100 100 100 101 102 105 105 105 105 105 105 105 105 109 110 110 110 110 110 110 110 110 110 110 110 110 112 115 115 115 115 115 115 120 120 120 120 120 120 120 120 120 120 124 125 125 125 125 125 125 130 130 140 140 140 140 145 150

  • PercentilesThe kth percentile is a number that has k% of the data values at or below it and (100 k)% of the data values at or above it. Lower quartile = 25th percentileMedian = 50th percentileUpper quartile = 75th percentile

  • Median, quartiles and area

    The data set is split into quarters by the median and quartiles.Histogram area is proportional to relative frequency therefore the median and quartiles split the histogram into four equal areas.

  • Basic Box plot

  • What does a box plot tell you about the distribution?Centre The vertical line inside the box (the median) gives an indication of the centre of the distribution. Spread The width of the box (the interquartile range) gives an indication of the spread of values in the distribution. IQR = UQ - LQShape High density corresponds to adjacent box plot values being close together. In particular, if the extreme and quartile on one side are closer to the median than the extreme and quartile on the other side, this shows that the distribution is skew.

  • Box plot: Clusters & OutliersClustersBoxplots cannot show clusters in a data setBefore using a box plots check that clusters do not exist by using dot plot, stem and leaf plot or a histogramOutliersThe basic box plot does not clearly show an outlierAny values more than 1.5 times the IQR from the box are considered to be outliers and displayed with a separate cross Outliers are displayed with a separate crossThe 'whiskers' that are drawn to the sides of the central box extend only as far as the most extreme values that are not classified as outliers.

  • Example 2.10 Fastest Speeds Ever Driven Five-Number Summary for 87 malesMedian = 110 mph measures the center of the dataTwo extremes describe spread over 100% of data Range = 150 55 = 95 mphTwo quartiles describe spread over middle 50% of data Interquartile Range = 120 95 = 25 mph

  • Comparing two or more groups

    Box plots are particularly useful for comparing different groups of values Rice yields in 1996

  • Picturing Location and Spread with BoxplotsBoxplots for right handspans of males and females. Box covers the middle 50% of the dataLine within box marks the median valuePossible outliers are marked with asteriskApart from outliers, lines extending from box reach to min and max values.

  • 2.5Pictures for Quantitative DataHistograms: similar to bar graphs, used for any number of data values.

    Stem-and-leaf plots and dotplots: present all individual values, useful for small to moderate sized data sets.

    Boxplot or box-and-whisker plot: useful summary for comparing two or more groups.

  • 2.6Numerical Summaries of Quantitative DataNotation for Raw Data:n = number of individuals in a data set x1, x2 , x3,, xn represent individual raw data valuesExample: A data set consists of handspan values in centimeters for six females; the values are 21, 19, 20, 20, 22, and 19. Then, n = 6 x1= 21, x2 = 19, x3 = 20, x4 = 20, x5 = 22, and x6 = 19

  • Describing the Location of a Data SetMean: the numerical average

    Median: the middle value (if n odd) or the average of the middle two values (n even)Symmetric: mean = medianSkewed Left: mean < medianSkewed Right: mean > median

  • Determining the Mean and MedianThe Mean

    where means add together all the values

    The MedianIf n is odd: M = middle of ordered values. Count (n + 1)/2 down from top of ordered list.If n is even: M = average of middle two ordered values. Average values that are (n/2) and (n/2) + 1 down from top of ordered list.

  • Example 2.9 Will Normal RainfallGet Rid of Those Odors? Data: Average rainfall (inches) for Davis, California for 47 yearsMean = 18.69 inchesMedian = 16.72 inchesIn 1997-98, a company with odor problem blamed it on excessive rain.That year rainfall was 29.69 inches. More rain occurred in 4 other years.

  • The Influence of Outliers on the Mean and MedianLarger influence on mean than median.High outliers will increase the mean. Low outliers will decrease the mean.If ages at death are: 70, 72, 74, 76, and 78 then mean = median = 74 years.

    If ages at death are: 35, 72, 74, 76, and 78 then median = 74 but mean = 67 years.

  • 2.7Bell-Shaped Distributionsof NumbersMany measurements follow a predictable pattern:Most individuals are clumped around the centerThe greater the distance a value is from the center, the fewer individuals have that value.Variables that follow such a pattern are said to be bell-shaped. A special case is called a normal distribution or normal curve.

  • Example 2.11Bell-Shaped British Womens Heights Data: representative sample of 199 married British couples. Below shows a histogram of the wives heights with a normal curve superimposed. The mean height = 1602 millimeters.

  • Describing Spread with Standard DeviationStandard deviation measures variability by summarizing how far individual data values are from the mean.

    Think of the standard deviation as roughly the average distance values fall from the mean.

  • Describing Spread with Standard DeviationBoth sets have same mean of 100.Set 1: all values are equal to the mean so there is no variability at all.Set 2: one value equals the mean and other four values are 10 points away from the mean, so the average distance away from the mean is about 10.

  • Formula for the (sample) standard deviation:

    The value of s2 is called the (sample) variance. An equivalent formula, easier to compute, is:Calculating the Standard Deviation

  • Step 1: Calculate , the sample mean.

    Step 2: For each observation, calculate the difference between the data value and the mean.

    Step 3: Square each difference in step 2.

    Step 4: Sum the squared differences in step 3, and then divide this sum by n 1.

    Step 5: Take the square root of the value in step 4.Calculating the Standard Deviation

  • Interpreting the Standard Deviation for Bell-Shaped Curves: The Empirical RuleFor any bell-shaped curve, approximately 68% of the values fall within 1 standard deviation of the mean in either direction

    95% of the values fall within 2 standard deviations of the mean in either direction

    99.7% of the values fall within 3 standard deviations of the mean in either direction

  • The Empirical Rule, the Standard Deviation, and the RangeEmpirical Rule => the range from the minimum to the maximum data values equals about 4 to 6 standard deviations for data with an approximate bell shape. You can get a rough idea of the value of the standard deviation by dividing the range by 6.

  • Example 2.11 Womens Heights (cont)Mean height for the 199 British women is 1602 mm and standard deviation is 62.4 mm.68% of the 199 heights would fall in the range 1602 62.4, or 1539.6 to 1664.4 mm 95% of the heights would fall in the interval 1602 2(62.4), or 1477.2 to 1726.8 mm 99.7% of the heights would fall in the interval 1602 3(62.4), or 1414.8 to 1789.2 mm

  • Example 2.11 Womens Heights (cont)Summary of the actual results:Note: The minimum height = 1410 mm and the maximum height = 1760 mm, for a range of 1760 1410 = 350 mm.So an estimate of the standard deviation is:

  • Standardized z-ScoresStandardized score or z-score:Example: Mean resting pulse rate for adult men is 70 beats per minute (bpm), standard deviation is 8 bpm. The standardized score for a resting pulse rate of 80:A pulse rate of 80 is 1.25 standard deviations above the mean pulse rate for adult men.

  • The Empirical Rule RestatedFor bell-shaped data, About 68% of the values have z-scores between 1 and +1. About 95% of the values have z-scores between 2 and +2. About 99.7% of the values have z-scores between 3 and +3.

  • TransformationsSometimes it is convenient to express numbers on a different scale

    Americans easily recognise that 90 Fahrenheit is a hot day.We understand temperatures better on the Celsius scale.

    No gain or loss of information (usually)Graphical and numerical summaries are affected.

    Transformations can help us understand a data set

  • Linear transformationsnew value = a + b x old valueimperical to metric measurements grams = 28.3494 x ouncestemperatureFahrenheit = 32 + 1.8 x Celsius

    Relative positions of the points do not change so we neither gain nor lose information.

  • linear transformationAffect the centre and spread of the data

    Shape remains unchanged

    Graphical displays: only the numbers labeling the axis changes

    Do not help you to understand the distribution of values in the data

  • Nonlinear transformationsExamples:The wavelength of radiation (in metres) may alternatively be recorded as a frequency (in cycles per second) -- a reciprocal relationship. A medical researcher might record the mean time between seizures for acute epileptic patients, or the rate of seizures per year -- another reciprocal relationship. The Richter scale transforms the measured intensity of earthquakes to a logarithmic scale.Nonlinear transformations.Changes the relative distances between data valuesChanges the shape of a distribution

  • Logarithmic transformationsThe most commonly used nonlinear transformation replaces each value by its logarithmnew value = log10(old value)base-10 logarithms easier to interpret and used in CAST, but natural logarithms (base e) have a similar effect.logarithms can be found only for positive numbers.log10(1) = 0, log10(10) = 1, log10(100) = 2, log10(1000) = 3, log10(0.1) = 1, log10(0.01) = 2, etc.Spreads out low values in a distribution and compresses high values.Useful for skew data with a long tail towards the high values. It will spread out a dense cluster of low values and may detect clustering or outliers that would not be visible in graphical displays of the original data.

  • A family of nonlinear transformationsPower transformations- raises each value in the data set to a power p

    p < 1 increases the spread in the lower tail of data values and decrease the spread in the upper tail.p > 1 expanding the upper tail of the data values and compressing the lower tail. (Rarely helpful)

  • Discrete Data DisplaysLarge countsThe distribution of values can be summarised with the same methods as continuous data.Moderate countsMost of the earlier displays can still be used, butStacked dot plots are better than jittered dot plotsNo information is lost by stacking since there can be a column of crosses for each distinct value. Histogram class boundaries should end in '.5' to ensure that data values do not occur on the boundary of two classes.Since the median, quartiles and extremes are always whole numbers (or occasionally half-way between two whole numbers), box plots do not give a very effective comparison of groups. Small countsA bar chart is a better representation of the data than a histogram