annotated 3 ch3 data description f2014

Upload: bob-hope

Post on 01-Jun-2018

229 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    1/16

    Stat 305, Fall 2014 Name

    Chapter 3: Data Description

    Theoretical vs. Empirical Distribution

    Theoretical Distribution: The expected pattern to be followed by a variable.

    Example: Roll 1 six sided die 60 times and record the number on the side that landsface up.

    Variable of interest: # on face-up side

    Expected Pattern: ten 1s, ten 2s, . . . , ten 6s

    Empirical Distribution: The pattern followed by the observed data (the actual pattern)

    Example: Roll 1 six sided die 60 times and record the number on the side that landsface up.

    Variable of interest: # on face-up side

    Observed data pattern: six 1s, eight 2s, twelve 3s, ten 4s, fourteen 5s, ten 6s.

    Descriptive Statistics for Quantitative Data

    Recall: Quantitative data are numerical characteristics associated with items in asample.

    Goal: Describe important distributional characteristics We will focus on quantitative data in this course.

    1

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    2/16

    Dot Diagram

    1. Order data (smallest to largest)

    2. Label x-axis with range of data values and y-axis with count of values at each distinctpoint.

    3. Add one dot to the plot for each data value, stacking duplicate values vertically.

    Example 1

    The government requires manufacturers to monitor the amount of radiation emitted throughthe closed door of a microwave. The following are radiation amounts emitted by 24 mi-crowaves measured by one manufacturer.

    .01 .08 .05 .11 .02 .12 .08 .03 .10 .07

    .10 .05 .10 .20 .01 .09 .05 .09 .02 .10

    .18 .20 .30 .15

    Stem-and-Leaf Plots

    1. Order data values

    2. Select one or more leading digits for the stem values; last digit becomes the leaf

    3. List possible stem values in a vertical column

    4. Draw a vertical line to the right of the stem

    5. Add leaf values, in order, on the other side of the vertical line

    Example 1 (part 2)

    It is easiest to order your data before any analysis.

    .01 .01 .02 .02 .03 .05 .05 .05 .07 .08

    .08 .09 .09 .10 .10 .10 .10 .11 .12 .15

    .18 .20 .20 .30

    2

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    3/16

    Key: The decimal point is 1 digit(s) to the left of the|-OR-Key: 0|1 =.01

    0 11223555788991 000012582 00

    3 0

    Split stem-and-leaf plots

    Have two leaf positions, one for 0-4 leaves and one for 5-9 leaves Helps give a better indication of the distribution of the data when too many observa-

    tions

    Key: The decimal point is 1 digit(s) to the left of the|-OR-Key: 0|1 =.01

    0 112230 555788991 0000121 582 0023 0

    Back-to-Back Stem-and-Leaf plots

    Used to compare two data sets

    One data set as before (one right side of stem, left to right)

    Second data set on left side of stem going right to left

    More stem-and-leaf plots (Examples)

    (Note: These data sets have been ordered for you.)

    1. 28.2 29.4 30.1 30.9 31.4 32.0 32.2 32.5 32.5 32.6 33.3 34.2 34.4 34.9 36.6

    3

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    4/16

    2. 58.65 58.97 59.72 60.15 62.87

    3. Data set 1: 3.5 4.2 4.6 4.6 5.0 5.1 6.4 6.8

    Data set 2: 5.2 5.5 5.7 5.7 5.8 5.9 6.2 7.2

    Frequency Tables

    Use intervals of equal length

    Number of intervals varies, a matter of judgment Every endpoint of the intervals is in exactly one interval (ie: no overlapping)

    Example 1 (part 3)

    0.01-0.05 0.06-0.10 0.11-0.15 0.16-0.20 0.21-0.25 0.26-0.30

    8 9 3 3 0 1

    Relative Frequency Tables

    Start with a frequency table Divide every cell by the total number of observations Cumulative Relative Frequency Table: Add up relative frequencies as you go.

    4

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    5/16

    Example 1:

    0.01-0.05 0.06-0.10 0.11-0.15 0.16-0.20 0.21-0.25 0.26-0.30 Sum

    Frequency 8 9 3 3 0 1 24

    RelativeFrequency .333 .375 .125 .125 0 .042 1.00

    Cumulative

    Relative .333 .708 .833 .958 .958 1.00Frequency

    Another Frequency Table

    (Note: These data sets have been ordered for you.)

    1. 28.2 29.4 30.1 30.9 31.4 32.0 32.2 32.5 32.5 32.6 33.3 34.2 34.4 34.9 36.6

    Histogram A plot of frequency or relative frequency How to make a histogram:

    Use intervals of equal length

    Show entire vertical axis beginning at zero and avoid breaking either axis

    Keep a uniform scale across a given axis

    Center bars of appropriate heights at the midpoints of the intervals

    Rule of Thumb: # of intervals# of observations

    5

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    6/16

    Example 1

    Common Distributional Shapes

    Bell-shaped (Symmetric Unimodal)

    Uniform

    Right-Skewed

    Left-Skewed

    Bimodal

    Truncated (J-shaped)

    6

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    7/16

    Quantiles

    Definition: for any number 0p1, the p quantile is the number, denoted as Q(p),such that p is the percentage of the distribution that lies to the left of (below) Q(p),and 1p is the percentage of the distribution that lies to the right ofQ(p)

    For an ordered data set x1x2 xnFor i= 1, 2, . . . , nthe p= i.5n quantile of the data set is theith smallest data point,xi. That is:

    Q(p) =Q

    i.5n

    =xi

    Example 2

    Annual incomes (in thousands of dollars) for 8 families (in a common geographical location)are given below:23, 31, 43, 47, 51, 58, 67, 103

    Which quantiles are exactly observations from this data set?

    7

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    8/16

    Quantiles Not Observed in the Data Set

    General procedure for finding the p quantile of an empirical distribution

    1. Order data valuesx(1)x(2) x(n)2. Set i =np+ 0.5

    3. Ifi

    {1, 2, . . . , n}then Q(p) =x(i)otherwise,

    Q(p) = (i i)xi+ (i i)xiNotation: Ceiling = next largest integer (round up). (i.e. 4.3= 5) Floor = previous smallest integer (round down). (i.e.4.3= 4)

    Example 2 (part 2)

    23, 31, 43, 47, 51, 58, 67, 103

    1. What data value corresponds to the .25 quantile?

    2. The .90 quantile?

    3. The .75 quantile?

    8

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    9/16

    Quartile

    Special quantiles:

    Q(.25): Q1, 1st quartile, lower quartile Q(.5): Q2, 2nd quartile, median

    Q(.75): Q3, 3rd quartile, upper quartileSpecial values associated with quartiles:

    Inter-quartile range (IQR): Q3Q1 Upper fence: Q3+ 1.5IQR Lower fence: Q11.5IQR

    Boxplot

    Steps for making a boxplot: (with ordered data)

    0. Draw your scale.

    1. Draw a vertical lines at Q1, Q2, Q3 and connect with a box.

    2. Compute IQR= Q3Q1, Upper and Lower fences Upper Fence =Q3+ 1.5IQR Lower Fence = Q11.5IQR

    3. Draw asterisks (or dots) for any data values less than the lower fence and any valuesgreater than the upper fence; these we will define as outliers.

    4. Draw a line from the sides of the box to the smallest value greater than the LF andthe largest value smaller than the UF.

    9

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    10/16

    Example 2 (part 3)

    Make a boxplot23, 31, 43, 47, 51, 58, 67, 103

    1. First we need quartiles.

    (a) (From above) Q1=

    (b) (From above) Q3=

    (c) Q2 =

    2. Next calculate IQR and fences.

    (a) IQR =

    (b) Upper Fence =

    (c) Lower Fence =

    3. Finally draw the boxplot.

    10

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    11/16

    Example 3

    Ten batteries were tested to determine how long the batteries would last (hrs) under normalconditions. Below are the ten values that were obtained:100, 120, 80, 90, 95, 115, 120, 110, 105, 95

    1. Calculate Q(.35)

    2. Calculate Q(.42)

    3. Calculate Q(.90)

    4. Draw a boxplot based on the 10 values above.

    11

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    12/16

    Side-by-Side Boxplots

    Side-by-side boxplots can be used to compare two data sets Make sure they are set on thesame scale to make a comparison possible.

    Quantile-Quantile Plots (Q-Q Plots)

    Used to make comparisons of the shapes of distributions for two data sets.

    If two data sets are generated by distributions of the same shape, then the quantilesof one data set should be linearly related to the quantiles of the second data set.

    Plot of the ordered pairs Q1 i.5n , Q2 i.5n , for i=1,...,n Points in a straight line indicate they are from the same distribution. Ifn1=n2 then use the smaller of the two.

    Example 4a (n1 = n2)

    Data Set 1: 1, 2, 3, 4, 5Data Set 2: 6, 7, 8, 9, 10

    12

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    13/16

    Example 4b (n1= n2)Data Set 1: 1, 2, 3, 4, 5Data Set 2: 6, 7, 8, 9, 10, 11

    Example 4cData set 1: 1, 5, 7, 8, 9, 10Data set 2: -10, -9, -8, -7, -5, -1

    Normal Probability Plot

    A type of Q-Q plot that allows us to determine if the distribution of our data isbell-shaped (the shape of the theoretical normal distribution)

    Rather than plot 2 data sets against one another, plot 1 data set against quantilesfrom a known normal distribution A straight line indicates our data is normal/bell-shaped An S-shape indicates our data is skewed. We will talk more about Normal Probability Plots in Chapter 5.

    13

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    14/16

    Standard Numerical Measures

    For univariate quantitative data:

    Measures of Location/Center give an indication of where most of the data is located. Measures of Variability/Spread give an indication of how spread out the data is.

    Median (Location)

    Same as Q(0.5); ie: gives center value of the data set. Not affected by a few extreme or outlying observations. Example:

    2, 3, 5, 8, 12 Q(.5) = 52, 3, 5, 8, 100 Q(.5) = 5

    Mean (Location)

    For x1, x2, . . . , xn the mean is given by

    x= 1

    n

    ni=1

    xi

    Also called first moment or center of mass Strongly affected by a few extreme oroutlying observations.

    Example:2, 3, 5, 8, 12 x= 62, 3, 5, 8, 100 x= 23.6

    Mode (Location)

    The most frequently occurring data point Can also be used for qualitative data Can have multiple modes Not affected by outliers (so to speak) Example:

    2, 3, 5, 5, 5, 8, 8, 12 mode = 5

    2, 3, 5, 5, 5, 8, 8, 8, 100 modes = 5,8

    14

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    15/16

    IQR/ Range (Variability)

    IQR= Q(.75) - Q(.25)

    Measures the spread of the middle half of the data. Not sensitive to extreme values.

    Range = Largest value - Smallest value

    If data is ordered x1x2 xn, then R= xnx1. Highly sensitive to extreme values.

    Variance/Standard Deviation (Variability)

    Sample Variance given by the formula:

    s2 = 1

    n1n

    i=1

    (xixn)2

    -or-

    s2 =

    ni=1

    x2i

    ni=1

    xi

    2

    n

    n1 Gives a measure of how much the data is spread from the sample mean. Larger values

    ofs2 indicate more spread.

    Average (squared) distance from the mean.

    Sample standard deviation, s= s2. Sensitive to extreme outliers.

    Example 5

    Calculate the mean and standard deviation for the data below.4, 8, 2, 14, 7, 12

    15

  • 8/9/2019 Annotated 3 Ch3 Data Description F2014

    16/16

    Recalculate the standard deviation using summary statistics.n

    i=1xi= 47, and

    n

    i=1x2i= 473

    Which One Should I Use?

    When describing a distribution, generally use a measure of location and a measure ofspread.

    Mean and Standard deviation for symmetric quantitative data. Median and IQR for skewed quantitative data. Mode for categorical data

    Statistics vs. Parameters

    Statistic: a numerical summary of sample data. (We will focus on this for now.)

    Sample mean, x Sample variance, s2

    Parameter: a numerical summary of population data. (More on this in chapter 5.) Population mean, Population variance, 2

    Descriptive Statistics for Qualitative Data

    Recall: qualitative data we generally aggregate into counts Generally it is helpful to calculate rates on a per-item basis (proportions) p= total # items of interesttotal # of items

    p= # items of interest in sample# of items in sample Graphical tools

    Bar chart: like a histogram without intervals

    Pie chart

    Dot diagram

    See 3.4 for more details

    16