stat 231 midterm 1 fall 2010

50
STAT 231 MIDTERM 1 Fall 2010

Upload: niles

Post on 22-Feb-2016

84 views

Category:

Documents


2 download

DESCRIPTION

STAT 231 MIDTERM 1 Fall 2010. Introduction. Jeffrey Baer 3B Actuarial Science Work terms at Manulife and Towers Watson Waterloo SOS President, May 2009 – Aug 2010. Agenda. 8:05 – 8:15Data Types and Transformations 8:15 – 8:35PPDAC 8:35 – 9:10Data Summaries - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: STAT 231 MIDTERM  1 Fall 2010

STAT 231 MIDTERM 1Fall 2010

Page 2: STAT 231 MIDTERM  1 Fall 2010

Introduction

• Jeffrey Baer• 3B Actuarial Science• Work terms at Manulife and Towers Watson• Waterloo SOS President, May 2009 – Aug 2010

Page 3: STAT 231 MIDTERM  1 Fall 2010

Agenda

• 8:05 – 8:15 Data Types and Transformations

• 8:15 – 8:35 PPDAC• 8:35 – 9:10 Data Summaries• 9:10 – 9:15 Bivariate Risk Measures• 9:15 – 9:40 Probability Models• 9:40 – 10:00 Likelihood Functions and

MLEs

Page 4: STAT 231 MIDTERM  1 Fall 2010

What is Statistics?

What is Statistics?

Statistics is the science of design and collection of data used to draw conclusions about a larger population.

Page 5: STAT 231 MIDTERM  1 Fall 2010

Data Types

• Discrete: countable (whole numbers), finite– i.e. Number of students in Stat 231 born in 1991

• Continuous: measured data using real number line– i.e. Age of Stat 231 students

• Categorical: non-numerical, pre-determined categories– i.e. Months of birth of Stat 231 students

• Binary: categorical data with two categories– i.e. Born in 1991?

Page 6: STAT 231 MIDTERM  1 Fall 2010

Data Types continued

• Ordinal: data that has an underlying order– i.e. Final Stat 230 grades of students in Stat 231

• Grouped/Frequency: numerical, # of occurrences in a category– i.e. Number of Pure Math/Act Sci/Stats students in Stat 231

• A Dataset is a collection of data– Can include several different data types

Page 7: STAT 231 MIDTERM  1 Fall 2010

Transformations

• Transforming data from one form to another using a transformation function can simplify data and/or solve comparison issues

• Transformation types:– Monotone increasing: preserves ranking, i.e. ranks

of {x1,x2,...,xn} = ranks of {F(x1),F(x2),...,F(xn)}• Monotone decreasing reverses rankings

– Affine: linear transformation (y = Ax + B)– Coding: categorical data to numerical data– Ranking: ordering data from smallest to largest

Page 8: STAT 231 MIDTERM  1 Fall 2010

Example 1If the temperature at which a certain compound melts is a random variable with mean value 120°C and standard deviation 2°C what are the mean temperature and standard deviation measured in °F? (Hint: °F = 1.8°C + 32).

Page 9: STAT 231 MIDTERM  1 Fall 2010

PPDAC

Page 10: STAT 231 MIDTERM  1 Fall 2010

Problem

• “A clear statement of what we are trying to achieve”

• Key Terms:– Unit: individual in the population– Variate: characteristic of a unit– Attribute: characteristic of the population

• The problem is defined in terms of attributes of the population

Page 11: STAT 231 MIDTERM  1 Fall 2010

Aspect• Aspects (type of problem)

– Descriptive (exploring a target population attribute)• What is the average age of death for smokers in Canada?• What are the average marks for STAT 230 and STAT 231?

– Causative (linking explanatory and response variates)• Does smoking lead to lung cancer?• Does a high mark in STAT 230 indicate the individual will get a

high mark in STAT 231?

– Predictive (predicting value of response variate)• Given that a male, age 30, smokes, what is the predicted age of

mortality?• If I know an individual’s mark in STAT 230, can I predict his mark

in STAT 231?

Page 12: STAT 231 MIDTERM  1 Fall 2010

Population

• Target Pop. (units we want to investigate)– University Students

• Study Pop. (units which could have been selected)– Laurier Students

• Sample (units actually selected)– Laurier Students selected for the study

• Subsets– Sample is a subset of study population– Study population not necessarily a subset of target

population

Page 13: STAT 231 MIDTERM  1 Fall 2010

Error and Plan

• Study Error (Study vs. Target)– Possible consequence: making the wrong

conclusion about our target population• Sample Error (Sample vs. Study)

– Is present because we use a subset to make a conclusion on a larger population

– Can only be reduced, but never eliminated

• Plan: how we execute the study– Experimental vs. Observational plans

Page 14: STAT 231 MIDTERM  1 Fall 2010

Example 2

PROBLEM: An auto manufacturer wants to know the average distance cars registered in Ontario go between oil changes.

PLAN: Canadian Tire is asked to collect data on the distance driven since the last oil change for all cars registered in Ontario whose oil they change during the last week in February. If the odometer reading at the last oil change is not available, a car will not be included in the sample.

Page 15: STAT 231 MIDTERM  1 Fall 2010

Data

• After we’ve collected data, it’s important to summarize it in a form that is clear and concise

• Potential Issues:– Outliers: extreme observations– Bias: systematic error from improper data

collection– Missing observations: suspicious -> omitted

Page 16: STAT 231 MIDTERM  1 Fall 2010

Our Collected Data

Observed Data:Ages of 12 individuals randomly selected from a room.

{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 38 }

Sample Size:n = 12

Page 17: STAT 231 MIDTERM  1 Fall 2010

Averages

Measures of Averages• Mean

Arithmetic Geometric

• Median– Q2, 50% of the data lies above, 50% lies below

• Mode– The most frequently occurring data point(s)

n

xx

n

ii

1

{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

nn

i

ig xx /1

1

)(

Page 18: STAT 231 MIDTERM  1 Fall 2010

Pie Chart

Pie Charts

• Frequency: # of occurrences

• Relative Frequency: proportion of occurrences

Page 19: STAT 231 MIDTERM  1 Fall 2010

Histogram

Histograms• Frequency Histogram

– Height (area) of each bar is the # of occurrences within each interval

• Relative Frequency Histogram– Height (area) of each bar is the proportion of occurrences within

each interval• Determining an interval size

– (Max – Min)/desired # of intervals

Page 20: STAT 231 MIDTERM  1 Fall 2010

Histogram

{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

Frequency Histogram

Page 21: STAT 231 MIDTERM  1 Fall 2010

Example 3

Estimate the number of electronic components in the sample which took at least 8 hours to fail, if there was a total of 300 items in the sample.

Relative Frequency Histogram

Page 22: STAT 231 MIDTERM  1 Fall 2010

CDFCumulative Frequency PlotX-axis: data pointsY-axis: sum of all relative frequencies for data points up to x

Page 23: STAT 231 MIDTERM  1 Fall 2010

Lorenz Curves

Lorenz Curves• CDF plot used to illustrate income inequality

– Shows percentage (y%) of total income held by poorest x% of households

– 45-degree line: line of perfect equality (LPE)– Gini Co-efficient: Area between Lorenz curve and LPE Area between Lorenz curve and LPI

0%10%

20%30%

40%50%

60%70%

80%90%

100%0%

20%40%60%80%

100%

Income Distribution in Canada

Lorenz CurveLPELPI

Percentage of Households

Perc

enta

ge o

f Inc

ome

Page 24: STAT 231 MIDTERM  1 Fall 2010

Tipping Points

Model of Tipping Points• How many people will do something, given how many

other people are expected to do it• Can be illustrated using a modified Lorenz curve

– Equilibria: points intersecting the 45⁰ line– Stable Equilibria: points at which small deviations from

equilibria will result in a return to equilibria, regardless of the direction of deviation

– Unstable Equilibria: tipping points at which small deviations from equilbria will not result in a return to equilibria

Page 25: STAT 231 MIDTERM  1 Fall 2010

Example 4 (from Asst. 1)100 students are in a class. Let N = the actual number of students clapping and NE be the number of students expected to clap. The relationship between N and NE is given as follows:

N = 0.5NE if NE <= 20N = 2NE – 30 if 20 < NE <= 50N = 0.5NE + 45 if 50 < NE <= 90N = 90 if NE >= 90

Illustrate this graphically. Equilibria? Stable Equilbria? Tipping Points? 0 20 40 60 80 100 120

0

20

40

60

80

100

120

Model of Tipping Points

NLPE

NE

N

Page 26: STAT 231 MIDTERM  1 Fall 2010

Variability and Spread

• Sample Variance Population Variance

• Percentile– The p-th percentile is the data point located at position number

(p/100)*(n + 1)– Use linear interpolation if necessary

• Interquartile Range (IQR) = Q3 (75th percentile) – Q1 (25th percentile)

{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

n n - 1

Page 27: STAT 231 MIDTERM  1 Fall 2010

How to find Percentiles

Page 28: STAT 231 MIDTERM  1 Fall 2010

Box and Whisker PlotBox and Whisker Plot Steps:• Calculate Q1, Q2 (median), Q3, and IQR• Draw a horizontal line representing scale of

measurement, and a box surrounding Q1 and Q3, with a line drawn for Q2

• Calculate outlier boundaries (dotted lines): – lower fence = Q1 – 1.5*IQR, upper fence = Q3 + 1.5*IQR– Mark any outliers with a * or o on the graph

• Draw whiskers connecting the largest and smallest measurements (upper/lower adjacent values) that are not outliers to the box

Page 29: STAT 231 MIDTERM  1 Fall 2010

Example 5Draw a Box and Whisker Plot for the dataset{ 4, 15, 16, 16, 18, 19, 20, 22, 24, 25, 28, 40 }

Page 30: STAT 231 MIDTERM  1 Fall 2010

QQ Plot

QQ Plot• Theoretical Quantiles

– Quartiles, percentiles, etc. of known distribution– 95th Theoretical quantile: α

• Sample Quantiles• 2 uses of QQ plots

– Sample vs. Theoretical Quantile (45o line = good fit)– Sample vs. Sample Quantile (straight line = similar

distribution)

Page 31: STAT 231 MIDTERM  1 Fall 2010

Measures of Association

• Relative Risk (of event A provided event B occurs or does not occur)

– > 1 : positive association between A and B– Association does not imply causation!

Page 32: STAT 231 MIDTERM  1 Fall 2010

Example 6

Given the following frequency table for individuals grouped according to whether they smoke or not and their education level:

Calculate the relative risk of smoking if a person has a PHD education.

Smoker High School University PHD

No 8 33 42

Yes 51 70 26

Page 33: STAT 231 MIDTERM  1 Fall 2010

Measures of Association

Correlation Coefficient• ρ = Cov(X, Y) or or

σx*σy

• Measures linear relationship between two random variables– ρ > 0 : positive correlation; vice-versa– |ρ| = 1: X and Y are linearly related

n

ii

n

ii

i

n

ii

yyxx

yyxx

1

2

1

2

1

)()(

)()(

n

ii

n

ii

n

iii

ynyxnx

yxnyx

1

22

1

22

1

Page 34: STAT 231 MIDTERM  1 Fall 2010

Example 7

• (47, 41) is called an influential outlier

Page 35: STAT 231 MIDTERM  1 Fall 2010

Time Series

Time Series Graphs• The explanatory variate is time• The response variate is the measured variable

of interest at time t• Neighbouring points are joined by straight

lines rather than a simple scatter plot• Time series graphs can be used to look at

trends, seasonal patterns, etc.

Page 36: STAT 231 MIDTERM  1 Fall 2010

Statistical Science

• Statistics is the science of design and collection of data used to draw conclusions about a larger population.

• When we collect this data, we’re always going to have uncertainty

• We fit our data to known probability models to quantify these uncertainties

Page 37: STAT 231 MIDTERM  1 Fall 2010

Terminology

• Descriptive Statistics (Chapter 1)– Tools and techniques used to describe certain

attributes of a population– Graphs, charts, numerical summaries

• Statistical Inference (Rest of Course)– A problem solving method using data to draw

general conclusions on a population

Page 38: STAT 231 MIDTERM  1 Fall 2010

Statistical Inference

• Estimation Problems– After collection of data, we fit the data to

probability models– Using the collected data, form estimates for the

parameters of the models

• Hypothesis Testing– Accepting or rejecting a statement about the

target population

Page 39: STAT 231 MIDTERM  1 Fall 2010

Probability Models

• Random Variables– Represent what we’re going to measure in our

experiment

• Realizations– Represent the actual data we’ve collected from

our experiment

Page 40: STAT 231 MIDTERM  1 Fall 2010

Probability Functions

• CDF = (discrete) or (cts.)

• E[g(X)] = (discrete) or (cts.)

• Var(X) = E(X^2) – [E(X)]^2

• E(aX + b) = aE(X) + b

• Var(aX + b) = a2 Var(X)

• P(a<=Y<=b) = (discrete) or (cts.)

x

dyyf )(

x

y

yf )(

x

xfxg )()(

dxxfxg )()(

b

a

dyyf )(

b

ax

xXP )(

Page 41: STAT 231 MIDTERM  1 Fall 2010

Example 8

A random variable X has a continuous probability model with a cumulative distribution function (cdf)

Give an expression for the expected value of

Do not evaluate any sums or integrals.

)arctan(121)()( xxXPxF

)sin(XY

)arctan(121)()( xxXPxF

)sin(XY

)arctan(121)()( xxXPxF

)arctan(1

21)()( xxXPxF

Page 42: STAT 231 MIDTERM  1 Fall 2010

Probability Models

• Binomial (binary data)– Fixed number of trials (n) and fixed probability (π) of

success on each (Bernoulli) trial– P(X=x; n, π) = ; x = 0,1,…,n

• Poisson (discrete data)– Events occur at a constant rate (λ)– P(X=x; λ) = ; x = 0,1,2,…

• Exponential (continuous data)– Waiting time between events occuring at rate λ– f(x; λ) = λe- λx ; x > 0

xnx

xn

)1(

!xe x

Page 43: STAT 231 MIDTERM  1 Fall 2010

Gaussian Distribution and CLTGaussian Distribution• f(x; μ, σ) = • If Y ~ G(μ,σ), then Z = ~ G(0,1)• If Y1,Y2,…Yn are G(μ1,σ1), G(μ2,σ2), … , G(μn,σn):

– ~ G( , )

Central Limit Theorem (CLT)• For any iid RVs W1,W2,…Wn with mean μ and s.d. σ:

– If = , then E( ) = μ and SD( ) =

– ~ G(0,1)

2)(21

21

x

e

Y

n

i

iiYb1

n

i

iib1

n

i

iib1

22

W

n

i

iWn 1

1WW n

)/

(limn

Wn

Page 44: STAT 231 MIDTERM  1 Fall 2010

Example 9 We are given that non-diabetics have glucose levels represented by

a random variable which follows a G(5.31, 0.58) distribution. Diabetics have glucose levels represented by a random variable which follows a G(11.74, 3.5) distribution. When taking a test, if the person’s glucose level measures higher than 6.5, they will be diagnosed as diabetic.

• If a person is diabetic, what is the probability that he/she is diagnosed correctly?

• What is the probability that a non-diabetic is diagnosed as diabetic?

Page 45: STAT 231 MIDTERM  1 Fall 2010

Response Model

• Problem: what is μ, the average of the attribute of interest in the target population

• We will use our collected data to estimate μ• Let Y be a random variable that represents the measured

response variate• Y = μ + R R~G(0, σ )

– Y ~ G(μ, σ)– μ is systematic (no risk), while R is random (variable)

Page 46: STAT 231 MIDTERM  1 Fall 2010

Maximum Likelihood Estimation

• Binomial π = ; x = # of successes

• Response μ = ; yi is the ith realization

• Maximum Likelihood Estimation – A procedure used to determine a parameter

estimate given any model

nx

n

yn

ii

1

Page 47: STAT 231 MIDTERM  1 Fall 2010

Maximum Likelihood Estimation

• First, we assume our data collected will follow a distribution

• Before we collect the sample random variables– {Y1, Y2, …, Yn}

• After we collect the sample realizations– {y1, y2, …, yn}

• We know the distribution of Yi (with unknown parameters), hence we know the PDF/PMF

Page 48: STAT 231 MIDTERM  1 Fall 2010

Likelihood Function

• The Likelihood Function:

• Likelihood: the probability of observing the dataset you have– We want to choose an estimate of the parameter θ that

gives the largest such probability– Ω is the parameter space, the set of possible values for θ– Relative Likelihood: R(μ) =

,);()(1

i

n

i

yfL Continuous

Discrete ,);()( yYPL

)(

)(^

L

L

Page 49: STAT 231 MIDTERM  1 Fall 2010

MLE Process

• Step One: Define the likelihood function

• Step Two: Define the log likelihood function ln[L(θ)]

• Step Three: Take the derivative with respect to θ

• Step Four: Solve for zero to arrive at the maximum likelihood estimate

• Step Five: Plug in data values (if given) to arrive at a numerical maximum likelihood estimate

^

Page 50: STAT 231 MIDTERM  1 Fall 2010

Examples 10/11 Discrete:What is the MLE of a geometric distribution with pmf ?

Continuous:Given Y ~ Exp(θ), with realizations y1,y2,…yn , find themaximum likelihood estimate of θ. What is the MLE for therealizations {3, 2, 1, 4}?

1 x,)1();( 1 xxXP