1 tr 555 statistics “refresher” lecture 1: probability concepts references: – penn state...

TR 555 Statistics “Refresher”Lecture 1: Probability Concepts

References:– Penn State University, Dept. of Statistics

Statistical Education Resource Kit a collection of resources used by faculty in Penn State's

Department of Statistics in teaching introductory statistics courses.

Page maintained by Laura J. Simon, Sept. 2003 – Statistics: Making Sense of Data (MIT)

William Stout, John Marden and Kenneth Travers http://www.introductorystatistics.com/ Sept. 2003

– Tom Maze, stat course prepared for KDOT, 2003

Outline

Overview of statistics Types of data Describing data numerically and graphically Probability and random variables

Probability and Statistics

Probably is the likelihood of an event occurring relative to all other events

– Example: If a coin is flipped, what is the probability of getting a heads

– 0.5Given that the last flip was a heads what is the probability that the next will be

heads– 0.5

Statistics is the measurement and modeling of random variables– Example:

If our state averages 200 fatal crashes per year, what is the probability of having one crash today. Poisson distribution – = average per time period. 200/365 = 0.55

– P(1 = x) = ((t)x/x!)e-t=(0.55*1)1/1!)e-0.55(1)= 0.32

Data Collection

Designing experiments– Does aspirin help reduce the risk of heart

attacks?

Observational studies– Polls - Clinton’s approval rating

Variable Types

Deterministic– Assume away variation and randomness– Known with certainty– One to one mapping of independent variable to

dependent variable

Relationship

Variable Types Continued

Random or Stochastic– Recognized uncertainty of an event– One to one distribution mapping of independent

variable to dependent variable

Probability that it could be any of these values

Most Likely Less LikelyLess Likely

Population

The set of data (numerical or otherwise) corresponding to the entire collection of units about which information is sought

Sample

A subset of the population data that are actually collected in the course of a study.

WHO CARES?

In most studies, it is difficult to obtain information from the entire population. We rely on samples to make estimates or inferences related to the population.

Organization and Description of Data

Qualitative vs. Quantitative data Discrete vs. Continuous Data Graphical Displays Measures of Center Measures of Variation

Qualitative (Categorical) Data

The raw (unsummarized) data are merely labels or categories

Quantitative (Numerical) Data

The raw (unsummarized) data are numerical

Qualitative Data Examples

Class Standing (Fr, So, Ju, Sr) Section # (1,2,3,4,5,6) Automobile Make (Ford, Chevrolet, Nissan) Questionnaire response (disagree, neutral,

agree)

Quantitative Data Examples (measures)

Voltage Height Weight SAT Score Number of students arriving late for class Time to complete a task

Discrete Data

Only certain values are possible (there are gaps between the possible values)

Continuous Data

Theoretically, any value within an interval is possible with a fine enough measuring device

Discrete Data Examples

Number of students late for class Number of crimes reported to SC police Number of times the word number is used

(generally, discrete data are counts)

Discrete Variable ModelPoisson Distribution

(0.55*t)x/x!)e-0.55(t)

# of Fatal Crashes

Probability of # of Fatals per one day

Continuous Data Examples

Voltage Height Weight Time to complete a homework assignment

Continuous Variable ModelExponential Distribution

0 0.8 1.6 2.4 3.2 4 4.7 5.5

Time till the first fatal accident

Fatality Probability Density Function

Probability of first Fatal at time t = e-t

Continuous Probability Function

0 0.8 1.6 2.4 3.2 4 4.7 5.5

Cummulative Probability till first fatal

Cumulative Probability of Time Till First Fatal t = 1 - e-t

Nominal Data

A type of categorical data in which objects fall into unordered categories, for example:– Hair color

blonde, brown, red, black, etc.

– Race Caucasian, African-American, Asian, etc.

– Smoking status smoker, non-smoker

Ordinal Data

A type of categorical data in which order is important. For example …– Class

fresh, sophomore, junior, senior, super senior

– Degree of illness none, mild, moderate, severe, …, going, going, gone

– Opinion of students about riots ticked off, neutral, happy

Binary Data

A type of categorical data in which there are only two categories.

Binary data can either be nominal or ordinal, for example …

– Smoking status smoker, non-smoker

– Attendance present, absent

– Class lower classman, upper classman

Interval and Ratio Data

Interval– Interval is important, but no meaningful zero– e.g, temperature in farenheit

Ratio– has a meaningful zero value– e.g., temperature in Kelvin, crash rate

Who Cares?

The type(s) of data collected in a study determine the type of statistical analysis used.

Proportions

Categorical data are commonly summarized using “percentages” (or “proportions”).– 11% of students have a tattoo– 2%, 33%, 39%, and 26% of the students in class

are, respectively, freshmen, sophomores, juniors, and seniors

Averages

Measurement data are typically summarized using “averages” (or “means”).– Average number of siblings Fall 1998 Stat 250

students have is 1.9.– Average weight of male Fall 1998 Stat 250

students is 173 pounds.– Average weight of female Fall 1998 Stat 250

students is 138 pounds.

Descriptive statistics

Describing data with numbers:measures of location

Another name for average. If describing a population, denoted as , the

greek letter “mu”. If describing a sample, denoted as x, called “x-

bar”. Appropriate for describing measurement data. Seriously affected by unusual values called

“outliers”.

Calculating Sample Mean

X iFormula:

That is, add up all of the data points and divide by the number of data points.

Data (# of classes skipped): 2 8 3 4 1

Sample Mean = (2+8+3+4+1)/5 = 3.6

Do not round! Mean need not be a whole number.

Population Mean

The mean of a random variable X is called the population mean and is denoted

It is also called the expected value of X or the expectation of X and is denoted E(X).

ii xfxXE )(

Median

Another name for 50th percentile. Appropriate for describing measurement

data. “Robust to outliers,” that is, not affected

much by unusual values.

Calculating Sample Median

Order data from smallest to largest.

If odd number of data points, the median is the middle value.

Data (# of classes skipped): 2 8 3 4 1

Ordered Data: 1 2 3 4 8

Median

Calculating Sample Median

Order data from smallest to largest.

If even number of data points, the median is the average of the two middle values.

Data (# of classes skipped): 2 8 3 4 1 8

Ordered Data: 1 2 3 4 8 8

Median = (3+4)/2 = 3.5

The value that occurs most frequently. One data set can have many modes. Appropriate for all types of data, but most

useful for categorical data or discrete data with only a few number of possible values.

Most appropriate measure of location

Depends on whether or not data are “symmetric” or “skewed”.

Depends on whether or not data have one (“unimodal”) or more (“multimodal”) modes.

Symmetric and Unimodal

Symmetric and Bimodal

Skewed Right

0 100 200 300 400

Number of Music CDs

Number of Music CDs of Spring 1998 Stat 250 Students

Skewed Left

Choosing Appropriate Measure of Location

If data are symmetric, the mean, median, and mode will be approximately the same.

If data are multimodal, report the mean, median and/or mode for each subgroup.

If data are skewed, report the median.

Descriptive statistics

Describing data with numbers: measures of variability

The difference between largest and smallest data point.

Highly affected by outliers.

Best for symmetric data with no outliers.

2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

GPAs of Spring 1998 Stat 250 Students

Interquartile range

The difference between the “third quartile” (75th percentile) and the “first quartile” (25th percentile). So, the “middle-half” of the values.

IQR = Q3-Q1 Robust to outliers or

extreme observations. Works well for skewed data.

2.0 2.2 2.4 2.6 2.8 3.0 3.2 3.4 3.6 3.8 4.0

Variance

2)x(x2s

1. Find difference between each data point and mean.

2. Square the differences, and add them up.

3. Divide by one less than the number of data points.

Variance

If measuring variance of population, denoted by 2 (“sigma-squared”).

If measuring variance of sample, denoted by s2 (“s-squared”).

Measures average squared deviation of data points from their mean.

Highly affected by outliers. Best for symmetric data.

Problem is units are squared.

Population Variance

The variance of a random variable X is called the population variance and is denoted

ii xfx22

Standard deviation

Sample standard deviation is square root of sample variance, and so is denoted by s.

Units are the original units. Measures average deviation of data points

from their mean. Also, highly affected by outliers.

Population Standard Deviation

The population standard deviation is the square root of the population variance and is denoted

ii xfx22

What is the variance or standard deviation?

Variance or standard deviation

Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 06.79 110.00 105.62 17.39 1.74 Minimum Maximum Q1 Q3female 65.00 120.00 85.00 98.25male 75.00 162.00 95.00 118.75

Females: s = 11.32 mph and s2 = 11.322 = 128.1 mph2

Males: s = 17.39 mph and s2 = 17.392 = 302.5 mph2

Coefficient of Variation (COV) – not covariance!

Ratio of sample standard deviation to sample mean multiplied by 100.

Measures relative variability, that is, variability relative to the magnitude of the data.

Unitless, so good for comparing variation between two groups.

Coefficient of variation (MPH)

Sex N Mean Median TrMean StDev SE Mean female 126 91.23 90.00 90.83 11.32 1.01 male 100 106.79 110.00 105.62 17.39 1.74 Minimum Maximum Q1 Q3female 65.00 120.00 85.00 98.25male 75.00 162.00 95.00 118.75

Females: CV = (11.32/91.23) x 100 = 12.4

Males: CV = (17.39/106.79) x 100 = 16.3

Choosing Appropriate Measure of Variability

If data are symmetric, with no serious outliers, use range and standard deviation.

If data are skewed, and/or have serious outliers, use IQR.

If comparing variation across two data sets, use coefficient of variation.

Descriptive Statistics

Summarizing data using graphs

Which graph to use?

Depends on type of data Depends on what you want to illustrate Depends on available statistical software

Bar Chart

Summarizes categorical data. Horizontal axis represents categories, while vertical

axis represents either counts (“frequencies”) or percentages (“relative frequencies”).

Used to illustrate the differences in percentages (or counts) between categories.

Middle Oldest Only Youngest

Birth Order

Birth Order of Spring 1998 Stat 250 Students

n=92 students

Histogram

Divide measurement up into equal-sized categories. Determine number (or percentage) of measurements

falling into each category. Draw a bar for each category so bars’ heights represent

number (or percent) falling into the categories. Label and title appropriately.

18 19 20 21 22 23 24 25 26 27

Age (in years)

Age of Spring 1998 Stat 250 Students

n=92 students

Use common sense in determining number of categories to use.

(Trial-and-error works fine, too.)

Number of ranges (see Tufte)

18 23 28

Age (in years)

Age of Spring 1998 Stat 250 Students

n=92 students

Dot Plot

Summarizes measurement data.

Horizontal axis represents measurement scale.

Plot one dot for each data point.

160150140130120110100908070Speed

Fastest Ever Driving Speed

Women126

Men100

226 Stat 100 Students, Fall '98

Stem-and-Leaf Plot

Summarizes measurement data.

Each data point is broken down into a “stem” and a “leaf.”

First, “stems” are aligned in a column.

Then, “leaves” are attached to the stems.

Boxplot

smallest observation = 3.20 Q1 = 43.645

Q2 (median) = 60.345

Q3 = 84.96 largest observation = 124.27

0 10 20 30 40 50 60 70 80 90 100 110 120 130

. . . . .

Box Plot

“Whiskers” are drawn to the most extreme data points that are not more than 1.5 times the length of the box beyond either quartile.

– Whiskers are useful for identifying outliers.

“Outliers,” or extreme observations, are denoted by asterisks.

– Generally, data points falling beyond the whiskers are considered outliers.

Useful for comparing two distributions

Amount of sleep in past 24 hours

of Spring 1998 Stat 250 Students

Using Box Plots to Compare

female male

Gender

Speed (

Fastest Ever Driving Speed

226 Stat 100 Students, Fall 1998

Scatter Plots

Summarizes the relationship between two measurement variables.

Horizontal axis represents one variable and vertical axis represents second variable.

Plot one point for each pair of measurements.

22 23 24 25 26 27 28 29 30 31

Left foot (in cm)

Foot sizes of Spring 1998 Stat 250 students

n=88 students

No relationship

52 57 62

Head circumference (in cm)

Left fore

Lengths of left forearms and head circumferences

of Spring 1998 Stat 250 Students

n=89 students

Closing comments

Many possible types of graphs. Use common sense in reading graphs. When creating graphs, don’t summarize your

data too much or too little. When creating graphs, label everything for

others. Remember you are trying to communicate something to others!

Probability

You’ll probably like it!

Before we begin …

What is the probability that 2 or more people share the same birthday if …– 5 people are in the sample?– 23 people?– 50 people?– This class?

Probability Properties

The probability of an event “A” (the proportion of times the event is expected to occur in repeated experiments), is denoted P(A).

All probabilities are between 0 and 1.(i.e. 0 < P(A) < 1)

The sum of the probabilities of all possible outcomes must be 1.

Probability Basics

Given that a crash has occurred, what is the probability that it is a fatal crash?– Possible events – Fatal, injury, and property

damage onlyFatal 37,000 P(F) = 0.58%Injury 2,026,000 P(I) = 32.16%PDO 4,226,000 P(D) = 67.08%Total Crashes 6,300,000

Complement

The complement of an event A, denoted by A, is the set of outcomes that are not in A

A means A does not occur

P(A) = 1 - P(A)Some texts use Ac to denote the complement of A

The union of two events A and B, denoted by A U B, is the set of outcomes that are in A, or B, or both

If A U B occurs, then either A or B or both occur

Intersection

The intersection of two events A and B, denoted by AB, is the set of outcomes that

are in both A and B.

If AB occurs, then both A and B occur

Combinations of Events

Union of fatal speed related and run-off the road crashes

Single Vehicle Crash

Speed RelatedCrashes

Intersection of Fatal and Run-off the Road Crashes

All Fatal Crashes (37,795)

21,052

13,357

Addition Law

P(A U B) = P(A) + P(B) - P(AB)

(The probability of the union of A and B is the probability of A plus the probability of B minus the probability of the intersection of A and B)

Mutually Exclusive Events

Two events are mutually exclusive if their intersection is empty.

Two events, A and B, are mutually exclusive if and only if P(AB) = 0

P(A U B) = P(A) + P(B)

Conditional Probability

The probability of event A occurring, given that event B has occurred, is called the conditional probability of event A given event B, denoted P(A|B)

Multiplication Rule

General form P(A/B) = P(A,B)/P(B)e.g., what is the probability of a single vehicle

accident given that it was speed related?

Conditional Probability Example

Total fatal crashes - 37,795 Total speed related crashes – 13,357 Total single vehicle crashes – 21,052 Total single vehicle, speed related crashes - 8,600 If the crash was speed related, what is the probability that it was a

single vehicle crash?– P(sv/sp) = 8600/13357 = 64.38%

If the crash was speed related, what is the probability that it was not a single vehicle crash?

– P(sv/sp) = 1 – 0.6438 = 35.62%

Single VehicleCrashes

21,05213,357

All FatalCrashes37,795

SR+SV8,600

Conditional Probability Example (Cont)

Probability that a fatal crash was speed related = P(sp) – 13,357/ 37,795 = 35.34%

Probability that a fatal crash was a single vehicle = P(sv) – 21,052/37,795 = 55.70%

Probability that a fatal crash is both speeding related and a single vehicle = P(sv,sp)

– 8,600/37,795 = 22.74%

21,05213,357

SR+SV8,600

Bayes’ Theorem

P(A/B)P(B) = P(B/A)P(A)P(B/A) = P(A/B)P(B)/P(A)P(sv) = 55.70%P(sp) = 35.34%P(sv/sp) = 64.38%P(sp/sv) = ?P(sp/sv) =

((0.6438)*(0.3534))/0.5570 = 0.3854

21,05213,357

SR+SV8,600

Bayes’ Theorem Problem

Given– There were 11,696 off-road fixed object fatal crashes

involving a single vehicle– There were 13,357 fatal crashes involving a speeding vehicle– There were 8,600 fatal crashes involving speeding and single

vehicles– There were 5,400 fatal crashes involving single vehicles,

speeding, and off-road fixed object crashes– The total number of fatal crashes is 37,795– Given that a crash is speeding related, what is the probability

that it will be an off-road single vehicle crash

Bayes’ Problem Answer

What we need to know P(or,sv/sp)What we know

– P(or,sv) = 30.95%– P(sp) = 35.34%– P(sv,sp) = 55.70%– P(sv,sp) = 22.75%– P(sp,or,sv) = 14.29%– P(or,sv/sv) = 55.56%

Answer Continued

Multiplication Rule– P(sp/or,sv)P(or,sv) = P(sp,or,sv)– P(sp/or,sv) = P(sp,or,sv)/P(or,sv)– 46.17% =0.1429/0.3095

Bayes’ Theorem– P(or,sv/sp)= (P(sp/or,sv)*P(or,sv))/P(sp)– 40.43% = (0.4617*0.3095)/0.3534

Independence

Two events A and B are independent if

P(A|B) = P(A)

P(B|A) = P(B)

P(AB) = P(A)P(B)

Probability Concepts

RandomnessIndependence

Thought Question 1

What does it mean to say that a deck of cards is “randomly” shuffled? Every ordering of the cards is equally likely

There are 8 followed by 67 zeros possible orderings of a 52 card deck

Every card has the same probability to end up in any specified location

The question continued

A 52 card deck is randomly shuffled How often will the tenth card down from

the top be a Club? 1/4 of the time Every card has the same chance to end up

10th. There are 13 clubs and 13 / 52 = 1/4

Law of Large Numbers

Relative frequency of an event gets closer to true probability as number of trials gets larger

Probability values

Probabilities are between 0 and 1 Total probabilities of all possible

outcomes = 1 Probability = 1

means an event always happens

Probability = 0 means an event never happens

Does a prior event matter?

A fair coin is flipped four times. First three flips are heads What’s the probability that the fourth flip

is heads? 1/2 assuming flips are independent

Results of first three flips don’t matter

Independence

The chance that B happens is not affected by whether A had happened.

Does prior event matter?

Ten card drawn without replacement from 52 card deck.

2 Aces are among these 10 cards What’s the probability the tenth card is an

Ace? 2/42 = 1/21

After ten draws, 42 cards remain, 2 of them are Aces

Dependence

The chance that B happens is affected by whether A has happened.

Sequence of Events

You guess at five True False questions. What’s the probability you get them right?

Five right in five guesses

For each question, Pr(correct) = 1/2 Multiply probabilities

(1/2) x (1/2) x (1/2) x (1/2) x (1/2) = 1/32 = 0.031

Card Example

Two cards are taken from normal 52 card deck.

What’s the probability that both are Hearts?

Note - there’s dependence between the two cards

Answer = (13/52) x (12/51) = 1/17 = 0.059

The Birthday Problem

What is the probability that at least two people in this class share the same birthday?

Assumptions

Only 365 days each year. Birthdays are evenly distributed throughout

the year, so that each day of the year has an equal chance of being someone’s birthday.

Take group of 5 people….

Let A = event no one in group shares same birthday.

Then AC = event at least 2 people share same birthday.

P(A) = 365/365 × 364/365 × 363/365 × 362/365 × 361/365

= 0.973

P(AC) = 1 - 0.973 = 0.027

That is, about a 3% chance that in a group of 5 people at least two people share the same birthday.

P(A) = 365/365 × 364/365 × … × 343/365

= 0.493

P(AC) = 1 - 0.493 = 0.507

That is, about a 50% chance that in a group of 23 people at least two people share the same birthday.

P(A) = 365/365 × 364/365 × … × 316/365

= 0.03

P(AC) = 1 - 0.03 = 0.97

That is, “virtually certain” that in a group of 50 people at least two people share the same birthday.

Two-way Tables

And various probabilities...

Two-way table of counts

Rows: gender Columns: pierced ears N Y All M 71 19 90 F 4 84 88 All 75 103 178 Cell Contents -- Count

Joint (“”) probabilities

Rows: gender Columns: pierced ears N Y All M 71 19 90 39.89 10.67 50.56 F 4 84 88 2.25 47.19 49.44

All 75 103 178 42.13 57.87 100.00 Cell Contents -- Count % of Tbl

Row conditional probabilities

Rows: gender Columns: pierced ears N Y All M 71 19 90 78.89 21.11 100.00 F 4 84 88 4.55 95.45 100.00 All 75 103 178 42.13 57.87 100.00 Cell Contents -- Count % of Row

Column conditional probabilities

Rows: gender Columns: pierced ears N Y All M 71 19 90 94.67 18.45 50.56 F 4 84 88 5.33 81.55 49.44 All 75 103 178 100.00 100.00 100.00 Cell Contents -- Count % of Col

Expected Value

Coincidences

Roulette Color Bet

18 black, 18 red, and 2 green numbers Bet on one of black or red If correct , win $1 If wrong, lose $1

Is the bet fair?

Fair game : expected value is 0 Expected value =

sum of (outcome x prob) Exp Val. = (+1)(18/38)+(-1)(20/38) = -2/38 Not fair since expected value is not 0.

Color Bet versus Number bet

Both have same expected value How are the bets the same? Long run result is same How are they different? Short run results can be quite different

Prob of Five Straight Losses

Color Bet = (20/38)5 = 0.04 , 4% Number Bet = (37/38)5 = 0.88, 88%

A Spectacular Coincidence ?

Many states draw four digit lottery numbers

Several years ago Mass. and N.H. both drew the same number on the same night

Associated Press wrote that this was a spectacular 1 in 100 million coincidence

Was Associated Press Right ?

Only if number picked is specified in advance of the draws.

Chance both pick the same pre-specified number, for example 2963, is (1/10,000) (1/10,000)

This is 1 in 100 million But the match could have been on any of

10,000 possibilities

The correct analysis

First state could have picked any number Chance the second state matches is

1/10,000 Answer for two specific states is 1/10,000 But there were 15 states doing this almost

every night .

The prob that the 15 states all differ

First state can be any number Prob second state differs = 9,999/10,000 Prob third state is unique = 9,998/10,000 And so on, for 15 states Multiply these prob.'s to get probability

that all 15 differ Answer is about 0.99 that all picked

different numbers

Prob at least two states are same

Opposite from all different Prob at least two the same = 1-Prob(all

differ) 1 - 0.99 = 0.01 About 1 in 100 ; a far cry from 1 in 100

million

1 tr 555 statistics “refresher” lecture 1: probability concepts references: – penn state...

Documents

copula gaussian graphical modelscopula gaussian graphical...

calculas refresher

+ refresher in inferential statistics tim.bates@ed.ac.uk ...

refresher in statistics and analysis skill

bayesian statistics and belief networks. overview book: ch...

numpy refresher

cobol refresher

cat. nos: 5 555 …2p+e 58740 5 555 54 58720 5 555 84 58710...

statistics refresher - professor davis' website 1 of 13...

simulation with arenaappendix c – a refresher on...

w2 probability & statistics refresher

discrete mathematics and algorithms icme refresher...

8way refresher

john doe marketing expert email: ...

1 tr 555 statistics “refresher” lecture 2: distributions...

1 peter fox data analytics – itws-4963/itws-6965 week 1b,...

mainframes refresher

economics of the firm consumer demand analysis. today’s...

refresher course in calculus, probability, and statistics

555 zion rapture 555