stat camp for the what is statistics? mba program · mean < median mean median 40 percentiles...

62
3/21/2018 1 1 Stat Camp for the MBA Program Daniel Solow Lecture 1 Exploratory Data Analysis 2 What is Statistics? Statistics is the art and science of collecting, analyzing, presenting and interpreting data, which are information you have or can obtain. Business Statistics helps managers make more informed decisions. Descriptive Statistics Inferential Statistics Describes properties of large data sets with a few summary numbers or graphs. Helps you make decisions when you can obtain only a portion of the desired data. 3 Where Is Statistics Needed? Market survey/research A market survey says your market share is 19% with margin of error of 3%. What does this mean? Manpower planning A bank wants to know how many tellers they should have during the busiest time on a given day? Quality control A machine is set to produce parts with a length of 2 inches. A part just produced has a length of 2.1 inches. Should you stop the production and reset the machine? 4 Where Is Statistics Needed? Forecasting How much sales can I expect next quarter? Premiums and Warranties What should the insurance premium be for a particular class of customers? You have just introduced a new automobile tire in the market. How many miles of warranty should you offer on this product? Fun and Games I bet that “this class has at least two persons with the same birthday (day and month)”. Should you take this bet?

Upload: others

Post on 22-Jul-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

1

1

Stat Camp for theMBA Program

Daniel SolowLecture 1 

Exploratory Data Analysis

2

What is Statistics?Statistics is the art and science of collecting, analyzing, presenting and interpreting data, which are information you have or can obtain.Business Statistics helps managers make more informed decisions.

Descriptive Statistics

Inferential Statistics

Describes properties of large data sets with a few summary numbers or graphs.

Helps you make decisions when you can obtain only a portion of the desired data.

3

Where Is Statistics Needed?• Market survey/research

– A market survey says your market share is 19% with margin of error of 3%. What does this mean?

• Manpower planning– A bank wants to know how many tellers they should

have during the busiest time on a given day?• Quality control

– A machine is set to produce parts with a length of 2 inches. A part just produced has a length of 2.1 inches. Should you stop the production and reset the machine?

4

Where Is Statistics Needed?

• Forecasting– How much sales can I expect next quarter?

• Premiums and Warranties– What should the insurance premium be for a

particular class of customers?– You have just introduced a new automobile tire in

the market. How many miles of warranty should you offer on this product?

• Fun and Games– I bet that “this class has at least two persons with

the same birthday (day and month)”. Should you take this bet?

Page 2: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

2

Example 1: Suppose you want to know the average length of iron bars produced by your machine.

5

Inferential Statistics

In such situations, there are a large number of items you are interested in, which is called the population.Every item in the population has a number of interest.You want to know the value of one number associated with the whole population, called the parameter.

Population:Length of the bar.

Average length of all iron bars

All iron bars produced on that machine.Number of interest for each item:Parameter: = .

6

Inferential Statistics• Example 2: You want to know your “market

share” (the fraction of customers that purchase your product).– Population:– Number Associated with Each Item in the

Population:

– Parameter:

All people that buy this product.

= fraction of the population thatbuys your product.

1, if that person buys your product0, if that person does not buy your product

7

Inferential Statistics• In general, you can never know the value of the

parameter of a population (why?).– Because there are too many items in the population.

• In such cases, you should compute your best estimate (statistic) from a “manageable” subset of data (sample) collected randomly from the population.

parameter is unknown

Population

statisticbest

estimate

sample

Random Sample

Example 1 (Iron Bars):– Collect a sample of n iron bars (iron bar i has a length xi).– Compute the following statistic (sample mean):

8

Inferential Statistics

Example 2 (Market Share):– Collect a sample of n people from the population of people

that buy the product (each person i has a value xi of 1 or 0).– Compute the following statistic (sample proportion):

y = number in the sample who buy your product

Page 3: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

3

9

Data• Data are information that are collected, summarized

and analyzed for presentation and interpretation.• Cross-Sectional: Data collected at the same point in

time.• Time Series: Data collected over several time

periods.• Example: The Data Files web site on the first page

of these notes has the following file shadow02.xls with data on certain stocks.

10

Exchange Classes:

OTC

AMEX

NYSE

Mkt Cap Classes:

0-50

50-100

100-150150-200

200-250

Qualitative Quantitative

11

Data SetsAs shown on the previous slide, • Elements: Entities on which data are

collected (the 25 different companies in the shadow-stocks example).

• Variable: A characteristic of the elements you are interested in and whose value varies (Exchange, Ticker Symbol, and so on).

• Class: A group consisting of one or more values for a variable.

12

Types of Statistical Data• Qualitative (non-numeric)

– Nominal – values cannot be compared in terms of order (color, stock exchange, and so on)

– Ordinal – values can be compared in terms of order (rank, quality level, satisfaction)

• Quantitative (numeric)– Interval – difference between values is

meaningful (birth year, customer arrival time)– Ratio – ratio of two values is meaningful

(income, age, height, inventory level)

Page 4: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

4

13

Example: MBA SURVEY Identify the Data Type

• What is your height in inches?

• What is your gender?

• Attitude toward this Course on 1 to 6 scale:

1 = seriously worried (strongly dreading this),

6 = enthused & confident (eager to start)

• Do you smoke?

• WWW purchases (in $) over past year.

RATIO

NOMINAL

ORDINAL

NOMINAL

RATIO14

Descriptive Statistics• Descriptive statistics is the art of

summarizing a data set using either:– Graphical Methods (Charts)– Numerical Methods– All done with computer software packages.– Used all the time in annual reports, news

articles, research studies.– Different for qualitative and quantitative data.

15

Summarizing Qualitative Data• File SoftDrink.xls

Frequency Distribution

Value Frequ-ency

Coke Classic 19Diet Coke 8Dr. Pepper 5Pepsi-Cola 13Sprite 5Total 50

Variable: Soft Drink

Frequency Distribution: A table listing the number of elements in each class.

16

Using SPSS for Frequency  Table(See the files UsingSPSS_Intro.ppt and UsingSPSS_DescriptiveStats.ppt)

To Open an EXCEL file:

•Click on file/open/data.

•Under Files of Type use .xls files.

Page 5: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

5

17

SPSS Output

The relative frequency table shows the proportion(or fraction) of elements in each class. You can display both the frequency and relative frequency tables in a graphical form for easy visualization.

18

•Click on Graphs/Legacy Dialogs/Bar.

•Click on Simple then Define.

•Drag the var. to the Category axis and click either N of Cases or % of Cases.

Using SPSS for a Bar GraphBar Graph: A graph with the classes on the x-axis and the frequencies (or percentages) on the y-axis.

19

SPSS Output

20

•Click on Graphs/Legacy Dialogs/Pie.

•Click Define

•Move the var. into the Slice By box and click % of Cases.

•Click OK

Using SPSS for a Pie ChartPie Chart: A circle having one “slice” for each class, with the size of each slice proportional to the relative frequency of that value..

Page 6: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

6

21

SPSS Output

22

Summarizing Quantitative Data• With quantitative data, the classes have to be

determined by the statistician. Given the minimum and maximum data values:– Determine the number of non-overlapping classes

(usually 5 – 20).• Too few classes: variation does not show.• Too many classes: too much detail.

– The class widths and class limits are then determined from the number of classes.

min

[ ][ ][ ][ ][ ]

width max

lower limit upper limit

23

Graphical Methods for Summarizing Quantitative Data

• Tabular Summaries– Frequency Distributions

• Number of items in each class• Relative Frequency (percentage of items in each class)• Cumulative (everything up to a certain value)

• Graphical Summaries– Histograms (like a bar chart)

24

Example: Audit Times

• File audit.xls• Here, try 5 classes, so

min

max

Class Width = (max – min) / classes= (33 – 12) / 5= 4.2 5 (round up)

Class Limits shows the smallest and largest values in the class.

10-1415-1920-2425-2930-34

Page 7: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

7

25

Frequency Table• The frequency table is constructed by

counting how many data items fall within each class (relative frequency table for percentages).

Audit time (days) Frequency Rel. Frequency (%)

10-14 4 20%

15-19 8 40%

20-24 5 25%

25-29 2 10%

30-34 1 5%26

Histogram• A histogram is a plot of a frequency distribution.

– Classes on the x-axis.– Frequencies or relative frequencies on the y-axis.

• Similar to bar graph, only now the bars are not separated.

• In SPSS: Choose Graph/Legacy Dialogs/Histogram, move the variable to the Variable box, and then customize the plot.

• In EXCEL: First create a column of “bins” (upper class limits), then choose Tools/Data Analysis/Histogram.

27

Histogram of Audit Times

28

EXCEL Histogram of Audit Times

Page 8: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

8

29

Numerical Summaries of Data• Location, Average, Central Tendency

– Mean– Median, Percentiles, Quartiles– Mode

• Variation (how spread out the numbers are)– Range– Variance, Standard Deviation

• Shape– Skewness

30

MEAN

• MEAN = Arithmetic Average

31

Example: Invention Development Time (Develop.xls)

Invention     Development                  Time

Automatic Transmission 16Ballpoint Pen 7Filter Cigarettes 2Frozen Foods 15Helicopter 37Instant Coffee 22Minute Rice 18Nylon 12Photography 56

Invention Development  Time

Radar 35Radio 24Roll-On Deodorant 7Telegraph 18Television 63Transistor 16Video Cassette Recorder 6Xerox Copying 15Zipper 30

In Excel: AVERAGE(range)

An invention on average takes 22.167 years to develop.

32

MEDIAN  (splits data in half)

• MEDIAN = middle value when data values are sorted from low to high...– At least 50% of values are below the median and

at least 50% are above the median.– If sample size (n) is even, the median is the mean

of the two middle values.• What is the median development time?

Page 9: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

9

33

Example: Invention Development Time

Median = (16+18)/2 = 17

In Excel:MEDIAN(range)

34

Mean vs. Median

• The mean is the most commonly used measure of location.

• However the mean is affected by extremely large or small values.

• In those cases the median may be a more reliable measure of location.

35

Example: Salaries

• Mean = 65,400

• Median = 32,000

Employee Salary

John 30,000

Doe 32,000

Smith 32,000

Perry 33,000

Sweeney 200,000

36

Example: Invention Development Time

Mean = 22.167

Median = 17

Page 10: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

10

37

SYMMETRIC DATA

Mean = MedianMedianMean

50% 50%

38

RIGHT SKEWED DATA

Long Right Hand TailMean > Median

Median Mean

39

LEFT SKEWED DATA

Long Left Hand TailMean < Median

MedianMean

40

PercentilesThink about your numerical data values lying on a line:

pth per-centile

The p-percentile is a number such that:

At least p % are ≤

•About p% of your data values are ≤ that number and

Example: The 90th percentile on the GMAT is a score so that about 90% of people’s GMAT scores are ≤ that number and about 10% are ≥ that number .

At least 100 p % are ≥

•About (100 p)% of your data values are ≥ that number.

Page 11: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

11

41

Quartiles• Q1 = First quartile = 25th percentile = a value so

that about 25% of the elements are that value and about 75% are ≥ that value.

• Q2 = Second quartile = 50th percentile = a value so that about 50% of the elements are that value and about 50% are ≥ that value

• Q3 = Third quartile = 75th percentile = a value so that about 75% of the elements are that value and about 25% are ≥ that value.

= the median..

42

Percentiles in EXCEL: (file salary.xls)

43

Percentiles in SPSS(File salary.xls)

Analyze; Descriptive Statistics; 123 Frequencies; then move the desired variable to the Variable(s) box; then click on Statistics; then click Percentile(s) and type your desired percentiles and Add; then click Continue and OK.

44

MODE

• The mode of a variable is the value or category that occurs most often in the batch of data.

• A data set can have more than one mode (bimodal, trimodal).

Page 12: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

12

45

Example: Invention Development Time

Modes: 7, 15,16,18

In Excel:MODE(range),which returnsonly one of

these values.46

Do It Yourself Example: Blood Problem

Suppose that the number of pints per day of whole blood used in transfusions at a hospital over the previous 11 days is:25, 18, 61, 12, 18, 15, 20, 25, 17, 19, 28.Use the file blood.xls and Excel to:•Find and interpret the mean, median and mode(s).

47

Is the Mean Enough? In the Blood Problem, an average of 23.45 pints of

blood are used on a day. Question: Does this mean you should have exactly

23.45 pints of blood available? Answer: Because the amount of blood you need

varies, that is, there is variation in the blood data. Question: How much variation is there? Answer: What is needed is a numerical value to

represent how much variation there is in the data. Example: Range = Largest Value – Smallest Value

No. Why not?

48

Variance• Variance is a number ≥ 0 that measures how close

the data values are to the mean .

• Variance is generally a relative measure.• More reliable measure of variation than the range.• Uses all the data.• There are two different formulas, depending on

whether you are computing the population variance or sample variance (see the handout formulas.pdf).

• Consider the following example for managing the amount of blood at a hospital (file blood.xls).

µ Var. is small µ Var. is larger

Page 13: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

13

49

Example: Blood Problem (blood.xls)

50

Population Variance• = population mean• xi = value of the ith item• (xi –) = deviation of ith item from • (xi –)2 = square deviation of ith item• Variance = average of the square deviations:

• In Excel: VAR.P(range)

51

• (xi – )2 = square deviation • (xi – ) = deviation of ith item from

Sample Variance

• xi = value of the ith item

• Sample Variance =

• In Excel: VAR.S(range)

• = sample mean

52

Standard Deviation• Square root of the variance.• Expressed in the same units as the data.• More intuitive measure of variability.• Blood Problem

– Sample Variance = S2 = 177.07– Sample Standard Deviation = S = = 13.31

• In Excel: Sample Std. Dev. = STDEV.S(range)Pop. Std. Dev. = STDEV.P(range)

• Under circumstances you will learn soon, the std. dev. has a useful interpretation)

Page 14: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

14

53

Using EXCEL and SPSS to Compute Descriptive Statistics

•Both EXCEL and SPSS can automatically compute all of the descriptive statistics.•In EXCEL:

– Tools/Data Analysis/Descriptive Statistics

•In SPSS: – Analyze/Descriptive Statistics/Frequencies– Click on the “Statistics” box and select all of the

descriptive statistics you want (including the percentiles).

•EXCEL and SPSS are now illustrated on the data in the file salary.xls.

54

Descriptive Statistics in ExcelTo compute descriptive statistics in EXCEL, in the Data tab, use the Data-Analysis add-in and choose Descriptive Statistics:

55

EXCEL Salary Example

56

Descriptive Statistics in SPSSTo compute descriptive statistics in SPSS, use the Analyze/Descriptive Statistics/Frequencies and then on the bottom of the screen, click on Statistics and choose the statistics you want reported:

Page 15: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

15

57

SPSS Salary Example

58

Relationship BetweenTwo Variables

• So far you have seen ways to analyze information about a single variable.

• One is often interested in the relationshipbetween two or more variables.

• Examples of relationships– Advertising expenditures and sales.– Company profits and stock price.– Home size and sales price.

59

Example: Stereo Store• File stereo.xls

• Is there any relationship between the number of commercials and the sales levels?

60

Scatter Diagrams in Excel• In Excel, select the two columns of data;

click on the Insert tab; then on the Scatter icon; then on the top left diagram.

• Number of commercials on the x-axis.• Sales levels on the y-axis.

Page 16: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

16

61

Scatter Diagrams in SPSS• Plot of two variables on the same graph.• In SPSS, choose Graphs/Legacy Dialogs/ Scatter

then choose Simple and click on Define• Number of commercials on the x-axis.• Sales levels on the y-axis.

62

Covariance and Correlation The sample and population covariance of two

variables X and Y are numbers whose sign have the following meaning: COV(X,Y) > 0 means that the two variables tend to

move in the same direction—if one increases (decreases), then the other increases (decreases).

COV(X,Y) < 0 means that the two variables tend to move in opposite directions—if one increases (decreases), then other decreases (increases).

The value of the covariance is hard to interpret, so the covariance is converted to a number between −1 and +1 called the correlation of X and Y that indicates how strongly X and Y are correlated.

63

Covariance and Correlation• For two variables X and Y for which you have n

pairs of data in the form (x1, y1), …, (xn, yn), the covariance and correlation are computed by:

Population Sample

COV(X, Y):

COR(X, Y):

Note: COVARIANCE.P and COVARIANCE.S in Excel compute the population and sample covariance XY. CORREL computes the sample correlation = population correlation. 64

Cov. and Correlation in EXCEL

Page 17: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

17

65

Cov and Correlation in SPSS• In SPSS, choose Analyze/Correlate/Bivariate.• On the next menu, click on Options.• Select Cross-Product Deviations and Covariances.• Click Continue and, on the previous menu, OK.

Page 18: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

1

66

Stat Camp for theMBA Program

Daniel Solow

Lecture 2Probability

67

Motivation• You often need to make decisions under

uncertainty, that is, facing an unknown future.• Examples:

– How many computers should I produce this month?– What premium should I charge a class of customers

for a particular type of insurance policy?• The answers to such questions requires

knowledge of probability that is, the study of likelihood of certain events occurring.

68

Probability• Probability is a number that measures the

likelihood that an event will happen.• Useful as an indicator of the uncertainty

associated with an event.• Scale from 0 to 1.

– Probability = 0: the event will certainly not happen.

– Probability = 1: the event will certainly happen.– Probability = 0.5: the event is equally likely to

happen or not happen.69

Experiments and Outcomes• Experiment: A situation in which an action

could be repeated many times, each resulting in one of many possible outcomes or sample points. Exactly one of these outcomes will occur, but it is not known which. For example:

Experiment OutcomesToss a coin Head, Tail

Roll a die 1, 2, 3, 4, 5, 6

Sales Call Sale, No sale

Dow Jones tomorrow All positive numbers

Page 19: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

2

70

Assigning Probabilities• Assume an experiment has n possible

outcomes E1, E2,…, En . • The probability assigned to each outcome

must be a number between 0 and 1.

• The sum of the probabilities of all outcomes must equal 1.

0 P(Ei) 1

P(E1) + … + P(En) = 1

71

Assigning Probabilities• Depending on the situation, you can obtain the

probabilities of the outcomes of an experiment from:– Assumption that all outcomes are equally likely

(classical method)– Experience from past data (relative frequency

method)– Experience, intuition or personal judgment

(subjective method)

72

Classical Method• In many situations it is reasonable to assume

that all n outcomes of an experiment are equally likely to occur.

• Then each outcome has probability equal to 1/n (why)?

• Examples:– Toss a coin: P(H) = P(T) = 1/2.– Roll a die: P(1) = P(2) = … = P(6) = 1/6.

73

Relative Frequency Method

• In some experiments, there are past data available, from which you can estimate the proportion of time each outcome has occurred if the experiment is repeated a large number of times.

• This proportion is used as an estimate of the probability of the outcome.

Page 20: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

3

74

Example• When asking about a person’s attitude on a new law, the

outcome could be: disagree (D), neutral (N), agree (A), or uninformed (U).

• One way to assign probabilities to these four outcomes is to use the results of a survey, such as the following:

• P(D) = 0.41, P(N) = 0.09, P(A) = 0.29, P(U) = 0.21

AttitudeNumber of Responses

RelativeFrequency

Disagree (D) 52 52/127=0.41Neutral (N) 12 12/127=0.09Agree (A) 37 37/127=0.29Uninformed (U) 26 26/127=0.21Total 127 1

75

Subjective Method

• Appropriate when– Equally-likely assumption is not appropriate.– No data are available.

• One can use other available information, experience, intuition, judgment.

• In this case, the probability of each outcome expresses the person’s subjective degree of belief about the likelihood of the outcome.

76

Example

• What is the probability that I will get that job offer?– My personal belief of the chance is 50%.

P(offer) = 0.5, P(no offer) = 0.5.– The career office believes that the chance is 70%.

P(offer) = 0.7, P(no offer) = 0.3

77

Events• The outcomes are the simplest elements

associated with an experiment.• However, you are often interested in finding

the probabilities of more complicated events related with this experiment.

• In probability language, events are collections of outcomes.

Page 21: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

4

78

Example 1: Rolling a Die

• Experiment: Roll a die• Outcomes: 1, 2, …, 6• Events:

– A = the outcome is less than 3 = {1, 2}– B = the outcome is even = {2, 4, 6}– C = the outcome is not less than 4 = {4, 5, 6}

79

Example 2: Sporting Event

• Experiment: One of the following eight cities will be chosen to host a sports contest:

New York (N), London (L), Paris (P), Tokyo (T),Beijing (B), Sydney (S), Madrid (M), Chicago (C)

• Events:– A = contest is in Asia– B = contest is in an English-speaking city– C = contest is not in Europe

= {T, B}= {N, L, S, C}

= {N, T, B, S, C}

80

Example 3: Playing Cards• A deck of cards consists of 52 cards, arranged in

four suits: spades (S), clubs(C), diamonds (D), and hearts (H).

• Spades and clubs are black, diamonds and hearts are red.

• Each suit has 13 cards arranged in order:1(ace), 2, 3,…, 10, J(ack) ,Q(ueen), K(ing)

• A card is denoted by the card number and the suit:1C = ace of clubs, JH = jack of hearts, and so on.

face cards

81

Example 3 (cont)

• Experiment: Draw a card at random• 52 Outcomes:

1S,…,KS,1C,…,KC,1D,…,KD,1H,…,KH• Events:

– A = draw a king = {KS, KC, KD, KH}– B = draw a red two = {2H, 2D}– C = draw a club face card = {JC, QC, KC}

Page 22: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

5

82

Probabilities of Events

• The probability of an event can be computed by adding up the probabilities of all outcomes included in that event.

83

Example 1: Rolling a Die• Pr(1) = Pr(2) = … = Pr(6) = 1/6.

– A = the outcome is less than 3 = {1, 2}

– B = the outcome is even = {2, 4, 6}

– C = the outcome is not less than 4 = {4, 5, 6}

P(A) = P(1) + P(2) = 1/6 + 1/6 = 1/3.

P(B) = P(2) + P(4) + P(6) = ½.

P(C) = P(4) + P(5) + P(6) = ½.

84

Example 2: Sporting Event• Assume:

– P(N) = 0.2, P(L) = 0.1, P(P) = 0.1, P(T) = 0.1, P(B) = 0.05, P(S) = 0.15, P(M) = 0.05, P(C) = 0.25.

• Then – A = contest is in Asia = {T, B}

– B = contest in an English-speaking city = {N, L, S, C}

– C = contest not in Europe = {N, T, B, S, C}

P(A) = P(T)+P(B) = 0.1 + 0.05 = 0.15.

P(B) = 0.2 + 0.1 + 0.15 +0.25 = 0.70.

P(C) = 0.2 + 0.1 + 0.05 + 0.15 + 0.25 = 0.75.

85

Example: Playing Cards• P(any outcome) = 1/52.

– A = draw a king = {KS, KC, KD, KH}

– B = draw a red two = {2H, 2D}

– C = draw a club face = {JC, QC, KC}

P(A) = 4/52 = 1/13 = 0.077.

P(B) = 2/52 = 1/26 = 0.038.

P(C) = 3/52 = 0.058.

Page 23: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

6

86

Complement of an Event• The complement of an event A is the event that

A does not happen and thus contains all outcomes that are not contained in A.

• The complement of A is written as Ac.• If A happens, Ac does not happen and vice versa.• Complement Law: P(A) = 1 – P(Ac).• Note: If the event A you are interested in has

many outcomes but Ac does not, then, to compute P(A), it is easier to find P(Ac) and then

P(A) = 1 – P(Ac). 87

Examples• Example 1: Suppose A is the event that the

weekly sales exceed $2,000 and P(A) = 0.75.• Then Ac is the event that the weekly sales do

not exceed $2,000 and P(Ac) = 0.25.• Example 2: When you pick a card at

random, what is the probability that you do not pick an ace?

• Answer: Let A = not pick and ace so Ac = pick an ace =

P(A) = 1 – P(Ac) = 1 – (4/52) = 48/52.{AC, AD, AH, AS} and so

88

Intersection of Two Events• If A and B are two events, you are often

interested in the probability that both A and B occur simultaneously.

• The event A and B is called the intersection of A and B, written A B, and consists of outcomes that are in both A and B simultaneously.

When you see the word AND think of .

ABA B

8889

Intersection of Two Events• Example: What is the probability of drawing a

red king?– A = draw a king = {KS, KC, KH, KD}– B = draw a red card =

{1H,2H,3H,4H,5H,6H,7H,8H,9H,10H,JH,QH,KH,1D,2D,3D,4D,5D,6D,7D,8D,9D,10D,JD,QD, KD}.

– A B = draw a king and red card = {KD, KH}.

P(drawing a red king) = P(drawing a red card and a king) = P(A B) = {KD, KH} = 1 / 26.

Page 24: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

7

90

Mutually Exclusive Events

• If A B contains no outcomes, then the events A and B are called mutually exclusive.

• If A and B are mutually exclusive it is not possible that both A and B will happen.

• If A and B are mutually exclusive, then

P(A B) = 0.

91

Example

• A = draw a king = {KS, KC, KH, KD}• B = draw a queen = {QS, QC, QH, QD}• A B contains no outcomes.• If you draw a single card, it is not possible that

the card will be both a king and a queen.

92

Union of Two Events• If A and B are two events, you are often

interested in probability that either A or B (or both) occur simultaneously.

• In math terms, this is called the union of A and B and is denoted by A B.

Example: Roll a die. A = the outcome is less than 3 = {1, 2}. B = the outcome is even = {2, 4, 6}. AB = the outcome is less than 3 or even (or

both) = {1, 2, 4, 6}.93

The Addition Law• The event A B consists of all the

outcomes that belong to both A and B.• When you see the word OR think of .

AA B: B

The probability of A B can be computed by adding the probabilities of the individual outcomes in A B,

P(A B) = P(A) + – P(A B)

A B

OR with the following addition law (whichever is easier):

Page 25: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

8

94

Union of Mutually Exclusive Events

• If A and B are mutually exclusive, you know that P(A B) = 0.

• Therefore, for mutually exclusive events,P(A B) = P(A) + P(B) – P(A B)

= P(A) + P(B)

95

Example• Roll a die

– A = the outcome is less than 3 = {1, 2}– B = the outcome is even = {2, 4, 6}– AB={2}– AB = {1, 2, 4, 6}

• Probabilities– P(A) = 1/3– P(B) = 1/2– P(AB) = 1/6– P(AB) = 4/6 = 2/3.

• From the addition law:– P(AB) = P(A) + P(B) – P(AB)

= 1/3 + 1/2 – 1/6 = 2/3which agrees with the direct calculation.

96

Example

• You draw a single card from a deck.• You win if you draw a king or a red card.• What is the probability you win?

97

Answer• A = draw a king

• B = draw a red card

• AB = red king

• AB = king or red card

P(A) = 4/52.

P(B) = 26/52.

P(AB) = 2 /52

P(AB) = P(A) + P(B) – P (AB) =4/52 + 26/52 – 2/52 = 28/52 = 0.538

Page 26: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

9

98

Conditional Probability• Sometimes the probability of an event changes

when you get information about another related event.

• Example: You roll a die and don’t see the outcome.– What is the probability of the outcome being a 2?– If I tell you that the outcome was odd, what is the

probability of a 2 now?– If I tell you that the outcome was even, what is the

probability of a 2 now? • This is the conditional probability of an event

given that another event happened.

0

1/6

1/3

99

Example: Police Force• A police force consists of 960 men an 240 women

officers. Last year 288 men and 36 women were promoted. Women officers complained of discri-mination. The administration said that this was due to the low number of women officers in the force.

• Do you think the discrim. complaint is justified?• Approach: Compare P(promotion given a man)

and P(promotion given a woman).Probability Model

Experiment: Select an officer at randomOutcomes: Any one of 1200 officers,

all equally likely.

100

Events and Their Probabilities• Events:

– M = man officer, – W = woman officer, – A = promoted officer, – Ac = non-promoted officer, – MA = male prom. off.,

P(M) = 960/1200 = 0.8P(W) = 240/1200 = 0.2

P(A) = 324/1200 = 0.27P(Ac) = 0.73

P(MA) = 288/1200 = 0.24

Men Women TotalsPromoted

NotTotals 960 240 1200

288672

36204 876

324

101

Conditional Probability

• Select an officer at random.• P(selected officer is promoted) = P(A) =

Men Women TotalsPromoted

NotTotals 960 240 1200

288672

36204 876

324

324/1200 = 0.27Suppose the selected

off. is a man.

P(A | M) = 288/960 = 0.3

Suppose the selected off. is a woman.

P(A | W) = 36/240 = 0.15

Page 27: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

10

102

Conclusion

• We have found that – P(A|M) = 0.3: 30% of men officers were

promoted.– P(A|W) = 0.15: 15% of women officers were

promoted.• Is the complaint justified in your opinion? YES!

103

Computing Conditional Prob.• The general formula (Baye’s Theorem) for the

conditional probability of an event A given that event B has occurred is:

• Police Example:

This is consistent with our original calculation.

104

Independent Events

• Two events A and B are independent if the probability of A does not change with information about B, that is, if

• If this relation is not true, then the two events are dependent.

P(A|B) = P(A)and vice versa

P(B|A) = P(B).

105

Example: Police Force

• You have found that – P(A|M) = 0.3– P(A) = 0.27

• Therefore, P(A|M) P(A), and the events A and M are dependent.

• This means that the probability of an officer being promoted is influenced by whether this officer is a man.

• This justifies the discrimination claim.

Page 28: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

11

106

Multiplication Law

• Assume A and B are independent. ThenP(A|B) = P(A).

• However, you also know thatP(A|B) = P(AB) / P(B).

• Then you find thatP(AB) / P(B) = P(A|B) = P(A),

orP(AB) = P(A) P(B).

107

Independence• It is often reasonable to assume that two events are

independent, because of their nature.• For example, if you roll two dice.

– Does knowing what happened on the first roll tell you anything about what will happen on the second roll?

• Because there is no relationship between the first and the second roll, you may assume that A and B are independent.

• Then the multiplication law implies

P(AB) = P(A) P(B) = (1/6) (1/6) = 1/36.

No!

108

Summary Example• Bob and Jon live together and each has a car that works,

respectively, 60% and 90% of the time.• A potential employer has said she will hire them if they have

one working car at least 95% of the time. • State what you are looking for as a prob. question.

Is P(that at least one car is working) 0.95?• Use probability theory to answer the question.

= P(A) + P(B) P(AB) (Addition Law)(A & B are ind.)= 0.6

= 0.96.

Events:

P(at least one car is working) B = Jon’s car works

They get the job.

Will they get the job?

A = Bob’s car worksP( ) = 0.9P( ) = 0.6

0.6 (0.9) + 0.9

= P(A OR B) = P(AB)

109

Random Variables• A random variable (rv) is a quantity of interest:

– Whose value is uncertain

– You cannot control the value that occurs.• Example 1: Y = the outcome of flipping a coin. • Random variables are used to help make decisions

in a problem involving uncertainty.• Example 2: Roll a die once. If the outcome is 1 or 2

you lose $5, if 3 you lose $1, if 4 you win $2, if 5 or 6 you win $4.

• To decide, you need to identify appropriate rvs.Do you want to play this game?

(by “uncertain” is meant that there are many (at least two) possible values and you do not know which value will occur).

Page 29: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

12

110

Identifying Random Variables and Their Possible Values

• The first two steps involved in working with a random variable are:

• Step 1: Identify the random variable.– Use a symbol and write the meaning of the variable,

including units.Example : Let X = $ earned in the dice game.

• Step 2: List all possible values the r.v. can have.Example: X = 5, 1, 2, or 4.

111

Types of Random Variables• Discrete Random Variable: A r.v. whose possible

values you can “count” (either finite or an infinite set of countable numbers, for example, 0,1,2,…).

Example Roll-and-Earn: X = −5, −1, 2, 4Can you count these possible values?

• Continuous Random Variable: A r.v. whose value is any number (including decimals and fractions) in an interval or a collection of intervals (infinite uncountable number of values).

Example: X = liters of water I drink today.Can you count these possible values?

[0, 5]

Yes!

No!

Poss. Val.

112

Examples of Random Variables• Number of heads in 50 tosses of a coin.

– All possible values:• Number of customers who enter a store in a day.

– All possible values:• Number of cm of rain next month.

– All possible values:• Time in minutes between two customers arriving

at a bank.– All possible values:

(discrete)

(discrete)

(continuous)

0, 1, …, 50

0, 1, 2, …

[0, 30].

[0, ). (continuous)113

• Number of defective products in a shipment of 100. – All possible values:

Examples of Random Variables

(discrete)

Quantity x of liquid inside a 12 oz can– All possible values: (continuous)

0, 1, 2, …, 100.

[0, 12]. Percentage x of a project completed by the

deadline.– All possible values:

$ sales in a retail store tomorrow.– All possible values:

(continuous)

(discrete)

[0, 100].

$0.00 - $10000.00

Page 30: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

13

114

• The temperature at noon yesterday.• The temperature at noon tomorrow.• The age of a person chosen at random in this class.

Which of the Following are RVs?(No)(Yes)

(It depends on timing.)

(Yes)

All possible values:(discrete)

(No: there is only one value for that person’s age.)

Time is critical!

If you have already selected the person:If you have not yet selected the person:

The (finite) list of agesof everyone in this class.

115

• The average of a population.

Which of the Following are RVs?(No, µ only has one val.)

(Depends on time.)

(Yes)

(No, there is only one value for that average.)

If you have already selected the sample:If you have not yet selected the sample:

The average of a sample of size 2.

G1Groups of size 2:

for the group: A1

G2

A2

G3

A3

The (finite) list of averagesof every group of size 2 in the population.

(discrete, but …)

All possible values of

Warning: When a discrete rv has too many possible values, it might not be practical to work with that rv.

116

RVs, Populations, and Sampling

For any RV, you can create a sample of size n by observing and recording the value of the RV n separate times.

Y = the value of an item that will be chosen randomly from the population.

For any population, you can create the following two discrete random variables:

Note: The average of the pop. is not a RV.

= the average of a sample of size n before taking the sample.

Question: What can you do when a quantity of interest—such as the average of a population—is unknown?

117

When Something is Unknown

Ideal: Determine the value, however…If doing so requires too much time, effort, money, then…

Next Best: Estimate the value, for example, by…Building a model! For example:Model 1: Take a sample of size n (the model) and use the average from the sample as your best estimate of .

Model 2: Think of as a discrete random variable (the model) with possible values: 20, 21, …, 30Model 3: Think of as a continuous random variable (the model) with possible values: [20, 30]

Page 31: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

14

118

Identifying Random Variables• INSURANCE PREMIUMS

What should the insurance premium be for a particular class of customers?

• Question: Is the annual premium a r.v.?Answer: No, because you can control its value.

• Let C = the $/year to be claimed by this type of customer with possible values: (discrete)

0 – 100,000

119

Examples of Random Variables• WARRANTIES

GoodTire has just introduced a new tire in the car market. How many miles of warranty should the company offer?

• Qn: Is the warranty mileage a r.v.?• Ans: No, because you can control its value.• M = the number of miles such a tire is

expected to last, with possible values: (continuous)

[0, ).

120

Examples of Random Variables• PERSONNEL PLANNING

How many bank tellers should be working during the busiest time of the day?

• A = the total number of customers that arrive during that period, with possible values:

• W = the number of minutes it takes a teller to serve a customer, with possible values:(continuous)

[0, 30].

0, 1, 2, … (discrete)

121

Probability Distribution• To “work” with a random variable, you must

know that variable’s probability distribution, which describes the probabilities of all the possible values of the random variable occurring.

• Note: Probability distributions are different for discrete RVs and continuous RVs.

Page 32: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

15

122

Discrete Distributions• For a discrete random variable X, the

probability distribution is described by a probability density function that consists of:– The list of the possible values and, for each one,

the probability of that value occurring.• Notationally, if t is a possible value for the rv X

then the density function is written as follows:

f(t) = P(X = t) = the probability that the random variable takes the value t.

123

Example 1• Toss a coin once.• X = number of heads.• Possible values of X:

t 0 1P(X = t) = f(t) 0.5 0.5

• A valid probability density function for a discrete random variable must satisfy the following two properties:– 0 f(t) 1, for each value of t.– f(t) = 1.

124

Example 2Toss a coin twice, and let X = number of heads.

Values of X Probability Density Function

f(0) =0.25

P(X = 0) = = P(T and T) P(no heads) = P(T) * P(T) = 0.5 (0.5) =

f(1) =0.5

P(X = 1) = = P(HT or TH) P(1 head) 0.25 + 0.25 == P(HT) + P(TH) =

f(2) =

0

1

2 0.25

1 f(0) f(1) =1 0.25 0.5 =

125

Example 2 (cont’d)

• The probability density function can be shown in a table or graph:

0.25

0 1

0.5f(t)

t2

Page 33: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

16

126

Example 3: Roll and Earn

• Find the probability density function of X, the amount earned in the following game:Roll a die once. If the outcome is 1 or 2 you lose $5, if 3 you lose $1, if 4 you win $2, if 5 or 6 you win $4.

Possible values for X: 5 1 2 4

Probabilities: 2/6 1/61 or 2 3

1/64

2/65 or 6

127

• In addition to the density function, you also want to find the expected value (mean) of the discrete random variable, which is a measure of the central location of the value of the random variable.

• Computed as the sum of the products of the possible values x and the corresponding probabilities:

X = E[X] = t f(t)

• Note: The mean of a r.v. is different from the mean of a population and the mean of a sample.

Expected Value

128

Why is the ExpectedValue Useful?

• Law of Large Numbers: If you observe the value of the random variable X a large number of times, the average of the observed values will be very close to the expected value of the random variable X.

• Let the rv X = be your random variable.

129

Example• Toss a coin twice and let X = number of heads.• The probability density function is:

• The expected value is: E(X) = (0) (0.25) + (1)(0.5) +(2)(0.25) = 1

• If the two tosses are repeated many times and the number of heads recorded each time, the average number of heads per two-tosses will be close to 1.

Page 34: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

17

130

Computing Exp.Value in Excel• Roll and Earn

This means that, if you play the game many times and record the results, on average, you will lose $0.16 each time. 131

Example: Car Dealership

• Now that the new car models are available, a dealership has lowered the prices on last year’s models in order to clear its inventory. With prices slashed, a salesperson estimates the following probability distribution of X, the number of cars that person will sell next week.

• Find the expected value of X. What does it mean?

132

Answer

• E(X) = (0)(0.05) + (1)(0.15) + (2)(0.35) + (3)(0.25) + (4)(0.2) = 2.40

• If prices stay the same over several weeks, the average number of cars sold per week will be close to 2.40.

133

Variance and Standard Deviation of Random Variables• You also want to find the variance of the r.v., which

is a measure of how close the possible values are to the expected value, .

• The standard deviation is the square root of the variance.

• Note: Variance and standard deviation for a r.v. are different from those for a population and sample.

The variance of a discrete random variable X is:µ Var. is small µ Var. is larger

2 = VAR(X) = (t – ) 2 f(t)

Page 35: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

18

134

Example• Toss a coin twice and let X = number of heads.• Recall that the expected value is = 1.

• Variance: Var(X) = 2 = 0.50

• Standard Deviation: = 0.50 = 0.707135

Example• For the roll-and-earn game, find the

variance and standard deviation.• Recall that E(X) = = 0.16.

• Variance: Var(X) = 2 = 14.35• Standard deviation: = 14.35 = 3.79

Page 36: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

1

136

Stat Camp for theMBA Program

Daniel Solow

Lecture 3Random Variables and

Distributions

137

Probability Distribution• Recall that a random variable is a quantity of

interest whose value is uncertain and you cannot control it. To use a r. v. to help solve a decision problem:

• Step 1: Identify the random variable, say X.• Step 2: List all possible values for X.• Step 3: Determine if X is discrete or continuous.• Step 4: Identify the density function of X.• Step 5: Find E[ X ] = the expected value of X.• Step 6: Find VAR[ X ] = the variance of X (or the

STDEV[ X ]).

138

Example: Debbon Air Seat-Release Problem

• Debbon Air needs to make a decision about Flight 206 to Miami, which is fully booked except that…

• 3 seats are reserved for last-minute customers (who pay $475 per seat), but the airline does not know if anyone will buy those seats.

• If they release them now, they know they will be able to sell them all for $250 each.

• Debbon Air counts a $150 loss of goodwill for every last-minute customer turned away.

139

Debbon Air “Seat Release”

• Question: How many seats, if any, should Debbon Air release?

• Question: On what basis that is, on what criterion are you going to make the final decision?

• Answer: Based on profits.• Approach: Find the expected profit when you release

0 seats, 1 seat, 2 seats, and 3 seats, and then…Choose the alternative that has the max. expected profit.

Page 37: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

2

140

Identifying Random Variables• Question: Can you identify any r.v.s that can

help you make this decision?• Let the r.v. X =• Possible values for X:

• Probability distribution for X:

# of arriving last-minute customers

141

Identifying Random Variables

• Another random variable of interest is:R = net revenue (revenue minus loss

of goodwill) • However, this revenue depends on the number

of seats released, so, defineRi = net revenue when i seats are

released (i = 0, 1, 2, 3).

142

“Debbon Air” Seat Release• What are the possible values for R3, that is,

what are the possible revenues when all 3 seats are released?

• The answer depends on how many last-minute customers (X) arrive, so: If X = 0: R3 = If X = 1: R3 = If X = 2: R3 = If X = 3: R3 =

3(250) = $7503(250) – 150 = $6003(250) – 2(150) = $4503(250) – 3(150) = $300

143

Debbon Air “Seat Release”

• Expected Value of R3 (revenue when 3 seats are released):

• E(R3) = 750 (0.45) + 600 (0.3) + 450 (0.15) + 300 (0.1)

= 615.

Page 38: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

3

144

“Debbon Air” Seat Release

• How many seats should be released to maximize expected net revenue?

Two seats should be released.Qn: What is the prob. that there are 0 defects (“successes”) in, say, a sample of n = 20 tires? 145

A Binomial Experiment• In many applications, you will perform a binomial

experiment a number of times for which, each time, there are two possible outcomes:– A “success” or – A “failure”.

• Example: At a tire factory, you will examine a number of tires and, for each one, determine if• There is a defect (“success”) or• There is no defect (“failure”).

Note: A “success” does not have to be a “good” outcome.

What you are then interested in is the probability of having k successes out of, say, n trials.

146

Binomial ExperimentIn a binomial experiment, you must identify:• What constitutes a trial, a “success” and a “failure.”• p = P(success) = the probability of a success occurring

in each trial (so, P(failure) = ).• n = the number of independent trials (repetitions).Then define the following r.v.:

X = number of successes out of n trials.

1 – p

Possible Values0, 1, 2, …, n.

Density Function = Probabilities: get from EXCEL

discrete

147

Example 1• Toss a fair coin 100 times.• A “trial” is a flip and a “success” = a heads.• p = probability of success = P(head) = 0.5.• n = 100 independent coin tosses.• X = number of heads follows binomial

distribution with n = 100 and p = 0.5.Then, E(X) =

(100)(0.5)(0.5) = 25 = 5.

100 (0.5)= 50SD(X) =

Page 39: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

4

148

• A student takes a multiple choice test with 25 questions. Each question has 4 choices. Assume the student does not know the answer to any question and just guesses.

• A “trial” is a question and a “success” is a correct answer.

• p = prob. of success = P(correct answer) = 0.25.• n = 25 independent questions.• X = number of correct answers follows the Binomial

distribution with n = 25 and p = 0.25.• Then, E(X) =

Example 2

SD(X) =

(25) (0.25) = 6.25.

(25) (0.25) (0.75) = 4.6875 = 2.17.149

Binomial Random Variables • If n = 1, then the possible values for X are:

X =1 if the outcome is a success0 if the outcome is a failure

Here, X is called a binomial random variable.• The density function is: 1 p, • The expected value of X is E(X) = np = p.• The standard deviation of X is SD(X) =

f(0) = f(1) = p.

150

Binomial Probabilities in Excel • The Excel BINOMDIST function provides

two kinds of binomial probabilities.• Suppose that the random variable X is

Binomial with parameters (n, p).• For k successes, where k is between 0 and n:

= BINOMDIST(k, n, p, FALSE)

= BINOMDIST(k, n, p, TRUE)

151

Example 1: Bad Seafood• Consumer Reports (Feb. 1992) found widespread

contamination of seafood in supermarkets in NYC and Chicago.

• 40% of the swordfish pieces for sale had a level of mercury above the maximum allowed by the Food & Drug Administration (FDA).

• Suppose a random sample of 12 swordfish pieces is selected.

• What is the probability that exactly five of the pieces have mercury levels above the FDA maximum?

• What is the probability that at least 10 pieces are contaminated?

Page 40: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

5

152

Answer• Find P(exactly five of the pieces have mercury levels

above the FDA maximum)?• “Trial” is choosing a piece of fish and a “success” =

the piece is contaminated.• p = prob. of success = P(contamination) = 0.4.• n = 12 independent (why?) pieces.• X = number of contaminated pieces follows the

Binomial distribution with n = 12 and p = 0.4.• From Excel:

P(X = 5) = BINOMDIST(5, 12, 0.4,FALSE) = 0.227

153

Answer (continued)• Find P(at least ten pieces are contaminated)• P(X 10) = P(X = 10 or 11 or 12) =

P(X = 10) + P(X = 11) + P(X = 12)• From Excel, you have

– P(X = 10) = BINOMDIST(10, 12, 0.4, FALSE) = 0.0025– P(X = 11) = BINOMDIST(11, 12, 0.4, FALSE) = 0.0003– P(X = 12) = BINOMDIST(12, 12, 0.4, FALSE) = 0

• P(X 10) = 0.0025 + 0.003 + 0 = 0.0028

154

Another Way to Do This

• From the Complement Law, you know that P(X 10) = 1 – P(X < 10)

= 1 – P(X 9)• From Excel, you have that

P(X 9) = BINOMDIST(9, 12, 0.4,TRUE) = 0.9972.

• Then P(X 10) = 1 – 0.9972 = 0.0028.

155

Example 2: Murder Trial• As the lawyer for a client accused of murder, you are

looking for ways to establish “reasonable doubt”. The prosecutor's case is based on the forensic evidence that a blood sample from the crime scene matches the DNA of your client. It is known that 2% of the time DNA tests are in error.

• Suppose your client is guilty. If six laboratories in the country are asked to perform a DNA test, what is the probability that at least one of them will make a mistake and conclude that your client is innocent?

Page 41: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

6

156

Answer• A “trial” is sending the DNA to a lab and a success =

lab makes error (finds no match)• p = prob. of success = P(error) = 0.02.• n = 6 independent (why?) lab tests.• X = number of labs that make an error follows the

Binomial with n = 6 and p = 0.02.• P(X 1) = 1 – P(X < 1) = 1 – P(X = 0)• From Excel, P(X = 0) = 0.8858.• P(X 1) = 1 – 0.8858 = 0.1142 • So there is an 11.42% probability that at least one lab

will find no DNA match.• Question: How many labs would you need to raise

this probability to 25%?157

Example 3:Multiple-Choice Quiz

• A multiple-choice quiz has 15 questions. Each question has five possible answers, of which only one is correct.

• What is the expected number of correct answers by sheer guesswork?

• What is the standard deviation of the correct answers by sheer guesswork?

• What is the probability that sheer guesswork will yield at least seven correct answers?

158

Answer

• A “trial” is answering a question and a success = a correct answer.

• p = Prob. of success = P(correct answer) = 1/5 = 0.2.

• n = 15 independent (why?) answers.• X = number of correct answers follows a

Binomial with n = 15 and p = 0.2.

159

• E(X) = np = 15 (0.2) = 3

• P(X 7) = 1 – P(X < 7) = 1 – P(X 6)• P(X 6) = BINOMDIST(6, 15, 0.2, TRUE)

= 0.9819• P(X 7) = 1 – 0.9819 = 0.0181 (1.81%)

SD(X) = n p (1 – p) = 15 (0.2) (0.8) = 2.4 = 1.55

Answer (continued)

Page 42: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

7

160

Covariance• When working with two random variables, say X

and Y, you are sometimes interested in the degree to which the values of X and Y are correlated—that is, as X increases, to what degree is it likely that Y increases (or decreases)?

• When X and Y are discrete RVs with n possible pairs of values, say (x1, y1), …, (xn, yn), and corresponding probabilities p1, …, pn, then the covariance of X and Y, written COV(X, Y) or XY, is given by the following formula:

161

Covariance and Correlation•COV(X,Y) > 0 means that the two variables tend to move in the same direction—if one increases (decreases), then the other increases (decreases).COV(X,Y) < 0 means that the two variables tend to move in opposite directions—if one increases (decreases), then other decreases (increases). The value of the covariance is hard to interpret, so the

covariance is converted to the following number between −1 and +1 called the correlation of X and Y, written COR(X, Y) or XY ( that indicates how strongly X and Y are correlated):

Note: Cov and correlation of RVs are different from cov. and correlation of samples and populations.

162

Example: Stocks and Bonds• Example: Suppose you are considering investing in both a

stock and a bond fund, and define the following RVs:S = annual rate of return on the stock fundB = annual rate of return on the bond fund

Possible Values (depend on the state of the economy):

Economy Stock Fund Bond Fund Prob.Recession −7% 17% 1/3Normal 12% 7% 1/3Boom 28% −3% 1/3

E[S] = 1/3(−0.07) +1/3(0.12) +1/3(0.28) = 0.11

E[B] = 1/3(0.17) +1/3(0.07) +1/3(−0.03) = 0.07163

See File Cov_and_Cor.xlsUsing E[S] = 0.11 and E[B] = 0.07, you can now compute COV(S, B) using the formula as follows:

= − 0.0117COV(S, B) =1/3(−0.07 − 0.11)(0.17 − 0.07)

+ 1/3(0.12 − 0.11)(0.07 − 0.07)+ 1/3(0.28 − 0.11)(− 0.03 − 0.07)

Using [S] = 0.143 and [B] = 0.082, you can now compute COR(S, B) using the formula as follows:

COR(S, B) =COV(S, B)[S] [B]

= − 0.01170.143 (0.082)

− 1

So S and B are perfectly negatively correlated: when S returns are up, B returns are down and vice versa.

Page 43: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

8

164

Continuous Random Variables• A continuous random variable assumes any

value, including decimals and fractions, in intervals on the real line.

• Example: X = the number of minutes a customer waits in line. Possible values:– Question: If all values are equally likely, what is

P(X = 5.3789)?– Answer: Prob(X = 5.3789) = 1/ = 0 because there

are an infinite number of possible values for X.• Conclusion: For a continuous rv, it is not meaningful

to specify the likelihood that the variable is equal to one specific value.

[0, ).

165

Continuous vs. Discrete RVs• Question: Is there any difference between P(X < a)

and P(X a)?• Answer: For a continuous rv, the answer is “no”

because:

P(X a) = P(X < a or X = a)= P(X < a) + P(X = a)= P(X < a) + 0= P(X < a)

• Note: The foregoing step that P(X = a) = 0 is not true for a finite discrete rv, and this is one major difference between a fnite discrete r.v. and a continuous rv.

Possible values where the density function is higher are morelikely to occur than where the density function is lower. 166

Using Density Functions• Solution—Use a probability density function to

describe the likelihood that X is in a given interval.

a b

Area under the graph =Total area = 1

f(x)

x

Probability density function

All Possible Values of X

P(a<X<b)

Morelikely

Lesslikely

To find probabilities, wefind areas under the density function.

167

Example: Finding Probabilitiesf(x)

x54 6

Suppose that the graph of some probability density function is symmetric around 5.

What is P(X < 5)? 0.5What is P(X > 5)? 0.5

If P(X < 4) = 0.3, find:P(X > 6)P(X < 4 or X > 6)

0.30.3 + 0.3 = 0.6

P(4 < X < 6) 1 0.6 = 0.4

P(4 < X < 5) 0.4/2 = 0.2

0.30.4

0.20.3

Page 44: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

9

168

Example: Uniform Distribution

1

1

f(x)

x

Find P(0.2 < X < 0.5)0.50.2

0.3 ꞏ 1 = 0.3Find P(X > 0.6)

0.6

0.4 ꞏ 1 = 0.4E[X] = 0.5

?

2

f(x)

x

1/2

Fact: In the real world, you can never find the density function of a continuous rv, so what can you do?Ans. Use a density function that mathematicians have created.Example: Consider a rv X with possible values between 0 and 1.

All values are equally likely.

Uniform DistributionX ~ U[0,1]

169

The Normal Distribution• Fact: In the real world, you can never find the density

function of a continuous rv, so what can you do?• Answer: Borrow an existing density function that

mathematicians have created (like the uniform dist.).• The Normal distribution is one such density function

with many desirable properties.• The Normal distribution applies to a continuous rv whose

possible values can be any real number from – to +.• To write the density function, you must know the:

ꞏ Mean ꞏ Standard Deviation

Usually estimated by computing the average and standard dev. from a sample of values for the rv.

170

The Normal DistributionThe density function f (x) =

x

f (x)

“The Bell Curve”

This density function is:

Smaller values of make the bell part thinner and taller.

•Centered at the mean .

•The std. dev. controls the “thickness”:•Symmetric about . •“Bell shaped.”

Use the Normal when most values of your rv are close to the mean and then become less likely farther from the mean.

171

The Normal Distribution: Effect of the SD

12

Page 45: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

10

172

The Normal DistributionArea left of = 0.5

a b

This area =

P(a ≤ Y ≤ b)f (x)

x

Note: Excel is used to find areas under the Normal density function.

173

Excel Function NORMDIST• The Excel function NORMDIST is used to find

the area under the normal curve to the left of a given value z, that is, if X ~ N(, ), then

P(X ≤ z) = NORMDIST(z, , , TRUE).

x z

174

Practice with ExcelExample: If X ~ N(20, 2), find P(X ≤ 23).

z23

= 0.933

20

NORMDIST(23, 20, 2, TRUE)P(X ≤ 23) =

175

Practice with ExcelQuestion: What do you do if the area you are interested in is not “all the way to the left”?

23

= 1 NORMDIST(23, 20, 2, TRUE)P(X > 23) = 1 P(X ≤ 23) = 0.067

Answer: Find a way to compute your area using “area all the way to the left”.

z20

X ~N(20, 2)

Page 46: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

11

176

Practice with Excel

If X ~ N(100, 20), find P(80 ≤ X ≤ 120).

Note: This is really the probability that the value of X is within 1 standard deviation of the mean.

177

Answer

12080

If X ~ N(100, 20), thenP(80 ≤ X ≤ 120) = P(X ≤ 120) P(X ≤ 80)

= NORMDIST(120, 100, 20, TRUE) NORMDIST(80, 100, 20, TRUE)

= 0.841 0.159 = 0.682Try 2 and 3 std. deviations from the mean and you will discover the empirical rule.

100

178

The Normal Distribution and the Empirical Rule

179

Interpreting the Std. Dev.When your data are bell-shaped (according to a histogram), you can interpret the pop. / sample standard deviation as follows: is a number so that 68% of your data are within one standard deviation of the mean .

= 72 2 Approx.

68%

Of the valuesare in 1

95% 2100% 3

The Empirical Rule

[ ][ ]68%95%

[ ]

100%

Page 47: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

12

180

A Historical NoteQuestion: Before Excel and NORMDIST, how did one find areas under the Normal density function?

Fact: Any probability question about a rv X~N(, ) can be stated as an equivalent question about a standard normal rv, as you will now see.

Answer: Using a table in which you could look up the area, but,…

Question: It is impossible to create a separate table for every combination of and , so what can you do?Answer: Create a single table for a rv Z~N(0, 1), which is called a standard normal rv.

181

The Standard Normal

Thus, P(X ≤ s) = P(Z ≤ t)

Example: If X~N(, ) then P(X ≤ s) = NORMDIST(s, , , TRUE).

But P(X ≤ s) = P(X ≤ s ) =

Z t

= NORMDIST(t, 0, 1, TRUE).

N(0, 1) ~

182

The Standard Normal Distribution

Standard Normalwith Mean 0

and SD 1

0 +1 +2-1-2

Normal with Mean

SD

183

The Standard Normal

z1.50

= 0.933

0

= NORMDIST(1.5, 0, 1, TRUE)

Example: If X ~ N(20, 2), find P(X ≤ 23).

Answer 1: P(X ≤ 23) = NORMDIST(23, 20, 2, TRUE)

= 0.933

Answer 2: P(X ≤ 23) = = P(Z ≤ 1.5)

Page 48: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

13

184

Excel Function NORMINV• For solving some problems, you know the

probability, p, and want to find the value of z so that the area to the left of z is p.

x

X ~ N(, )

Answer: z = NORMINV(p, , )

z = ?

p

P(X ≤ z) = p

185

Practice with NORMINV

k0.95 0

186

Example

50+j50 j 50

0.950.025

If X ~ N(50, 8), find j so that P(50 j X 50 + j) = 0.95

Page 49: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

1

187

Stat Camp for theMBA Program

Daniel Solow

Lecture 4The Normal Distribution and the

Central Limit Theorem

188

You wrote that a woman is pregnant for 266 days. Who said so? I carried my baby for ten months and five days, and there is no doubt about it because I know the exact date my baby was conceived. My husband is in the Navy and it couldn’t possibly have been any other time because I saw him only once for an hour, and I didn’t see him again until the day before the baby was born.

I don’t drink or run around, and there is no way this baby isn’t his, so please print a retraction about the 266-day carrying time because otherwise I am in a lot of trouble.

San Diego Reader

Example 1: Dear Abby

189

Dear AbbyStep 1: Identify an appropriate random variable.

Y = number of days of pregnancyWhat are the possible values for Y?What is the density function for Y?

About 230 – 290?

265 270260 275255… …Days

Prob. Density

Idea: Approximate the density of Y with a normal!

???

cont.

190

Dear Abby• Question: If you are going to use a normal approximation,

what information do you need? • Answer: The mean and standard deviation.• Fact: According to the data from generations of births,

pregnancies have a (sample) mean of 266 and (sample) standard deviation of 16 days, so Y ~ N ( = 266, = 16).

• Question: What are the possible values for Y?• Question: How can the number of days of pregnancy be

< 230?• Answer: Using the normal distribution, you have that

P(Y < 230) = NORMDIST(230, 266, 16, true) 0.01.• Thus, when using the normal approximation, there is only

about 1% chance that a pregnancy lasts less than 230 days.

– to

Models are NOT the real world but hopefully good approximations!

Page 50: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

2

191

Dear Abby•Step 2: State what you are looking for as a probability question in terms of the rv.

You want to find P(Y ≥ 10 mo. and 5 days) =P(Y ≥ 310).

= 1 – NORMDIST(310, 266, 16, TRUE)

•Step 3: Use the probability distribution of the rv to answer the probability question.

= 0.00298

P(Y ≥ 310) = 1 – P(Y < 310)

Was she telling the truth?Possibly, but highly unlikely.

192

Example 2: Problem of GoodTire

GoodTire has a new tire for which, in order to be competitive, they want to offer a warranty of 30,000 miles. Before doing so, the company wants to know what fraction of tires they can expect to be returned under the warranty.

193

The Problem of GoodTire

•For GoodTire, let X = number of miles such a tire will last.

X ~N( = 40000, = 10000)with possible values:

Step 1: Identify an appropriate random variable.

What are the possible values for X?What is the density function for X?

0 – 90000?

It is unknown, so estimate it using a model, as follows:

From statistical analysis of a random sample, GoodTirebelieves the mileage follows approximately a normal distribution with a mean of 40,000 miles and a standard deviation of 10,000 miles, so assume that

(cont.)

– to 194

The Problem of GoodTire

Step 2: State what you are looking for in terms of a probability question pertaining to the random variable.

•GoodTire wants to know the

P{X 30000} = ?Likelihood a tire fails =Fraction of tires returned =

Page 51: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

3

195

The Problem of GoodTireStep 3: Use the probability distribution of the random variable to answer the probability question.•For GoodTire, you have P{X 30000} = ?

40000

X N(40000, 10000)

30000

NORMDIST(30000, 40000, 10000, TRUE) = 0.1587

196

The Problem of GoodTireQuestion: The CEO finds that a 16% return rate is too high. What warranty mileage s should they offer to get a 5% return rate?Step 2: Probability Question: What should s be so that P{X s} = 0.05?

40000s = ?

0.05

Step 3: s = NORMINV(0.05, 40000, 10000) = 23551.47Fact: While you cannot control the value of a rv, you can control the likelihood of certain events occurring with that RV.

197

Example 3: Marketing Projections

• From historical data over a number of years, a firm knows that its annual sales average $25 million. For planning purposes, the CEO wants to know the likelihood that sales next year will:– Exceed $30 million.– Be within $1.5 million of the average.

The CEO is willing to issue bonuses if sales are “sufficiently” high. What level should be set so that bonuses are given at most 20% of the time?

198

Marketing Projections

•Let Y = next year’s sales in $ millions.

Y ~N( = 25, = 3)

Step 1: Identify an appropriate random variable.

What are the possible values for Y?What is the density function for Y?

0 – 50????

From statistical analysis over a number of years, they believe that annual sales follows approximately a normal distribution with a mean of $25 mil. and a standard deviation of $3 mil., so assume that

Page 52: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

4

199

Marketing ProjectionsStep 2: State what you are looking for in terms of a probability question pertaining to the random variable.•You want to know:

•P(sales exceeds $30 mil.) =

•P(sales is within $1.5 of $25 mil.) =

P(giving a bonus) = 0.20?P(Y ≥ s) = 0.20?

P(Y ≥ 30).

P(23.5 Y 26.5).•What should be the value of sales (s) so that

200

Marketing ProjectionsStep 3: Use the probability distribution of the random variable to answer the probability question.

•From Excel, using = 25 and = 3:

•P(Y ≥ 30) =

•P(23.5 Y 26.5) =

1 NORMDIST(30, 25, 3, TRUE)

NORMDIST(26.5, 25, 3, TRUE) –NORMDIST(23.5, 25, 3, TRUE)

= 0.045.

= 0.383.

• s = = 27.524.NORMINV(0.8, 25, 3)

201

Example 4: DUI Test• In many states, a driver is legally drunk if the blood

alcohol concentration, as determined by a breath analyzer, is 0.10% or higher.

• Suppose that a driver has a true blood alcohol concentration of 0.095%. With the breath analyzer test, what is the probability that the person will be (incorrectly) booked on a DUI charge?

Step 1: Identify an appropriate random variable.Let Y = the measurement of the analyzer as a %.Question: What are the possible values for Y? 0 – 0.3?

(cont.)

202

DUI TestStep 1 (continued).

Question: What is the density function for Y?

Answer: We do not know, but data indicate that Y follows approximately a normal distribution with mean equal to the person’s true alcohol level and standard deviation equal to 0.004%, so…

= the person’s true blood alcohol level (%)Y ~N(, = 0.004), where

Page 53: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

5

203

DUI TestStep 2: State what you are looking for in terms of a probability question pertaining to the random variable.•You want to know the probability that a person with = 0.095 will be (incorrectly) booked on a DUI charge:

P(Y ≥ 0.10) P(being booked on a DUI) =

204

DUI TestStep 3: Use the probability distribution of the random variable to answer the probability question.

•From Excel (using = 0.095 and = 0.004):

P(Y ≥ 0.10) = 1 NORMDIST(0.10, 0.095, 0.004, true) =

0.1056.

•There is about an 11% chance that such a person will be incorrectly charged with a DUI.

205

An Insurance ProblemGoodHands is considering insuring employees of GoodTire. What annual premium should the company charge to be sure that there is a likelihood of no more than 1% of losing money on each customer?

This is an example of decision making under uncertainty: you have to make a decision today—how much should the annual premium be—

Question: Why is the future uncertain?facing an uncertain future.

206

Solving the Insurance ProblemStep 1: Identify an appropriate random variable.•Let X = the $ claimed by a customer in one year.

X ~N( = 2500, = 1000)

•What are the possible values for X? [0, 100000 (?)]•Is X continuous or discrete? discrete•What is the density function for X?It is unknown, so borrow one.

From statistical analysis of data, the annual claim for these people follows approximately a normal distribution with a mean of $2500 and a standard deviation of $1000, so:

•Note: It can be OK to approximate a discrete RV with a continuous distribution.

discrete or cont.?

Page 54: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

6

Probability Question: What should the premium s be so that the

207

An Insurance ProblemStep 2: State what you are looking for in terms of a probability question pertaining to the RV.

•For GoodHands, what should the premium s be so that the likelihood of losing money is no more than 1%.

2500

X N(2500, 1000)

X s

s

Question: When do you lose money on a customer?

P( ) = 0.01?

208

An Insurance ProblemStep 3: Use the probability distribution of the random variable to answer the probability question.

= NORMINV(0.99, 2500, 1000)

Fact: While you cannot control the value of a rv (such as the claim of a person), you can control the likelihood of certain events occurring with that RV (such as the likelihood of such a claim exceeding the premium).

2500

X N(2500, 1000)

s= $4826.3478solution to the model!= $4826.35solution for the real world!

209

The Insurance Problem (cont.)Question: GoodHands will insure all 100 employees of GoodTire. What premium should GoodHands charge per employee so that the likelihood of losing money on the average of all these claims is 1%?Step 1: Identify appropriate random variables.

X = the $ / annual claim of customer~N( = 2500, = 1000)

Prob. Question: What should be the premium, s, so that

(i = 1,…,100)•For this problem, you now have the following rvs:

i i

P(X > s) = 0.01? P( > s) = 0.01?

(a new random var.)

Fact: To answer this prob. quest. about you need to knowthe density function of . ???

Idea: When the rv you are interested in is the AVERAGE of other rvs, try…

(1) Independent

(2) Identically distributed

210

The Central Limit Theorem

(knowing the value of one rv tells younothing about the values of the other rvs).

(have the same densityfunction with mean and standard deviation ),

then, for “large” n,

The Central Limit Theorem provides an approximate density function when the r.v. you are interested in is the average of n other rvs, say, X1, X2, …, Xn, that are:

(approx.)

Page 55: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

7

211

The Insurance Problem (cont.)For the insurance problem, you have

Xi = annual $ claimed by person i (i = 1, …, 100)

(1) Are X1, X2, …, X100 independent random variables?

Yes, because the amount claimed by one person has no effect on the amount claimed by another person.

(2) Are X1, X2, …, X100 identically distributed? Yes, because

Therefore, by the CLT, is approximately Normal with…

212

An Insurance Problem

Step 2: State what you are looking for in terms of a probability question pertaining to the random variable.

•For GoodHands,What should the premium s be so that theprobability that the average of the 100 claimsexceeds s is 0.01?

Probability Question: What should s be so that

2500

N(2500, 100)

213

An Insurance Problem (cont.)Probability Question: What should the premium s be so that

Step 3: Use the probability distribution of the random variable to answer the probability question.

s = NORMINV(0.99, 2500, 100)

= $2732.64

s

Another Example of the CLT• In modeling the performance of a team with 5

people, consider the following five rvs:

Pi = performance contribution of person ifor (i = 1,…,5)

214

U[0,1]Possible values: [0, 1] (continuous)Density function:E[Pi] = = 0.5 STDEV[Pi] = =However, what is of interest is the team performance, so let…

Page 56: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

8

Another Example of the CLT

215

T = performance of the whole team

Possible values: [0, 1] (continuous)Density function: ???You cannot find the true density function, so borrow one.Because the rv T is the average of other RVs, think of using the Central Limit Theorem to approximate the density function of T.

0.29.

The Team Problem

216

For the team problem, you have

Pi = performance of person i (i = 1, 2, 3, 4, 5)

(1) Are P1, P2, P3, P4, P5 independent random variables?

Yes, assuming that the performance of a person says nothing about the performance of another person.

(2) Are P1, P2, P3, P4, P5 identically distributed?

0.5 and std. dev. =

Therefore, by the CLT, T is approximately Normal with…

~U[0, 1] with mean =

Yes, because

The Team Problem

217

Question: What is the probability that the team performance is at least 0.75?

0.5

T N(0.5, 0.13)

P(T ≥ 0.75) =

1 – NORMDIST(0.75, 0.5, 0.13, TRUE) =

0.027

P(T ≥ 0.75)

0.75218

Working with a PopulationConsider a population of N items in which item i has a number, Xi, associated with it and letX = the value of an item to be selected randomly from the population.

Possible values of X:

E[X] =

(discrete)X1, X2, …, XN

Density function: 1/N 1/N … 1/N

= the population average µ.

STDEV[X] = the population standard deviation .

Note: From here on, a sample of size n from a population should always be thought of as a random variable.

Page 57: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

9

Now suppose you are going to record the numbers X1, X2,…, Xntaken from a sample of size n from a population and then compute:

219

The Average of a Sample

All possible values:

G1Groups of size n:

A1

G2

A2

G3

A3

Discrete

for the group:

The (finite) list of averages of everygroup of size n in the population.

Density function: All equally likely.the population average µ.

the population standard deviation .

E[ ] =

STDEV[ ] =Fact: We cannot use the density function because we cannot list all of the possible values, so…

The Average of a Sample

220

(, +)Possible Values:

The rvs X1, X2,…, Xn are iid from the same population with

mean = and std. dev. =

Now you can use the Normal Distribution to answer your probability question about

Solution: Because is the average of rvs, think of the using the CLT which, if applicable, results in the following density function for

,

221

How Large is Large Enough?• For symmetric but outlier-prone data,

n = 15 samples should be enough to use the normal approximation.

• For mild skewness, n = 30 should generally be sufficient to make the normal approximation appropriate.

• For severe skewness, n should be at least 100 to use the normal approximation.

• Generally speaking, the larger n is, the better the normal approximation is.

222

A Final Example of the CLT• Historical data collected at a paper mill show that

40% of sheet breaks are due to water drops, resulting from the condensation of steam.

• Suppose that the causes of the next 100 sheet breaks are monitored and that the sheet breaks are independent of one another.

• Find the expected value and the standard deviation of the number of sheet breaks that will be caused by water drops.

• What is the probability that at least 35 of the breaks will be due to water drops?

Page 58: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

10

223

• Success = break due to water drops• P(success) = p =• X = number of breaks due to water drops• X is Binomial with n = 100 and p = 0.4• E(X) =

• From ExcelP(X 35) = 1 – P(X < 35) = 1 – P(X 34)

• = 1 – BINOMDIST(34, 100, 0.4, TRUE)• = 0.8617

Exact Answer

np = (100)(0.4) = 40= (100)(0.4)(0.6) = 24 = 4.9SD(X) = n p (1 p)

0.4

224

Normal Approx. to BinomialFor this problem, let p = P(success) = 0.4, and

In this problem, you are interested in the rvX = number of successes in 100 trials

= X1 + X2 + … +X100

To find P(X ≥ 35) = P(X / 100 ≥ 35 / 100) , you need to know the probability distribution of

which, by the CLT, is approximately normal, so…

225

Normal Approx. to BinomialEach Xi ~ Binomial(1, p = 0.4), so

E[Xi] = = p = 0.4

Assuming that•The Xi are pairwise independent and•n = 100 is large enough (np > 5 and n(1 – p) > 5),

then by the CLT, the random variable

226

Normal Approx. to BinomialThen, for X = X1 + …+ X100

= 1 NORMDIST(0.35, 0.4, 0.049, TRUE)

= 0.85.

P(X / 100 ≥ 35 / 100)

(The exact answer was 0.86.)

P(X ≥ 35) = 100 100

Page 59: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

11

227

A function y = f(x) describes a relationship between the two quantitative variables x and y.

Review of Basic Math

You can represent a function visually as follows:

•y = f(x) = –x + 2 (a linear relationship)

•y = f(x) = x2 – 2x + 1 (a nonlinear relationship)

y

x

y

x228

Review of FunctionsYou can also think of a function f as transforming an input x into an output y, as follows:

f

x

f(x ) = y

Note: A function f can have many input values, instead of just one.

229

y

x

y = mx + b

A linear equation y = mx + b, provides a relationship between the two variables, x and y, in which:

Review of Linear Equations

y

x

•m > 0: as x increases, y increases. m > 0

m < 0

m = 0

•b = the y-intercept

•m = the slope of the line

•m = 0: as x increases, y remains the same.•m < 0: as x increases, y decreases.

= the value of y when x = 0.

= the change in y per unit of increase in x.

b

x

m1

x + 1

230

An Example of a LineIf

y = the thousands of bushels of wheatx = the number of inches of rain

then, for the liney = 80x + 71,

• b = 71 means that there are 71,000 bushels of wheat when there is no rain.

• m = 80 means that each extra inch of rain results in 80,000 more bushels of wheat.

Page 60: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

12

231

Sometimes a line is written in the form: a1x1 + a2x2 = c

Assuming that a2 0, you can solve for x2:x2 = – (a1 / a2) x1 + (c / a2)

A Different Equation for a Line

y = m x + b

232

Graphing a LineTo draw the graph of the line a1x1 + a2x2 = b:• Find two different points on the line (usually by

setting x1 = 0 and finding x2 and then setting x2 = 0 and finding x1).

• Plotting these two points on a graph.• Drawing the straight line through those two

points.

233

Example of Graphing a Line

The line: 2x1 + x2 = 230

When x1 = 0, x2 = 230

When x2 = 0, x1 = 115

Note: Any point on the line gives a value for x1and a value for x2 that satisfies 2x1 + x2 = 230.

x1300200100

100

200

300

x2

234

Solving Two Linear Equations• Objective: Solve the following two equations

for x1 and x2: 2x1 + x2 = 230 (a)x1 + 2x2 = 250 (b)

• Solution Procedure:– Solve (a) for x2:– Substitute x2 = 230 – 2x1 in (b):

x1 + 2(230 – 2x1) = –3x1 + 460 = 250 (d)– Solve (d) for x1: – Substitute x1 = 70 in (c):

x1 = 70x2 = 230 – 2x1 = 90.

x2 = 230 – 2x1 (c)

Page 61: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

13

235

• Objective: Solve the following for x1 and x2: (a) 2x1 + x2 = 230(b) x1 + 2x2 = 250

• Alternative Procedure:– Multiply (a) through by 2.– Subtract (b) from (c).– Solve (d) for x1: – Substitute x1 = 70 in (a) and solve for x2:

x2 = 230 – 2x1 = 90• Note: There are computer packages for solving n

linear equations in n unknowns.

(c) 4x1 + 2x2 = 460

Another Approach

(d) 3x1 = 210

–[ ](b) x1 + 2x2 = 250

x1 = 70

236

Exponentials• An exponent is the power to which a number

(called the base) is raised.• Example: 25 (base = 2; exponent = 5)• Question: How much will $1000 be worth after

5 years at 6% compound interest?

Answer: Total = f (P, r, n) = P(1 + r )n

= 1000 (1 + 0.06)5 = 1338.23

237

Properties of Exponents• Laws of Exponents:

– xa + b = xb + a = xa xb (example: 23 + 2 = 23 22)– (xa)b = (xb)a = xab (example: (23)2 = 26)– x–a = 1 / xa (example: 2–3 = 1 / 23 = 1 / 8)– x0 = 1

• Exponential Functions Increase and Decrease Rapidly:

238

Scientific Notation• Scientific Notation: a 10b (also written as

a E ±b) means move the decimal point of a:– b positions to the right, if b > 0.– b positions to the left, if b < 0.

• Example: 4.000 103 = 4.000 E+3 =• Example: 4 10–3 = 4 E3 =

4000.0.004.

Page 62: Stat Camp for the What is Statistics? MBA Program · Mean < Median Mean Median 40 Percentiles Think about your numerical data values lying on a line: pthper-centile The p-percentile

3/21/2018

14

239

Logarithms• The log base b of x [written logb(x)] is the

power to which you must raise b to get x.• Examples: log10(100) =• Logs are only defined for positive numbers.• If the base is omitted, the default is 10.• The base e = 2.718… is used in some financial

applications (such as continuous compounding), in which case, loge(x) is written as ln(x) (the “natural log” of x).

2, 5log2(32) =

240

Laws of Logarithms• Logs convert products to sums, that is,

logb(xy) = logb(x) + logb(y).– Ex: log2(64) =

• logb(x / y) = logb(x) – logb(y)– Ex: log10(1000 / 100) =

• Logs bring down exponents, that is, logb(xy) = y logb(x). – Example: log2(45) =

• Logs undo exponentiation, that is,logb(by) = y logb(b) = y.– Example: log2(25) =

• loga(x) = k logb(x), where k = loga(b) – Example: log2(x) = 3.322 log10(x)

log2(416) = log2(4) + log2(16) = 2+4 = 6

log10(1000) – log10(100) = 32 = 1

5(2) = 105 log2(4) =

5

241

Problem Solving with Logs• Question: How many years will it take to

double an investment at i % interest compounded annually?

• Answer: LetP = the initial investmentr = interest rate as a fraction = i / 100n = the number of years of compounding

Then, after n years, you will haveP(1 + r )n.

242

Problem Solving with Logs• Answer (continued):

Thus, you want to find n so thatP(1 + r )n = 2P

To solve (a) for n, take the log of both sides to bring the exponent n down:

log[(1 + r )n] = log(2) n log[(1 + r )] = log(2) n = log(2) / log[(1 + r )]

• Example: At 6% (r = 0.06), it will take n = log(2) / log(1.06) =

(1 + r )n = 2 (a)

0.301 / 0.025 = 11.9 years.

Qn: Log base what?Ans: Log base 10 (but any base will work).