statistics & data analysis
DESCRIPTION
Statistics & Data Analysis. Course NumberB01.1305 Course Section60 Meeting TimeMonday 6-9:30 pm. CLASS #1. Class #1 Outline. Introduction to the instructor Introduction to the class Review of syllabus Introduction to statistics Class Goals Types of data - PowerPoint PPT PresentationTRANSCRIPT
Statistics & Data AnalysisStatistics & Data Analysis
Course Number B01.1305
Course Section 60
Meeting Time Monday 6-9:30 pm
Course Number B01.1305
Course Section 60
Meeting Time Monday 6-9:30 pm
CLASS #1
Professor S. D. Balkin -- May 20, 2002 2
Class #1 OutlineClass #1 Outline
Introduction to the instructor
Introduction to the class• Review of syllabus• Introduction to statistics• Class Goals
Types of data
Graphical and numerical methods for univariate series
Minitab Tutorial
Professor S. D. Balkin -- May 20, 2002 3
Professor Balkin’s InfoProfessor Balkin’s Info
Ph.D. in Business Administration, Penn State
Masters in Statistics, Penn State
Mathematics/Economics and Music, Lafayette College
Employment• Pfizer Inc.
– Management Science Group; Sept. 2001 – current
• Ernst & Young– Quantitative Economics and Statistics Group; June 1999 – August 2001
Professor S. D. Balkin -- May 20, 2002 4
What is Statistics?What is Statistics?
STATISTICS: A body of principles and methods for extracting useful information from data, for assessing the reliability of that information, for measuring and managing risk, and for making decisions in the face of uncertainty.
POPULATION: set of measurements corresponding to the entire collection of units
SAMPLE: set of measurements that are collected from a population
OBJECTIVES:• To make inferences about a population from a sample, including
the extent of uncertainty• Design the data collection process to facilitate drawing valid
inferences
Professor S. D. Balkin -- May 20, 2002 5
Reasons for SamplingReasons for Sampling
Typically due to prohibitive cost of contacting millions of people or performing costly experiments• Election polls query about 2,000 voters to make
inferences regarding how all voters cast their ballots
Sometimes the sampling process is destructive• Sampling wine quality
Professor S. D. Balkin -- May 20, 2002 6
Statistics in Everyday LifeStatistics in Everyday Life
Monthly Unemployment Rates (BLS)
Consumer Price Index
Presidential Approval Rating
Quality and Productivity Improvement
Scientific Inquiry• Training effectiveness• Advertising impact
Professor S. D. Balkin -- May 20, 2002 7
Interesting Statistical PerspectivesInteresting Statistical Perspectives
“Statistical thinking will one day be as necessary for efficient citizenship as the ability to read and write”.
– (H. G. Wells)
“There are three kinds of lies -- Lies, damn lies, and statistics”.
– (Benjamin Disraeli)
“You’ve got to know when to hold ‘em, know when to fold ‘em.”
– (Kenny Rogers, in The Gambler)
“The average U. S. household has 2.75 people in it.”– (U. S. Census Bureau, 1980)
“4 out of 5 dentists surveyed recommended Trident Sugarless Gum for their patients who chew gum.”
– (Advertisement for Trident)
Professor S. D. Balkin -- May 20, 2002 8
Semester OverviewSemester Overview
Understanding data• Intro to descriptive statistics, interpreting data, and graphical
methods
Dealing with and quantifying uncertainty• Random variables and probability
Using samples to make generalizations about populations• Assessing whether a change in data is beyond random
variation
Modeling relationships and predicting• Using sample data to create models that give predictions for
all values of a population
Professor S. D. Balkin -- May 20, 2002 9
Goals for this ClassGoals for this Class
To gain an understanding of descriptive statistics, probability, statistical inference, and regression analysis so that it may be applied to your job
To be able to identify when statistical procedures are required to facilitate your business decision making
To be able to identify both good and poor use of statistics in business
Professor S. D. Balkin -- May 20, 2002 10
Goals for MeGoals for Me
To teach you statistics and data analysis effectively
To improve my effectiveness as an instructor
Professor S. D. Balkin -- May 20, 2002 11
My Promise To YouMy Promise To You
I will not teach you anything in this class that is not regularly used in business and industry
If you ask, “Where is this used?” I will have a real example for you
Professor S. D. Balkin -- May 20, 2002 12
Types of DataTypes of Data
C able Appoin tm en t (M ade , M issed)E m ploym ent S ta tus (em p loyed, unem p loyed)
B ond R atings (1, 2 , 3 , o r 4 sta rs)S e rv ice Q ua lity (poo r, good , exce llen t)
Q ua lita tive / Ca tego ria lQ ua lita tive tra it on ly c lass ifiab le in to ca tego ries
C able Appo in tm en t W a iting T im e (hou rs)E m ploym ent Tenure (m onths)
B ond R etu rn (pe rcentage)C ost (dolla rs)
Q uan tita tive / C on tinuousC haracte ristic measu rem en t on a numerica l sca le
D ata
Professor S. D. Balkin -- May 20, 2002 13
Example: Data TypesExample: Data Types
Business Horizons (1993) conducted a comprehensive survey of 800 CEOs who run the country's largest global corporations. Some of the variables measured are given below. Classify them as quantitative or qualitative.
• State of birth• Age• Educational Level• Tenure with Firm• Total Compensation• Area of Expertise• Gender
Professor S. D. Balkin -- May 20, 2002 14
How Much DataHow Much Data
GM AT Scores for students in this classIncom es in a zipcode
Returns for a stock over this past yearRespondent ages from m arket research
W hat is a typical value?How do the values vary?
Univariate DataData sets w ith just one piece of inform ation
GM AT scores and college GPAIncom es and age in a zipcode
Returns and volum e for a stockM R respondent age and purchase intent
Is there a relationship?How strong is the relationship?
Is there a predictive relationship?
B ivariate DataData sets w ith two pieces of inform ation
GM AT Scores, Salary, Gender, Job Tenure,Job Category, House O wnership, etc...
A re there relationships?How strong are the relationships?Do predictive relationships ex ist?
M ultivariate DataData sets w ith three or more pieces of inform ation
Variables
CHAPTER 2CHAPTER 2Summarizing Data about
One Variable
Summarizing Data about
One Variable
Professor S. D. Balkin -- May 20, 2002 16
IntroductionIntroduction
Unorganized mass of numbers is difficult to interpret
First task in understanding data is summarizing it• Graphically• Numerically
Professor S. D. Balkin -- May 20, 2002 17
Chapter GoalsChapter Goals
Distinguish between qualitative and quantitative variables
Learn graphic representations of univariate data
Learn numerical representations of univariate data
Investigate data acquired over time
Professor S. D. Balkin -- May 20, 2002 18
Distribution of ValuesDistribution of Values
Distribution is essentially how many times each possible data values occur in a set of data.
Methods for displaying distributions• Qualitative data
– Frequency table– Bar charts
• Quantitative data– Histograms– Stem-Leaf diagrams– Boxplots
Professor S. D. Balkin -- May 20, 2002 19
Example: Qualitative DataExample: Qualitative Data
Background: A question on a market research survey asked 17 respondents the size of their households
Data: 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6
Frequency TableHousehold
SizeNumber of
Households
1 32 53 64 25 06 1
Professor S. D. Balkin -- May 20, 2002 20
Example: Qualitative Data (cont.)Example: Qualitative Data (cont.)
Barchart: Plot of frequencies each category occurs in the data set
Number of Households
0
1
2
3
4
5
6
7
1 2 3 4 5 6Household Size
Fre
qu
en
cy
Professor S. D. Balkin -- May 20, 2002 21
Example: Quantitative DataExample: Quantitative Data
Background: Forbes magazine published data on the best small firms in 1993. These were firms with annual sales of more than five and less than $350 million. Firms were ranked by five-year average return on investment. The data are the annual salary of the chief executive officer for the first 60 ranked firms.
Data (in thousands):
145 621 262 208 362 424 339 736 291 58 498 643 390 332 750
368 659 234 396 300 343 536 543 217 298 1103 406 254 862 204
206 250 21 298 350 800 726 370 536 291 808 543 149 350 242
198 213 296 317 482 155 802 200 282 573 388 250 396 572
Professor S. D. Balkin -- May 20, 2002 22
Example: Quantitative Data (cont.)Example: Quantitative Data (cont.)
Histograms are constructed in the same way as bar charts except:• User must create classes to count frequencies• Bars are adjacent instead of separated with space
Professor S. D. Balkin -- May 20, 2002 23
Example: Quantitative Data (cont.)Example: Quantitative Data (cont.)
CEO Salary Histogram
Salary (in thousands)
Fre
qu
en
cy
0 200 400 600 800 1000 1200
05
10
15
20
25
30
Professor S. D. Balkin -- May 20, 2002 24
Example: Quantitative Data (cont.)Example: Quantitative Data (cont.)
Questions:• What is the typical value of CEO salary?• How much variability is there around this value?• What is the general shape of the data?
Histogram characteristics:• Central tendency• Variability• Skewness• Modality• Outliers
Professor S. D. Balkin -- May 20, 2002 25
SkewnesssSkewnesssSymmetric Distribution
Data
Fre
q
26 28 30 32 34
050
015
00
Right Skewed Distribution
Data
Fre
q
0 10 20 30
050
015
00
Left Skewed Distribution
Data
Fre
q
60 70 80 90 100
050
015
00
Professor S. D. Balkin -- May 20, 2002 26
ModalityModality
Unimodal Distribution
Data
Fre
q
26 28 30 32
01
00
0
Bimodal Distribution
Data
Fre
q
8 10 12 14 16 18
05
01
50
Professor S. D. Balkin -- May 20, 2002 27
OutliersOutliers
Distribution with Outlier
Data
Fre
q
28 30 32 34 36
05
10
15
20
25
30
35
Professor S. D. Balkin -- May 20, 2002 28
Example: Stem-Leaf DiagramExample: Stem-Leaf Diagram
Background: Telecom company wants to analyze the time to complete new service orders measured in hours
Data: 42 21 46 69 87 29 34 59 81 97 64 60 87 81 69 77 75 47
73 82 91 74 70 65 86 87 67 69 49 57 55 68 74 66 81 90 75 82 37 94
Diagram: 2 | 193 | 474 | 26795 | 5796 | 0456789997 | 03445578 | 1112267779 | 0147
Professor S. D. Balkin -- May 20, 2002 29
Measures of Central TendencyMeasures of Central Tendency
Mode: Value or category that occurs most frequently
Median: Middle value when the data are sorted
Mean: Sum of measurements divided by the number of measurements
Professor S. D. Balkin -- May 20, 2002 30
Example: ModeExample: Mode
Background: A question on a market research survey asked 17 respondents the size of their households
Data: 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6
Frequency TableHousehold
SizeNumber of
Households
1 32 53 64 25 06 1
Mode
Professor S. D. Balkin -- May 20, 2002 31
Example: MedianExample: Median
Background: A question on a market research survey asked 17 respondents the size of their households
Data: 1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,6
Since the n=17 observations, • Median is the (n+1)/2 = 9th observation
Observation 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17Household Size 1 1 1 2 2 2 2 2 3 3 3 3 3 3 4 4 6
Median
Professor S. D. Balkin -- May 20, 2002 32
Example: MeanExample: Mean
Background: Cable company wants to know how long an installer spends at each stop. One employee performed five installations in one day and recorded how many minutes she was at each location.
Data: 45, 23, 36, 29, 52
Mean = (45+23+36+29+52) / 5 = 37 minutes
Professor S. D. Balkin -- May 20, 2002 33
Example: Back to the CEO’s SalariesExample: Back to the CEO’s Salaries
CEO Salary Histogram
Salary (in thousands)
Freq
uenc
y
0 200 400 600 800 1000 1200
05
1015
2025
30
Mean = 404.1695
Median = 350
WHY THE DIFFERENCE?
Professor S. D. Balkin -- May 20, 2002 34
Measures of VariationMeasures of Variation
A primary reason for using statistics is due to variability
If there was no variability, we would not nee statistics
Examples:• Worker productivity• Stock market• Promotional expenditures
Measures• Standard deviation: variation around the mean• Range: distance between smallest and largest observations
Professor S. D. Balkin -- May 20, 2002 35
Standard DeviationStandard Deviation
Standard Deviation: summarizes how far away from the mean the data value typically are.
Calculation• Find the deviations by subtracting the mean from
each data value• Square these deviations, add them up, and divide
by n-1• Take the square root of this number
Professor S. D. Balkin -- May 20, 2002 36
Example: Standard DeviationExample: Standard Deviation
Background: Your firm spends $19 Million per year on advertising, and management is wondering if that figure is appropriate. Other firms in your industry have a mean advertising expenditure of $22.3 Million per year.
Professor S. D. Balkin -- May 20, 2002 37
Example: Standard Deviation (cont.)Example: Standard Deviation (cont.)
Ad$$$ Deviations Sq Devs8 -14.29 204.32
19 -3.29 10.8522 -0.29 0.0920 -2.29 5.2627 4.71 22.1537 14.71 216.2638 15.71 246.6723 0.71 0.5023 0.71 0.5012 -10.29 105.9711 -11.29 127.5632 9.71 94.2020 -2.29 5.2618 -4.29 18.4423 0.71 0.5035 12.71 161.4411 -11.29 127.56
Mean = 22.29St Dev = 9.18
Industry Advertising Histogram
Millions of Dollars
Fre
qu
en
cy
5 10 15 20 25 30 35 40
01
23
4
Professor S. D. Balkin -- May 20, 2002 38
Example: Standard Deviation (cont.)Example: Standard Deviation (cont.)
Difference from peer group average is $3.3 Million
This difference is smaller than the industry standard deviation of $9.18 Million
Conclusion: You advertising budget, while slightly below the industry average, is typical compared with your industry peers
Professor S. D. Balkin -- May 20, 2002 39
Empirical RuleEmpirical Rule
If the histogram for a given sample is unimodal and symmetric (mound-shaped), then the following rule-of-thumb may be applied:
Let represent the sample mean and s the sample standard deviation. Then
x
ts.measuremen theof allely approximat contains3
ts;measuremen theof 95%ely approximat contains2
ts;measuremen theof 68%ely approximat contains1
sx
sx
sx
Professor S. D. Balkin -- May 20, 2002 40
Example: Stock Market VolatilityExample: Stock Market Volatility
Description: Stock market returns are supposed to be unpredictable. Let’s see if the empirical rule holds true
Data: S&P-500 Daily returns; Jan 01, 1998 – May 17, 2002
Mean = 0.0002
St. Dev. = 0.0128
72.8% (95.3%) of the returns fallbetween the sample mean plusand minus one (two) st.dev.
S&P-500 Daily Returns Histogram
Daily Return
Fre
qu
en
cy
-0.06 -0.04 -0.02 0.00 0.02 0.04 0.06
05
01
00
15
02
00
25
03
00
35
0
Professor S. D. Balkin -- May 20, 2002 41
Inter-Quartile RangeInter-Quartile Range
Inter-Quartile Range (IQR) provides an alternative approach to measuring variability
Computation:• Sort the data and find the median• Divide the data into top and bottom halves• Find the median of both halves. These are the 25th and
75th percentiles• IQR = 75th percentile – 25th percentile
Outlier Measure – Any value outside the inner fences is an outlier candidate• Lower inner fence = 25th percentile – 1.5 IQR• Upper inner fence = 75th percentile + 1.5 IQR
Professor S. D. Balkin -- May 20, 2002 42
Box-Plot – S&P-500 ExampleBox-Plot – S&P-500 Example
Data: S&P-500 Daily returns; Jan 01, 1998 – May 17, 2002-0
.06
-0.0
4-0
.02
0.00
0.02
0.04
S&P-500 Daily Returns BoxplotD
aily
Ret
urn
Median
75th percentile
25th percentile
Upper inner fence
Lower inner fence
Outliers
Minitab TutorialMinitab Tutorial
Professor S. D. Balkin -- May 20, 2002 44
Why Use Minitab???Why Use Minitab???
Goal of course is to learn statistical concepts• Most statistical analyses are performed using computers• Each company may use a different statistical package
YES…Minitab is used in business!• Typically in quality control and design of experiments
EXCEL has very limited statistical functionality and is considerably more difficult to use than Minitab
There are many stat packages (SAS, SPSS, Systat, Splus, R, Statistica, Mathematica, etc.)• Minitab is the easiest program to use right away• Excellent Help facilities• Statistical glossary built-in
Professor S. D. Balkin -- May 20, 2002 45
Minitab Tutorial – Case Study 1Minitab Tutorial – Case Study 1
A hotel kept records over time of the reasons why guest requested room changes. The frequencies were as follows
– Room not clean 2– Plumbing not working 1– Wrong type of bed 13– Noisy location 4– Wanted nonsmoking 18– Didn’t like view 1– Not properly equipped 8– Other 6
Professor S. D. Balkin -- May 20, 2002 46
Minitab Tutorial – Case Study 2Minitab Tutorial – Case Study 2
Exercise 2.8 in book• Produce graphics• Produce descriptive statistics
Professor S. D. Balkin -- May 20, 2002 47
Minitab Tutorial – Case Study 3Minitab Tutorial – Case Study 3
Diversification???
Data: S&P-500 and IBM daily returns from Jan 01, 1998 through May 17, 2002
Professor S. D. Balkin -- May 20, 2002 48
Next TimeNext Time
Probability and Probability Distributions