Download - Intro to quant_analysis_students
Week 11: Basic Descriptive Quantitative Data Analysis
Tables, Graphs, & Summary Statistics
1
Objectives Learn about basic descriptive quantitative
analysis How to perform these tasks in Excel
Starting point for 502B Excel knowledge and quantitative skills are highly
desired by Employers EC stream
2
Introduction
3
Without data, it is anyone’s opinion Why use tables, graphs, summary stats?
“At their best, tables, graphs, and statistics are instruments for reasoning about complex quantitative information.”
Why learn how to design them appropriately?
“At their worst, tables, graphs and summary statistics are instruments of evil used for deceiving a naive viewer.”
Does your mindset match my dataset! http://www.ted.com/talks/hans_rosling_at_state.html
Quantitative Research Process
Page 4
Introduction
Page 5
Page 6
Presenting the Data
Frequency Distribution
Page 7
A convenient way of summarizing a lot of tabular data
What is a Frequency Distribution?
A frequency distribution is a list or a table …
containing class groupings (categories or ranges within which the data fall) ...
and the corresponding frequencies with which data fall within each class or category
For nominal/ordinal data
Introduction
Page 8
Page 9
Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities
Source: 1999 World Bank World Business Environment
Survey (WBES), excludes missing observations
% of Sales Reported
100%
90-99%
80-89%
70-79%
60-69%
50-59%
<50%
Total
Frequency
3307
1096
916
703
501
694
936
8153
Percent (%)
40.56
13.44
11.24
8.62
6.14
8.51
11.48
100
http://www.enterprisesurveys.org/
Contingency/Pivot/Cross Table
10
May also want to produce a table with more categories Cross table or Contingency table or Pivot
table Suitable if you have two nominal/ordinal
variables Simple extension to a univariate table
Considers relationship between two variables Row variable (Dependent) Column variable (Independent)
Table2Percentage of Sales Reported to Tax Authorities by Region
Page 11
Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries
100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153
Source: 1999 World Bank World Business Environment Survey (WBES)
* Excludes missing observations
Features of a Table
12
Title that accurately summarizes the data Simple, indicates major variables, and time frame
(if applicable) Source: data set or origin of table Explanatory footnotes Easy to read & separated from text Properly formatted for style (see APA Rules) Necessary to advance analysis See Module 7 for APA Table Checklist
Reproduced from APA manual
Page 13
Presenting the Data
Bar Graph
Page 14
Often used to describe categorical data Ordinal/Nominal
Draws attention to the frequency of each category
Page 15
Table 1 Univariate Frequencies of Percentage of Sales Reported to Tax Authorities
Source: 1999 World Bank World Business Environment
Survey (WBES), excludes missing observations
% of Sales Reported
100%
90-99%
80-89%
70-79%
60-69%
50-59%
<50%
Total
Frequency
3307
1096
916
703
501
694
936
8153
Percent (%)
40.56
13.44
11.24
8.62
6.14
8.51
11.48
100
http://www.enterprisesurveys.org/
Bar Graph
Page 16
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
Relative Frequency Polygone
17
Pie Graph
Page 18
Emphasizes the proportion of each category Something that may be good for our tax evasion
data Circle represents the total Segments the shares of the total Segment size is proportional to frequency
Pie Graph
19
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
Page 2020
Pie Graph
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
Page 2121
Pie Graph
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
Charts in Excel I
22
Table2Percentage of Sales Reported to Tax Authorities by Region
Page 23
Africa Transition Asia Latin OECD Former Total Europe America Soviet Countries
100% 490 554 416 794 446 607 3,307 90-99% 266 196 142 119 145 228 1,096 80-89% 158 152 117 192 73 224 916 70-79% 162 117 103 153 43 125 703 60-69% 140 69 70 115 22 85 501 50-59% 140 105 141 118 16 174 694 <50% 100 106 283 296 25 126 936 Total 1,456 1,299 1,272 1,787 770 1,569 8,153
Bar Graph
Page 24
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
Page 2525
Segmented Bar Chart
Figure 1
Percentage of sales reported to tax authority
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
Pie Graph
Page 26
34%
18%11%
11%
10%
10%
7%
43%
15%
12%
9%
5%
8%
8%
33%
11%
9%8%
6%
11%
22%
44%
7%11%
9%
6%
7%
17%
58%
19%
9%
6%
3%2% 3%
39%
15%
14%
8%
5%
11%
8%
Africa and Middle East Transition Europe Asia
Latin America OECD Former Soviet Union
100% 90-99% 80-89% 70-79% 60-69% 50-59% <50%
Source: World Business Environment Survey
Figure #:Percentage of Sales Reported to Tax Authorties
Figure 2
Percentage of sales reported to tax authority by region
Source: 1999 World Bank World Business Environment Survey (WBES)
Note. Excludes missing observations. n = 8314
Vertical Bar Chart
27
Charts in Excel II
28
34%
18%11%
11%
10%
10%
7%
43%
15%
12%
9%
5%
8%
8%
33%
11%
9%8%
6%
11%
22%
44%
7%11%
9%
6%
7%
17%
58%
19%
9%
6%
3%2% 3%
39%
15%
14%
8%
5%
11%
8%
Africa and Middle East Transition Europe Asia
Latin America OECD Former Soviet Union
100% 90-99% 80-89% 70-79% 60-69% 50-59% <50%
Source: World Business Environment Survey
Figure #:Percentage of Sales Reported to Tax Authorties
Time Series Graph
Page 29
Time series are often used in social sciences Data collected at various time period: daily,
weekly, monthly, quarterly, annually, etc. Examples include GDP, Unemployment, University
Tuition Plot series of interest over time
Let’s look at a graph of the unemployment rate by gender and age
Line Graph
Page 30
InstructorPage 31
Histogram
Used for continuous data Frequency Distribution for continuous data
Summary graph showing count of the data pints falling in various ranges
Rough approximate of the distribution of the data
A histogram is a way to summarize data
The distribution condenses the raw data into a more useful form...
and allows for a quick visual interpretation of the data
Histogram
32
InstructorPage 33
Scatter Graphs
Graphs relationship between two continuous variables
Scatter Graph
34
Principles of Graphical Excellence
35
Well-designed presentation of interesting data Substance & design
Simplicity of design, complexity of data Proportion and Balance Clear, precise, efficient
Know what you are trying to show (have a story) make sure you graph shows it
Well formatted, professional Choose format that reflects your data and the story Informative and legible axis Fully labelled & legible
Gets across main point(s) in the shortest time with the least ink in the smallest space Adds information not otherwise available to the reader But supplemented with text describing the figure
Tells the truth about the data Limits complexity and confusion Avoid Chart Junk
36
0
10
20
30
40
50
60
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
0
20
40
60
80
100
120 West
North
Northeast
Southwest
Mexico
Europe
Japan
East
South
International
Examples of Chartjunk
37
Examples of Chartjunk
0
10
20
30
40
50
60
70
80
90
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Gridlines!Vibration
Pointless
Fake 3-D Effects
Filled “Floor” Clip Art
In or out?
Filled
“Walls”
Borders and
Fills Galore
Unintentional
Heavy or Double Lines
Filled Labels
Serif Font with
Thin & Thick Lines
Displaying Data: “Mistakes”
Page 38
Graphs are also instruments of evil used for deceiving a naive viewer. Non-zero origin Omitting data that refutes your “evidence” Limiting scope of data
What is Wrong with this Graph?
39
Provincial Personal Income TaxesSingle Individual with $45,000 in income claiming basic personal tax credits
The Real Story
40
Exaggerates a change in data
Page 41Source: Statistics Canada, CANSIM II, V31215364
Dr. Kendall
42
Worst Recession Since the Depression (?)
43
Page 44
Presenting the Data
Describing Data Numerically
45
Simple Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the Distribution
Mode
46
A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or categorical
data There may be no mode or several modes What are the modes for the displayed
data?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
Mode
47
A measure of central tendency Value that occurs most often Not affected by extreme values Used for either numerical or
categorical data There may be no mode There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
0 1 2 3 4 5 6
No Mode
Mode
48
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5 & 9
Mode
49
Caution: Mode may not be representative of the data {0.1, 0.1, 5000, 4900, 4500, 5200,…}
Median
50
In an ordered list, the median is the “middle” number (50% above, 50% below)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean
51
The “balancing point” (centre of gravity) of the data E.g. The data “balances” at 5
1 2 3 4 5 6 7 8 9
-2
-1 +3
Arithmetic Mean
52
The arithmetic mean (mean) is the most common measure of central tendency
Calculated by summing the value observations and dividing by the number of observations For a sample of size n:
# of observationsn
xxx
n
xx n21
n
1ii
Observed
values
Arithmetic Mean
53
The most common measure of central tendency Mean = sum of values divided by the number of
values Affected by extreme values (outliers) What is the mean for these examples?
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Arithmetic Mean
54
The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
35
15
5
54321
45
20
5
104321
Measures of Central Tendency
55
Central Tendency
Mean Median Mode
n
xx
n
1ii
Overview
Midpoint of ranked values
Most frequently observed valueArithmetic
average
50% 50%
The “Shape of a Distribution”
56
Use information on mean, median, and mode to “visualize” the data
A data distribution is said to be symmetric if its shape is the same on both sides of the median Symmetry implies that median=arithmetic
mean If a distribution is uni-modal and symmetric
then Median=mean=mode
The “Shape of a Distribution”
57
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7
# of
Obs
.
Value
MEDIAN50% 50%
Symmetric
:
Median=Mean
Symmetric:
Median=Mean
UNIMODAL
Symmetric & Unimodel: Median=Mean=Mode
The “Shape of a Distribution”
58
0
1
2
3
4
5
6
7
8
9
1 2 3 4 5 6 7
# of
Obs
.
Value
MEDIAN50% 50%
Symmetric:
Median=Mean Symmetric
:
Median=Mean
BIMODAL BIMODALSymmetric & Bimodel: Median=Mean≠Mode
The “Shape of a Distribution”
59
0
1
2
3
4
5
6
7
8
1 2 3 4 5 6 7
# of
Obs
.
Values
MEDIAN50% 50%
Symmetric: Median=Mean
Symmetric: Median=Mean
MODE?
Symmetric & no mode: Median=Mean (Uniform Distribution)
The “Shape of a Distribution”
60
An asymmetric distribution is said to be skewed
1. Negatively if Mean<Median<Mode2. Positively if Mean>Median>Mode
Hence, by comparing our measures of cental tendancy, we can start to visualize the shape and characteristics of the data
The “Shape of a Distribution”
61
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8
MODE=2MEDIAN=3
50% 50%
MEAN=3.2
MODE < MEDIAN < MEAN = POSITIVELY SKEWED DISTRIBUTION
Example: Positively skewed variable
62
The Distribution of After-Tax Income shows the
distribution of income across all Canadian households
Example: Positively skewed variable
63
The mode income is the most common income and was in the range from $15,000 to $19,999.
The median income is the level of income that separates the population into two groups of equal size and was $39,700.
The mean income is the average income and was $48,400.
Example: Positively skewed variable
64
A distribution in which the mean exceeds the median and the median exceeds the mode is positively skewed, which means it has a long tail of high values.
The distribution of income in Canada is positively skewed.
Most likely to report median rather than mean since long tail distorts average
Example: Positively skewed variable
65
Volunteer hours Charitable contributions # of Cigarette packs smoked (excluding 0) Collective bargaining agreement duration (in
years) # of beers consumed on a Saturday night Duration of low income (in years) Number of children
The “Shape of a Distribution”
66
0
2
4
6
8
10
12
0 1 2 3 4 5 6 7
MODE=6MEDIAN=5
50% 50%
MEAN=4.7
Mean< MEDIAN < Mode = NEGATIVELY SKEWED DISTRIBUTION
Examples
67
University Grades Age Years in school Etc.
Describing Data Numerically
68
Simple Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the Distribution
Same center, different variation
Measures of Dispersion/Variability
69
Variation
Variance Standard Deviation
Range
Measures of variation give information on the spread or variability of the data values.
Range
70
Simplest measure of variation Difference between the largest and the
smallest observations:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Example:
Range
71
Simplest measure of variation Difference between the largest and the
smallest observations:
Range = Xlargest – Xsmallest
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Example:
The Range
72
• Problem• Ignores all but two data points• These values may be “outliers”
(i.e. not representative)
Disadvantages of the Range
73
Ignores the way in which data are distributed
Sensitive to outliers
7 8 9 10 11 12
Range = 12 - 7 = 5
7 8 9 10 11 12
Range = 12 - 7 = 5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 5 - 1 = 4
Range = 120 - 1 = 119
The Variance
74
• A single summary measure of dispersion would be more helpful
• Takes account of all data Values
The Variance
1. Variance
2. Standard Deviation
N
ii Xx
ns
1
22 )(1
1
75
siancedeviationdards vartan
Measuring variation
76
Small standard deviation
Large standard deviation
Comparing Standard Deviations
77
Mean = 15.5 s = 3.338 11 12 13 14 15 16 17 18 19 20 21
11 12 13 14 15 16 17 18 19 20 21
Data B
Data A
Mean = 15.5 s = 0.926
11 12 13 14 15 16 17 18 19 20 21
Mean = 15.5 s = 4.570
Data C
Describing Data Numerically
78
Simple Arithmetic Mean
Median
Mode
Describing Data Numerically
Variance
Standard Deviation
Range
Central Tendency Variation Association
Covariance
Correlation
Shape of the Distribution
The Sample Covariance
79
The covariance measures the strength of the linear relationship between two variables
The sample covariance:
Only concerned with the strength of the relationship
No causal effect is implied
1n
)y)(yx(xsy),(xCov
n
1iii
xy
Interpreting Covariance
80
Covariance between two variables:
Cov(x,y) > 0 x and y tend to move in the same direction
Cov(x,y) < 0 x and y tend to move in opposite directions
Cov(x,y) = 0 x and y are independent
Coefficient of Correlation
81
Measures the relative strength of the linear relationship between two variable
Sample correlation coefficient:
YX ss
y),(xCovr
Features of Correlation Coefficient, r
82
Unit free Ranges between –1 and 1 The closer to –1, the stronger the negative
linear relationship The closer to 1, the stronger the positive
linear relationship The closer to 0, the weaker any positive
linear relationship
Interpreting the Correlation Coefficient, r
83
Scatter Plots of Data with Various Correlation Coefficients
84
Y
X
Y
X
Y
X
Y
X
Y
X
r = -1Cov<0
r = -.6Cov<0
r = 0Cov=0
r = +.3Cov>0
r = +1Cov>0
Y
Xr = 0Cov=0
502B
85
Fun with Graphs
86
Does your mindset match my dataset! http://www.ted.com/talks/
hans_rosling_at_state.html
Looking ahead SRs to client (cc) and Turnitin on Wednesday by
noon No class next week
Work on 598 critiques 598 Critiques due in class & Turnitin Nov. 30 Comments on your SRs will be ready Nov. 30 Final SRs (if required) due Dec. 8 @11:55PM PST
Note carefully the requirements Moodle site will be inaccessible sometime in
December Final Grades reported via usource once approved
by the Director
87