Lecture 1: IntroductionMath Boot Camp
Will Terry
Department of Political Science University of Oregon
September 16, 2013
Objectives of Math Camp
Have a good time learning about the wonders of math(s)!
Get ready for PS545-546….
Objectives of PS545-546
• The objectives of our sequence are twofold:
(1.) to improve your ability to read mainstream quantitative research, and
(2.) provide a broad overview of the main tools of quantitative analysis.
• We will focus on the linear regression model.
• You will become familiar with Stata.
Statistical software
• This course will focus on practical computing skills that you might find useful in your future research.
– There are reasons to spend some time with R to appreciate capability of statistical computing.
– Given the limited time we will focus on developing STATA skills as much as possible.
• We will master the basic components of statistical computing.
– Data management
– Estimating regression models
– Graphing
The standard political science stats education
I. Basic probability theory- random variables- PDFs-CDFs
II. Statistical inference theory- confidence intervals, hypothesis testing, p-values, etc.
III. Linear regression analysis - the workhorse model of the social sciences
IV. Binary Outcome Models & Other Extensions of the Basic Linear Model
V. Time Series Cross Sectional Models
First, some key terms…
Causality Phenomenon Y (e.g. income) is affected by factor X (e.g., gender)
Statistical inference Drawing conclusions about the world based on characteristics of sample data.
Typically we are in interested in understanding “population parameters.” Independent variable (syn. “regressor”, RHS var) The variable that is exogenously manipulated or changed.
Dependent variable (syn. “regressand”, LHS var) Its value “depends” on the value taken by the independent variables.
Random variables and hypothesis testing
Random Variable (RV) A variable whose values are determined by chance. Population Density Function (PDF) Describes how an RV is “distributed”—i.e., how likely it is that the RV takes any
particular value.
Parameter Characteristic or measure that describes a population. Statistic (not to be confused with Statistics) Characteristic or measure obtained from a sample. .
Common ways to distinguish variables
Qualitative Variables Variables that take non-numerical values. (e.g., eye color; gun ownership) Quantitative Variables Variables that take numerical values. (e.g., number of credit cards in one’s wallet;
time elapsed since the Compromise of 1877) Discrete Variables Variables which assume a finite or countable number of possible values. Usually
obtained by counting. (e.g., the number of credit cards in one’s wallet) Continuous Variables Variables which assume an infinite number of possible values. Usually obtained
by measurement. (e.g., time elapsed since the Compromise of 1877)
Hypothesis testing terminology
Population All subjects possessing a common characteristic that is being studied. Sample A subgroup or subset of the population.
StatisticsCollection of methods for planning experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting, and drawing conclusions.
Hypothesis testing
Research design
• Research design is the means by which we attempt to uncover causal relationships between variables using data that we collect.
• In the jargon of the trade, the objective is to to “identify” the effect of a “treatment.”
• Conceptually, one wants to make a comparison between two identical subjects—one who received the treatment, and one who did not.
• A pure experiment is the gold standard. Unfortunately, this ideal is generally infeasible in the social sciences.
Language of research design
Treatment groupThe group that receives the treatment.
Control group The group that does not receive the treatment.
Experimental data Data derived from a process whereby the researcher determines the receipt of the
treatment.
Non-experimental data (syn. “observational data”) Data in which the administration of the treatment is determined by factors beyond
the researchers control.
The standard political science stats education
I Basic probability theory- random variables- PDFs-CDFs
II. Statistical inference theory- confidence intervals, hypothesis testing, p-values, etc.
III. Linear regression analysis - the workhorse model of the social sciencesIV. Binary Outcome Models & Other Extensions of the Basic
Linear ModelV. Time Series Cross Sectional Models
Linear regression analysis
A. Univariate regression model
yi = β0 + β1xi + εi (There is one IV)
B. Multivariate regression model
yi = β0 + β1xi +β2zi + εi (There are two IVs)
yi = β0 + β1x1i +….+ βNxNi + εi (There are N IVs)
V. Binary dependent variable models
Used when the dependent variable takes one of two possible values:
= 1 if citizen i is a Democrat
Democrati
= 0 if citizen i is not a Democrat
Democrati = f (genderi, incomei, racei, agei )
VI. Time series cross sectional models
State Year GDP per capita Ave. Education
Alabama 1970 $5,000 10.3 years
Alabama 1980 $9,500 11.2 years
Alabama 1990 $11,200 12.4 years
Illinois 1970 $7,000 9.3 years
Illinois 1980 $12,500 10.2 years
Illinois 1990 $17,200 13.7 years
New York 1970 $6,000 8.4 years
New York 1980 $11,500 10.1 years
New York 1990 $18,00 14.5 years
When the researcher observes the objects of analysis at multiple points in time.
(These data have both time series and cross section features.)
What we won’t cover in PS545-6 but might be useful in your dissertation, future research, etc.
A. MLE estimation and other procedures
B. Model selection
C. Simultaneous equations/IV estimation
D. Matching
E. Non-parametric models
F. Case study selection for qualitative research
And much, much more!
Causality and research design
• Causality is often difficult to determine—wait for the next slide—that’s that’s why research design is important.
• An experiment is the gold standard.
• If a treated subject and a control subject are the same in every respect (as they are in a perfect experiment), we can logically attribute any difference in the observed outcome to receipt of the treatment.
• In the social sciences, we generally can’t run experiments so we use statistical techniques to make the treatment and control group as alike as we can.
Common difficulties in determining causalityOne variable causes another, but how do you know which is causal?
Douglass firs ? Rainfall
Two variables cause each other.
Expected closeness of race Candidate expenditures
Common difficulties in determining causality
An omitted third variable causes both. (One reason correlation ≠ causation.)
Bad Driving
Old age Gray Hair
If one were to look at the relationship between Bad Driving and Gray Hair only one might be led to the erroneous conclusion that Gray Hair causes people to drive badly (or Bad Driving causes one to have Gray Hair).
How could one test these competing hypotheses?
Recall the relationship between ice cream consumption and the NY homicide rate…
A research design schematic
R denotes randomized assignment.N denotes non-randomized assignment.X denotes receipt of the treatment.O Denotes that the subject is tested.
Some basic mathematical tools
We will review some basic mathematical tools:
- Functions
- Summation operators
- Differential Calculus
Functions
A function is a rule that assigns exactly one value to each input of a specified type
A function expresses the intuitive idea that one quantity (the argument of the function, also known as the input) completely determines another quantity (the value, or the output).
Summation operators
Summation operators are a useful way to represent the sum of a large set of numbers:
The index i indicates which numbers in the set are to be included in the sum.
The product operator works in a similar fashion.
€
x ii=1
N
∑ = x1 + x2 + ...+ xN −1 + xN
€
x ii=1
N
∏ = x1 × x2 × ... × xN
Summation operatorsSuppose your data were, {x1, x2 , x3 , x4 , x5 , x6 , x7} = {-100,-10, -1, 0, 1, 10, 100}.
Compute the following:
€
x ii=1
7
∑
€
x ii=1
3
∑€
x ii=3
5
∑
€
8(x i)i=1
7
∑€
x i
4i≠3∑
€
x ii is an odd number
∑€
x ii=1
7
∏
€
x ii≥6∏
Sample mean and sample varianceEvery population has a mean (μ) and a variance (σ2), note this implies it has a
standard deviation (σ) as well.
The population mean tells you were the population is “centered.” There’s a sense in which the mean is the middle of the data.
The population variance (or standard deviation) measures how far “spread out” individuals in the population are. (Obviously, these are always non-negative).
The sample mean and sample variance are two fundamental statistics. They estimate the parameters of the population the data were drawn from.
€
ˆ μ =1N
x ii=1
N
∑
€
ˆ σ 2 =1N
(x ii=1
N
∑ − ˆ μ )2
Derivatives
Loosely speaking, a derivative can be thought of as how much one quantity is changing in response to changes in some other quantity.
Integrals
A definite integral of a function can be represented as the signed area of the region bounded by its graph.
Math Camp game plan: Time to get down to business…
In the remainder of this lecture we will discuss some elementary results in a branch of mathematics called Real Analysis—i.e., the branch of math that studies real numbers.
Q: Why do we care about Real Analysis?A: Because it provides the logical structure that undergirds the math we use as social scientists.
The next few slides follow a text that is slightly more advanced than we need, but let’s follow along to develop a few ideas about the real number line…
The set of real numbers:Special symbols
The real number line
The set of real numbers:Properties
Inequalities
Inequalities
Inequalities
Roots
A cheat sheet of handy rules re real numbers
(see the Math Camp website for the complete sheet)
Quadratic equations
Quadratic equations (cont.)
Quadratic equations (cont.)
Absolute value
Achilles and the tortoise
Achilles and tortoise
Achilles and the tortoise
Achilles and the tortoise
Bounds
Bounds
Bounds
Bounds
Intervals
Intervals
Intervals
Intervals
Next lecture…
Functions and graphs - Functions
- Graphs
- Functional forms