applied statistics and econometrics outline of lecture...

28
Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul Lach () Applied Statistics and Econometrics September 2017 1 / 55 Outline of Lecture 1 1 What is Econometrics? What do we use it for? (SW 1.1-1.2) 2 Software and types of data (SW 1.3) 3 Example of statistical analysis and review of basic statistics 4 Representation of data: numerical and graphical 5 Bivariate data analysis (parts of SW 2.3) Saul Lach () Applied Statistics and Econometrics September 2017 2 / 55

Upload: others

Post on 23-Jun-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Applied Statistics and EconometricsLecture 1

Saul Lach

September 2017

Saul Lach () Applied Statistics and Econometrics September 2017 1 / 55

Outline of Lecture 1

1 What is Econometrics? What do we use it for? (SW 1.1-1.2)2 Software and types of data (SW 1.3)3 Example of statistical analysis and review of basic statistics4 Representation of data: numerical and graphical5 Bivariate data analysis (parts of SW 2.3)

Saul Lach () Applied Statistics and Econometrics September 2017 2 / 55

Page 2: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

We seek quantitative answers

Economics suggests important relationships between variables:

1 What is the quantitative effect of reducing class size on student achievement?2 How does another year of education change earnings?3 What is the price elasticity of cigarettes?4 What is the effect on output growth of a 1 percentage point increase ininterest rates by the ECB?

5 What is the effect on housing prices of environmental improvements?6 Are CEOs rewarded according to performance? How?

But Economics virtually never suggests quantitative magnitudes of causaleffects between variables.

Saul Lach () Applied Statistics and Econometrics September 2017 3 / 55

What is Econometrics?

Econometrics addresses this last point.

Econometrics is the science and art of using economic and statistical theoryto analyze economic data and provide quantitative answers.Econometrics consists of a set of mathematical/statistical tools to performsuch analysis.

The main tool is called “regression analysis”and this will be the focus ofthe course.

Econometric methods used in many areas of Economics: finance, micro andmacro, labor, industrial organization, history.

Econometric methods used not only in Economics but, increasingly, in othersocial sciences such as Sociology (criminology) and Political Science (votingpatterns).

Saul Lach () Applied Statistics and Econometrics September 2017 4 / 55

Page 3: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

This course is about using data to measure causal effects(SW 1.2)

Causality defined by the outcome of a randomized controlled experiment.This experiment ensures that changes in an outcome are associated withchanges in a variable subject to controlled changes and with nothing else.Ideally, we would like to use an experiment to generate data and use them toestimate causal effects.

What would be an experiment to estimate the effect of class size onstandardized test scores?

But almost always we only have observational (nonexperimental) data.returns to educationcigarette pricesmonetary policy

Most of the course deals with diffi culties arising from using observationaldata to estimate causal effects:

correlation between two variables does not imply causation between them.

Saul Lach () Applied Statistics and Econometrics September 2017 5 / 55

What do we use Econometrics for?

We use Econometrics to:

1 test economic theories (does a minimum wage decrease employment?)2 estimate economic relationships (demand function price elasticity)3 forecast economic variables (firm’s sales, economy’s inflation rate)4 make policy recommendation to businesses or government (increase cigarettetaxes?)

Saul Lach () Applied Statistics and Econometrics September 2017 6 / 55

Page 4: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

In this course you will:

Learn methods for estimating causal effects using observational data;Focus on applications - theory is used only as needed to understand the“why”s of the methods;

Learn to evaluate the regression analysis of others - this means you will beable to read/understand empirical economics papers in other economicscourses;

Get some hands-on experience with regression analysis in problem sets.

Saul Lach () Applied Statistics and Econometrics September 2017 7 / 55

Where are we?

1 What is Econometrics? What do we use it for? (SW 1.1-1.2)2 Software and types of data (SW 1.3)3 Example of statistical analysis and review of basic statistics4 Representation of data: numerical and graphical5 Bivariate data analysis (parts of SW 2.3)

Saul Lach () Applied Statistics and Econometrics September 2017 8 / 55

Page 5: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Data and statistical software (SW 1.3)

We will use a lot of data.

Data come in files (excel files, etc.)

Need statistical software to analyze data (e.g.: Excel, Stata, SPSS, SAS)

We will use “Stata”which is available at the computer labs. Also studentscan purchase it at a very attractive price.

Saul Lach () Applied Statistics and Econometrics September 2017 9 / 55

Types of data

Cross Section: data on different entities or units (individuals, firms,countries, etc.) collected at a single point in time.Time series: data on a single entity or unit (individuals, firm, country, etc.)collected over multiple time periods.Panel (longitudinal) data: data on different entities or units collected overmultiple time periods.

Saul Lach () Applied Statistics and Econometrics September 2017 10 / 55

Page 6: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Cross sectional data: ceo salaries

Salaries of CEOs in US firms

Name:CEOSAL1 (google it!)

Number of obs: 209

Description of variables:

1. salary 1990 salary, thousands $2. pcsalary % change salary, 89-903. sales 1990 firm sales, millions $4. roe return on equity, 88-90 avg5. pcroe % change roe, 88-906. ros return on firm’s stock, 88-907. indus =1 if industrial firm8. finance =1 if financial firm9. consprod =1 if consumer product firm10. utility =1 if transportation or utilties11. lsalary natural log of salary12. lsales natural log of sales

Saul Lach () Applied Statistics and Econometrics September 2017 11 / 55

Cross sectional data: ceo salaries

. l salary pcsalary sales roe pcroe ros indus finance consprod in 1/15

salary pcsalary sales roe pcroe ros indus finance consprod 1. 1095 20 27595 14.1 106.4 191 1 0 0 2. 1001 32 9958 10.9 -30.6 13 1 0 0 3. 1122 9 6125.9 23.5 -16.3 14 1 0 0 4. 578 -9 16246 5.9 -25.7 -21 1 0 0 5. 1368 7 21783.2 13.8 -3 56 1 0 0 6. 1145 5 6021.4 20 1 55 1 0 0 7. 1078 10 2266.7 16.4 -5.9 62 1 0 0 8. 1094 7 2966.8 16.3 -1.6 44 1 0 0 9. 1237 16 4570.2 10.5 -70.2 37 1 0 0 10. 833 5 2830 26.3 -23.9 37 1 0 0 11. 567 7 596.8 25.9 39.5 109 1 0 0 12. 933 -3 19773 26.8 -26.8 -10 1 0 0 13. 1339 -9 40047 14.8 12.1 41 1 0 0 14. 937 9 2513.8 22.3 9.8 44 1 0 0 15. 2011 49 1580.6 56.3 62.2 63 1 0 0

Saul Lach () Applied Statistics and Econometrics September 2017 12 / 55

Page 7: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Cross sectional data: California schools

8/31/2017 fmwww.bc.edu/ec-p/data/stockwatson/caschool.des

http://fmwww.bc.edu/ec-p/data/stockwatson/caschool.des 1/1

Contains data from http://fmwww.bc.edu/ec-p/data/stockwatson/caschool.dta obs: 420 vars: 18 29 Mar 2002 07:13 size: 60,060 (99.9% of memory free) ------------------------------------------------------------------------------- storage display value variable name type format label variable label ------------------------------------------------------------------------------- observation_n~r float %9.0g dist_cod float %9.0g county str18 %18s district str53 %53s gr_span str8 %8s enrl_tot float %9.0g teachers float %9.0g calw_pct float %9.0g meal_pct float %9.0g computer float %9.0g testscr float %9.0g comp_stu float %9.0g expn_stu float %9.0g str float %9.0g avginc float %9.0g el_pct float %9.0g read_scr float %9.0g math_scr float %9.0g -------------------------------------------------------------------------------

Saul Lach () Applied Statistics and Econometrics September 2017 13 / 55

Cross sectional data: California schools

8/31/2017 fmwww.bc.edu/ec-p/data/stockwatson/caschool.des

http://fmwww.bc.edu/ec-p/data/stockwatson/caschool.des 1/1

THE CALIFORNIA TEST SCORE DATA SET The California Standardized Testing and Reporting (STAR)) dataset contains data on test performance, school characteristics and student demographic backgrounds. The data used here are from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999. Test scores are the average of the reading and math scores on the Stanford 9 standardized test administered to 5th grade students. School characteristics (averaged across the district) include enrollment, number of teachers (measured as Òfull-time-equivalentsÓ), number of computers per classroom, and expenditures per student. The student-teacher ratio used here is the number of full-time equivalent teachers in the district, divided by the number of students. Demographic variables for the students also are averaged across the district. The demographic variables include the percentage of students in the public assistance program CalWorks (formerly AFDC), the percentage of students that qualify for a reduced price lunch, and the percentage of students that are English Learners (that is, students for whom English is a second language). All of these data were obtained from the California Department of Education (www.cde.ca.gov).

Saul Lach () Applied Statistics and Econometrics September 2017 14 / 55

Page 8: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Cross sectional data: California schools

district gr_span testscr enrl_tot teachers str Sunol Glen Unified KK-08 690.8 195 10.9 17.88991 Manzanita Elementary KK-08 661.2 240 11.15 21.52466 Thermalito Union Elementary KK-08 643.6 1550 82.9 18.69723 Golden Feather Union Elementary KK-08 647.7 243 14 17.35714 Palermo Union Elementary KK-08 640.85 1335 71.5 18.67133 Burrel Union Elementary KK-08 605.55 137 6.4 21.40625 Holt Union Elementary KK-08 606.75 195 10 19.5 Vineland Elementary KK-08 609 888 42.5 20.89412 Orange Center Elementary KK-08 612.5 379 19 19.94737 Del Paso Heights Elementary KK-06 612.65 2247 108 20.80556

Saul Lach () Applied Statistics and Econometrics September 2017 15 / 55

Time series: Dow Jones index

Saul Lach () Applied Statistics and Econometrics September 2017 16 / 55

Page 9: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Time series graphical representation

Saul Lach () Applied Statistics and Econometrics September 2017 17 / 55

Panel data: smoking

Cigarette consumption and prices. 48 US states over time, 1985-1995

Name: Cigarettes

Number of obs: 528 (48× 11)1. state: State

2. year: Year

3. cpi: Consumer price index

4. population: State population

5. packs: Number of packs per capita

6. income: State personal income (total, nominal)

7. tax: Average state, federal and average local excise taxes for fiscal year

8. price: Average price during fiscal year, including sales tax

9. taxs Average excise taxes for fiscal year, including sales tax

Saul Lach () Applied Statistics and Econometrics September 2017 18 / 55

Page 10: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Panel data: smoking

Observation

Numberstate year packs price taxs

1 AL 1985 116.5 1.022 0.333

2 AR 1985 128.5 1.015 0.370

3 AZ 1985 104.5 1.086 0.362

⁞now start observations for 1986

49 AL 1986 117.2 1.080 0.334

50 AR 1986 127.7 1.091 0.370

51 AZ 1986 103.3 1.162 0.365

⁞526 WI 1995 92.5 2.014 0.716

527 WV 1995 115.6 1.665 0.504

528 WY 1995 112.2 1.585 0.360

Saul Lach () Applied Statistics and Econometrics September 2017 19 / 55

Where are we?

1 What is Econometrics? What do we use it for (SW 1.2-1.2)?2 Software and types of data (SW.1.3)3 Example of statistical analysis and review of basic statistics4 Representation of data: numerical and graphical5 Bivariate data analysis (parts of SW 2.3)

Saul Lach () Applied Statistics and Econometrics September 2017 20 / 55

Page 11: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Example of policy-driven statistical analysisEmpirical problem: Class size and educational output

Policy question: What is the effect on test scores (or some other outcomemeasure) of reducing class size by one student per class? By 8students/class?

We must use data to find out (is there any way to answer this without data?)

Use California Test Score Data Set (CAS data)

All K-6 and K-8 California school districts (n = 420)

Variables (obtained from data file shown earlier). Focus on:

5th grade test scores (Stanford-9 achievement test, combined math andreading), district average.Student-teacher ratio (STR) = no. of students in the district divided by no.full-time equivalent teachers.

Saul Lach () Applied Statistics and Econometrics September 2017 21 / 55

First look at the CAS data

Saul Lach () Applied Statistics and Econometrics September 2017 22 / 55

Page 12: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Slight discrepancy between the text and Stata file

. tabstat str testscr, stat(mean sd p10 p25 p50 p75 p90) columns(statistics) format(%6.0g)

variable mean sd p10 p25 p50 p75 p90

str 19.64 1.892 17.35 18.58 19.72 20.87 21.88 testscr 654.2 19.05 630.4 640 654.4 666.7 679.1

Saul Lach () Applied Statistics and Econometrics September 2017 23 / 55

Reminder: sample statistics

A random variable Y (e.g., testscore).

A sample of size n for Y is denoted by

{Y1,Y2, . . . ,Yn}

Y1 is the first observation, Y2 is the second observation, etc.

for cross section the typical observation is denoted by Yi .

for time series the typical observation is usually denoted by Yt .

Saul Lach () Applied Statistics and Econometrics September 2017 24 / 55

Page 13: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Data summary

Given a sample of size n on some variable Y we want to summarize thedata.Central tendency: mean and median of a variable (e.g.,STR, test score) inthe sample.

Dispersion: variance and standard deviation of a variable (e.g.,STR, testscore) in the sample.

Position: quartiles, deciles, percentiles of a variable (e.g.,STR, test score) inthe sample.

Saul Lach () Applied Statistics and Econometrics September 2017 25 / 55

Central tendency: sample mean

The leading measure of central tendency is the sample mean, which is thearithmetic average of the data

For a sample of size n, the sample mean

Y = (Y1 + Y2 + . . .+ Yn)/n

Often, this formula is abbreviated using the summation operator ∑

Y =n

∑i=1

Yi/n

Saul Lach () Applied Statistics and Econometrics September 2017 26 / 55

Page 14: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Central tendency: sample median

The other leading indicator of central tendency is the sample median, whichis the value of the sample that divides the data after ordering theobservations into two halves, the median being the midpoint.

The median is relatively easy to calculate:

Odd Number of Data Values (n isodd)

1 arrange data in order fromsmallest to largest

2 Find the data value in the exactmiddle

Even Number of Data Values (n iseven)

1 arrange data in order fromsmallest to largest

2 Find the mean of the twomiddle numbers

Saul Lach () Applied Statistics and Econometrics September 2017 27 / 55

Dispersion: sample variance and standard deviation

the sample variance is the average of the squared deviation of the data fromthe mean

s2 =1

n− 1n

∑i=1(Yi − Y )2

the sample standard deviation is the square root of the sample variance

s =√s2 =

√1

n− 1n

∑i=1(Yi − Y )2

Saul Lach () Applied Statistics and Econometrics September 2017 28 / 55

Page 15: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Test score-STR data

Saul Lach () Applied Statistics and Econometrics September 2017 29 / 55

Position: sample quarters and deciles

The sample quartiles divides the data in 4 parts:

the lower quartile (Q1) is that point where one-quarter of the ordered samplelies below and three-quarters of the ordered sample lies abovethe middle quartile (Q2) is the sample medianthe upper quartile (Q3) is that point where three-quarters of the orderedsample lies below and one-quarter of the ordered sample lies above.

Even more detailed divisions of the sample are possible.

Deciles split the ordered sample into tenths and are used, for example, tosummarize the distribution of individual incomePercentiles split the ordered sample into hundredths. The pth percentile is thevalue for which p percent of the observed values are equal to or less than thatvalue.

Saul Lach () Applied Statistics and Econometrics September 2017 30 / 55

Page 16: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Test score-STR data

Saul Lach () Applied Statistics and Econometrics September 2017 31 / 55

Back to statistical analysis: do districts with smallerclasses have higher test scores?

Table describes data but doesn’t tell us anything about the relationshipbetween test scores and the STR.

Scatterplot of test score vs. student-teacher ratio (STR) is a first steptowards showing relationship between test scores and class size (as measuredby STR).

Saul Lach () Applied Statistics and Econometrics September 2017 32 / 55

Page 17: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Scatterplot of test score vs. str

twoway (scatter testscr str, sort)

600

620

640

660

680

700

test

scr

15 20 25str

Saul Lach () Applied Statistics and Econometrics September 2017 33 / 55

Statistical analysis

Scatterplot is suggestive of a negative relationship....but this is not enough.

We need to get some numerical evidence on whether districts with low STRshave higher test scores

Divide all districts into two groups: low (< 20) STRs and high (≥ 20) STRsCompute and compare mean scores by group of districts:

Test scoreSTR n mean sdSmall 238 657.35 19.36Large 182 649.98 17.85All 420 654.16 19.05

Is there a difference in test scores between low and high STRs schools?

Saul Lach () Applied Statistics and Econometrics September 2017 34 / 55

Page 18: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Statistical analysis

1 Estimation: compute and compare average test scores in districts with lowSTRs (Ysmall ) to those with high STRs (Ylarge ) :

∆ = Ysmall − Ylarge2 Hypothesis testing: Test the “null” hypothesis that the mean test scores inthe two types of districts are the same, against the “alternative”hypothesisthat they differ:

H0 : ∆ = 0 vs H1 : ∆ 6= 03 Confidence interval: Present an interval estimate for the difference in themean test scores ∆.

Saul Lach () Applied Statistics and Econometrics September 2017 35 / 55

Estimation

Ysmall − Ylarge =1

nsmall∑

i∈smallYi −

1nlarge

∑i∈large

Yi

=657.35− 649.98=7.37

Is this a large difference in a real-world sense?

Standard deviation across districts = 19.05

Is this a big enough difference to be important for school reform discussions,for parents, or for a school committee?

Saul Lach () Applied Statistics and Econometrics September 2017 36 / 55

Page 19: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Hypothesis testing (SW 3.4)

Difference-in-means test. Compute the famous “t-statistic”:

t =Ysmall − Ylarge√s2smallnsmall

+s2largenlarge

=Ysmall − Ylarge

SE (Ysmall − Ylarge )

where SE (Ysmall − Ylarge ) is the standard error of Ysmall − Ylarge and

s2small =1

nsmall − 1 ∑i∈small

(Yi − Y )2, s2large =1

nlarge − 1 ∑i∈large

(Yi − Y )2.

Saul Lach () Applied Statistics and Econometrics September 2017 37 / 55

t-statistic for difference-in-means (SW 3.4)

Test scoreSTR n mean sdSmall 238 657.35 19.36Large 182 649.98 17.85All 420 654.16 19.05

t =Ysmall − Ylarge

SE (Ysmall − Ylarge )=

657.35− 649.98√19.362238 + 17.852

182

=7.371.824

= 4.04

Recall that when | t | > 1.96 we reject the null hypothesis (at the 5%significance level) that the two means are the same.

Saul Lach () Applied Statistics and Econometrics September 2017 38 / 55

Page 20: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Confidence interval

A 95% confidence interval for the difference between the means is

(Ysmall − Ylarge )± 1.96× SE (Ysmall − Ylarge )= 7.37± 1.96× 1.824= (3.79, 10.95)

Two equivalent statements:

1 The 95% confidence interval for Ysmall − Ylarge doesn’t include 0;2 The hypothesis that Ysmall − Ylarge = 0 is rejected at the 5% level.

Saul Lach () Applied Statistics and Econometrics September 2017 39 / 55

What comes next. . .

The mechanics of estimation, hypothesis testing, and confidence intervalsshould be familiar.

These concepts extend directly to regression analysis and its variants.

We’ll study how to cast previous example within a regression framework....andmore.

Before turning to regression, however, Lectures 2 and 3 reviews some of theunderlying theory of estimation, hypothesis testing, and confidence intervals.

We end Lecture 1 with additional examples of data summaries/representation.

Saul Lach () Applied Statistics and Econometrics September 2017 40 / 55

Page 21: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Where are we?

1 What is Econometrics? What do we use it for? (SW 1.1-1.2)2 Software and types of data (SW 1.3)3 Example of statistical analysis and review of basic statistics4 Representation of data: numerical and graphical5 Bivariate data analysis (parts of SW 2.3)

Saul Lach () Applied Statistics and Econometrics September 2017 41 / 55

Numerical representation of data

CEO salary-sales datasetMean, standard deviation, median, quartiles,...min, max,etc.

. tabstat salary sales roe, stats(mean sd min p25 p50 p75 max) columns (statistics)

variable mean sd min p25 p50 p75 max

salary 1281.12 1372.345 223 736 1039 1407 14822 sales 6923.793 10633.27 175.2 2210.3 3705.2 7177 97649.9 roe 17.18421 8.518509 .5 12.4 15.5 20 56.3

Important to know units of measurement (do not appear in this table).

Saul Lach () Applied Statistics and Econometrics September 2017 42 / 55

Page 22: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Graphical representation of data

1 Cross sectional data: histogram, boxplot, pie-chart2 Time-series data: line chart

Saul Lach () Applied Statistics and Econometrics September 2017 43 / 55

Histogram of test scores

0.0

5.1

.15

Frac

tion

600 620 640 660 680 700testscr

Saul Lach () Applied Statistics and Econometrics September 2017 44 / 55

Page 23: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Histogram of CEO salaries

0.2

.4.6

.8Fr

actio

n

0 5000 10000 150001990 salary, thousands $

Saul Lach () Applied Statistics and Econometrics September 2017 45 / 55

Boxplots (box and whiskers) of CEO salaries

05,

000

10,0

0015

,000

1990

sal

ary,

thou

sand

s $

Lines are 25%, 50%, 75% quartiles and min and max values

Saul Lach () Applied Statistics and Econometrics September 2017 46 / 55

Page 24: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Time series of DJIA

Saul Lach () Applied Statistics and Econometrics September 2017 47 / 55

Where are we?

1 What is Econometrics? What do we use it for? (SW 1.1-1.2)2 Software and types of data (SW 1.3)3 Example of statistical analysis and review of basic statistics4 Representation of data: numerical and graphical5 Bivariate data analysis (parts of SW 2.3)

Saul Lach () Applied Statistics and Econometrics September 2017 48 / 55

Page 25: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Bivariate data analysis (parts of SW 2.3)

Bivariate data analysis considers the relationship between two variables, suchas education and income, or price and house size, or testscore and STR.

Data summary tools:

1 Graphical: scatterplot (e.g., Test score against STR) and other graphs2 Numerical: covariance/correlation . . . regression analysis (later on).

Saul Lach () Applied Statistics and Econometrics September 2017 49 / 55

Scatterplot

Data on 2 variables Y and X (5 data points)X Y-0.51 0.400.70 -0.40-1.94 -1.104.30 0.41-3.38 0.17

Saul Lach () Applied Statistics and Econometrics September 2017 50 / 55

Page 26: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Scatterplot of CEO salary vs roe

050

0010

000

1500

019

90 s

alar

y, th

ousa

nds

$

0 20 40 60return on equity, 88-90 avg

Saul Lach () Applied Statistics and Econometrics September 2017 51 / 55

Sample covariance (parts of SW 2.3)

(Sample) Covariance is a measure of the linear association between twovariables, say X and Y , in the sample:

sXY =1

n− 1n

∑i=1(Xi − X ) (Yi − Y )

sXY > 0: if X and Y tend to move together in the same directionsXY < 0: if X and Y tend to move together in the opposite directionsXY = 0: no association

Saul Lach () Applied Statistics and Econometrics September 2017 52 / 55

Page 27: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Correlation coeffi cient

The covariance between testscore and STR is negative:

stestsor ,str = −8.15932

Use Stata to verify this result (corr testscr str,cov).

Magnitude of covariance depends on units in which X and Y are measured.

Saul Lach () Applied Statistics and Econometrics September 2017 53 / 55

Sample correlation coeffi cient

Correlation coeffi cient (Pearson) defined by

(ρXY =)rXY =sXYsX sY

where sX and sY are sample standard deviations of X and Y .

Does not depend on units of measurement. Verify (in Stata: corr testscr str)that

rtestscr ,str =−8.1593219.05× 1.89 = −0.226

Symmetric measure (rXY = rYX ). No direction of causality implied.

Always: −1 ≤ rXY ≤ 1

1 rXY = 1 mean perfect positive linear association2 rXY = −1 means perfect negative linear association3 rXY = 0 means no linear association

rXY measures only linear association between X and Y .

Saul Lach () Applied Statistics and Econometrics September 2017 54 / 55

Page 28: Applied Statistics and Econometrics Outline of Lecture 1saullach.weebly.com/uploads/2/4/5/3/2453675/... · Applied Statistics and Econometrics Lecture 1 Saul Lach September 2017 Saul

Scatter plots and correlation coeffi cients

Saul Lach () Applied Statistics and Econometrics September 2017 55 / 55