getting started with large scale datasets dr. joni m. lakin dr. margaret ross dr. yi han

Getting Started with Large Scale Datasets

Dr. Joni M. Lakin

Dr. Margaret Ross

Dr. Yi Han

Presentation Files Are Available:http://www.auburn.edu/~jml0035/

(Under “Conference materials and resources” at the bottom of the page)

http://www.auburn.edu/~jml0035/


Opening questions• How many of you primarily use SPSS for data analysis?• How many are comfortable with using syntax (in SPSS or

other programs)?• How many already have plans to use a specific dataset?• How many just curious about what’s available?

What Data is Available?Dr. Yi Han

U.S. National Datasets• NCES

U.S. National Datasets• Restricted use licenses

http://nces.ed.gov/nationsreportcard/researchcenter/license.aspx

http://nces.ed.gov/nationsreportcard/researchcenter/license.aspx

International Datasets

International Datasets

PISA PIAAC

Accessing Data and Getting StartedDr. Margaret Ross

See PDFs

Key Issues in Working with Large DatasetsDr. Joni Lakin

Key issues

1. Statistical weighting in SPSS

2. Practical significance and large samples

3. Matrix sampling

4. Plausible values

SPSS skills that make working with large datasets easier:

5. Keeping and managing syntax

6. Merging datasets

7. Checking for duplicate cases

8. Missing data imputation

1. Statistical weighting in SPSS• Weights allow us to better approximate the full population

• If African American students are 18% of population but 9% of my sample, I could weight each AA student 2.0 (so each observation is included twice in analyses) to get results that better reflect population-level effects.

• Types of weights• Scale weights = multiplies observations to create a weighted

sample of same size as population• Proportional weights = may be below 1 to keep overall sample size

the same as the sample

• Note• When you’re reporting results, you can report weighted sample

size, but you should also report unweighted sample sizes too

Using weights

These “weight” values are already in large datasets

ELS:2002 Race

UNWEIGHTED

ELS:2002 Race

WEIGHTED

Amer. Indian/Alaska Native1%

Asian, Hawaii/Pac. Islander4%

Black or African American

14%

Hispanic, no race specified

7%

Hispanic, race specified

9%More than one race

4%

White, non-Hispanic

60%

Freq. %Amer. Indian/Alaska Native 130 .8Asian, Hawaii/Pac. Islander 1460 9.0Black or African American 2020 12.5Hispanic, no race specified 996 6.1Hispanic, race specified 1221 7.5More than one race 735 4.5White, non-Hispanic 8682 53.6Total 16197 100.0

Freq. %Amer. Indian/Alaska Native 32781 1.0Asian, Hawaii/Pac. Islander 142518 4.2Black or African American 491321 14.4Hispanic, no race specified 243607 7.1Hispanic, race specified 298648 8.8More than one race 147896 4.3White, non-Hispanic 2054103 60.2Total 3410873 100.0

Amer. Indian/Alaska Native1% Asian, Hawaii/Pac. Islander

10%

Black or African Amer-ican13%

Hispanic, no race specified

7%Hispanic, race speci-

fied8%

More than one race5%

White, non-Hispanic

57%

2. Practical significance and large datasets

• Because of large sample size, many negligible effects (and ALL correlations) will be significant

• Must consider effect sizes and practical significance

ELS:2002 variablesIndependent Samples

Test

t df Sig.Math test score 8.71 8593 <.001Reading test score -4.14 8593 <.001Mathematics self-efficacy 14.65 8593 <.001English self-efficacy scale -2.19 8593 .029

Wow!! All significant!!

Practical significance and large datasets

• Actually negligible differences for reading and small differences for math

ELS:2002 variablesIndependent Samples

Test

t df Sig. Cohen’s dMath test score 8.71 8593 <.001 0.19Reading test score -4.14 8593 <.001 -0.09Mathematics self-efficacy 14.65 8593 <.001 0.32English self-efficacy scale -2.19 8593 .029 -0.05

3. Matrix sampling (be aware of…)• Used in large-scale assessments when

• Large domain being sampled (e.g., world history)• Need to cover many topics in limited time• Individual estimates of the constructs are less important than

aggregate estimates (state level achievement)

• Usually requires IRT (item response theory) scoring methods to allow for comparable scores across examinees completing different items

Table from von Davier et al., http://www.ierinstitute.org/fileadmin/Documents/IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf

http://www.ierinstitute.org/fileadmin/Documents/IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf

http://www.ierinstitute.org/fileadmin/Documents/IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf

4. Plausible values• Can result from matrix sampling (with IRT models),

bootstrapping, and missing data imputation• In matrix sampling, individual estimates of skills are less reliable and

plausible values better capture this error variance compared to single scores

• Results in multiple estimates of the student’s true score on the construct (will appear as multiple variables)

• Poor practice = averaging plausible values before analysis• Produces biased estimates (von Davier et al., see notes)

• Better practice = using methods that analyze the different estimates together and produce standard error bars• Refer to von Davier et al. link in notes

5. Keeping and managing syntax• From any command window, can select “Paste” • Makes sure analyses start with the same data selections:

Sample weights, split files, selecting relevant cases• Good for keeping record of computed and recoded variables

6. Merging datasets• Add cases = add more participants’ data• Add variables = add variables for same participants from

another dataset

Merging datasets--Adding variables• Have to exclude duplicate variables from one dataset

• Check that values are really identical (if not, change variable name)

• Use Key Variables to match cases

7. Checking for duplicate cases

Duplicate cases output• Will appear as a new variable “PrimaryLast”• Will need to decide how to handle on case-by-case basis

• Merging datasets incorrectly can result in duplicates• If variables are identical, delete one• If variables are different, check that identification variables are correct

8. Missing data• Methods that bias results:

• Mean substitution, listwise or pairwise deletion

• Methods that can provide less biased estimates • Single imputation regression (better than above, but restricts variability)• Expectation-maximization (EM)—best of SPSS options, works well when

data is missing at random

• AnalyzeMissing Value Analysis

• Be sure to read up on “missing completely at random, missing at random”, and “missing not at random”

Other ResourcesDr. Lakin

AERA Research Grants and Dissertation Grants“The program seeks to stimulate research on U.S. education issues using data from the large-scale, national and international data sets supported by the National Center for Education Statistics (NCES), NSF, and other federal agencies, and to increase the number of education researchers using these data sets.”• Suggestions based on personal observations and the RFP:

• Must use a strong quasi-experimental design (Schneider et al., Estimating Causal Effects: Using Experimental and Observational Designs)• Regression discontinuity, propensity score matching, etc.• Bringing in new quantitative approaches for other fields also very

appealing (economics, epidemiology, etc.)• Check past grants to see which datasets are “neglected” (more

recent datasets better)• Prefer ideas that involve multiple datasets in meaningful research are

more successful• Analyses of international datasets have been more successful recently

Other opportunities• IES Research Grants do fund secondary data analyses with

Exploration grant goals (any subject area)http://ies.ed.gov/funding/

• IES data training workshopshttp://ies.ed.gov/whatsnew/conferences/?cid=2

• AERA annual meeting usually has data training events:• PDC02: Analyzing NAEP Assessment Data with Plausible Values…• PDC13: Advanced Analysis using Adult International Large Scale

Assessment Databases• PDC16: Using NAEP Data on the Web for Educational Policy Research• Several on quantitative methods (including propensity scores)

• AERA Institute on Statistical Analysis for Education Policy (summer)

• IES/NCES hosts STATS-DC conferences and summer institutes to train researchers in using specific datasets

http://ies.ed.gov/funding/

http://ies.ed.gov/funding/

http://ies.ed.gov/whatsnew/conferences/?cid=2

http://ies.ed.gov/whatsnew/conferences/?cid=2

http://www.aera.net/ProfessionalOpportunitiesFunding/FundingOpportunities/InstituteonStatisticalAnalysis/tabid/10906/Default.aspx

http://www.aera.net/ProfessionalOpportunitiesFunding/FundingOpportunities/InstituteonStatisticalAnalysis/tabid/10906/Default.aspx

Q&APresentation files are available from


(Under “Conference materials and resources”)



getting started with large scale datasets dr. joni m. lakin dr. margaret ross dr. yi han

Documents

pdfs slide

pisapiaac slide

page slide

national datasets nces

joni lakin slide

yi han slide

large scale datasets

large sample size