Getting Started with Large Scale Datasets
Dr. Joni M. Lakin
Dr. Margaret Ross
Dr. Yi Han
Presentation Files Are Available:http://www.auburn.edu/~jml0035/
(Under “Conference materials and resources” at the bottom of the page)
Opening questions• How many of you primarily use SPSS for data analysis?• How many are comfortable with using syntax (in SPSS or
other programs)?• How many already have plans to use a specific dataset?• How many just curious about what’s available?
What Data is Available?Dr. Yi Han
U.S. National Datasets• NCES
U.S. National Datasets• Restricted use licenses
http://nces.ed.gov/nationsreportcard/researchcenter/license.aspx
International Datasets
International Datasets
PISA PIAAC
Accessing Data and Getting StartedDr. Margaret Ross
See PDFs
Key Issues in Working with Large DatasetsDr. Joni Lakin
Key issues
1. Statistical weighting in SPSS
2. Practical significance and large samples
3. Matrix sampling
4. Plausible values
SPSS skills that make working with large datasets easier:
5. Keeping and managing syntax
6. Merging datasets
7. Checking for duplicate cases
8. Missing data imputation
1. Statistical weighting in SPSS• Weights allow us to better approximate the full population
• If African American students are 18% of population but 9% of my sample, I could weight each AA student 2.0 (so each observation is included twice in analyses) to get results that better reflect population-level effects.
• Types of weights• Scale weights = multiplies observations to create a weighted
sample of same size as population• Proportional weights = may be below 1 to keep overall sample size
the same as the sample
• Note• When you’re reporting results, you can report weighted sample
size, but you should also report unweighted sample sizes too
Using weights
These “weight” values are already in large datasets
ELS:2002 Race
UNWEIGHTED
ELS:2002 Race
WEIGHTED
Amer. Indian/Alaska Native1%
Asian, Hawaii/Pac. Islander4%
Black or African American
14%
Hispanic, no race specified
7%
Hispanic, race specified
9%More than one race
4%
White, non-Hispanic
60%
Freq. %Amer. Indian/Alaska Native 130 .8Asian, Hawaii/Pac. Islander 1460 9.0Black or African American 2020 12.5Hispanic, no race specified 996 6.1Hispanic, race specified 1221 7.5More than one race 735 4.5White, non-Hispanic 8682 53.6Total 16197 100.0
Freq. %Amer. Indian/Alaska Native 32781 1.0Asian, Hawaii/Pac. Islander 142518 4.2Black or African American 491321 14.4Hispanic, no race specified 243607 7.1Hispanic, race specified 298648 8.8More than one race 147896 4.3White, non-Hispanic 2054103 60.2Total 3410873 100.0
Amer. Indian/Alaska Native1% Asian, Hawaii/Pac. Islander
10%
Black or African Amer-ican13%
Hispanic, no race specified
7%Hispanic, race speci-
fied8%
More than one race5%
White, non-Hispanic
57%
2. Practical significance and large datasets
• Because of large sample size, many negligible effects (and ALL correlations) will be significant
• Must consider effect sizes and practical significance
ELS:2002 variablesIndependent Samples
Test
t df Sig.Math test score 8.71 8593 <.001Reading test score -4.14 8593 <.001Mathematics self-efficacy 14.65 8593 <.001English self-efficacy scale -2.19 8593 .029
Wow!! All significant!!
Practical significance and large datasets
• Actually negligible differences for reading and small differences for math
ELS:2002 variablesIndependent Samples
Test
t df Sig. Cohen’s dMath test score 8.71 8593 <.001 0.19Reading test score -4.14 8593 <.001 -0.09Mathematics self-efficacy 14.65 8593 <.001 0.32English self-efficacy scale -2.19 8593 .029 -0.05
3. Matrix sampling (be aware of…)• Used in large-scale assessments when
• Large domain being sampled (e.g., world history)• Need to cover many topics in limited time• Individual estimates of the constructs are less important than
aggregate estimates (state level achievement)
• Usually requires IRT (item response theory) scoring methods to allow for comparable scores across examinees completing different items
Table from von Davier et al., http://www.ierinstitute.org/fileadmin/Documents/IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf
4. Plausible values• Can result from matrix sampling (with IRT models),
bootstrapping, and missing data imputation• In matrix sampling, individual estimates of skills are less reliable and
plausible values better capture this error variance compared to single scores
• Results in multiple estimates of the student’s true score on the construct (will appear as multiple variables)
• Poor practice = averaging plausible values before analysis• Produces biased estimates (von Davier et al., see notes)
• Better practice = using methods that analyze the different estimates together and produce standard error bars• Refer to von Davier et al. link in notes
5. Keeping and managing syntax• From any command window, can select “Paste” • Makes sure analyses start with the same data selections:
Sample weights, split files, selecting relevant cases• Good for keeping record of computed and recoded variables
6. Merging datasets• Add cases = add more participants’ data• Add variables = add variables for same participants from
another dataset
Merging datasets--Adding variables• Have to exclude duplicate variables from one dataset
• Check that values are really identical (if not, change variable name)
• Use Key Variables to match cases
7. Checking for duplicate cases
Duplicate cases output• Will appear as a new variable “PrimaryLast”• Will need to decide how to handle on case-by-case basis
• Merging datasets incorrectly can result in duplicates• If variables are identical, delete one• If variables are different, check that identification variables are correct
8. Missing data• Methods that bias results:
• Mean substitution, listwise or pairwise deletion
• Methods that can provide less biased estimates • Single imputation regression (better than above, but restricts variability)• Expectation-maximization (EM)—best of SPSS options, works well when
data is missing at random
• AnalyzeMissing Value Analysis
• Be sure to read up on “missing completely at random, missing at random”, and “missing not at random”
Other ResourcesDr. Lakin
AERA Research Grants and Dissertation Grants“The program seeks to stimulate research on U.S. education issues using data from the large-scale, national and international data sets supported by the National Center for Education Statistics (NCES), NSF, and other federal agencies, and to increase the number of education researchers using these data sets.”• Suggestions based on personal observations and the RFP:
• Must use a strong quasi-experimental design (Schneider et al., Estimating Causal Effects: Using Experimental and Observational Designs)• Regression discontinuity, propensity score matching, etc.• Bringing in new quantitative approaches for other fields also very
appealing (economics, epidemiology, etc.)• Check past grants to see which datasets are “neglected” (more
recent datasets better)• Prefer ideas that involve multiple datasets in meaningful research are
more successful• Analyses of international datasets have been more successful recently
Other opportunities• IES Research Grants do fund secondary data analyses with
Exploration grant goals (any subject area)http://ies.ed.gov/funding/
• IES data training workshopshttp://ies.ed.gov/whatsnew/conferences/?cid=2
• AERA annual meeting usually has data training events:• PDC02: Analyzing NAEP Assessment Data with Plausible Values…• PDC13: Advanced Analysis using Adult International Large Scale
Assessment Databases• PDC16: Using NAEP Data on the Web for Educational Policy Research• Several on quantitative methods (including propensity scores)
• AERA Institute on Statistical Analysis for Education Policy (summer)
• IES/NCES hosts STATS-DC conferences and summer institutes to train researchers in using specific datasets
Q&APresentation files are available from
http://www.auburn.edu/~jml0035/
(Under “Conference materials and resources”)