getting started with large scale datasets dr. joni m. lakin dr. margaret ross dr. yi han
TRANSCRIPT
![Page 1: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/1.jpg)
Getting Started with Large Scale Datasets
Dr. Joni M. Lakin
Dr. Margaret Ross
Dr. Yi Han
![Page 2: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/2.jpg)
Presentation Files Are Available:http://www.auburn.edu/~jml0035/
(Under “Conference materials and resources” at the bottom of the page)
![Page 3: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/3.jpg)
Opening questions• How many of you primarily use SPSS for data analysis?• How many are comfortable with using syntax (in SPSS or
other programs)?• How many already have plans to use a specific dataset?• How many just curious about what’s available?
![Page 4: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/4.jpg)
What Data is Available?Dr. Yi Han
![Page 5: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/5.jpg)
U.S. National Datasets• NCES
![Page 6: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/6.jpg)
U.S. National Datasets• Restricted use licenses
http://nces.ed.gov/nationsreportcard/researchcenter/license.aspx
![Page 7: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/7.jpg)
![Page 8: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/8.jpg)
International Datasets
![Page 9: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/9.jpg)
International Datasets
PISA PIAAC
![Page 10: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/10.jpg)
Accessing Data and Getting StartedDr. Margaret Ross
See PDFs
![Page 11: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/11.jpg)
Key Issues in Working with Large DatasetsDr. Joni Lakin
![Page 12: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/12.jpg)
Key issues
1. Statistical weighting in SPSS
2. Practical significance and large samples
3. Matrix sampling
4. Plausible values
SPSS skills that make working with large datasets easier:
5. Keeping and managing syntax
6. Merging datasets
7. Checking for duplicate cases
8. Missing data imputation
![Page 13: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/13.jpg)
1. Statistical weighting in SPSS• Weights allow us to better approximate the full population
• If African American students are 18% of population but 9% of my sample, I could weight each AA student 2.0 (so each observation is included twice in analyses) to get results that better reflect population-level effects.
• Types of weights• Scale weights = multiplies observations to create a weighted
sample of same size as population• Proportional weights = may be below 1 to keep overall sample size
the same as the sample
• Note• When you’re reporting results, you can report weighted sample
size, but you should also report unweighted sample sizes too
![Page 14: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/14.jpg)
Using weights
These “weight” values are already in large datasets
![Page 15: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/15.jpg)
ELS:2002 Race
UNWEIGHTED
ELS:2002 Race
WEIGHTED
Amer. Indian/Alaska Native1%
Asian, Hawaii/Pac. Islander4%
Black or African American
14%
Hispanic, no race specified
7%
Hispanic, race specified
9%More than one race
4%
White, non-Hispanic
60%
Freq. %Amer. Indian/Alaska Native 130 .8Asian, Hawaii/Pac. Islander 1460 9.0Black or African American 2020 12.5Hispanic, no race specified 996 6.1Hispanic, race specified 1221 7.5More than one race 735 4.5White, non-Hispanic 8682 53.6Total 16197 100.0
Freq. %Amer. Indian/Alaska Native 32781 1.0Asian, Hawaii/Pac. Islander 142518 4.2Black or African American 491321 14.4Hispanic, no race specified 243607 7.1Hispanic, race specified 298648 8.8More than one race 147896 4.3White, non-Hispanic 2054103 60.2Total 3410873 100.0
Amer. Indian/Alaska Native1% Asian, Hawaii/Pac. Islander
10%
Black or African Amer-ican13%
Hispanic, no race specified
7%Hispanic, race speci-
fied8%
More than one race5%
White, non-Hispanic
57%
![Page 16: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/16.jpg)
2. Practical significance and large datasets
• Because of large sample size, many negligible effects (and ALL correlations) will be significant
• Must consider effect sizes and practical significance
ELS:2002 variablesIndependent Samples
Test
t df Sig.Math test score 8.71 8593 <.001Reading test score -4.14 8593 <.001Mathematics self-efficacy 14.65 8593 <.001English self-efficacy scale -2.19 8593 .029
Wow!! All significant!!
![Page 17: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/17.jpg)
Practical significance and large datasets
• Actually negligible differences for reading and small differences for math
ELS:2002 variablesIndependent Samples
Test
t df Sig. Cohen’s dMath test score 8.71 8593 <.001 0.19Reading test score -4.14 8593 <.001 -0.09Mathematics self-efficacy 14.65 8593 <.001 0.32English self-efficacy scale -2.19 8593 .029 -0.05
![Page 18: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/18.jpg)
3. Matrix sampling (be aware of…)• Used in large-scale assessments when
• Large domain being sampled (e.g., world history)• Need to cover many topics in limited time• Individual estimates of the constructs are less important than
aggregate estimates (state level achievement)
• Usually requires IRT (item response theory) scoring methods to allow for comparable scores across examinees completing different items
Table from von Davier et al., http://www.ierinstitute.org/fileadmin/Documents/IERI_Monograph/IERI_Monograph_Volume_02_Chapter_01.pdf
![Page 19: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/19.jpg)
4. Plausible values• Can result from matrix sampling (with IRT models),
bootstrapping, and missing data imputation• In matrix sampling, individual estimates of skills are less reliable and
plausible values better capture this error variance compared to single scores
• Results in multiple estimates of the student’s true score on the construct (will appear as multiple variables)
• Poor practice = averaging plausible values before analysis• Produces biased estimates (von Davier et al., see notes)
• Better practice = using methods that analyze the different estimates together and produce standard error bars• Refer to von Davier et al. link in notes
![Page 20: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/20.jpg)
5. Keeping and managing syntax• From any command window, can select “Paste” • Makes sure analyses start with the same data selections:
Sample weights, split files, selecting relevant cases• Good for keeping record of computed and recoded variables
![Page 21: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/21.jpg)
6. Merging datasets• Add cases = add more participants’ data• Add variables = add variables for same participants from
another dataset
![Page 22: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/22.jpg)
Merging datasets--Adding variables• Have to exclude duplicate variables from one dataset
• Check that values are really identical (if not, change variable name)
• Use Key Variables to match cases
![Page 23: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/23.jpg)
7. Checking for duplicate cases
![Page 24: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/24.jpg)
Duplicate cases output• Will appear as a new variable “PrimaryLast”• Will need to decide how to handle on case-by-case basis
• Merging datasets incorrectly can result in duplicates• If variables are identical, delete one• If variables are different, check that identification variables are correct
![Page 25: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/25.jpg)
8. Missing data• Methods that bias results:
• Mean substitution, listwise or pairwise deletion
• Methods that can provide less biased estimates • Single imputation regression (better than above, but restricts variability)• Expectation-maximization (EM)—best of SPSS options, works well when
data is missing at random
• AnalyzeMissing Value Analysis
• Be sure to read up on “missing completely at random, missing at random”, and “missing not at random”
![Page 26: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/26.jpg)
Other ResourcesDr. Lakin
![Page 27: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/27.jpg)
AERA Research Grants and Dissertation Grants“The program seeks to stimulate research on U.S. education issues using data from the large-scale, national and international data sets supported by the National Center for Education Statistics (NCES), NSF, and other federal agencies, and to increase the number of education researchers using these data sets.”• Suggestions based on personal observations and the RFP:
• Must use a strong quasi-experimental design (Schneider et al., Estimating Causal Effects: Using Experimental and Observational Designs)• Regression discontinuity, propensity score matching, etc.• Bringing in new quantitative approaches for other fields also very
appealing (economics, epidemiology, etc.)• Check past grants to see which datasets are “neglected” (more
recent datasets better)• Prefer ideas that involve multiple datasets in meaningful research are
more successful• Analyses of international datasets have been more successful recently
![Page 28: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/28.jpg)
Other opportunities• IES Research Grants do fund secondary data analyses with
Exploration grant goals (any subject area)http://ies.ed.gov/funding/
• IES data training workshopshttp://ies.ed.gov/whatsnew/conferences/?cid=2
• AERA annual meeting usually has data training events:• PDC02: Analyzing NAEP Assessment Data with Plausible Values…• PDC13: Advanced Analysis using Adult International Large Scale
Assessment Databases• PDC16: Using NAEP Data on the Web for Educational Policy Research• Several on quantitative methods (including propensity scores)
• AERA Institute on Statistical Analysis for Education Policy (summer)
• IES/NCES hosts STATS-DC conferences and summer institutes to train researchers in using specific datasets
![Page 29: Getting Started with Large Scale Datasets Dr. Joni M. Lakin Dr. Margaret Ross Dr. Yi Han](https://reader030.vdocuments.mx/reader030/viewer/2022032723/56649d125503460f949e60eb/html5/thumbnails/29.jpg)
Q&APresentation files are available from
http://www.auburn.edu/~jml0035/
(Under “Conference materials and resources”)