data organization & creating reproducible analyses · data organization & creating...
TRANSCRIPT
Data Organization &
Creating Reproducible Analyses
Nick Jackson
University of Southern California
Department of Psychology
09/20/2013
Overview
Raw Data Organization
◦ Vocabulary
◦ Entering/Storing Data in Excel
◦ Data Checks/Reducing Errors
◦ Data Backup/Security
Creating Reproducible Analyses
◦ Directory Structures
◦ Workflow for Projects (cleaning->archiving)
◦ Workflow for Analyses (cleaning->analysis->present/archive)
Vocabulary
Database
◦ “An organized collection of data”-Wikipedia
◦ Most of us don’t actually use “real” databases
SQL, Oracle, dBASE, MS Access
More Specifically: where the links between different
data files are automatic (ie. no merging)
◦ We mostly store data in Excel or SPSS and
pretend it is a database.
Vocabulary
Wide File: One row of data per subject
◦ Pseudonym: Flat File
Long File: more than one row of data per
subject
◦ Pseudonyms: Relational File
Wide File: Primary Index Variables
Cases
An Observation Observations
Long File:
Cases
Variable Primary Index Secondary Index
An Observation Observations
Developing an Excel Database
System – Basic Rules No “Mississippi” tables
◦ Excel tables that contain ALL of your data at
once can cause problems
Enter Wide Data in wide format, Long
Data in long format.
One Excel File per survey/measure (not
Sheets!)
Developing an Excel Database
System-Variable Name Rules The primary (and secondary) index should have
the EXACT same name in every file.
Only the primary (and secondary) index should have the same names within and between files.
Be consistent with variable names, keep them simple and free of unusual characters or spaces ◦ Use ”_” character to join two words together
instead_of_a_blank_space.
Never leave a blank column between variable names-this causes random sorting.
ESS:
PSQI:
Primary Indices Don’t
Match:
id≠Id
Inconsistent Capitalization:
These should start lower case, OR
all other variables should be capitalized
Improper Spaces:
Variable names shouldn’t
have spaces in it.
Inconsistent Naming:
Underscore “_” should be
removed OR other
variables should have
underscores.
Blank Column:
There should be no empty
columns between variables
Types of Data http://en.wikipedia.org/wiki/Numerical_data
Types of Data-Considerations
Default formatting in excel is “General”
which is fine for everything but dates.
Collect discrete variables as continuous
when possible. DOB vs Age vs age groups.
NEVER enter data as already reverse
coded.
Only use strings when you cannot assign
a number
Types of Data-Considerations
Be consistent with entering string data
◦ Medications:
“Aspirin” vs “ASPIRIN” vs “Apsirin”
Use Alphanumeric codes when possible
(ATC\DDD)
Time and Date data
◦ Use the correct formatting
12:01 PM
09/20/2013
Handling Missing Data
Have indicators of missing data
◦ Use numbers as the indicator
NOT a possible real value in any excel sheet
Typically -9 or -99 or -999
◦ Not for String/Dates/Times
Handling Unknown/Don’t
Know/NA Responses Same rules as missing data indicator
◦ Usually -8 or -88 or -888
The Data Dictionary
Key Elements
◦ Variable Name –the name exactly as it
appears in the data
◦ Variable Label- a lengthier description of
the variable
◦ Variable Values-tells you what discrete
values mean (ie. 2=Male, -99=Missing)
Data Dictionary
Reducing Errors in Data Entry
Data Validation tools in Excel
◦ Restrict possible values
Step 1: Highlight Variables/Columns. Select Data Tab, Click on Data Validation
A) Highlight Variables
B) Select Data Tab C) Click on Data Validation
Using Excel Data Validation Tools to Restrict Variable Values
Step 2 (For Discrete Data): Select to Allow: List.
Input the potential values for the variable (including the missing data indicator) in the Source: box
D) Select to Allow: List E) Input Potential Values
Step 2 (For Interval/Ratio Data): Select to Allow: (Whole Number OR Decimal).
Determine the criteria for the Data Box (Between or Not Between)
Input Minimum and Maximum values.
D) Select to Allow: Whole
Number OR Decimal
F) Select Appropriate
Minimum and
Maximum
E) Choose
Between OR Not
Between from
Data: box
Reducing Errors in Data Entry
Use Data validation tools in Excel
Enter with a number pad
◦ 1-0.01% Error Rate
Use split windows to keep case numbers
and variable names visible
Click on this, and drag the window down
Splitting windows in Excel
Splitting windows in Excel
Reducing Errors in Data Entry
Use Data validation tools in Excel
Enter with a number pad
Use split windows to keep case numbers
and variable names visible
Use Column Borders to denote separate
physical pages of your survey within your
excel file.
Applying a border in Excel
Select this Then this Which will produce….THIS!
Checking for Errors
Use the excel data filters
Re-enter data because
◦ No hand-entry method protects against row
shifts
◦ Acceptable, but incorrect values can be
present
◦ In general, a 10% re-entry.
Errors Happen.
Backing up Your Data
Should exist in at least 2 physically
separate locations.
Regular backups
◦ Easy to reconcile versions
Servers, Clouds, RAID Drives?
◦ Dropbox, Skydrive, Google Drive: Not HIPPA
or FERPA compliant
Back-up of your data every time you
change something in the raw data.
Data Security
Protecting the Identity of your subjects
◦ Assign ID’s that have NO significance to any
identifying characteristic of the subject
Identifying characteristics: names (initials), dates (of
birth, death), years, addresses, phone number, email,
zip codes, geo-codes, SSN, MRN, VIN, DL #, License
Plate, IP address, health plan/insurance numbers
◦ Never store Identifying characteristics in files
with your other data.
Data Security
Keeping people out of your data
◦ Physical Threats
Use anti-theft devices (cables)
BIOS Password on the hard drive + Logon Password
Remote wipes
◦ Cyber Threats
2-step verification for any cloud/email services
Traffic encryption, VPNs
Servers should have restricted access Also how often backups sent offsite.
Data Security
1) Click the Office button.
2) Scroll over the
Prepare button.
3) Click on Encrypt
Document.
You can encrypt your excel file.
Creating Reproducible Analyses
Without replication we cease to have science ◦ can you replicate your own findings?
◦ Is everyone reinventing the wheel with your rawdata?
Journals are beginning to require submission of rawdata and syntax
“The NIH recently declared that the sharing of data is essential for translating research into knowledge and products that improve health. As such, all investigators seeking more than $500,000 in grant support per year are now required to include a plan for data sharing” –Savage & Vickers 2009
Creating Reproducible Analyses
The difficultly of replication
◦ Can you find it?
Issues of version control (multiple versions)
Everything is labeled “final”
◦ Undocumented steps
Don’t end up on here:
http://retractionwatch.wordpress.com/
Creating Reproducible Analyses
Data Workflow (Long, 2009) ◦ Clean -> Analyze-> Present-> Archive
Replication is our guiding principle ◦ Simple Complex systems don’t get used (usability)
High efficiency
◦ Standardized Create a template
Be consistent
◦ Scalable Easy to do with 2 or 200 (automation)
Setting up a directory structure
Standardized directory structures
(templates)
◦ Easy to find things
The naming of data files in your
directories 2-4 letter acronym_nameYYYYMMDD
Ensures proper sorting and version
control
Data Workflow
Project Level (Data Manager)
◦ Clean - > Archive
Analysis Level (Data Analyst/Researcher)
◦ Clean -> Analyze -> Present / Archive
Data Workflow: Project Level (Cleaning)
Rawdata
Clean Data
isac_cesd20130915.xlsx
isac_bdi20071124.xlsx
isac_wasi20090616.sav
Programs
isac_master_wideDATE.sav
isac_master_longDATE.sav
setup.sps
Merge: cesd, bdi, wasi
Data Workflow: Project Level (Cleaning)
Programs:
◦ Setup
Cleans and aggregates the files
Data Workflow: Project Level (Archive)
Rawdata
isac_cesd20130915.xlsx
isac_bdi20071124.xlsx
isac_wasi20090616.sav
Archive
isac_cesd20130920.xlsx
Data Workflow
Analysis Level
◦ Clean -> Analyze -> Present /Archive
Data Workflow: Analysis Level
A reminder of the directory structures for the Analysis
Contents of my \Analysis folder:
Data Workflow: Analysis Level (Cleaning)
Method 1
Rawdata
Clean Data
isac_cesd20130915.xlsx
isac_bdi20071124.xlsx
isac_master_wide20130920.sav
isac_master_long20130920.sav
cleaning.sps
Analysis
201309 Example For
Presentation (Jackson)
Analysis Data
Analysis
Rawdata
Syntax
isac_analysisDATE.sav
Data Workflow: Analysis Level (Cleaning)
Method 2
Rawdata
Clean Data
isac_cesd20130915.xlsx
isac_bdi20071124.xlsx
isac_master_wide20130920.sav
isac_master_long20130920.sav
cleaning.sps
Analysis
201309 Example For
Presentation (Jackson)
Analysis Data
Analysis
Rawdata
Syntax
isac_analysisDATE.sav
I use this folder
only for rawdata
files that are
unique to the
specific analysis
Data Workflow: Analysis Level
(Analyze)
cleaning.sps
Analysis
201309 Example For
Presentation (Jackson)
Analysis Data
Analysis
Rawdata
Syntax
isac_analysisDATE.sav
analysis.sps
Data Workflow: Analysis Level
Syntax:
◦ Cleaning.sps
Cleans and aggregates the files
◦ Analysis.sps
Conducts the analysis
Makes NO changes to the data
DOES NOT write out any files
Data Workflow: Analysis Level
(Present/Archive)
You have analyzed your data and put
together a paper, abstract, poster, etc.
ARCHIVE your syntax on the date of
submission
◦ Put dates at the end of the syntax file
◦ DO NOT TOUCH THE SYNTAX FILE
AGAIN (I usually comment out the
commands)
Create Abstact.sps syntax file
Data Workflow: Analysis Level
(Present/Archive)
cleaning20131001.sps
Analysis
201309 Example For
Presentation (Jackson)
Analysis Data
Syntax
isac_analysis20130920.sav
analysis20131001.sps
_Abstract
Add dates to Syntax
apsabstract_20131001.docx
abstract20131001.sps Create abstract Syntax file
cleaning.sps
analysis.sps Probably still analyzing stuff for
creating abstract into a poster,
so I keeping working versions of
my needed syntax files
I generally do not archive my analysis
folders, but rather focus on archiving the
syntax when major events happen in an
analysis (submissions, revisions,
acceptance).
Data Workflow: Analysis Level
(Present/Archive)
The important things
Never overwrite a rawdata file.
◦ Every change to your rawdata should be a new
file with a new date.
Files are only opened or saved with syntax
All changes to files are done with syntax
Cleaning/Setup Syntax:
◦ We read files in and write separate files out
Analysis Syntax:
◦ We read files in and NEVER write or change files
The important things
When important things happen, we
archive the SYNTAX by commenting out
all of the commands and placing dates on
the end.