data organization & creating reproducible analyses · data organization & creating...

52
Data Organization & Creating Reproducible Analyses Nick Jackson University of Southern California Department of Psychology 09/20/2013

Upload: truongbao

Post on 01-Apr-2018

235 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Organization &

Creating Reproducible Analyses

Nick Jackson

University of Southern California

Department of Psychology

09/20/2013

Page 2: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Overview

Raw Data Organization

◦ Vocabulary

◦ Entering/Storing Data in Excel

◦ Data Checks/Reducing Errors

◦ Data Backup/Security

Creating Reproducible Analyses

◦ Directory Structures

◦ Workflow for Projects (cleaning->archiving)

◦ Workflow for Analyses (cleaning->analysis->present/archive)

Page 3: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Vocabulary

Database

◦ “An organized collection of data”-Wikipedia

◦ Most of us don’t actually use “real” databases

SQL, Oracle, dBASE, MS Access

More Specifically: where the links between different

data files are automatic (ie. no merging)

◦ We mostly store data in Excel or SPSS and

pretend it is a database.

Page 4: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Vocabulary

Wide File: One row of data per subject

◦ Pseudonym: Flat File

Long File: more than one row of data per

subject

◦ Pseudonyms: Relational File

Page 5: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Wide File: Primary Index Variables

Cases

An Observation Observations

Long File:

Cases

Variable Primary Index Secondary Index

An Observation Observations

Page 6: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Developing an Excel Database

System – Basic Rules No “Mississippi” tables

◦ Excel tables that contain ALL of your data at

once can cause problems

Enter Wide Data in wide format, Long

Data in long format.

One Excel File per survey/measure (not

Sheets!)

Page 7: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Developing an Excel Database

System-Variable Name Rules The primary (and secondary) index should have

the EXACT same name in every file.

Only the primary (and secondary) index should have the same names within and between files.

Be consistent with variable names, keep them simple and free of unusual characters or spaces ◦ Use ”_” character to join two words together

instead_of_a_blank_space.

Never leave a blank column between variable names-this causes random sorting.

Page 8: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

ESS:

PSQI:

Primary Indices Don’t

Match:

id≠Id

Inconsistent Capitalization:

These should start lower case, OR

all other variables should be capitalized

Improper Spaces:

Variable names shouldn’t

have spaces in it.

Inconsistent Naming:

Underscore “_” should be

removed OR other

variables should have

underscores.

Blank Column:

There should be no empty

columns between variables

Page 9: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Types of Data http://en.wikipedia.org/wiki/Numerical_data

Page 10: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Types of Data-Considerations

Default formatting in excel is “General”

which is fine for everything but dates.

Collect discrete variables as continuous

when possible. DOB vs Age vs age groups.

NEVER enter data as already reverse

coded.

Only use strings when you cannot assign

a number

Page 11: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Types of Data-Considerations

Be consistent with entering string data

◦ Medications:

“Aspirin” vs “ASPIRIN” vs “Apsirin”

Use Alphanumeric codes when possible

(ATC\DDD)

Time and Date data

◦ Use the correct formatting

12:01 PM

09/20/2013

Page 12: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Handling Missing Data

Have indicators of missing data

◦ Use numbers as the indicator

NOT a possible real value in any excel sheet

Typically -9 or -99 or -999

◦ Not for String/Dates/Times

Page 13: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Handling Unknown/Don’t

Know/NA Responses Same rules as missing data indicator

◦ Usually -8 or -88 or -888

Page 14: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

The Data Dictionary

Key Elements

◦ Variable Name –the name exactly as it

appears in the data

◦ Variable Label- a lengthier description of

the variable

◦ Variable Values-tells you what discrete

values mean (ie. 2=Male, -99=Missing)

Page 15: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Dictionary

Page 16: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Reducing Errors in Data Entry

Data Validation tools in Excel

◦ Restrict possible values

Page 17: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Step 1: Highlight Variables/Columns. Select Data Tab, Click on Data Validation

A) Highlight Variables

B) Select Data Tab C) Click on Data Validation

Using Excel Data Validation Tools to Restrict Variable Values

Step 2 (For Discrete Data): Select to Allow: List.

Input the potential values for the variable (including the missing data indicator) in the Source: box

D) Select to Allow: List E) Input Potential Values

Page 18: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Step 2 (For Interval/Ratio Data): Select to Allow: (Whole Number OR Decimal).

Determine the criteria for the Data Box (Between or Not Between)

Input Minimum and Maximum values.

D) Select to Allow: Whole

Number OR Decimal

F) Select Appropriate

Minimum and

Maximum

E) Choose

Between OR Not

Between from

Data: box

Page 19: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Reducing Errors in Data Entry

Use Data validation tools in Excel

Enter with a number pad

◦ 1-0.01% Error Rate

Use split windows to keep case numbers

and variable names visible

Page 20: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Click on this, and drag the window down

Splitting windows in Excel

Page 21: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Splitting windows in Excel

Page 22: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Reducing Errors in Data Entry

Use Data validation tools in Excel

Enter with a number pad

Use split windows to keep case numbers

and variable names visible

Use Column Borders to denote separate

physical pages of your survey within your

excel file.

Page 23: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Applying a border in Excel

Select this Then this Which will produce….THIS!

Page 24: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Checking for Errors

Use the excel data filters

Re-enter data because

◦ No hand-entry method protects against row

shifts

◦ Acceptable, but incorrect values can be

present

◦ In general, a 10% re-entry.

Page 25: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Errors Happen.

Page 26: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Backing up Your Data

Should exist in at least 2 physically

separate locations.

Regular backups

◦ Easy to reconcile versions

Servers, Clouds, RAID Drives?

◦ Dropbox, Skydrive, Google Drive: Not HIPPA

or FERPA compliant

Back-up of your data every time you

change something in the raw data.

Page 27: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Security

Protecting the Identity of your subjects

◦ Assign ID’s that have NO significance to any

identifying characteristic of the subject

Identifying characteristics: names (initials), dates (of

birth, death), years, addresses, phone number, email,

zip codes, geo-codes, SSN, MRN, VIN, DL #, License

Plate, IP address, health plan/insurance numbers

◦ Never store Identifying characteristics in files

with your other data.

Page 28: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Security

Keeping people out of your data

◦ Physical Threats

Use anti-theft devices (cables)

BIOS Password on the hard drive + Logon Password

Remote wipes

◦ Cyber Threats

2-step verification for any cloud/email services

Traffic encryption, VPNs

Servers should have restricted access Also how often backups sent offsite.

Page 29: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Security

1) Click the Office button.

2) Scroll over the

Prepare button.

3) Click on Encrypt

Document.

You can encrypt your excel file.

Page 30: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Creating Reproducible Analyses

Without replication we cease to have science ◦ can you replicate your own findings?

◦ Is everyone reinventing the wheel with your rawdata?

Journals are beginning to require submission of rawdata and syntax

“The NIH recently declared that the sharing of data is essential for translating research into knowledge and products that improve health. As such, all investigators seeking more than $500,000 in grant support per year are now required to include a plan for data sharing” –Savage & Vickers 2009

Page 31: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Creating Reproducible Analyses

The difficultly of replication

◦ Can you find it?

Issues of version control (multiple versions)

Everything is labeled “final”

◦ Undocumented steps

Don’t end up on here:

http://retractionwatch.wordpress.com/

Page 32: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Creating Reproducible Analyses

Data Workflow (Long, 2009) ◦ Clean -> Analyze-> Present-> Archive

Replication is our guiding principle ◦ Simple Complex systems don’t get used (usability)

High efficiency

◦ Standardized Create a template

Be consistent

◦ Scalable Easy to do with 2 or 200 (automation)

Page 33: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Setting up a directory structure

Standardized directory structures

(templates)

◦ Easy to find things

Page 34: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps
Page 35: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

The naming of data files in your

directories 2-4 letter acronym_nameYYYYMMDD

Ensures proper sorting and version

control

Page 36: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow

Project Level (Data Manager)

◦ Clean - > Archive

Analysis Level (Data Analyst/Researcher)

◦ Clean -> Analyze -> Present / Archive

Page 37: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Project Level (Cleaning)

Rawdata

Clean Data

isac_cesd20130915.xlsx

isac_bdi20071124.xlsx

isac_wasi20090616.sav

Programs

isac_master_wideDATE.sav

isac_master_longDATE.sav

setup.sps

Merge: cesd, bdi, wasi

Page 38: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Project Level (Cleaning)

Programs:

◦ Setup

Cleans and aggregates the files

Page 39: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps
Page 40: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Project Level (Archive)

Rawdata

isac_cesd20130915.xlsx

isac_bdi20071124.xlsx

isac_wasi20090616.sav

Archive

isac_cesd20130920.xlsx

Page 41: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow

Analysis Level

◦ Clean -> Analyze -> Present /Archive

Page 42: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Analysis Level

A reminder of the directory structures for the Analysis

Contents of my \Analysis folder:

Page 43: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Analysis Level (Cleaning)

Method 1

Rawdata

Clean Data

isac_cesd20130915.xlsx

isac_bdi20071124.xlsx

isac_master_wide20130920.sav

isac_master_long20130920.sav

cleaning.sps

Analysis

201309 Example For

Presentation (Jackson)

Analysis Data

Analysis

Rawdata

Syntax

isac_analysisDATE.sav

Page 44: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Analysis Level (Cleaning)

Method 2

Rawdata

Clean Data

isac_cesd20130915.xlsx

isac_bdi20071124.xlsx

isac_master_wide20130920.sav

isac_master_long20130920.sav

cleaning.sps

Analysis

201309 Example For

Presentation (Jackson)

Analysis Data

Analysis

Rawdata

Syntax

isac_analysisDATE.sav

I use this folder

only for rawdata

files that are

unique to the

specific analysis

Page 45: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Analysis Level

(Analyze)

cleaning.sps

Analysis

201309 Example For

Presentation (Jackson)

Analysis Data

Analysis

Rawdata

Syntax

isac_analysisDATE.sav

analysis.sps

Page 46: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Analysis Level

Syntax:

◦ Cleaning.sps

Cleans and aggregates the files

◦ Analysis.sps

Conducts the analysis

Makes NO changes to the data

DOES NOT write out any files

Page 47: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps
Page 48: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Analysis Level

(Present/Archive)

You have analyzed your data and put

together a paper, abstract, poster, etc.

ARCHIVE your syntax on the date of

submission

◦ Put dates at the end of the syntax file

◦ DO NOT TOUCH THE SYNTAX FILE

AGAIN (I usually comment out the

commands)

Create Abstact.sps syntax file

Page 49: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

Data Workflow: Analysis Level

(Present/Archive)

cleaning20131001.sps

Analysis

201309 Example For

Presentation (Jackson)

Analysis Data

Syntax

isac_analysis20130920.sav

analysis20131001.sps

_Abstract

Add dates to Syntax

apsabstract_20131001.docx

abstract20131001.sps Create abstract Syntax file

cleaning.sps

analysis.sps Probably still analyzing stuff for

creating abstract into a poster,

so I keeping working versions of

my needed syntax files

Page 50: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

I generally do not archive my analysis

folders, but rather focus on archiving the

syntax when major events happen in an

analysis (submissions, revisions,

acceptance).

Data Workflow: Analysis Level

(Present/Archive)

Page 51: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

The important things

Never overwrite a rawdata file.

◦ Every change to your rawdata should be a new

file with a new date.

Files are only opened or saved with syntax

All changes to files are done with syntax

Cleaning/Setup Syntax:

◦ We read files in and write separate files out

Analysis Syntax:

◦ We read files in and NEVER write or change files

Page 52: Data Organization & Creating Reproducible Analyses · Data Organization & Creating Reproducible Analyses ... health plan/insurance numbers ... isac_master_long20130920.sav cleaning.sps

The important things

When important things happen, we

archive the SYNTAX by commenting out

all of the commands and placing dates on

the end.