sipi data days 2019 n. thompson, phd...2019/07/12 · 53/complex-headers-in-angular2-data-table 16...
TRANSCRIPT
Welcome to data days
N. Thompson, PhDSIPI data days 2019
My story: I’m Nicole...
2
❏ Cuban-American❏ Grew up all over USA❏ Loves:
❏ fantasy, sci-fi❏ human language and
culture❏ physics and chemistry
❏ Wanted to tie it all together
I’m now a behavioral ecologistAfter 3 degrees and lots of exploring...
My job is to research animal behavior and physiology.
❏ Analyze data almost every day.
❏ Greatest tool: A computer programming language called “R”.
3
My goals for you
❏ Apply the principles of tidy data and data visualization❏ Use curiosity and creativity to generate and answer
research questions on socially relevant topics
4
After our 2 day-long sessions you will be able to
❏ Manipulate & explore a data set using R programming language
❏ Visualize patterns in data with R
… and have fun!
5
What is R? A language to talk to your computer
6Diagram courtesy of Garret Grolemund
Who is R for? Everyone.
7
What is data science?
8
Wickham & Grolemund, r4ds
Exploratory data analysis is one important part
9
Wickham & Grolemund, r4ds
Your capstone team projects
❏ Choose data sets❏ Become familiar with them and form research questions
❏ Use functions in R to answer questions❏ Data transformations and summaries (package dplyr)❏ Data visualizations (package ggplot2)
❏ Present questions and findings to class in 15 min
Date of 15 minute group presentations is TBD.
10
Our schedule
Day 1: 7/12/19
Literacy: Choose and describe a data set, create research questions
Transformations: Exploring data sets with R by subsets, transformations, and summaries
12
Day 2: 7/19/19
Graphics best practices: Evaluate and interpret visualizations
Visualizations: Exploring data sets with R by graphical plotting
*Exploring = answering questions*
Let’s meet R in RStudio
13
Troubleshooting
Run “?function_name” - for help
GOOGLE “R error name/function name/task”
Ask a friend.
Ask me!
Know you can do it.
14
Introduction to Tidy Data
N. Thompson, PhDSIPI data days 2019
Importance of data literacy
https://en.wikipedia.org/wiki/Data
https://www.digitaltveurope.com/2019/05/31/data-to-drive-40-of-tv-ad-spend-by-2020/
https://stackoverflow.com/questions/40182253/complex-headers-in-angular2-data-table 16
What is tidy data?
❏ Data are in a table
❏ Each variable gets a column
❏ Each observation gets a row
❏ Each cell is a single value
❏ Each type of observation gets its own table
Fig 12.1, Wickham & Grolemund “R for Data Science”
17
Tidy data sets have data dictionaries
Data dictionary: a description of each variable in a data set, including its data type and units.
Soon, you will write your own data dictionaries in teams.
18
Example tidy data set: Diabetes risk factors in Pima women from AZFrom: https://www.kaggle.com/uciml/pima-indians-diabetes-database
19
Continuous
Data dictionary: define the variables
ContinuousCategorical
Diabetes risk factors in Pima women from AZ
Logical
31
Is it tidy?Diabetes risk factors in Pima women from AZ
❏ Data are in a table
❏ Each variable gets a column
❏ Each observation gets a row - a woman >21 yrs old
❏ Each cell is a single value
❏ Each type of observation gets its own table - diagnosis and measurements per woman
35
Your turn…
❏ Break into teams of 3 - lead detective, scribe, & reporter
❏ Choose data sets - view in R Studio
❏ Learning goals for 1st group activity:
❏ create a data dictionary for chosen data set
❏ formulate research questions and diagnose limitations of data set
36
Team roles
Reporter: communicates the team’s findings, process, and questions to the class as a whole.
37
Lead detective: drives the team toward its goal, takes charge of plans of action, watches the clock.
Scribe: writes down the team’s initial answers on worksheets and writes initial code.
Project data sets:
1. Cancer rates by US state in 2017
2. Human trafficking in the USA in 2016 (some untidiness!)
3. Crime rates in major metropolitan areas
4. Gun crime in the USA 2012-2014
5. Diabetes risk factors among Pima women in AZ
38
Exploratory Data Analysis (EDA) in R
N. Thompson, PhDSIPI data days 2019
Moving on from tidy data… time to start exploring
40
Wickham & Grolemund, r4ds
Key functions you will learn (see handouts)
Dplyr functions:
%>%
select()
filter()
mutate()
summarise()
group_by()
41
Base R arithmetic & notation:
<- “assignment”
==, != “equal to”, “not equal to”
>, <, >=, <= inequalities
&, | intersection, union
str(), View(), c()
mean(), sd(), sum()
Key functions you will learn cont’d (see handouts)
Functions for data types:
class()
is.na()
as.numeric() - continuous
as.character() - categorical
as.factor() - categorical
42
Base subsetting:
Data[a,b] - a index = rows, b index = columns
Data$name - select a column
Learning to code...
1. Observe live coding
2. Copy sections of live code
3. Fill in blanks and perform exercises solo
4. Share progress with teammates
43
To our consoles!
44
Benefits of tidy data
❏ Consistent and predictable structure
❏ Prevents errors in your own analyses
❏ Increases clarity for others to follow your analyses
45