utah data manipulation with plyr - hadley wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11....
TRANSCRIPT
![Page 1: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/1.jpg)
Hadley WickhamRice University / RStudio
Data manipulation with plyr
November 2012
http://stat405.had.co.nz/utah
Monday, November 12, 12
![Page 2: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/2.jpg)
• Visualizing data with R
• Working in R
• Transforming data
• Reshaping data
Day 1
Transform
Visualise
Model
Day 2• Fitting and interpreting
• Understanding model uncertainty
• Comparing models
• Modelling/Machine learning survey
Monday, November 12, 12
![Page 3: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/3.jpg)
1. Baby names data
2. Four tools for data manipulation
3. Groupwise operations
4. ddply
5. Challenges
Monday, November 12, 12
![Page 4: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/4.jpg)
CC BY http://www.flickr.com/photos/the_light_show/2586781132
Top 1000 male and female baby names in the US, from 1880 to 2008.258,000 records (1000 * 2 * 129)But only five variables: year, name, soundex, sex and prop.
Baby names
Monday, November 12, 12
![Page 5: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/5.jpg)
install.packages(c("plyr", "ggplot2"))
library(plyr)library(ggplot2)
options(stringsAsFactors = FALSE)
bnames <- read.csv("bnames2.csv.bz2")bnames <- read.csv(choose.file())
Monday, November 12, 12
![Page 6: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/6.jpg)
head(bnames) year name prop sex soundex1 1880 John 0.081541 boy J5002 1880 William 0.080511 boy W4503 1880 James 0.050057 boy J5204 1880 Charles 0.045167 boy C6425 1880 George 0.043292 boy G6206 1880 Frank 0.027380 boy F652
tail(bnames) year name prop sex soundex257995 2008 Diya 0.000128 girl D000257996 2008 Carleigh 0.000128 girl C642257997 2008 Iyana 0.000128 girl I500257998 2008 Kenley 0.000127 girl K540257999 2008 Sloane 0.000127 girl S450258000 2008 Elianna 0.000127 girl E450
Monday, November 12, 12
![Page 7: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/7.jpg)
Extract your name from the dataset. Plot the trend over time.
Your turn
Monday, November 12, 12
![Page 8: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/8.jpg)
garrett <- bnames[bnames$name == "Garrett", ]
qplot(year, prop, data = garrett, geom = "line")
Monday, November 12, 12
![Page 9: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/9.jpg)
subset summarisemutatearrange
Monday, November 12, 12
![Page 10: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/10.jpg)
Revision
Recall the four functions that filter rows, create summaries, add new variables and rearrange the rows.You have 30 seconds!
Monday, November 12, 12
![Page 11: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/11.jpg)
Function Package
subset base
summarise plyr
mutate plyr
arrange plyr
They all have similar syntax. The first argument is a data frame, and all other arguments are interpreted in the context of that data frame. Each returns a data frame.
Monday, November 12, 12
![Page 12: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/12.jpg)
subset(df, color == "blue")
color valueblue 1black 2blue 3blue 4black 5
color valueblue 1blue 3blue 4
Monday, November 12, 12
![Page 13: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/13.jpg)
summarise(df, double = 2 * value)
color valueblue 1black 2blue 3blue 4black 5
double246810
Monday, November 12, 12
![Page 14: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/14.jpg)
summarise(df, total = sum(value))
color valueblue 1black 2blue 3blue 4black 5
total15
Monday, November 12, 12
![Page 15: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/15.jpg)
mutate(df, double = 2 * value)
color valueblue 1black 2blue 3blue 4black 5
color value doubleblue 1 2black 2 4blue 3 6blue 4 8black 5 10
Monday, November 12, 12
![Page 16: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/16.jpg)
arrange(df, color)
color value4 11 25 33 42 5
color value1 22 53 44 15 3
Monday, November 12, 12
![Page 17: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/17.jpg)
arrange(df, desc(color))
color value4 11 25 33 42 5
color value5 34 13 42 51 2
Monday, November 12, 12
![Page 18: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/18.jpg)
Using the data frame containing your name:Reorder from highest to lowest popularity.Calculate the min, mean, and max proportions for your name over the years. Return these in a data frame.
Add a new column that changes the proportion to a percentage
Your turn
Monday, November 12, 12
![Page 19: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/19.jpg)
arrange(garrett, desc(prop))
summarise(garrett, min = min(prop), mean = mean(prop), max = max(prop))
mutate(garrett, perc = prop * 100)
Monday, November 12, 12
![Page 20: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/20.jpg)
Group wise operations
Monday, November 12, 12
![Page 21: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/21.jpg)
Total number of people per name
name total1 Aaden 9592 Aaliyah 396653 Aarav 2194 Aaron 5094645 Ab 256 Abagail 26827 Abb 168 Abbey 143489 Abbie 1662210 Abbigail 6800
Monday, November 12, 12
![Page 22: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/22.jpg)
Total number of people per name
Do we have enough
information to calculate this?
name total1 Aaden 9592 Aaliyah 396653 Aarav 2194 Aaron 5094645 Ab 256 Abagail 26827 Abb 168 Abbey 143489 Abbie 1662210 Abbigail 6800
Monday, November 12, 12
![Page 23: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/23.jpg)
Your turnIn groups, calculate the totals for each name. If it is difficult, just devise a strategy for doing it.Hint: Start by calculating the total for a single name (e.g, your name).
name total1 Aaden 9592 Aaliyah 396653 Aarav 2194 Aaron 5094645 Ab 256 Abagail 26827 Abb 168 Abbey 143489 Abbie 1662210 Abbigail 6800
Monday, November 12, 12
![Page 24: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/24.jpg)
garrett <- subset(bnames, name == "garrett")sum(garrett$n)
# Orsummarise(garrett, total = sum(n))
# But how could we do this for every name?
Monday, November 12, 12
![Page 25: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/25.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_total <- summarise(john, total = sum(n))william_total <- summarise(william, total = sum(n))james_total <- summarise(james, total = sum(n))...
# Combinetotals <- rbind(john_total, william_total, james_total...)
Monday, November 12, 12
![Page 26: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/26.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_total <- summarise(john, total = sum(n))william_total <- summarise(william, total = sum(n))james_total <- summarise(james, total = sum(n))...
# Combinetotals <- rbind(john_total, william_total, james_total...)
The a
pproach
is so
und, b
ut the
re
are 67
82 na
mes :(
Monday, November 12, 12
![Page 27: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/27.jpg)
Your turn
Look at the previous code and answer these three questions:What data set did we split into pieces?What variable did we use to split up the data set?What function did we apply to each piece?
Monday, November 12, 12
![Page 28: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/28.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_total <- summarise(john, total = sum(n))william_total <- summarise(william, total = sum(n))james_total <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
Monday, November 12, 12
![Page 29: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/29.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_total <- summarise(john, total = sum(n))william_total <- summarise(william, total = sum(n))james_total <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
Split this data set
Monday, November 12, 12
![Page 30: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/30.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_total <- summarise(john, total = sum(n))william_total <- summarise(william, total = sum(n))james_total <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
Monday, November 12, 12
![Page 31: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/31.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_total <- summarise(john, total = sum(n))william_total <- summarise(william, total = sum(n))james_total <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
with this variable
Monday, November 12, 12
![Page 32: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/32.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_years <- summarise(john, total = sum(n))william_years <- summarise(william, total = sum(n))james_years <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
Monday, November 12, 12
![Page 33: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/33.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_years <- summarise(john, total = sum(n))william_years <- summarise(william, total = sum(n))james_years <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
then apply this function
Monday, November 12, 12
![Page 34: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/34.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_years <- nrow(john)william_years <- nrow(william)james_years <- nrow(james)...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
Monday, November 12, 12
![Page 35: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/35.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_years <- nrow(john)william_years <- nrow(william)james_years <- nrow(james)...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
put results into a single data
frame
Monday, November 12, 12
![Page 36: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/36.jpg)
ddply
Monday, November 12, 12
![Page 37: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/37.jpg)
ddply(
split-combine-apply with ddply
)
Monday, November 12, 12
![Page 38: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/38.jpg)
ddply(
split-combine-apply with ddply
)
data set to split
variable(s) to split by
function to apply to each group
Monday, November 12, 12
![Page 39: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/39.jpg)
ddply(bnames2
split-combine-apply with ddply
)
data set to split
variable(s) to split by
function to apply to each group
Monday, November 12, 12
![Page 40: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/40.jpg)
ddply(bnames2
split-combine-apply with ddply
, "name" )
data set to split
variable(s) to split by
function to apply to each group
Monday, November 12, 12
![Page 41: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/41.jpg)
ddply(bnames2
split-combine-apply with ddply
, "name", summarise)
data set to split
variable(s) to split by
function to apply to each group
Monday, November 12, 12
![Page 42: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/42.jpg)
ddply(bnames2
split-combine-apply with ddply
, "name", summarise)
data set to split
variable(s) to split by
function to apply to each group
Q: How many arguments did summarise get in our previous code? Will it know what they are above?
Monday, November 12, 12
![Page 43: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/43.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_years <- summarise(john, total = sum(n))william_years <- summarise(william, total = sum(n))james_years <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
Monday, November 12, 12
![Page 44: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/44.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_years <- summarise(john, total = sum(n))william_years <- summarise(william, total = sum(n))james_years <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
first argument is the piece from
above
Monday, November 12, 12
![Page 45: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/45.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_years <- summarise(john, total = sum(n))william_years <- summarise(william, total = sum(n))james_years <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
Monday, November 12, 12
![Page 46: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/46.jpg)
# Splitjohn <- subset(bnames2, name == "John")william <- subset(bnames2, name == "William")james <- subset(bnames2, name == "James")...
# Applyjohn_years <- summarise(john, total = sum(n))william_years <- summarise(william, total = sum(n))james_years <- summarise(james, total = sum(n))...
# combinetotals <- rbind(john_total, william_total, james_total, ...)
first argument is the piece from
above
the second argument is what we want summarise to do
Monday, November 12, 12
![Page 47: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/47.jpg)
ddply(bnames2, "name", summarise, ... )
split-combine-apply with ddply
Monday, November 12, 12
![Page 48: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/48.jpg)
ddply(bnames2, "name", summarise, ... )
split-combine-apply with ddply
data set to split
variable(s) to split by
function to use on each group
Monday, November 12, 12
![Page 49: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/49.jpg)
ddply(bnames2, "name", summarise, ... )
split-combine-apply with ddply
data set to split
variable(s) to split by
function to use on each group
extra arguments for function
Monday, November 12, 12
![Page 50: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/50.jpg)
Your turn
In your group, use ddply to finish calculating the totals for each name. Hint: be sure to save your output to something.
name total1 Aaden 9592 Aaliyah 396653 Aarav 2194 Aaron 5094645 Ab 256 Abagail 26827 Abb 168 Abbey 143489 Abbie 1662210 Abbigail 6800
Monday, November 12, 12
![Page 51: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/51.jpg)
totals <- ddply(bnames2, "name", summarise, total = sum(n))
Monday, November 12, 12
![Page 52: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/52.jpg)
Al 2Bo 4Bo 0Bo 5Ed 5Ed 10
name n
name n
Al 2
Bo 4Bo 0Bo 5
Ed 5Ed 10
Split
name n
name n
Al 2Bo 9Ed 15
Combine
name total
2
9
15
Applytotal
total
total
ddply(bnames2, "name", summarise, total = sum(n))Monday, November 12, 12
![Page 53: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/53.jpg)
Number per soundex
• How can we compute the total number of people who ever had each soundex?
Monday, November 12, 12
![Page 54: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/54.jpg)
Number per soundex
• How can we compute the total number of people who ever had each soundex?
Predict (in your head) the code that will do this.
Monday, November 12, 12
![Page 55: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/55.jpg)
Repeat the same operation, but use soundex instead of name. This tells us the total number of people whose names ever sounded like each soundex.What is the most common sound? What name does it correspond to?
Your turn
Monday, November 12, 12
![Page 56: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/56.jpg)
stotals <- ddply(bnames2, "soundex", summarise, total = sum(n))stotals <- arrange(stotals, desc(n))
subset(bnames, soundex == "W450")
Monday, November 12, 12
![Page 57: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/57.jpg)
Workflow
1. Extract a single group2. Find a function that solves it for just
that group
3. Combine the function with ddply to solve it for all groups
Monday, November 12, 12
![Page 58: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/58.jpg)
Transformations
• How can we make a new variable that shows the rank of each name for each year? We should treat boys names as distinct from girls names (even if they’re spelled the same).
Monday, November 12, 12
![Page 59: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/59.jpg)
Transformations
Hint: We’ll use rank()
• How can we make a new variable that shows the rank of each name for each year? We should treat boys names as distinct from girls names (even if they’re spelled the same).
Monday, November 12, 12
![Page 60: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/60.jpg)
Extract a single year’s worth of names. Make sure they are all the same sex. Try year = 2008 and sex = boy.
Your turn: Step 1
Monday, November 12, 12
![Page 61: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/61.jpg)
Add a rank variable to our subset. Use rank().
Your turn: Step 2
Monday, November 12, 12
![Page 62: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/62.jpg)
Combine mutate(boys2008, rank = rank(desc(prop)))
with ddply and apply to the entire data set.
Your turn: Step 3
Monday, November 12, 12
![Page 63: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/63.jpg)
boys2008 <- subset(bnames2, year == 2008 & sex == "boy")
mutate(boys2008, rank == rank(desc(prop)))
ranks <- ddply(bnames2, c("year", "sex"), mutate, rank = rank(desc(prop)))
Monday, November 12, 12
![Page 64: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/64.jpg)
Challenges
Monday, November 12, 12
![Page 65: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/65.jpg)
Solving new problems
• Simple problems: identify which one of subset, summarise, mutate or arrange that you need. Combine with ddply.
• Complex problems: figure out a sequence of simple problems that leads you from the raw data to the final problem.
Monday, November 12, 12
![Page 66: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/66.jpg)
Warmups
Which names were most popular in 1999?Work out the average yearly usage of each name.List the 10 names with the highest average proportions.
Monday, November 12, 12
![Page 67: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/67.jpg)
# Which names were most popular in 1999?subset(bnames2, year == 1999 & rank < 10)n1999 <- subset(bnames2, year == 1999)head(arrange(n1999, desc(prop)), 10)
# Average usageoverall <- ddply(bnames2, "name", summarise, prop1 = mean(prop), prop2 = sum(prop) / 129)
# Top 10 nameshead(arrange(overall, desc(prop)), 10)
Monday, November 12, 12
![Page 68: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/68.jpg)
library(stringr)
# extracts first and last letter of each namebnames2 <- mutate(bnames2, first = str_sub(name, 1, 1), last = str_sub(name, -1, -1))
Set up for challenges
Monday, November 12, 12
![Page 69: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/69.jpg)
How has the total proportion of babies with names in the top 1000 changed over time?How has the popularity of different initials changed over time?
Challenge 1
Monday, November 12, 12
![Page 70: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/70.jpg)
top1000 <- ddply(bnames2, c("year","sex"), summarise, prop = sum(prop), npop = sum(prop > 1/1000))
qplot(year, prop, data = top1000, colour = sex, geom = "line")qplot(year, npop, data = top1000, colour = sex, geom = "line")
Monday, November 12, 12
![Page 71: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/71.jpg)
init <- ddply(bnames2, c("year","first"), summarise, prop = sum(prop)/2)
qplot(year, prop, data = init, colour = first, geom = "line")
Monday, November 12, 12
![Page 72: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/72.jpg)
Challenge 2
For each name, find the year in which it was most popular, and the rank in that year. (Hint: you might find which.max useful). Print all names that have been the most popular name at least once.
Monday, November 12, 12
![Page 73: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/73.jpg)
most_pop <- ddply(bnames2, "name", summarise, year = year[which.max(prop)], rank = min(rank))most_pop <- ddply(bnames2, "name", subset, prop == max(prop))
subset(bnames2, rank == 1)
Monday, November 12, 12
![Page 74: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/74.jpg)
Challenge 3
What two names (boy and girl) have been in the top 10 most often? (Hint: you'll have to do this in three steps. Think about what they are before starting)
Monday, November 12, 12
![Page 75: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/75.jpg)
top10 <- subset(bnames2, rank <= 10)counts <- ddply(top10, c("sex", "name"), nrow) # orcounts <- count(top10, c("sex", "name"))
ddply(counts, "sex", subset, freq == max(freq)) # orhead(arrange(counts, desc(freq)), 10
Monday, November 12, 12
![Page 76: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/76.jpg)
For each soundex, find the most common name in that group.Hint: use count (see ?count)
Challenge 4
Monday, November 12, 12
![Page 77: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/77.jpg)
names <- count(bnames2, c("soundex", "name"), "n")
ddply(names, "soundex", subset, freq == max(freq))
Monday, November 12, 12
![Page 78: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/78.jpg)
More about plyr
Monday, November 12, 12
![Page 79: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/79.jpg)
array data frame list nothing
array
data frame
list
n replicates
function arguments
aaply adply alply a_ply
daply ddply dlply d_ply
laply ldply llply l_ply
raply rdply rlply r_ply
maply mdply mlply m_ply
Monday, November 12, 12
![Page 81: utah Data manipulation with plyr - Hadley Wickhamstat405.had.co.nz/utah/plyr.pdf · 2012. 11. 12. · 9 Abbie 16622 10 Abbigail 6800 Monday, November 12, 12. Your turn In groups,](https://reader035.vdocuments.mx/reader035/viewer/2022071219/6055369c78844c692920c6e1/html5/thumbnails/81.jpg)
Monday, November 12, 12