bigr data user
TRANSCRIPT
-
7/27/2019 Bigr Data User
1/54
Hadley Wickham@hadleywickham
Chief Scientist, RStudio
Bigger dataanalysis
July 2013
http://bit.ly/bigrdata2
Wednesday, July 10, 13
http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/ -
7/27/2019 Bigr Data User
2/54
1. What is data analysis?
2. Transforming data
3. Visualising data
http://bit.ly/bigrdata2
Wednesday, July 10, 13
http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2 -
7/27/2019 Bigr Data User
3/54
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
4/54
Data analysis is the process
by which data becomesunderstanding, knowledge
and insight
Data analysis is the process
by which data becomesunderstanding, knowledge
and insight
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
5/54
Data analysis is the process
by which data becomesunderstanding, knowledge
and insight
Data analysis is the process
by which data becomesunderstanding, knowledge
and insight
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
6/54
Frequent data analysislearn to program
http://www.flickr.com/photos/compleo/5414489782
Wednesday, July 10, 13
http://www.flickr.com/photos/compleo/5414489782http://www.flickr.com/photos/compleo/5414489782 -
7/27/2019 Bigr Data User
7/54
Transform
Visualise
Model
Tidy
Wednesday, July 10, 13 h
-
7/27/2019 Bigr Data User
8/54
Cognition time Computation time
http://www.flickr.com/phot
os/mutsmuts/4695658106
Wednesday, July 10, 13
http://www.flickr.com/photos/mutsmuts/4695658106http://www.flickr.com/photos/mutsmuts/4695658106 -
7/27/2019 Bigr Data User
9/54
Transform
Visualise
Model
Tidy
reshape2
ggplot2
plyrstringr
lubridate
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
10/54
Computation time Cognition time
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
11/54
Transform
Visualise
Model
Tidy
bigvis
dplyr
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
12/54
Studio
Data
Every commercial US flight 2000-2011:
~76 million flights
Total database: ~11 Gb
>100 variables, but Ill focus on ahandful: airline, delay, distance, flight
time and speed.
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
13/54Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
14/54
Al 2
Bo 4
Bo 0Bo 5
Ed 5
Ed 10
name n
name n
Al 2
Bo 4
Bo 0Bo 5
Ed 5
Ed 10
name n
name n
Al 2
Bo 9
Ed 15
name total
2
9
15
total
total
total
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
15/54
array data frame list nothing
array
data frame
list
n replicates
functionarguments
aaply adply alply a_ply
daply ddply dlply d_ply
laply ldply llply l_ply
raply rdply rlply r_ply
maply mdply mlply m_ply
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
16/54
array data frame list nothing
array
data frame
list
n replicates
functionarguments
aaply adply alply a_ply
daply ddply dlply d_ply
laply ldply llply l_ply
raply rdply rlply r_ply
maply mdply mlply m_ply
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
17/54
ddply
ldply
dlply
llply
d_ply
laply
adply
daply
l_ply
aaply
alply
a_ply
0 50 100 150count
fun
use
Never
OccassionallyOften
All the time
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
18/54
Data analysis verbs
select: subset variables
filter: subset rows
mutate: add new columns
summarise: reduce to a single row
arrange: re-order the rows
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
19/54
Data analysis verbs
select: subset variables
filter: subset rows
mutate: add new columns
summarise: reduce to a single row
arrange: re-order the rows
+groupby
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
20/54
h
-
7/27/2019 Bigr Data User
21/54
# Often work with the same grouping variables
# multiple times, so define upfront. Also refer
# to variables in the same way
daily_df
-
7/27/2019 Bigr Data User
22/54
library(data.table)
h_dt
-
7/27/2019 Bigr Data User
23/54
# And dplyr also works seamlessly with databases:
ontime
-
7/27/2019 Bigr Data User
24/54
# Behind the scenes
library(dplyr)
ontime 2005, ontime)
# Year > 2005.0
translate_sql(Year > 2005L, ontime)
# Year > 2005
translate_sql(Origin == "IAD" || Dest == "IAD", ontime)
# Origin = 'IAD' OR Dest = 'IAD'
years
-
7/27/2019 Bigr Data User
25/54
Data frames (dplyr)
Data tables (dplyr)
SQLite tables (dplyr)
Postgresql, MySql, SQL server, ...
MonetDB (planned)
Google bigquery (bigrquery)
Data sources
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
26/54
daily_df
-
7/27/2019 Bigr Data User
27/54
# It might even live on the web
library(bigrquery)
library(dplyr)library(bigrquery)
h_bq
-
7/27/2019 Bigr Data User
28/54
dplyr
Currently experimental and incomplete,but it works, and youre welcome to try it
out.
library(devtools)
install_github("assertthat")
install_github("dplyr")
install_github("bigrquery")
Needs a development environment(http://www.rstudio.com/ide/docs/packages/prerequisites)
Wednesday, July 10, 13
http://www.rstudio.com/ide/docs/packages/prerequisiteshttp://www.rstudio.com/ide/docs/packages/prerequisiteshttp://www.rstudio.com/ide/docs/packages/prerequisites -
7/27/2019 Bigr Data User
29/54
Wednesday, July 10, 13
Studio
-
7/27/2019 Bigr Data User
30/54
Studio
library(ggplot2)
library(bigvis)
# Can't use data frames :(
dist
-
7/27/2019 Bigr Data User
31/54
qplot(dist, speed, colour = delay) +
scale_colour_gradient2()Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
32/54
qplot(dist, speed, colour = delay) +
scale_colour_gradient2()
One hour later...
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
33/54
x
-
7/27/2019 Bigr Data User
34/54
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
35/54
user system elapsed2.785 0.010 2.806
Wednesday, July 10, 13
Studio
-
7/27/2019 Bigr Data User
36/54
Studio
Goals
Support exploratory analysis (e.g. in R)
Fast on commodity hardware
100,000,000 in
-
7/27/2019 Bigr Data User
37/54
Studio
Insight
Bottleneck is number of pixels:
1d 3,000; 2d: 3,000,000 Process:
Condense (bin & summarise)
Smooth
Visualise
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
38/54
x origin
width
Bin
Wednesday, July 10, 13
S i
-
7/27/2019 Bigr Data User
39/54
Count
Mean
Std. dev.
Quantiles
Histogram, KDE
Regression, Loess
Boxplots, Quantile regressionsmoothing
Summarise
Wednesday, July 10, 13
Studio
-
7/27/2019 Bigr Data User
40/54
Studio
0
500000
1000000
1500000
0 1000 2000 3000 4000 5000
dist
.co
unt
dist_s
-
7/27/2019 Bigr Data User
41/54
Studio
0
500000
1000000
1500000
0 1000 2000 3000 4000 5000
dist
.co
unt
dist_s
-
7/27/2019 Bigr Data User
42/54
Stud o
NA
0
500000
1000000
1500000
0 1000 2000 3000
time
.co
unt
time_s
-
7/27/2019 Bigr Data User
43/54
0
250000
500000
750000
0 250 500 750 1000
time
.co
unt
autoplot(time_s, na.rm = TRUE)Wednesday, July 10, 13
Studio
-
7/27/2019 Bigr Data User
44/54
0
250000
500000
750000
0 100 200 300 400 500
time
.co
unt
autoplot(time_s[time_s < 500, ])Wednesday, July 10, 13
Studio
-
7/27/2019 Bigr Data User
45/54
0
500000
1000000
1500000
0 20 40 60
time
.co
unt
autoplot(time_s %% 60)Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
46/54
200
400
600
0 1000 2000 3000 4000 5000
dist
spe
ed
1e+00
1e+02
1e+04
1e+06
.count
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
47/54
200
400
600
0 1000 2000 3000 4000 5000
dist
speed
1e+00
1e+02
1e+04
1e+06
.count
sd1
-
7/27/2019 Bigr Data User
48/54
200
400
600
0 1000 2000 3000 4000 5000
dist
speed
1e+00
1e+02
1e+04
1e+06
.count
sd1
-
7/27/2019 Bigr Data User
49/54
0
200
400
600
800
0 1000 2000 3000 4000 5000
dist
speed
0e+00
1e+05
2e+05
3e+05
4e+05
5e+05
6e+05
.count
Wednesday, July 10, 13
800
-
7/27/2019 Bigr Data User
50/54
0
200
400
600
800
0 1000 2000 3000 4000 5000
dist
speed
0e+00
1e+05
2e+05
3e+05
4e+05
5e+05
6e+05
.count
sd2
-
7/27/2019 Bigr Data User
51/54
0
200
400
600
0 1000 2000 3000 4000 5000
dist
speed
0e+00
1e+05
2e+05
3e+05
4e+05
5e+05
6e+05
.count
sd2
-
7/27/2019 Bigr Data User
52/54
Demoshiny::runApp("mt/", 8002)
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
53/54
Wednesday, July 10, 13
-
7/27/2019 Bigr Data User
54/54
To do...
Bigvis and dplyr currently
complementary, but not at allintegrated
Also need better tools for
modelling large data biglm helpful,but only one class of model