bigr data user

Download Bigr Data User

If you can't read please download the document

Upload: aaasterisk

Post on 14-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/27/2019 Bigr Data User

    1/54

    Hadley Wickham@hadleywickham

    Chief Scientist, RStudio

    Bigger dataanalysis

    July 2013

    http://bit.ly/bigrdata2

    Wednesday, July 10, 13

    http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/http://creativecommons.org/licenses/by-nc/3.0/
  • 7/27/2019 Bigr Data User

    2/54

    1. What is data analysis?

    2. Transforming data

    3. Visualising data

    http://bit.ly/bigrdata2

    Wednesday, July 10, 13

    http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2http://bit.ly/bigrdata2
  • 7/27/2019 Bigr Data User

    3/54

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    4/54

    Data analysis is the process

    by which data becomesunderstanding, knowledge

    and insight

    Data analysis is the process

    by which data becomesunderstanding, knowledge

    and insight

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    5/54

    Data analysis is the process

    by which data becomesunderstanding, knowledge

    and insight

    Data analysis is the process

    by which data becomesunderstanding, knowledge

    and insight

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    6/54

    Frequent data analysislearn to program

    http://www.flickr.com/photos/compleo/5414489782

    Wednesday, July 10, 13

    http://www.flickr.com/photos/compleo/5414489782http://www.flickr.com/photos/compleo/5414489782
  • 7/27/2019 Bigr Data User

    7/54

    Transform

    Visualise

    Model

    Tidy

    Wednesday, July 10, 13 h

  • 7/27/2019 Bigr Data User

    8/54

    Cognition time Computation time

    http://www.flickr.com/phot

    os/mutsmuts/4695658106

    Wednesday, July 10, 13

    http://www.flickr.com/photos/mutsmuts/4695658106http://www.flickr.com/photos/mutsmuts/4695658106
  • 7/27/2019 Bigr Data User

    9/54

    Transform

    Visualise

    Model

    Tidy

    reshape2

    ggplot2

    plyrstringr

    lubridate

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    10/54

    Computation time Cognition time

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    11/54

    Transform

    Visualise

    Model

    Tidy

    bigvis

    dplyr

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    12/54

    Studio

    Data

    Every commercial US flight 2000-2011:

    ~76 million flights

    Total database: ~11 Gb

    >100 variables, but Ill focus on ahandful: airline, delay, distance, flight

    time and speed.

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    13/54Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    14/54

    Al 2

    Bo 4

    Bo 0Bo 5

    Ed 5

    Ed 10

    name n

    name n

    Al 2

    Bo 4

    Bo 0Bo 5

    Ed 5

    Ed 10

    name n

    name n

    Al 2

    Bo 9

    Ed 15

    name total

    2

    9

    15

    total

    total

    total

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    15/54

    array data frame list nothing

    array

    data frame

    list

    n replicates

    functionarguments

    aaply adply alply a_ply

    daply ddply dlply d_ply

    laply ldply llply l_ply

    raply rdply rlply r_ply

    maply mdply mlply m_ply

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    16/54

    array data frame list nothing

    array

    data frame

    list

    n replicates

    functionarguments

    aaply adply alply a_ply

    daply ddply dlply d_ply

    laply ldply llply l_ply

    raply rdply rlply r_ply

    maply mdply mlply m_ply

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    17/54

    ddply

    ldply

    dlply

    llply

    d_ply

    laply

    adply

    daply

    l_ply

    aaply

    alply

    a_ply

    0 50 100 150count

    fun

    use

    Never

    OccassionallyOften

    All the time

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    18/54

    Data analysis verbs

    select: subset variables

    filter: subset rows

    mutate: add new columns

    summarise: reduce to a single row

    arrange: re-order the rows

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    19/54

    Data analysis verbs

    select: subset variables

    filter: subset rows

    mutate: add new columns

    summarise: reduce to a single row

    arrange: re-order the rows

    +groupby

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    20/54

    h

  • 7/27/2019 Bigr Data User

    21/54

    # Often work with the same grouping variables

    # multiple times, so define upfront. Also refer

    # to variables in the same way

    daily_df

  • 7/27/2019 Bigr Data User

    22/54

    library(data.table)

    h_dt

  • 7/27/2019 Bigr Data User

    23/54

    # And dplyr also works seamlessly with databases:

    ontime

  • 7/27/2019 Bigr Data User

    24/54

    # Behind the scenes

    library(dplyr)

    ontime 2005, ontime)

    # Year > 2005.0

    translate_sql(Year > 2005L, ontime)

    # Year > 2005

    translate_sql(Origin == "IAD" || Dest == "IAD", ontime)

    # Origin = 'IAD' OR Dest = 'IAD'

    years

  • 7/27/2019 Bigr Data User

    25/54

    Data frames (dplyr)

    Data tables (dplyr)

    SQLite tables (dplyr)

    Postgresql, MySql, SQL server, ...

    MonetDB (planned)

    Google bigquery (bigrquery)

    Data sources

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    26/54

    daily_df

  • 7/27/2019 Bigr Data User

    27/54

    # It might even live on the web

    library(bigrquery)

    library(dplyr)library(bigrquery)

    h_bq

  • 7/27/2019 Bigr Data User

    28/54

    dplyr

    Currently experimental and incomplete,but it works, and youre welcome to try it

    out.

    library(devtools)

    install_github("assertthat")

    install_github("dplyr")

    install_github("bigrquery")

    Needs a development environment(http://www.rstudio.com/ide/docs/packages/prerequisites)

    Wednesday, July 10, 13

    http://www.rstudio.com/ide/docs/packages/prerequisiteshttp://www.rstudio.com/ide/docs/packages/prerequisiteshttp://www.rstudio.com/ide/docs/packages/prerequisites
  • 7/27/2019 Bigr Data User

    29/54

    Wednesday, July 10, 13

    Studio

  • 7/27/2019 Bigr Data User

    30/54

    Studio

    library(ggplot2)

    library(bigvis)

    # Can't use data frames :(

    dist

  • 7/27/2019 Bigr Data User

    31/54

    qplot(dist, speed, colour = delay) +

    scale_colour_gradient2()Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    32/54

    qplot(dist, speed, colour = delay) +

    scale_colour_gradient2()

    One hour later...

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    33/54

    x

  • 7/27/2019 Bigr Data User

    34/54

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    35/54

    user system elapsed2.785 0.010 2.806

    Wednesday, July 10, 13

    Studio

  • 7/27/2019 Bigr Data User

    36/54

    Studio

    Goals

    Support exploratory analysis (e.g. in R)

    Fast on commodity hardware

    100,000,000 in

  • 7/27/2019 Bigr Data User

    37/54

    Studio

    Insight

    Bottleneck is number of pixels:

    1d 3,000; 2d: 3,000,000 Process:

    Condense (bin & summarise)

    Smooth

    Visualise

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    38/54

    x origin

    width

    Bin

    Wednesday, July 10, 13

    S i

  • 7/27/2019 Bigr Data User

    39/54

    Count

    Mean

    Std. dev.

    Quantiles

    Histogram, KDE

    Regression, Loess

    Boxplots, Quantile regressionsmoothing

    Summarise

    Wednesday, July 10, 13

    Studio

  • 7/27/2019 Bigr Data User

    40/54

    Studio

    0

    500000

    1000000

    1500000

    0 1000 2000 3000 4000 5000

    dist

    .co

    unt

    dist_s

  • 7/27/2019 Bigr Data User

    41/54

    Studio

    0

    500000

    1000000

    1500000

    0 1000 2000 3000 4000 5000

    dist

    .co

    unt

    dist_s

  • 7/27/2019 Bigr Data User

    42/54

    Stud o

    NA

    0

    500000

    1000000

    1500000

    0 1000 2000 3000

    time

    .co

    unt

    time_s

  • 7/27/2019 Bigr Data User

    43/54

    0

    250000

    500000

    750000

    0 250 500 750 1000

    time

    .co

    unt

    autoplot(time_s, na.rm = TRUE)Wednesday, July 10, 13

    Studio

  • 7/27/2019 Bigr Data User

    44/54

    0

    250000

    500000

    750000

    0 100 200 300 400 500

    time

    .co

    unt

    autoplot(time_s[time_s < 500, ])Wednesday, July 10, 13

    Studio

  • 7/27/2019 Bigr Data User

    45/54

    0

    500000

    1000000

    1500000

    0 20 40 60

    time

    .co

    unt

    autoplot(time_s %% 60)Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    46/54

    200

    400

    600

    0 1000 2000 3000 4000 5000

    dist

    spe

    ed

    1e+00

    1e+02

    1e+04

    1e+06

    .count

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    47/54

    200

    400

    600

    0 1000 2000 3000 4000 5000

    dist

    speed

    1e+00

    1e+02

    1e+04

    1e+06

    .count

    sd1

  • 7/27/2019 Bigr Data User

    48/54

    200

    400

    600

    0 1000 2000 3000 4000 5000

    dist

    speed

    1e+00

    1e+02

    1e+04

    1e+06

    .count

    sd1

  • 7/27/2019 Bigr Data User

    49/54

    0

    200

    400

    600

    800

    0 1000 2000 3000 4000 5000

    dist

    speed

    0e+00

    1e+05

    2e+05

    3e+05

    4e+05

    5e+05

    6e+05

    .count

    Wednesday, July 10, 13

    800

  • 7/27/2019 Bigr Data User

    50/54

    0

    200

    400

    600

    800

    0 1000 2000 3000 4000 5000

    dist

    speed

    0e+00

    1e+05

    2e+05

    3e+05

    4e+05

    5e+05

    6e+05

    .count

    sd2

  • 7/27/2019 Bigr Data User

    51/54

    0

    200

    400

    600

    0 1000 2000 3000 4000 5000

    dist

    speed

    0e+00

    1e+05

    2e+05

    3e+05

    4e+05

    5e+05

    6e+05

    .count

    sd2

  • 7/27/2019 Bigr Data User

    52/54

    Demoshiny::runApp("mt/", 8002)

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    53/54

    Wednesday, July 10, 13

  • 7/27/2019 Bigr Data User

    54/54

    To do...

    Bigvis and dplyr currently

    complementary, but not at allintegrated

    Also need better tools for

    modelling large data biglm helpful,but only one class of model