Visualising Big Data

Download Visualising Big Data

Post on 11-Apr-2017

8.067 views

Category:

Data & Analytics

0 download

Embed Size (px)

TRANSCRIPT

  • Amit Kapoor@amitkaps

    Visualising Big Data

  • Visualise Million Data Points

    x

  • Data Sample

    Sampling can be effective (with overweighting

    unusual values)

    Require multiple plots or careful

    tuning parameters

  • Data Sample

    Model

    Models are great as they scale nicely.

    But, visualisation is required as

    I dont know, what I dont know.

  • Data Sample

    ModelBinning

    Binning can solve a lot of these challenges

    Bin - Summarize - Smooth: A framework

    for visualising big data - Hadley Wickam (2013)

  • Visualising big data is the process of creating generalized histograms

  • Approach

    BIN : fixed size bins = (x-origin)/width

    SUMMARIZE : summary stats = count, mean, stdev

    SMOOTH : smoothing e.g. kernel mean, regression

    VISUALISE : visualise using standard plots

  • Bigvis Package in R

    Aim: To plot 100 million points in under 5 seconds.Approach: - Plotting using standard R libraries- Processing done in (fast) compiled C++ code, using

    Rcpp package - Outlier removal in big data- Smoothing to highlight trends & suppress noise

  • Diamonds dataset

    ggplot(diamonds) + aes(carat, price) + geom_point(alpha = 0.2, colour = orange)

    50k observations e.g. price, carat of diamonds

  • Condense (bin + summarise)

    library(bigvis)library(ggplot2)

    Nbin

  • Plotting the Condense

    p

  • Both Points & Condensed

    q

  • Movies dataset

    ggplot(movies) + aes(length, rating) + geom_point(alpha = 0.2, colour = orange)

    130k observations e.g. length, rating of movies on IMDB

  • Let us see the outliers

    title length rating1 Matrjoschka 5700 8.52 The Cure for Insomnia 5220 5.93 The Longest Most Meaningless Movie in the World 2880 7.34 The Hazards of Helen 1428 6.65 **** 1100 6.9

  • Condense (bin + summarise)

    library(bigvis)library(ggplot2)

    Nbin

  • Condesed Plot

    p

  • Remove Outliers

    p %>% peel(BinData)

    Create bins = 10000, summarize count & peel 1% outlier

  • Smoothing

    smoothBinData

  • Big Data Visualisation

    Approach: Bin - Summarize - Smooth - Visualise Interactively plot nearly 100 millions data point in-

    memory for EDA in R Can be extend to in-database e.g. for binning Can be parallelised e.g. summarize on count, mean

  • Amit Kapoor@amitkaps

    amitkaps.comnarrativeviz.com

    Data

    Visual

    Story

    *