visualising big data

21
Amit Kapoor @amitkaps Visualising Big Data

Upload: amit-kapoor

Post on 11-Apr-2017

8.078 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Visualising Big Data

Amit Kapoor@amitkaps

Visualising Big Data

Page 2: Visualising Big Data

Visualise Million Data Points

x <- rnorm(1000000, mean=0, sd=2)y <- rnorm(1000000, mean=0, sd=2)

xy <- data.frame(x,y)

Same order as theNumber of Pixels

on my MacBook Air1400 x 900

Data

Page 3: Visualising Big Data

Data Sample

Sampling can be effective (with overweighting

unusual values)

Require multiple plots or careful

tuning parameters

Page 4: Visualising Big Data

Data Sample

Model

Models are great as they scale nicely.

But, visualisation is required as

“I don’t know, what I don’t know.”

Page 5: Visualising Big Data

Data Sample

ModelBinning

Binning can solve a lot of these challenges

“Bin - Summarize - Smooth: A framework

for visualising big data” - Hadley Wickam (2013)

Page 6: Visualising Big Data
Page 7: Visualising Big Data

“Visualising big data is the process of creating generalized histograms”

Page 8: Visualising Big Data

Approach

BIN : fixed size bins = (x-origin)/width

SUMMARIZE : summary stats = count, mean, stdev

SMOOTH : smoothing e.g. kernel mean, regression

VISUALISE : visualise using standard plots

Page 9: Visualising Big Data

Bigvis Package in R

Aim: To plot 100 million points in under 5 seconds.Approach: - Plotting using standard R libraries- Processing done in (fast) compiled C++ code, using

Rcpp package - Outlier removal in big data- Smoothing to highlight trends & suppress noise

Page 10: Visualising Big Data

Diamonds dataset

ggplot(diamonds) + aes(carat, price) + geom_point(alpha = 0.2, colour = “orange”)

50k observations e.g. price, carat of diamonds

Page 11: Visualising Big Data

Condense (bin + summarise)

library(bigvis)library(ggplot2)

Nbin <- 20BinData <- with(diamonds, condense(

bin(carat, find_width(carat,Nbin)),bin(price, find_width(price,Nbin)))

Page 12: Visualising Big Data

Plotting the Condense

p <- ggplot(BinData) + aes(carat, price, fill=.count) + geom_tile()

Create bins = 20 and summarized using count

Page 13: Visualising Big Data

Both Points & Condensed

q <- p + geom_point(data = diamonds, aes(fill = NULL), alpha = 0.2, colour = "orange")

Create bins = 20, summarized using count & added base data

Page 14: Visualising Big Data

Movies dataset

ggplot(movies) + aes(length, rating) + geom_point(alpha = 0.2, colour = “orange”)

130k observations e.g. length, rating of movies on IMDB

Page 15: Visualising Big Data

Let us see the outliers

title length rating1 Matrjoschka 5700 8.52 The Cure for Insomnia 5220 5.93 The Longest Most Meaningless Movie in the World 2880 7.34 The Hazards of Helen 1428 6.65 **** 1100 6.9

Page 16: Visualising Big Data

Condense (bin + summarise)

library(bigvis)library(ggplot2)

Nbin <- 1e4BinData <- with(movies, condense(

bin(length, find_width(length,Nbin)),bin(rating, find_width(rating,Nbin)))

Page 17: Visualising Big Data

Condesed Plot

p <- ggplot(BinData) + aes(length, rating, fill=.count) + geom_tile()

Create bins = 10000 and summarized using count

Page 18: Visualising Big Data

Remove Outliers

p %>% peel(BinData)

Create bins = 10000, summarize count & peel 1% outlier

Page 19: Visualising Big Data

Smoothing

smoothBinData <- smooth(peel(binData), h=c(20, 1))autoplot(smoothBinData)

Create bins = 20, summarize count, peel 1% outlier & smooth

Page 20: Visualising Big Data

Big Data Visualisation

● Approach: Bin - Summarize - Smooth - Visualise● “Interactively” plot nearly 100 millions data point in-

memory for EDA in R● Can be extend to in-database e.g. for binning● Can be parallelised e.g. summarize on count, mean

Page 21: Visualising Big Data

Amit Kapoor@amitkaps

amitkaps.comnarrativeviz.com

Data

Visual

Story

*