visualising big data

Amit Kapoor@amitkaps

Visualising Big Data

Visualise Million Data Points

x <- rnorm(1000000, mean=0, sd=2)y <- rnorm(1000000, mean=0, sd=2)

xy <- data.frame(x,y)

Same order as theNumber of Pixels

on my MacBook Air1400 x 900

Data

Data Sample

Sampling can be effective (with overweighting

unusual values)

Require multiple plots or careful

tuning parameters

Data Sample

Model

Models are great as they scale nicely.

But, visualisation is required as

“I don’t know, what I don’t know.”

Data Sample

ModelBinning

Binning can solve a lot of these challenges

“Bin - Summarize - Smooth: A framework

for visualising big data” - Hadley Wickam (2013)

“Visualising big data is the process of creating generalized histograms”

Approach

BIN : fixed size bins = (x-origin)/width

SUMMARIZE : summary stats = count, mean, stdev

SMOOTH : smoothing e.g. kernel mean, regression

VISUALISE : visualise using standard plots

Bigvis Package in R

Aim: To plot 100 million points in under 5 seconds.Approach: - Plotting using standard R libraries- Processing done in (fast) compiled C++ code, using

Rcpp package - Outlier removal in big data- Smoothing to highlight trends & suppress noise

Diamonds dataset

ggplot(diamonds) + aes(carat, price) + geom_point(alpha = 0.2, colour = “orange”)

50k observations e.g. price, carat of diamonds

Condense (bin + summarise)

library(bigvis)library(ggplot2)

Nbin <- 20BinData <- with(diamonds, condense(

bin(carat, find_width(carat,Nbin)),bin(price, find_width(price,Nbin)))

Plotting the Condense

p <- ggplot(BinData) + aes(carat, price, fill=.count) + geom_tile()

Create bins = 20 and summarized using count

Both Points & Condensed

q <- p + geom_point(data = diamonds, aes(fill = NULL), alpha = 0.2, colour = "orange")

Create bins = 20, summarized using count & added base data

Movies dataset

ggplot(movies) + aes(length, rating) + geom_point(alpha = 0.2, colour = “orange”)

130k observations e.g. length, rating of movies on IMDB

Let us see the outliers

title length rating1 Matrjoschka 5700 8.52 The Cure for Insomnia 5220 5.93 The Longest Most Meaningless Movie in the World 2880 7.34 The Hazards of Helen 1428 6.65 **** 1100 6.9

Condense (bin + summarise)

library(bigvis)library(ggplot2)

Nbin <- 1e4BinData <- with(movies, condense(

bin(length, find_width(length,Nbin)),bin(rating, find_width(rating,Nbin)))

Condesed Plot

p <- ggplot(BinData) + aes(length, rating, fill=.count) + geom_tile()

Create bins = 10000 and summarized using count

Remove Outliers

p %>% peel(BinData)

Create bins = 10000, summarize count & peel 1% outlier

Smoothing

smoothBinData <- smooth(peel(binData), h=c(20, 1))autoplot(smoothBinData)

Create bins = 20, summarize count, peel 1% outlier & smooth

Big Data Visualisation

● Approach: Bin - Summarize - Smooth - Visualise● “Interactively” plot nearly 100 millions data point in-

memory for EDA in R● Can be extend to in-database e.g. for binning● Can be parallelised e.g. summarize on count, mean

Amit Kapoor@amitkaps

amitkaps.comnarrativeviz.com

Data

Visual

Story

*

visualising big data

Data & Analytics