a statistical viewpoint on data science, data mining and big data alec stephenson data analytics,...

22
A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Upload: sabastian-dunsford

Post on 14-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

A Statistical Viewpoint on Data Science, Data Mining and Big DataAlec Stephenson

DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Page 2: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Statistics Vs Data Science Statistician Vs Data Scientist Data Science in Predictive Analytics Data Science in Consulting Big Data: Are Statisticians Relevant?

Introduction

Page 3: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Data Science Venn Diagram (Drew Conway)

Page 4: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Statistician Vs Data Scientist

2006 2008 2010 2012 2014

01

03

05

0

Time

Inte

rest

STATISTICIANDATA SCIENTISTSTATISTICIANDATA SCIENTIST

Page 5: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

On Linkedin On my email signature To market myself to internal and external clients

I am a Data Scientist

I am a Statistician At academic conferences Providing expertise for journal articles Any role as a technical expert

Page 6: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Experfy www.experfy.com

Melbourne Data Science Meet-Upwww.meetup.com/Data-Science-

Melbourne/BUT: Kaggle Connect

No longer exists (March-December 2013)

Is There A Greater Demand For Data Scientists?

Page 7: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Essential: Statistical Modelling: e.g. R, Matlab, Python Data Munging: e.g. Perl, Python, Ruby

Additional: Fast Computation: C, C++, Java Data Storage: SQL, noSQL Big Data: MapReduce, Mahout, Hive, Pig

Data Science Skills

Page 8: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Good For Building Essential Skills In Predictive Analytics

Only Three Steps To Winning:

Data Munging Machine Learning / Statistical Modelling Ensembling

Data Mining Competitionswww.kaggle.com

Page 9: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

General Advice:

Just because you have data, does not mean that you have to use it

There is no such thing a single best model Different models can capture different features Visualize the data

Data Mining Competitionswww.kaggle.com

Page 10: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

General Advice:

If something takes more than one minute to run, do you really need to run it?

Spend more time on trying different data transformations and models, and less on parameter specification

Just have a go. How much time can you afford?

Data Mining Competitionswww.kaggle.com

Page 11: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Usually Good Methods:

Gradient boosting machine (gbm / mboost) Random forest (randomForest) Elastic net (glmnet) Support Vector Machine (kernlab / e1071) Neural networks (nnet)

Data Mining Competitionswww.kaggle.com

Page 12: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Usually Not So-Good Methods:

Recursive Partitioning (rpart / tree) Nearest neighbour (class) Multivariate Adaptive Regression Splines (earth) Naive Bayes (e1071)

Data Mining Competitionswww.kaggle.com

Page 13: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

library(randomForest)library(gbm)library(glmnet)

data <- as.matrix(iris[,-5])set.seed(100)ind <- sample(150, 15)train <- data[-ind,]test <- data[ind,]

Data Mining Example I

Page 14: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

set.seed(100)m1 <- randomForest(train[,2:4], train[,1], ntree = 1000, mtry = 2)pm1 <- predict(m1, test[,-1])mean((pm1 - test[,1])^2)

set.seed(100)m2 <- gbm.fit(train[,2:4], train[,1], distribution = "gaussian", n.trees = 10000, shrinkage = 0.001, interaction.depth = 2)pm2 <- predict(m2, test[,-1], n.trees = 10000)mean((pm2 - test[,1])^2)

set.seed(100)m3 <- glmnet(train[,2:4], train[,1], family = "gaussian", alpha = 0.5)pm3 <- predict(m3, test[,-1])pm3 <- pm3[,ncol(pm3)]mean((pm3 - test[,1])^2)

Data Mining Example II

Page 15: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

mean(((pm1 + pm2)/2 - test[,1])^2)mean(((pm1 + pm3)/2 - test[,1])^2)mean(((pm2 + pm3)/2 - test[,1])^2)mean(((pm1 + pm2 + pm3)/3 - test[,1])^2)

Data Mining Example III

Page 16: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Predictive analytics is a black box Simplicity vs Predictive Accuracy Communication with client

Reporting: methods or conclusions Variable Importance Client Implementation

Prediction: Competitions Vs Clients

Page 17: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Means different things to different people SKA: 10 petabytes per hour by 2025 Statisticians typically consider a few gigabytes to

be a huge dataset

Do statisticians have a role to play?

Big Data

Page 18: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Volume: MB, GB, TB, PB, ... Velocity: Real-Time, Hourly, Weekly, Batch, Variety: Structured, Unstructured

Veracity: How accurate? Value: How valuable?

Big Data 3V’s: Volume Velocity Variety

Page 19: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Gartner Hype Cycle 2013

Page 20: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Will say that they are heavily involved in big data Will use big data for marketing purposes

Will never have programmed a MapReduce job Will have never used datasets of 0.5TB+ Will not know about big data technologies

Why is this?

Big Data: A typical statistician…

Page 21: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Deciding what data is relevant to the question Subsetting and sampling big data Modelling these subsets

Statisticians may have a role in

Statistician may not have a role If you need to touch all of the data (0.5TB+) Restriction to linear (or linearithmic) algorithms Sums / Averages / Graph Search / Sorting

Page 22: A Statistical Viewpoint on Data Science, Data Mining and Big Data Alec Stephenson DATA ANALYTICS, DIGITAL PRODUCTIVITY AND SERVICES

Robust Statistics and Extremes 8 – 11 September, 2014

Australian National University Statistics today is faced with many challenges, especially relating to such topical issues as the analysis of "big data" through to understanding the complexities of climate change - and many others. Floods, fires, variations in temperature on local through to global scales, etc., have provided impetus for recent vigorous redevelopments of extreme value analysis. Extremely large data sets and high dimensional data now becoming available in genetics, finance, physics, astronomy, and many other areas, have spurred exponential advances in statistical theory and practice with special emphasis on robustness issues, in recent years. The need to analyse large, linked, data sets in health, crime, agriculture, surveys, and industry, just to name a few, has revolutionised our profession. It's an exciting time to be a statistician.

The aim of the Robust Statistics and Extremes (RS&E) conference is to provide an opportunity for researchers to present up-to-date accounts of the present state of the art in the topics of Robust Statistics and Extremes. A number of distinguished speakers, both international and Australian, will give keynote addresses in their areas of interest. Special provision will be made for student participation.

Big Data???