data science - tools and methods: practitioner perspectives

69
Company Confidential - For Internal Use Only Copyright © 2017, SAS Institute Inc. All rights reserved. ANALYTICS TOOLS AND METHODS : PRACTITIONER PERSPECTIVES Guest Lecturer Scott Allen Mongeau Data Scientist Cyber Analytics Cell: + 31 (0)6 8370 3097 [email protected] BIG DATA AND BUSINESS ANALYTICS Masters of Business and Information Management 2016 - 2017 dr Jan van Dalen

Upload: scott-allen-mongeau

Post on 21-Apr-2017

47 views

Category:

Data & Analytics


4 download

TRANSCRIPT

Company Confidential - For Internal Use Only

Copyright © 2017, SAS Insti tute Inc. Al l r ights reserved.

ANALYTICS TOOLS AND METHODS:

PRACTITIONER PERSPECTIVES

Guest LecturerScott Allen Mongeau

Data Scientist

Cyber Analytics

Cell: + 31 (0)6 8370 3097

[email protected]

BIG DATA AND BUSINESS ANALYTICS

Masters of Business and Information Management

2016 - 2017

dr Jan van Dalen

2

2

2

Education

• PhD (ABD)

• MBA

• MA Financial Mgmt

• Cert. Finance

• GD IT Mgmt

• MA Com Tech

Experience

• SAS InstituteSr. Mgr. Business Solutions

• DeloitteManager Analytics

• Nyenrode UniversityLecturer Analytics

• SARK7 Owner / Principal Consultant

• Genentech Inc. / Roche Principal Analyst / Sr. Mgr.

• AtradiusSr. R&D Engineer

• CFSICIO

Data Scientist

Cyber Analytics

[email protected]

+31 (0)64 235 3427

Scott Allen MongeauCertified Analytics Professional (CAP)

YouTube

• Introduction to Advanced Analytics

• Introduction to Cognitive Analytics

• TedX RSM: Data Analytics

Blog: sctr7.com

Twitter: sark7

Web: sark7.com

IT solutions

Research

methods

Finance

Data

analytics

Consulting

3

40 #1

14,000

93

80,000+

US $ 3.2 B

23%

SAS employees worldwide

of the top

100companieson the

GLOBAL 500 LIST

Annual reinvestment in

R&D

Continuous Revenue

Growth since 1976

Years of

BUSINESS

ANALYTICS

World’s

privately held

software company

LARGEST

Customer sites in 148 countries

DATAANALYTICS MARKET LEADER

4

Copyright © 2017, SAS Institute Inc. All rights reserved.

FORECASTING

DATA MINING /

MACHINE LEARNING

TEXT ANALYTICS

OPTIMIZATION

STATISTICS

Finding treasures in unstructured data

like social media or survey tools

that could uncover insights

about key business challenges

Mine transaction databases

to create models of likely

outcomes

Leveraging historical data

to drive better insight into

proactive decision-making

Analyze massive

amounts of data in

order to accurately

identify areas likely to

produce the most

profitable results

ANALYTICS SOLUTIONS

Data Management (Integration, Quality &

Governance)

5MOORE’S LAW: EXPONENTIAL GROWTH OF COMPUTING POWER

5

25,000 x

Home computers

High-capacity servers

Smartphone

explosion

Cloud, AI / Watson, IoT

2015

66

Company Confidential - For Internal Use Only

Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.

7

PEOPLE & ORGANIZATION:

DATA SCIENTIST ROLE

88

99

Calvin.Andrus (2012) http://en.wikipedia.org/wiki/File:DataScienceDisciplines.png

SEEKING THE

‘DATA SCIENTIST’

10

10

DATA SCIENCE

PROFESSIONAL

PERSPECTIVES

http://www.oreilly.com/data/free/2016-data-science-salary-survey.csp

11

12

13

14

1515

15

DATA SCIENCE AS

PEOPLE/ROLES

1616

16

DATA ANALYTICS

• Data science

• Statistician

• Data miner / machine

learning

• Text analytics / mining

BUSINESS ANALYTICS

• Business analyst

• BI solutions

• Visualization / interface design

• Functional domain specialty

(i.e. marketing analytics)

DATA MANAGEMENT

• Information / data architecture

• Database management

• Data engineering

• Data quality / governance / MDM

OPERATIONS

• Analytics engineering / operations

• Security

• IT systems management

BUSINESS / ORGANIZATIONAL

• Decision Management

• Change management

• Analytics project management

• Domain expert / functional

specialty / business manager

DATA SCIENCE

PEOPLE/ROLES

17CORE DATA SCIENCE SKILLSET

17

IT

• BI/reports/dashboards

• Programming

• Systems/software dev

• Algorithms

• Systems administration

• User interface

design/visualization

Mathematics

• Econometrics

• Graph analysis

• Matrix mathematics

• Multivariate analysis

• Probability

• Survival analysis

• Statistics

• Spatial analysis

• Temporal analysis

Business Domain

• Finance

• Operations

• Sales/marketing

• HR

Data Engineering

• Big & fast data solutions

• Data manipulation/ETL

• Database design

• Data structures

• Graphical

• NOSQL

• Unstructured data

Data Science

• Machine Learning

• Optimization

• Predictive analytics

• Simulation

• Text/semantic analytics

Research

• Scientific method

• Experimental design

• Research methodologies

• Social science methods

• Survey research

Company Confidential - For Internal Use Only

Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.

18

TECHNOLOGY & TOOLS:

DATA ANALYTICS TOOLS & TECH

1919

2020APPLIED TECHNIQUES & TECHNOLOGIES

20

• Algorithms

(ex: computational complexity, CS theory)

• Back-end programming

(ex: JAVA/Rails/Objective C)

• Bayesian/Monte-Carlo statistics

(ex: MCMC, BUGS)

• Big and distributed data

(ex: Hadoop, Map/Reduce)

• Business

(ex: management, business development, budgeting)

• Classical statistics

(ex: general linear model, ANOVA)

• Data manipulation

(ex: regexes, R, SAS, web scraping)

• Front-end programming

(ex: JavaScript, HTML, CSS)

• Graphical models

(ex: social networks, Bayes networks)

• Machine learning

(ex: decision trees, neural nets, SVM, clustering)

• Math

(ex: linear algebra, real analysis, calculus)

• Optimization

(ex: linear, integer, convex, global)

• Science

(ex: experimental design, technical writing/publishing)

• Simulation

(ex: discrete, agent-based, continuous)

• Solutions development

(ex: design, project management)

• Spatial statistics

(ex: geographic covariates, GIS)

• Structured data

(ex: SQL, JSON, XML)

• Surveys and marketing

(ex: multinomial modeling)

• Systems administration

(ex: *nix, DBA, cloud tech.)

• Temporal statistics

(ex: forecasting, time-series analysis)

• Unstructured data

(ex: noSQL, text mining)

• Visualization

(ex: statistical graphics, mapping, web-based dataviz)

SOURCE: “Analyzing the Analyzers”

http://www.datasciencecentral.com/profiles/blogs/how-

to-become-a-data-scientist?overrideMobileRedirect=1

21

21

DATA SCIENCE

TECHNOLOGIES

TOOLS:

MULTIPLE TOOLS

22

22

DATA SCIENCE

TECHNOLOGIES

TOOLS:

MULTIPLE TOOLS

23

23

DATA SCIENCE

TECHNOLOGIES

TOOLS:

DBMS AND HADOOP...

24

25

26

27

27

DATA SCIENCE

TECHNOLOGIES

28

28

DATA SCIENCE

TECHNOLOGIES

2929

Enterprise Big Data

Browser Open Source

TOOLS

Company Confidential - For Internal Use Only

Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.

30

PROCESS & METHODS:

DATA ANALYTICS

3131

3232

3333

VALUE

SO

PH

IST

ICA

TIO

N

DESCRIPTIVE

PREDICTIVE

PRESCRIPTIVE

What

happened?

What are

trends?

What to do?

3434

VALUE

SO

PH

IST

ICA

TIO

N

DESCRIPTIVE

PREDICTIVE

PRESCRIPTIVE

Business

Intelligence (BI)

Econometrics

Forecasting

Machine Learning

Operations

Management

3535

business valueTransactional

an

aly

tic

s m

atu

rity

Strategic

Advanced Analytics

DESCRIPTIVE

DIAGNOSTICS

PREDICTIVE

PRESCRIPTIVE

Identifying

Factors & Causes

Asp

irati

on

al

Tra

nsfo

rmed

Optimizing

Systems

Understanding

Social Context

& Meaning

SEMANTICData

visualization

DATA QUALITY

Business

Intelligence

Understanding

Patterns

Forecasting &

Probabilities

3636

CRISP DM

Provost; Fawcett. Data Science for Business

Chapter 2: Business Problems and Data Science Solutions

37

37

SAS ANALYTICS

LIFECYCLE

PROBLEM

FRAMING

DATA

SELECTION &

GATHERING

DATA

EXPLORATION

TRANSFORM &

SELECT

MODEL

BUILDING

MODEL

VALIDATION

MODEL

DEPLOYMENT

EVALUATE &

MONITOR

RESULTS

FRAMING &

DISCOVERY

EXPLANATION

& PREDICTION

3838

Fair use: illustrate publication and article of issue in question. The Economist.

http://en.wikipedia.org/wiki/Category:Fair_use_The_Economist_magazine_covers38

3939

Wikipedia commons http://en.m.wikipedia.org/wiki/File:Mond-vergleich.svg

4040

Scientific test…

4141

41

Public domain Agricultural Research Service

http://en.wikipedia.org/wiki/File:Orange_juice_1.jpg

GNU Free Documentation License: Ibanix Suzuki Shahid DL650 motorcycle

http://commons.wikimedia.org/wiki/File:Suzuki_vstrom_dl650_motorcycle.jpg

Company Confidential - For Internal Use Only

Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.

42

PREDICTIVE ANALYTICS:

SUPERVISED MACHINE LEARNING

43

Supervised learning - predictive• K-Means

• Decision Trees (DT)

(random forests, boosted trees)

• Naïve Bayes classifier

• Neural networks

• Support Vector Machine (SVM)

• Ensembles / Ensemble Learning

Decision Tree

Machine Learning

Support Vector Machines

4444

MACHINE LEARNING PREDICTION (SUPERVISED)

CAR Engine

Training set Validation set

Non-criminal Criminal

NORMAL UNUSUAL

Device

Time of day

Source

location

IP

Threat

intelligence

Amount

At risk

profile

Destination

location

Secure

profile

Known

devices

Average

amount

Known

location

Known

destination

45

45

EXAMPLE MACHINE LEARNING TOOLS

Open source

•R

•Python

•Weka

Commercial

• SAS BASE & JMP

• SAS Enterprise Miner

• IBM SPSS

• Oracle Data Mining

• Rapid Miner

Ranjit Bose, (2009),"Advanced analytics: opportunities and challenges",

Industrial Management & Data Systems, Vol. 109 Iss 2 pp. 155 - 172

http://dx.doi.org/10.1108/02635570910930073

4646

MACHINE LEARNING

ENGINES

WEKA SAS Enterprise Miner

47

47

DEMO: SAS ENTERPRISE MINER

Workflow

Configuration

Models / utilities

Data

IDE

4848

• Data preparation

• Model development

• Model management

• Model deployment

http://www.sas.com/en_gb/insights/articles/analytics/

Industrialize-your-analytics-today.html

4949

business valueTransactional

an

aly

tic

s m

atu

rity

Strategic

Advanced Analytics

DESCRIPTIVE

DIAGNOSTICS

PREDICTIVE

PRESCRIPTIVE

Identifying

Factors & Causes

Asp

irati

on

al

Tra

nsfo

rmed

Optimizing

Systems

Understanding

Social Context

& Meaning

SEMANTICData

visualization

DATA QUALITY

Business

Intelligence

Understanding

Patterns

Forecasting &

Probabilities

5050

CONFUSION

MATRIX

A confusion matrix

separates out the

decisions made by

the classifier,

making explicit how

one class is being

confused for

another. In this way

different sorts of

errors may be dealt

with separately.

Foster & Fawcett. Data Science for Business

What you need to know about data mining and data-analytic thinking: Chapter 7: Decision Analytic Thinking

5151RECEIVER OPERATING

CHARACTERISTICS (ROC) &

AREA UNDER THE CURVE (AUC)

“A ROC graph is a two-

dimensional plot of a

classifier with false positive

rate on the x axis against

true positive rate on the y

axis.

ROC graph depicts relative

trade-offs that a classifier

makes between benefits

(true positives) and costs

(false positives).”

Provost; Fawcett. Data Science for Business

Chapter 8: Visualization Model Performance

Area Under the Curve (AUC):

area under a classifier’s curve

expressed as a fraction of the

unit square. Its value ranges

from zero to one.

5252

CUMULATIVE RESPONSE /

LIFT CURVE

• How much the line representing the

model performance is lifted up over

the random performance diagonal

Provost; Fawcett. Data Science for Business. Chapter 8: Visualizing Model Performance

• I.E. “our model gives a two times (or a 2X)

lift”: this means that at the chosen

threshold (often not mentioned), the lift

curve shows that the model’s targeting is

twice as good as random

Company Confidential - For Internal Use Only

Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.

53

DESCRIPTIVE ANALYTICS:

UNSUPERVISED MACHINE LEARNING

54

Unsupervised learning• Cluster analysis

• Factor analysis

• Self-Organizing Maps (SOMs)

k-nearest neighbors

Machine Learning

55

R Studio

Workflow

Configuration Data

Results

Scripting

environment

Graphical results

Models

MACHINE LEARNING R / R Studio

5656

DESCRIPTIVE

(UNSUPERVISED):

CLUSTER ANALYSIS

FOR PATTERN

DETECTION

Cluster Analysis using

SAS Enterprise Guide

Company Confidential - For Internal Use Only

Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.

57

BIG DATA:

BACKGROUND AND EXAMPLE

58ONLINE IN

60 SECONDS…

Qmee

http://blog.qmee.com/qmee-online-

in-60-seconds/

59

DATA ANALYTICS DRIVERS: V4C

59

Social and mobile Data analytics

Interactive platforms Real-Time systems•VOLUME

•VELOCITY

•VARIETY

•VARIABILITY

•COMPLEXITY

V4C

60

• Cases where prediction is

not “deterministic”

• Bayes rate

• Theoretical maximum accuracy

that can be achieved for a

problem

60

MODEL ERRORS: INHERENT

RANDOMNESS

61

• Bias: even with ‘Big Data’, model will

never reach perfect accuracy of true

model

• Example

• Linear regression model to predict

response to an advertising campaign…

• Model is an abstraction…

• True model always

more complex

61

MODEL ERRORS: BIAS

62

• Variance: procedures with more variance tend to

produce models with larger errors

• Accuracy tends to vary across training sets

• Given finite sample set…

• Different models emerge

from different samples

• Different models tend to

have different accuracy

62

MODEL ERRORS: VARIANCE

63

Big Data

• Complex model

• Many variables

• Low bias…

• but high variance

• Subject to overfitting

63

BALANCE: BIAS VERSUS VARIANCE

Strong models

– Tested abstraction

– Few, but significant

variables

– Low variance…

– but high bias

Jno. T-62 tank in Russian service. http://www.aviation.ru/jno/Kubinka02

http://commons.wikimedia.org/wiki/File:T-62_tank_in_Russian_service_(2).jpg

6464

Statistical Learning with Big Data

http://web.stanford.edu/~hastie/T

ALKS/SLBD_new.pdf

6565

Statistical Learning with Big Data

http://web.stanford.edu/~hastie/T

ALKS/SLBD_new.pdf

Company Confidential - For Internal Use Only

Copyright © 2015, SAS Insti tute Inc. Al l r ights reserved.

66

EXPLANATION:

CAUSAL MODELING

67

• Explanatory performance NOT EQUAL to predictive efficacy (and vice versa),

difference between inductive and deductive methods/thinking

• This is a (sometimes heated) methodological debate amongst

practitioners/academics…

• Is it really a debate, or a religious (professional/Kuhnian) dispute? Econometrics

+ machine learning (H. Varian)

EXPLANATORY

ANALYTICS

68

• Varian, Hal R. 2014. Machine Learning and Econometrics. Stanford lecture slides:

https://web.stanford.edu/class/ee380/Abstracts/140129-slides-Machine-Learning-and-Econometrics.pdf

• Varian, Hal R. 2013. Big Data: New Tricks for Econometrics. Paper:

http://people.ischool.berkeley.edu/~hal/Papers/2013/ml.pdf

MACHINE LEARNING

AND ECONOMETRICS

69

• Ensemble learning…

• Promising – averages over many predictive

cases to reduce impact of variance

• However, is CORRELATIVE, not CAUSAL

• CAUSAL data analysis requires • Investment in data acquisition

• Similarity measurements

• Expected value calculations

• Correlation understanding

• Identifying informative variables

• Fitting equations to data

• Significance testing

• Domain knowledge69

MODEL MANAGEMENT