big data science: a path forward

18

Click here to load reader

Upload: dan-mallinger

Post on 09-Jul-2015

1.022 views

Category:

Technology


1 download

DESCRIPTION

This talk uses a case study to demonstrate core data science capabilities in Big Data, infrastructure requirements, and talent profiles that translate to early success. Using the challenge of classifying events in a consumer-oriented website, the discussion is for a wide audience: - Practitioners will learn two key techniques for early success - Technologists will learn how teams rely on key infrastructure and where engineers play a valuable role in data sciences - Hiring managers will expand their knowledge of the skills required to bring business value with data

TRANSCRIPT

Page 1: BIG Data Science:  A Path Forward

June 2013

BIG DATA SCIENCE: A PATH FORWARD

Page 2: BIG Data Science:  A Path Forward

CONFIDENTIAL | 2

linkedin.com/in/danmallinger/

@danmallinger

www.thinkbiganalytics.com

Data Science Lead @ Think Big

Product/Brand Obsessive

Teacher

Occasional Engineer

Page 3: BIG Data Science:  A Path Forward

CONFIDENTIAL | 3

TODAY

• High level exploration of the

• skills, tools, and techniques

• needed to achieve early success

• and to help you build

• your data science practice.

Page 4: BIG Data Science:  A Path Forward

CONFIDENTIAL | 4

Understand our organizational needs for data science

Infrastructure: Technological tools and platforms.

Talent: Staff hired and trained.

Capabilities: Data science techniques utilized.

INFRASTRUCTURE, TALENT, & CAPABILITIES

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduceData

ExplorationBasic Modeling PhD Math

Visualization Clustering CategorizationContinuous

ModelsText Analysis

Page 5: BIG Data Science:  A Path Forward

CONFIDENTIAL | 5

Boxed Solutions: Mahout & Platform

Toolkits: RHadoop, Scikit, etc.

You will need toolkits to solve unique problems

but smart techniques make that easier.

Boxed solutions are limited

but can be a good source of early velocity.

ANALYTICS TOOLS

Page 6: BIG Data Science:  A Path Forward

CONFIDENTIAL | 6

Gigabytes from Stackoverflow

Questions from users

With metadata

Users have reputations

Questions open or closed

Follow along

Thinking about your data

To learn in a

Familiar context and

Plan

DATA

Presenter Audience

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Page 7: BIG Data Science:  A Path Forward

CONFIDENTIAL | 7

select count(1) as total

, sum(has_code)

, avg(body_count)

, stddev_samp(body_count)

, corr(reputation, owner_questions)

, histogram_numeric(body_count, 10)

from questions

;

STEP 1: EXPLORE

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Patterns through Hive Patterns through Tableau

Page 8: BIG Data Science:  A Path Forward

CONFIDENTIAL | 8

Summaries of unstructured data

Time-since metrics

select transform(…)

using ‘python …’

Clustering: Browsing cohorts

/bin/mahout canopy

STEP 2: FEATURE BUILDING

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

SQL Windowing Cross-Record Features

Page 9: BIG Data Science:  A Path Forward

CONFIDENTIAL | 9

• Sample (don’t parallelize)

• Naturally parallel

• SVD

• Random Forests

• Estimators and Ensembles

• Bootstrapping

• Localizing

• Advanced Parallelization

• Linear models with SGD

• Neural networks

PARALLEL MODELS IN HADOOP

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Page 10: BIG Data Science:  A Path Forward

CONFIDENTIAL | 10

Single R model

run many times

over samples

and aggregated

m <- C5.0(status ~ …)

STEP 3: STRUCTURED MODEL (BAGGING)

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Mapper 1:

Define n reducer keys

Send any record to reducer I with

probability p

Reducer 1:

Key: Id of sample

Value: List of records

Perform analysis over records

Reducer 2:

Key: One

Value: List of models

Aggregate the models (e.g. average)

Bagging a Model

Page 11: BIG Data Science:  A Path Forward

CONFIDENTIAL | 11

WHERE ARE WE?

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

We’ve created a structured model

to flag questions that won’t be closed

using Big Data.

But we haven’t used unstructured data.

Page 12: BIG Data Science:  A Path Forward

CONFIDENTIAL | 12

TEXT ANALYSIS

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

• Is “the big dog” really different from “dog is big?”

• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”

• Language has lexical and syntactical features

• Different techniques leverage these in different ways

Bag of Words: Structure doesn’t matter

n-gram: Structure matters (but not that much)

Feature Extraction: BACON! BACON! BACON!

Page 13: BIG Data Science:  A Path Forward

CONFIDENTIAL | 13

STEP 4: UNSTRUCTURED MODEL

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Similar to Hadoop’s Word Count

Create counts for token/category pairs

Use counts to calculate Information Gain

MR Job 1:

Calculate information gain (IG) for all

tokens.

MR Job 2:

Select tokens with largest IG.

Create structured data for record, tokens:question #4 | 0 | 1 | 0 | 1 | 1

MR Job 3:

Build a classifier over the newly structured

data (prior slides)

Information Gain

Page 14: BIG Data Science:  A Path Forward

CONFIDENTIAL | 14

WHERE ARE WE?

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

We’ve created two models

One structured,

one unstructured.

But they don’t work together.

Page 15: BIG Data Science:  A Path Forward

CONFIDENTIAL | 15

STEP 5: ENSEMBLE MODEL

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Join many models together

By using their output

As input to ensemble model.

Best when models perform differently

Exploit differences with nonlinearities

Like interaction effects.

Ensembling

Mapper 1:

Load multiple models

Score the models per record and output

Reducer 1:

Key: Id of record

Value: List of model outputs

Join model outputs to make new records

MR Job 2:

Build a model over the output data as if it

was raw data.

Page 16: BIG Data Science:  A Path Forward

CONFIDENTIAL | 16

We’ve created two models:

one structured,

one unstructured

and have ensembled them

to create a single, powerful model

and solve a practical business problem.

WHERE ARE WE?

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Page 17: BIG Data Science:  A Path Forward

CONFIDENTIAL | 17

This required simple infrastructure

a blend of analysis and scripting skills

an understanding of BIG data science techniques

but not a team of PhDs or a billion dollars.

HOW DID WE GET HERE?

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

Page 18: BIG Data Science:  A Path Forward

CONFIDENTIAL | 18

Questions?

www.thinkbiganalytics.com

@danmallinger