big data science: a path forward

June 2013

BIG DATA SCIENCE: A PATH FORWARD

CONFIDENTIAL | 2

linkedin.com/in/danmallinger/

@danmallinger

www.thinkbiganalytics.com

Data Science Lead @ Think Big

Product/Brand Obsessive

Teacher

Occasional Engineer

CONFIDENTIAL | 3

TODAY

• High level exploration of the

• skills, tools, and techniques

• needed to achieve early success

• and to help you build

• your data science practice.

CONFIDENTIAL | 4

Understand our organizational needs for data science

Infrastructure: Technological tools and platforms.

Talent: Staff hired and trained.

Capabilities: Data science techniques utilized.

INFRASTRUCTURE, TALENT, & CAPABILITIES

Hadoop NoSQL Analytics SQL/MPP Real Time

Scripting MapReduceData

ExplorationBasic Modeling PhD Math

Visualization Clustering CategorizationContinuous

ModelsText Analysis

CONFIDENTIAL | 5

Boxed Solutions: Mahout & Platform

Toolkits: RHadoop, Scikit, etc.

You will need toolkits to solve unique problems

but smart techniques make that easier.

Boxed solutions are limited

but can be a good source of early velocity.

ANALYTICS TOOLS

CONFIDENTIAL | 6

Gigabytes from Stackoverflow

Questions from users

With metadata

Users have reputations

Questions open or closed

Follow along

Thinking about your data

To learn in a

Familiar context and

Plan

DATA

Presenter Audience


Scripting MapReduce Exploration Basic Modeling PhD Math

Visualization Clustering Categorization Continuous Text Analysis

CONFIDENTIAL | 7

select count(1) as total

, sum(has_code)

, avg(body_count)

, stddev_samp(body_count)

, corr(reputation, owner_questions)

, histogram_numeric(body_count, 10)

from questions

;

STEP 1: EXPLORE




Patterns through Hive Patterns through Tableau

CONFIDENTIAL | 8

Summaries of unstructured data

Time-since metrics

select transform(…)

using ‘python …’

Clustering: Browsing cohorts

/bin/mahout canopy

STEP 2: FEATURE BUILDING




SQL Windowing Cross-Record Features

CONFIDENTIAL | 9

• Sample (don’t parallelize)

• Naturally parallel

• SVD

• Random Forests

• Estimators and Ensembles

• Bootstrapping

• Localizing

• Advanced Parallelization

• Linear models with SGD

• Neural networks

PARALLEL MODELS IN HADOOP




CONFIDENTIAL | 10

Single R model

run many times

over samples

and aggregated

m <- C5.0(status ~ …)

STEP 3: STRUCTURED MODEL (BAGGING)




Mapper 1:

Define n reducer keys

Send any record to reducer I with

probability p

Reducer 1:

Key: Id of sample

Value: List of records

Perform analysis over records

Reducer 2:

Key: One

Value: List of models

Aggregate the models (e.g. average)

Bagging a Model

CONFIDENTIAL | 11

WHERE ARE WE?




We’ve created a structured model

to flag questions that won’t be closed

using Big Data.

But we haven’t used unstructured data.

CONFIDENTIAL | 12

TEXT ANALYSIS




• Is “the big dog” really different from “dog is big?”

• How about “I like eggs but hate tofu” and “I hate eggs but like tofu?”

• Language has lexical and syntactical features

• Different techniques leverage these in different ways

Bag of Words: Structure doesn’t matter

n-gram: Structure matters (but not that much)

Feature Extraction: BACON! BACON! BACON!

CONFIDENTIAL | 13

STEP 4: UNSTRUCTURED MODEL




Similar to Hadoop’s Word Count

Create counts for token/category pairs

Use counts to calculate Information Gain

MR Job 1:

Calculate information gain (IG) for all

tokens.

MR Job 2:

Select tokens with largest IG.

Create structured data for record, tokens:question #4 | 0 | 1 | 0 | 1 | 1

MR Job 3:

Build a classifier over the newly structured

data (prior slides)

Information Gain

CONFIDENTIAL | 14

WHERE ARE WE?




We’ve created two models

One structured,

one unstructured.

But they don’t work together.

CONFIDENTIAL | 15

STEP 5: ENSEMBLE MODEL




Join many models together

By using their output

As input to ensemble model.

Best when models perform differently

Exploit differences with nonlinearities

Like interaction effects.

Ensembling

Mapper 1:

Load multiple models

Score the models per record and output

Reducer 1:

Key: Id of record

Value: List of model outputs

Join model outputs to make new records

MR Job 2:

Build a model over the output data as if it

was raw data.

CONFIDENTIAL | 16

We’ve created two models:

one structured,

one unstructured

and have ensembled them

to create a single, powerful model

and solve a practical business problem.

WHERE ARE WE?




CONFIDENTIAL | 17

This required simple infrastructure

a blend of analysis and scripting skills

an understanding of BIG data science techniques

but not a team of PhDs or a billion dollars.

HOW DID WE GET HERE?




CONFIDENTIAL | 18

Questions?

www.thinkbiganalytics.com

@danmallinger

big data science: a path forward

Technology