machine learning with hadoop boston hug 2012

Machine Learning with Hadoop

Agenda

• Why Big Data? Why now?

• What can you do with big data?

• How does it work?

2

Slow Motion Explosion

3

Why Now?

• But Moore’s law has applied for a long time

• Why is Hadoop/Big Data exploding now?

• Why not 10 years ago?

• Why not 20?

48/9/2013

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

5

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

6

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

7

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

8

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

9

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

They did

10

Everywhere at Once?

• Something very strange is happening

– Big data is being applied at many different scales

– At many value scales

– By large companies and small

11

Everywhere at Once?

• Something very strange is happening

– Big data is being applied at many different scales

– At many value scales

– By large companies and small

Why?

12

Analytics Scaling Laws

• Analytics scaling is all about the 80-20 rule

– Big gains for little initial effort

– Rapidly diminishing returns

• The key to net value is how costs scale

– Old school – exponential scaling

– Big data – linear scaling, low constant

• Cost/performance has changed radically

– IF you can use many commodity boxes

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue Net value optimum has a

sharp peak well before maximum effort

But scaling laws are changing both slope and shape

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

More than just a little

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

They are changing a LOT!

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

2,0000 500 1000 1500

1

0

0.25

0.5

0.75

Scale

Va

lue

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

Pre-requisites for Tipping

• To reach the tipping point,

• Algorithms must scale out horizontally

– On commodity hardware

– That can and will fail

• Data practice must change

– Denormalized is the new black

– Flexible data dictionaries are the rule

– Structured data becomes rare

So that is why and why now

26

So that is why, and why now

What can you do with it?

And how?

27

Agenda

• Mahout outline

– Recommendations

– Clustering

– Classification

• Hybrid Parallel/Sequential Systems

• Real-time learning

Agenda

• Mahout outline

– Recommendations

– Clustering

– Classification

• Supervised on-line learning

• Feature hashing

• Hybrid Parallel/Sequential Systems

• Real-time learning

Classification in Detail

• Naive Bayes Family

– Hadoop based training

• Decision Forests


• Logistic Regression (aka SGD)

– fast on-line (sequential) training

Classification in Detail

• Naive Bayes Family


• Decision Forests


• Logistic Regression (aka SGD)

– fast on-line (sequential) training

– Now with MORE topping!

How it Works

• We are given “features”

– Often binary values in a vector

• Algorithm learns weights

– Weighted sum of feature * weight is the key

• Each weight is a single real value

An Example

Features

From: Dr. Paul Acquah

Dear Sir,

Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India

hospital directory, I am pleased to propose a

confidential business deal for our mutual

benefit. I have in my possession, instruments

(documentation) to transfer the sum of

33,100,000.00 eur thirty-three million one hundred

thousand euros, only) into a foreign company's

bank account for our favor.

...

Date: Thu, May 20, 2010 at 10:51 AM

From: George <[email protected]>

Hi Ted, was a pleasure talking to you last night

at the Hadoop User Group. I liked the idea of

going for lunch together. Are you available

tomorrow (Friday) at noon?

mailto:[email protected]



But …

• Text and words aren’t suitable features

• We need a numerical vector

• So we use binary vectors with lots of slots

Feature Encoding

Hashed Encoding

Feature Collisions

Training Data

Training Data

Training examples

with target values

Tokens

VectorsTraining

algorithm

Parsing

Encoding

Raw

data

Joining,

combining,

transforming

Full Scale Training

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS

Hybrid Model Development

44

Logs User sessions

Training dataGroup by user

Count transaction

patterns

Account info

Training data

Big-data cluster Legacy modeling

Shared filesystem

Merge PROC LOGISTIC

Model

Enter the Pig Vector

• Pig UDF’s for

– Vector encoding

– Model training

define EncodeVector

org.apache.mahout.pig.encoders.EncodeVector(

'10','x+y+1',

'x:numeric, y:numeric, z:numeric');

vectors = foreach docs generate newsgroup, encodeVector(*) as v;

grouped = group vectors all;

model = foreach grouped generate 1 as key,

train(vectors) as model;

Real-time Developments

• Storm + Hadoop + Mapr

– Real-time with Storm

– Long-term with Hadoop

– State checkpoints with MapR

• Add the Bayesian Bandit for on-line learning

Aggregate Splicing

t

Hadoop handles the past

Storm handles the present

Mobile Network Monitor

48

Transaction data

Batch aggregation

HBase

Real-time dashboard and alerts

Geo-dispersed ingest servers

Retro-analysisinterface

A Quick Diversion

• You see a coin

– What is the probability of heads?

– Could it be larger or smaller than that?

• I flip the coin and while it is in the air ask again

• I catch the coin and ask again

• I look at the coin (and you don’t) and ask again

• Why does the answer change?

– And did it ever have a single value?

A First Conclusion

• Probability as expressed by humans is subjective and depends on information and experience

A Second Conclusion

• A single number is a bad way to express uncertain knowledge

• A distribution of values might be better

I Dunno

5 and 5

2 and 10

Bayesian Bandit

• Compute distributions based on data

• Sample p1 and p2 from these distributions

• Put a coin in bandit 1 if p1 > p2

• Else, put the coin in bandit 2

The Basic Idea

• We can encode a distribution by sampling

• Sampling allows unification of exploration and exploitation

• Can be extended to more general response models

Deployment with Storm/MapR

Impression Logs

Click Logs

Targeting Engine

Conversion Detector

Model Selector

RPC

Online Model

Online Model

Online Model

RPC

RPC

RPC

Conversion Dashboard

RPC

Training

Training

Training

All state managed transactionally in MapR file system

Service Architecture

MapR Lockless Storage Services

MapR Pluggable Service Management

Storm

HadoopImpression Logs

Click Logs

Targeting Engine

Conversion Detector

Model Selector

RPC

Online Model

Online Model

Online Model

RPC

RPC

RPC

Conversion Dashboard

RPC

Training

Training

Training

Find Out More

• Me: [email protected]

[email protected]

[email protected]

• MapR: http://www.mapr.com

• Mahout: http://mahout.apache.org

• Code: https://github.com/tdunning




http://www.mapr.com

http://mahout.apache.org

https://github.com/tdunning

machine learning with hadoop boston hug 2012

Technology

value of analytics

big data technology

net value optimum

net positive value

net value changes

highest net value

best net value

availability of data