boston hug

Machine Learning with Hadoop

2

Agenda

• Why Big Data? Why now?

• What can you do with big data?

• How does it work?

3

Slow Motion Explosion

4

Why Now?

• But Moore’s law has applied for a long time

• Why is Hadoop/Big Data exploding now?

• Why not 10 years ago?

• Why not 20?

2/15/12

5

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

6

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

7

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

8

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

9

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

10

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

They did

11

Everywhere at Once?

• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

12

Everywhere at Once?

• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Why?

Analytics Scaling Laws

• Analytics scaling is all about the 80-20 rule – Big gains for little initial effort– Rapidly diminishing returns

• The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant

• Cost/performance has changed radically– IF you can use many commodity boxes

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

Net value optimum has a sharp peak well before maximum effort

But scaling laws are changing both slope and shape

More than just a little

They are changing a LOT!

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

Pre-requisites for Tipping

• To reach the tipping point, • Algorithms must scale out horizontally– On commodity hardware– That can and will fail

• Data practice must change– Denormalized is the new black– Flexible data dictionaries are the rule– Structured data becomes rare

26

So that is why and why now

27

So that is why, and why now

What can you do with it?And how?

Agenda

• Mahout outline– Recommendations– Clustering– Classification

• Hybrid Parallel/Sequential Systems• Real-time learning

Agenda

• Mahout outline– Recommendations– Clustering– Classification• Supervised on-line learning• Feature hashing

• Hybrid Parallel/Sequential Systems• Real-time learning

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training– Now with MORE topping!

How it Works

• We are given “features”– Often binary values in a vector

• Algorithm learns weights– Weighted sum of feature * weight is the key

• Each weight is a single real value

An Example

Features

From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

mailto:[email protected]

But …

• Text and words aren’t suitable features• We need a numerical vector• So we use binary vectors with lots of slots

Feature Encoding

Hashed Encoding

Feature Collisions

Training Data

Full Scale Training

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS

44

Hybrid Model Development

Logs User sessions

Training dataGroup by user

Count transaction

patterns

Account info

Training data

Big-data cluster Legacy modeling

Shared filesystem

Merge PROC LOGISTIC

Model

Enter the Pig Vector

• Pig UDF’s for– Vector encoding

– Model training

define EncodeVector org.apache.mahout.pig.encoders.EncodeVector( '10','x+y+1', 'x:numeric, y:numeric, z:numeric');

vectors = foreach docs generate newsgroup, encodeVector(*) as v;grouped = group vectors all;model = foreach grouped generate 1 as key, train(vectors) as model;

Real-time Developments

• Storm + Hadoop + Mapr– Real-time with Storm– Long-term with Hadoop– State checkpoints with MapR

• Add the Bayesian Bandit for on-line learning

Aggregate Splicing

tHadoop handles the

past

Storm handles the present

48

Mobile Network MonitorTransaction

data

Batch aggregation

HBase

Real-time dashboard and alerts

Geo-dispersed ingest servers

Retro-analysisinterface

A Quick Diversion

• You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

• I flip the coin and while it is in the air ask again• I catch the coin and ask again• I look at the coin (and you don’t) and ask again• Why does the answer change?– And did it ever have a single value?

A First Conclusion

• Probability as expressed by humans is subjective and depends on information and experience

A Second Conclusion

• A single number is a bad way to express uncertain knowledge

• A distribution of values might be better

I Dunno

5 and 5

2 and 10

Bayesian Bandit

• Compute distributions based on data• Sample p1 and p2 from these distributions

• Put a coin in bandit 1 if p1 > p2

• Else, put the coin in bandit 2

The Basic Idea

• We can encode a distribution by sampling• Sampling allows unification of exploration and

exploitation

• Can be extended to more general response models

Deployment with Storm/MapR

All state managed transactionally in MapR file system

Service Architecture

MapR Lockless Storage Services

MapR Pluggable Service Management

Storm

Hadoop

Find Out More

• Me: [email protected] [email protected] [email protected]

• MapR: http://www.mapr.com • Mahout: http://mahout.apache.org

• Code: https://github.com/tdunning




http://www.mapr.com/

http://mahout.apache.org/

https://github.com/tdunning

boston hug

Technology

availability of data

value scales

hadoopbig data

data practice

rule structured data

net positive value

net value optimum

single real value