data mining-2011-09

Data-mining, Hadoop and the Single Node

Map-Reduce

Input Output

Shuffle

MapR's Streaming Performance

Read Write0

250

500

750

1000

1250

1500

1750

2000

2250

Read Write0

250

500

750

1000

1250

1500

1750

2000

2250

HardwareMapRHadoopMB

persec

Tests: i. 16 streams x 120GB ii. 2000 streams x 1GB

11 x 7200rpm SATA 11 x 15Krpm SAS

Higher is better

Terasort on MapR

1.0 TB0

10

20

30

40

50

60

3.5 TB0

50

100

150

200

250

300

MapRHadoop

Elapsed time (mins)

10+1 nodes: 8 core, 24GB DRAM, 11 x 1TB SATA 7200 rpm

Lower is better

Data Flow Expected Volumes

Node

Storage

6 x 1Gb/s =600 MB / s

12 x 100MB/s =900 MB / s

MUCH faster for some operations

# of files (millions)

CreateRate

Same 10 nodes …

ClusterNode

NFSServer

Universal export to self

Task

Cluster Nodes

ClusterNode

NFSServer

Task

ClusterNode

NFSServer

Task

ClusterNode

NFSServer

Task

Nodes are identical

Sharded text Indexing

MapReducer

Input documents

Localdisk Search

EngineLocal

disk

Clustered index storage

Assign documents to shards

Index text to local disk and then copy index to

distributed file store

Copy to local disk typically required before

index can be loaded

Conventional data flow

MapReducer

Input documents

Localdisk Search

EngineLocal

disk


Failure of a reducer causes garbage to accumulate in the

local disk

Failure of search engine requires

another download of the index from clustered storage.

SearchEngine

Simplified NFS data flows

MapReducer

Input documents


Failure of a reducer is cleaned up by

map-reduce framework

Search engine reads mirrored index directly.

Index to task work directory via NFS

Aggregatenew

centroids

K-means, the movie

Assignto

Nearestcentroid

Centroids

Input

But …

Averagemodels

Parallel Stochastic Gradient Descent

Trainsub

model

Model

Input

Updatemodel

Variational Dirichlet Assignment

Gathersufficientstatistics

Model

Input

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, (1, point)

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)

• Output to HDFS

Read fromHDFS to local disk by distributed cache

Written by map-reduce

Read from local disk from distributed cache

Old tricks, new dogs

• Mapper– Assign point to cluster– Emit cluster id, (1, point)

• Combiner and reducer– Sum counts, weighted sum of points– Emit cluster id, (n, sum/n)

• Output to HDFSMapR FS

Read fromNFS

Written by map-reduce

Poor man’s Pregel

• Mapper

• Lines in bold can use conventional I/O via NFS

18

while not done: read and accumulate input models for each input: accumulate model write model synchronize reset input formatemit summary

Mahout

• Scalable Data Mining for Everybody

What is Mahout

• Recommendations (people who x this also x that)

• Clustering (segment data into groups of)• Classification (learn decision making from

examples)• Stuff (LDA, SVD, frequent item-set, math)

What is Mahout?

• Recommendations (people who x this also x that)

• Clustering (segment data into groups of)• Classification (learn decision making from

examples)• Stuff (LDA, SVM, frequent item-set, math)

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training

So What?

• Online training has low overhead for small and moderate size data-sets

big starts here

An Example

And Another

From: Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit. I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

mailto:[email protected]

Mahout’s SGD

• Learns on-line per example– O(1) memory– O(1) time per training example

• Sequential implementation– fast, but not parallel

Special Features

• Hashed feature encoding• Per-term annealing– learn the boring stuff once

• Auto-magical learning knob turning– learns correct learning rate, learns correct

learning rate for learning learning rate, ...

Feature Encoding

Hashed Encoding

Feature Collisions

Learning Rate AnnealingLe

arni

ng R

ate

# training examples seen

Per-term AnnealingLe

arni

ng R

ate

# training examples seen

Common Feature

Rare Feature

General Structure

• OnlineLogisticRegression– Traditional logistic regression– Stochastic Gradient Descent– Per term annealing– Too fast (for the disk + encoder)

Next Level

• CrossFoldLearner– contains multiple primitive learners– online cross validation– 5x more work

And again

• AdaptiveLogisticRegression– 20 x CrossFoldLearner– evolves good learning and regularization rates– 100 x more work than basic learner– still faster than disk + encoding

A comparison

• Traditional view– 400 x (read + OLR)

• Revised Mahout view– 1 x (read + mu x 100 x OLR) x eta– mu = efficiency from killing losers early– eta = efficiency from stopping early

Click modeling architecture

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS

Click modeling architecture

Map-reduceMap-reduce

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce cooperates

with NFSSequential

SGDLearning

SequentialSGD

Learning

SequentialSGD

Learning

Deployment

• Training– ModelSerializer.writeBinary(..., model)

• Deployment– m = ModelSerializer.readBinary(...)– r = m.classifyScalar(featureVector)

The Upshot• One machine can go fast– SITM trains in 2 billion examples in 3 hours

• Deployability pays off big– simple sample server farm

data mining-2011-09

Technology

streams x

x olr x etamu

volumesnode6 x

x thatclustering segment

mb s12 x

regularization rates100

rpm sata11 x

learning learning rate