running with elephants: predictive analytics with hdinsight

Running with Elephants

Predictive Analytics with Mahout & HDInsight

Introduction

Chris Price Senior BI Consultant with Pragmatic Works

AuthorRegular SpeakerData Geek & Super Dad!

@BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com

You are the demo….

SQL Brewhaushttp://sqlbrewhaus.azurewebsites.net

Create an Account…

Rate some beers…

Don’t worry your infowill only be sold to the HIGHEST bidder

Agenda

• Business Case for Recommendations• How a Recommendation Engine Works• Recommendation Implementation & Integration• Evaluating Recommendations• Challenges of Implementing Recommendations

Making the Business Case

ObjectiveIncreaseRevenue

Increase # of Orders

Increase Items per

Increase Average

Item PriceUp-Sell Website

Navigational

Inefficiency

Cross-Sell

Business Case Example

Up- Sell

Increase

Unit Pric

Cross-Sell

Increase Unit Qty

IncreasedRevenue

Recommendation Engines

• Take observation data and use data mining/machine learning algorithms to predict outcomes

• Assumptions:• People with similar interest have common preferences• Sufficiently large number of preferences available

Recommendation Options

• Collaborative Filtering (Mahout)• User-Based• Item-Based

• Content-Based (Mahout Clustering)• Data Mining (SSAS)

• Association• Clustering

Technology

• A scalable machine learning library• Fast, Efficient & Pragmatic• Many of the algorithms can be run on Hadoop

HDInsight• Hadoop on Windows• HDInsight on Windows Azure (Seamlessly scale in the

cloud)• HortonWorks Data Platform/HDP (On-Premise Solution)

Generating Recommendations

1. Sources of Data2. Clean & Prepare Data3. Generate Recommendations• Build User/Item matrix• Calculate User Similarity• Form Neighborhoods• Generate Recommendations

Sources of Data

• Implicit• Ratings• Feedback• Demographics• Psychographics (Personality/Lifestyle/Attitude),• Ephemeral Need (Need for a moment)

• Explicit• Purchase History• Click/Browse History

• Product/Item• Taxonomy• Attributes• Descriptions

Our focus for today

Data Preparation

• Clean-Up:• Remove Outliers (Z-Score)• Remove frequent buyers (Skew)• Normalize Data (Unity-Based)

• Format Data into CSV input file:<User ID>, <Item ID>, <Rating>

How it Works?

• Build a User/Item Matrix

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

… 1 1

Neighborhood Formation

• Requires some experimentation• Similarity Metrics

• Pearson Correlation• Euclidean Distance• Spearman Correlation• Cosine• Tanimoto Coefficient• Log-Likelihood

How it Works?

• Find users similar to U5

• Use a similarity metric (kNN)

• U1 & U7 are identified as most similar to U5

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

… 1 1

How it Works?

• Generate Recommendations:• Find items that have not been reviewed (I1 and I6)

• Predict rating by taking weighted sum

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 0.5 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

6 0.7 1

Pseudo-Code Implementation

for each item i that u has no preferencefor each user v that has a preference for i

compute similarity s between u and vcalculate running average of v‘s

preference for i, weighted by s

return top ranked (weighted average) i

Restrict to Neighborhood

Mahout Implementation

• Real-Time Recommendations• Write Java Code and host in JVM Instance• Limited scalability• Requires Training Data• Integration typically handled through web services

• Batch-Based Recommendations• Uses MapReduce jobs on Hadoop• Offline, Slow, yet scalable• Out-of-the-box recommender jobs

Mahout MapReduce Implementation1 – Generate List of ItemIDs2 – Create Preference Vector3 – Count Unique Users4 – Transpose Preference Vectors5 – Row Similarity

• Compute Weights• Computer Similarities• Similarity Matrix

6 – Pre-Partial Multiply, Similarity Matrix7 – Pre-Partial Multiply, Preferences8 – Partial Multiple (Steps 6 & 7)9 – Filter Items10 – Aggregate & Recommend

Integrating Mahout

• Real-Time• Requires Java coding• Web Service• Process:• Load training data (memory pressure)• Generate recommendations

• Batch• ETL from source• Generate input file (UserID, ItemID, Rating)• Load to HDFS

• Process with Mahout/Hadoop• ETL output from HDFS/Hadoop

• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]• UserID [ItemID:Estimate Rating, ………]

Handling Recommendations

Storing Recommendations:• Hive• Data Warehouse system for Hadoop• Hive ODBC Driver

• MongoDB• Leading NOSQL database• JSON-like storage with flexible schema• C#/.Net MongoDB Driver

• HBase• Open-source distributed, column-oriented database modeled

after Google’s BigTable• Use Pig/MapReduce to process output files and load HBase

table• Java API for easy reading

• Source System (SQL Server, etc)

Evaluating the Recommendations

• How good are your recommendations?• How do you evaluate the recommendation

engine?• Two options both split data into test & training

data sets:• Average Difference• Root-Mean Square

• How it works?I1 I2 I3

Estimated Review 3.5 4.0 1.5

Actual Review 4.0 2.0 2.0

Absolute Difference 0.5 2.0 0.5

Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0

Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23

Evaluating the Recommendations

DataModel model = new FileDataModel(new File(“ratings.csv”));

RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder bldr = new RecommenderBuilder(){@Overridepublic Recommender buildRecommender(DataModel model) throws TasteException{

//Use the Pearson Correlation to calculate similarityUserSimilarity similarity = new PearsonCorrelationSimilarity(model);//Generate neighborhoods of approx. 10 usersUserNeighborhood hood = new NearestUserNeighborhood(10, similarity,

model);return new GenericUserBasedRecommender(model, hood, similarity);

//Use 70% of the data to train the model and 30% to testdouble score = eval.evaluate(bldr, model, 0.7, 1.0);

Challenges

1. Context2. Cold Start3. Data Scarsity4. Popularity Bias5. Curse of Dimensionality

Context Challenges

???January

20 degrees & Snowing…..

Other Challenges

• Cold Start• Occurs when either a new item or new user is introduced• Can be handled by:• Can substitute average item/user profile• Use another recommendation generation technique

(Content-Based)

• Data Sparsity• Too many items/user make finding intersections difficult

• Popularity Bias• Skewed towards popular items, people with “unique”

taste are left out

• Curse of Dimensionality• More items/user leads to more noise and greater error

Resources

Mahout in ActionSean Owen, Robin Anil, Ted Dunning, Ellen Friedman

Hadoop: The Definitive GuideTom White

Thank You!

@BluewaterSQL http://bluewatersql.wordpress.com/ cprice@pragmaticworks.com

QUESTIONS???

running with elephants: predictive analytics with hdinsight

split data

observation data

recommendations challenges

handling recommendations

data unitybased format

training data integration

recommendations batch

data preparation cleanup

Technology

jumpstarting big data projects / architectural...

2016 dsg webinar azure hdinsight 2 v4

goodnight elephants

azure data usando hdinsight ejemplo hadoop: madreduce, hive,...

elephants final

warm up let’s sing the song. elephants and fishes how many...

carga y procesamiento de datos en hdinsight

making your apps smarter with azure hdinsight

introduction to azure hdinsight

because big elephants can always understand small elephants

democratizing big data with microsoft azure hdinsight

elephants in the movies - matthew hunt · elephants in the...

introducción a hdinsight

big data, hadoop, hdinsight

elephants elephants elephants album

hdinsight : hadoop en environnement microsoft

windows azure hdinsight service

azure hdinsight - db4 · 2019-05-15 · title: microsoft...

hdinsight hadoop on windows azure

only aggressive elephants are fast elephants - vldb...