running with elephants: predictive analytics with hdinsight

Post on 06-May-2015

865 Views

Category:

Technology

5 Downloads

Preview:

Click to see full reader

DESCRIPTION

Amazon and Twitter do it, Wal-Mart & Facebook too….What about you? Big Data Predictive Analytics is pervasive and with HDInsight it's never been more approachable. In this session you become part of the demo as your clickstream data at our fictional e-commerce website drives user and product recommendations using the built-in Mahout (Taste) algorithms. In this action pack session, real-world and practical solutions for moving data into and out of HDFS (with Sqoop), using Mongo or HBase as a source/destination and of course handling Mahout processing in distributive mode will all be covered.

TRANSCRIPT

Running with Elephants

Predictive Analytics with Mahout & HDInsight

You are the demo….

SQL Brewhaushttp://sqlbrewhaus.azurewebsites.net

Create an Account…

Rate some beers…

Don’t worry your infowill only be sold to the HIGHEST bidder

Agenda

• Business Case for Recommendations• How a Recommendation Engine Works• Recommendation Implementation & Integration• Evaluating Recommendations• Challenges of Implementing Recommendations

Making the Business Case

ObjectiveIncreaseRevenue

Increase # of Orders

Increase Items per

Order

Increase Average

Item PriceUp-Sell Website

Navigational

Inefficiency

Cross-Sell

Business Case Example

Up- Sell

Increase

Unit Pric

e

Cross-Sell

Increase Unit Qty

IncreasedRevenue

Recommendation Engines

• Take observation data and use data mining/machine learning algorithms to predict outcomes

• Assumptions:• People with similar interest have common preferences• Sufficiently large number of preferences available

Recommendation Options

• Collaborative Filtering (Mahout)• User-Based• Item-Based

• Content-Based (Mahout Clustering)• Data Mining (SSAS)

• Association• Clustering

Technology

• A scalable machine learning library• Fast, Efficient & Pragmatic• Many of the algorithms can be run on Hadoop

HDInsight• Hadoop on Windows• HDInsight on Windows Azure (Seamlessly scale in the

cloud)• HortonWorks Data Platform/HDP (On-Premise Solution)

Generating Recommendations

1. Sources of Data2. Clean & Prepare Data3. Generate Recommendations• Build User/Item matrix• Calculate User Similarity• Form Neighborhoods• Generate Recommendations

Sources of Data

• Implicit• Ratings• Feedback• Demographics• Psychographics (Personality/Lifestyle/Attitude),• Ephemeral Need (Need for a moment)

• Explicit• Purchase History• Click/Browse History

• Product/Item• Taxonomy• Attributes• Descriptions

Our focus for today

Data Preparation

• Clean-Up:• Remove Outliers (Z-Score)• Remove frequent buyers (Skew)• Normalize Data (Unity-Based)

• Format Data into CSV input file:<User ID>, <Item ID>, <Rating>

How it Works?

• Build a User/Item Matrix

Item

s

Users

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

… 1 1

N

Neighborhood Formation

U2

U1

U5

U3

U6

U7

U4

Neighborhood Formation

• Requires some experimentation• Similarity Metrics

• Pearson Correlation• Euclidean Distance• Spearman Correlation• Cosine• Tanimoto Coefficient• Log-Likelihood

How it Works?

• Find users similar to U5

• Use a similarity metric (kNN)

• U1 & U7 are identified as most similar to U5

Item

s

Users

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

… 1 1

N

How it Works?

• Generate Recommendations:• Find items that have not been reviewed (I1 and I6)

• Predict rating by taking weighted sum

Item

s

Users

1 2 3 4 5 6 7 8 9 10 … n

1 1 1 1 0.5 1 1

2 1 1 1

3 1 1 1 1 1

4 1 1 1

5 1 1

6 0.7 1

Pseudo-Code Implementation

for each item i that u has no preferencefor each user v that has a preference for i

compute similarity s between u and vcalculate running average of v‘s

preference for i, weighted by s

return top ranked (weighted average) i

Restrict to Neighborhood

Mahout Implementation

• Real-Time Recommendations• Write Java Code and host in JVM Instance• Limited scalability• Requires Training Data• Integration typically handled through web services

• Batch-Based Recommendations• Uses MapReduce jobs on Hadoop• Offline, Slow, yet scalable• Out-of-the-box recommender jobs

Mahout MapReduce Implementation1 – Generate List of ItemIDs2 – Create Preference Vector3 – Count Unique Users4 – Transpose Preference Vectors5 – Row Similarity

• Compute Weights• Computer Similarities• Similarity Matrix

6 – Pre-Partial Multiply, Similarity Matrix7 – Pre-Partial Multiply, Preferences8 – Partial Multiple (Steps 6 & 7)9 – Filter Items10 – Aggregate & Recommend

Integrating Mahout

• Real-Time• Requires Java coding• Web Service• Process:• Load training data (memory pressure)• Generate recommendations

• Batch• ETL from source• Generate input file (UserID, ItemID, Rating)• Load to HDFS

• Process with Mahout/Hadoop• ETL output from HDFS/Hadoop

• 7 [1:4.5,2:4.5,3:4.5,4:4.5,5:4.5,6:4.5,7:4.5]• UserID [ItemID:Estimate Rating, ………]

Handling Recommendations

Storing Recommendations:• Hive• Data Warehouse system for Hadoop• Hive ODBC Driver

• MongoDB• Leading NOSQL database• JSON-like storage with flexible schema• C#/.Net MongoDB Driver

• HBase• Open-source distributed, column-oriented database modeled

after Google’s BigTable• Use Pig/MapReduce to process output files and load HBase

table• Java API for easy reading

• Source System (SQL Server, etc)

Evaluating the Recommendations

• How good are your recommendations?• How do you evaluate the recommendation

engine?• Two options both split data into test & training

data sets:• Average Difference• Root-Mean Square

• How it works?I1 I2 I3

Estimated Review 3.5 4.0 1.5

Actual Review 4.0 2.0 2.0

Absolute Difference 0.5 2.0 0.5

Average Difference = (0.5 + 2.0 + 0.5) / 3 = 1.0

Root-Mean-Square = √((0.52 + 2.02 + 0.52) / 3) = 1.23

Evaluating the Recommendations

DataModel model = new FileDataModel(new File(“ratings.csv”));

RecommenderEvaluator eval = new AverageAbsoluteDifferenceRecommenderEvaluator();

RecommenderBuilder bldr = new RecommenderBuilder(){@Overridepublic Recommender buildRecommender(DataModel model) throws TasteException{

//Use the Pearson Correlation to calculate similarityUserSimilarity similarity = new PearsonCorrelationSimilarity(model);//Generate neighborhoods of approx. 10 usersUserNeighborhood hood = new NearestUserNeighborhood(10, similarity,

model);return new GenericUserBasedRecommender(model, hood, similarity);

}};

//Use 70% of the data to train the model and 30% to testdouble score = eval.evaluate(bldr, model, 0.7, 1.0);

Challenges

1. Context2. Cold Start3. Data Scarsity4. Popularity Bias5. Curse of Dimensionality

Context Challenges

???January

20 degrees & Snowing…..

Other Challenges

• Cold Start• Occurs when either a new item or new user is introduced• Can be handled by:• Can substitute average item/user profile• Use another recommendation generation technique

(Content-Based)

• Data Sparsity• Too many items/user make finding intersections difficult

• Popularity Bias• Skewed towards popular items, people with “unique”

taste are left out

• Curse of Dimensionality• More items/user leads to more noise and greater error

Resources

Mahout in ActionSean Owen, Robin Anil, Ted Dunning, Ellen Friedman

Hadoop: The Definitive GuideTom White

top related