dreamforce_2012_hadoop_use_cases

56
How Salesforce.com Uses Hadoop Some Data Science Use Cases Narayan Bharadwaj Jed Crosby salesforce.com salesforce.com @nadubharadwaj @JedCrosby

Upload: narayan-bharadwaj

Post on 06-May-2015

559 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Dreamforce_2012_Hadoop_Use_Cases

How Salesforce.com Uses HadoopHow Salesforce.com Uses HadoopSome Data Science Use Cases

Narayan Bharadwaj Jed Crosby

salesforce.com salesforce.com

@nadubharadwaj @JedCrosby

Page 2: Dreamforce_2012_Hadoop_Use_Cases

Safe HarborSafe harbor statement under the Private Securities Litigation Reform Act of 1995: This presentation may contain forward-looking statements that involve risks, uncertainties, and assumptions. If any such uncertainties materialize or if any of the assumptions proves incorrect, the results of salesforce.com, inc. could differ materially from the results expressed or implied by the forward-looking statements we make. All statements other than statements of historical fact could be deemed forward-looking, including any projections of product or service availability, subscriber growth, earnings, revenues, or other financial items and any statements regarding strategies or plans of management for future operations, statements of belief, any statements concerning new, planned, or upgraded services or technology developments and customer contracts or use of our services. The risks and uncertainties referred to above include – but are not limited to – risks associated with developing and delivering new functionality for our service, new products and services, our new business model, our past operating losses, possible fluctuations in our operating results and rate of growth, interruptions or delays in our Web hosting, breach of our security measures, the outcome of intellectual property and other litigation, risks associated with possible mergers and acquisitions, the immature market in which we operate, our relatively limited operating history, our ability to expand, retain, and motivate our employees and manage our growth, new releases of our service and successful customer deployment, our limited history reselling non-salesforce.com products, and utilization and selling to larger enterprise customers. Further information on potential factors that could affect the financial results of salesforce.com, inc. is included in our annual report on Form 10-Q for the most recent fiscal quarter ended July 31, 2012. This documents and others containing important disclosures are available on the SEC Filings section of the Investor Information section of our Web site. Any unreleased services or features referenced in this or other presentations, press releases or public statements are not currently available and may not be delivered on time or at all. Customers who purchase our services should make the purchase decisions based upon features that are currently available. Salesforce.com, inc. assumes no obligation and does not intend to update these forward-looking statements.

Page 3: Dreamforce_2012_Hadoop_Use_Cases

Agenda

• Technology

• Hadoop use cases

• Use case discussion• Product Metrics

• User Behavior Analysis

• Collaborative Filtering

• Q&A

Every time you see the elephant, we will attempt to explain a Hadoop related concept.

Page 4: Dreamforce_2012_Hadoop_Use_Cases

Got “Cloud Data”?

800 million transactions/dayTerabytes/day

130k customersMillions of users

Page 5: Dreamforce_2012_Hadoop_Use_Cases

Technology

Page 6: Dreamforce_2012_Hadoop_Use_Cases

Hadoop Overview

- Started by Doug Cutting at Yahoo!

- Based on two Google papers Google File System (GFS): http://research.google.com/archive/gfs.html

Google MapReduce: http://research.google.com/archive/mapreduce.html

- Hadoop is an open source Apache project Hadoop Distributed File System (HDFS)

Distributed Processing Framework (MapReduce)

- Several related projects HBase, Hive, Pig, Flume, ZooKeeper, Mahout, Oozie, HCatalog

Page 7: Dreamforce_2012_Hadoop_Use_Cases

Our Hadoop Ecosystem

Apache Pig

Page 8: Dreamforce_2012_Hadoop_Use_Cases

Contributions

@pRaShAnT1784 : Prashant Kommireddi

Lars Hofhansl @thefutureian : Ian Varley

Page 9: Dreamforce_2012_Hadoop_Use_Cases

Use Cases

Page 10: Dreamforce_2012_Hadoop_Use_Cases

Product Metrics User behavior analysis Capacity planning

Monitoring intelligence Collections Query Runtime

Prediction

Early Warning System Collaborative Filtering Search Relevancy

Internal AppInternal App Product featureProduct feature

Hadoop Use Cases

Page 11: Dreamforce_2012_Hadoop_Use_Cases

Product Metrics

Page 12: Dreamforce_2012_Hadoop_Use_Cases

Track feature usage/adoption across 130k+ customers Eg: Accounts, Contacts, Visualforce, Apex,…

Track standard metrics across all features Eg: #Requests, #UniqueOrgs, #UniqueUsers, AvgResponseTime,…

Track features and metrics across all channels API, UI, Mobile

Primary audience: Executives, Product Managers

Product Metrics – Problem Statement

Page 13: Dreamforce_2012_Hadoop_Use_Cases

Feature Metadata(Instrumentation)

Daily Summary(Output)

Crunch it(How?)

Storage & Processing

Feature (What?)Fancy UI

(Visualize)

Data Pipeline

Page 14: Dreamforce_2012_Hadoop_Use_Cases

Feature Metrics (Custom Object) Feature Metrics (Custom Object)

Trend Metrics(Custom Object)Trend Metrics(Custom Object)

Client Machine Client Machine

Pig script generatorPig script generator

HadoopHadoop Log FilesLog Files

Lo

g P

ull

Lo

g P

ull

User Input(Page Layout)

User Input(Page Layout) Reports, DashboardsReports, Dashboards

AP

IA

PI

AP

IA

PIW

ork

flo

wW

ork

flo

w

Fo

rmu

la

Fie

lds

Fo

rmu

la

Fie

lds

Java ProgramJava Program

Wo

rkfl

ow

Wo

rkfl

ow

Product Metrics Pipeline

Page 15: Dreamforce_2012_Hadoop_Use_Cases

Id Feature Name PM Instrumentation Metric1 Metric2 Metric3 Metric4 Status

F0001 Accounts John /001 #requests #UniqOrgs #UniqUsers AvgRT Dev

F0002 Contacts Nancy /003 #requests #UniqOrgs #UniqUsers AvgRT Review

F0003 API Eric A #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0004 Visualforce Roger V #requests #UniqOrgs #UniqUsers AvgRT Decom

F0005 Apex Kim axapx #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0006 Custom Objects Chun /aXX #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0008 Chatter Jed chcmd #requests #UniqOrgs #UniqUsers AvgRT Deployed

F0009 Reports Steve R #requests #UniqOrgs #UniqUsers AvgRT Deployed

Feature Metrics (Custom Object)

Page 16: Dreamforce_2012_Hadoop_Use_Cases

Feature Metrics (Custom Object)

Page 17: Dreamforce_2012_Hadoop_Use_Cases

User Input (Page Layout)Formula Field

Workflow Rule

Page 18: Dreamforce_2012_Hadoop_Use_Cases

User Input (Child Custom Object)

Child Objects

Page 19: Dreamforce_2012_Hadoop_Use_Cases

Apache Pig

Page 20: Dreamforce_2012_Hadoop_Use_Cases

-- Define UDFs

DEFINE GFV GetFieldValue(‘/path/to/udf/file’);

-- Load data

A = LOAD ‘/path/to/cloud/data/log/files’ USING PigStorage();

-- Filter data

B = FILTER A BY GFV(row, ‘logRecordType’) == ‘U’;

-- Extract Fields

C = FOREACH B GENERATE GFV(*, ‘orgId’), LFV(*. ‘userId’) ……..

-- Group

G = GROUP C BY ……

-- Compute output metrics

O = FOREACH G {

orgs = C.orgId; uniqueOrgs = DISTINCT orgs;

}

-- Store or Dump results

STORE O INTO ‘/path/to/user/output’;

Basic Pig Script Construct

Page 21: Dreamforce_2012_Hadoop_Use_Cases

Java Pig Script Generator (Client)

Page 22: Dreamforce_2012_Hadoop_Use_Cases

Id Date #Requests #Unique Orgs #Unique Users Avg ResponseTime

F0001 06/01/2012 <big> <big> <big> <little>

F0002 06/01/2012 <big> <big> <big> <little>

F0003 06/01/2012 <big> <big> <big> <little>

F0001 06/02/2012 <big> <big> <big> <little>

F0002 06/02/2012 <big> <big> <big> <little>

F0003 06/03/2012 <big> <big> <big> <little>

Trend Metrics (Custom Object)

Page 23: Dreamforce_2012_Hadoop_Use_Cases

Upload to Trend Metrics (Custom Object)

Page 24: Dreamforce_2012_Hadoop_Use_Cases

Visualization (Reports & Dashboards)

Page 25: Dreamforce_2012_Hadoop_Use_Cases

Visualization (Reports & Dashboards)

Page 26: Dreamforce_2012_Hadoop_Use_Cases

Collaborate, Iterate (Chatter)

Page 27: Dreamforce_2012_Hadoop_Use_Cases

Feature Metrics (Custom Object) Feature Metrics (Custom Object)

Trend Metrics(Custom Object)Trend Metrics(Custom Object)

Client Machine Client Machine

Pig script generatorPig script generator

HadoopHadoop Log FilesLog Files

Lo

g P

ull

Lo

g P

ull

User Input(Page Layout)

User Input(Page Layout) Reports, DashboardsReports, Dashboards

AP

IA

PI

AP

IA

PIW

ork

flo

wW

ork

flo

w

Fo

rmu

la

Fie

lds

Fo

rmu

la

Fie

lds

Java ProgramJava Program

Wo

rkfl

ow

Wo

rkfl

ow

Recap

Page 28: Dreamforce_2012_Hadoop_Use_Cases

User Behavior Analysis

Page 29: Dreamforce_2012_Hadoop_Use_Cases

Problem Statement

How do we reduce number of clicks on the user interface?

Need to understand top user click paths. What are they typically trying to do?

What are the user clusters/personas?

Approach:

• Markov transition for click path, D3.js visuals

• K-means (unsupervised) clustering for user groups

Page 30: Dreamforce_2012_Hadoop_Use_Cases

Markov Transitions for "Setup" Pages

Page 31: Dreamforce_2012_Hadoop_Use_Cases

K-means clustering of "Setup" Pages

Page 32: Dreamforce_2012_Hadoop_Use_Cases

Collaborative Filtering

Jed Crosby

Page 33: Dreamforce_2012_Hadoop_Use_Cases

Show similar files within an organization Content-based approach

Community-base approach

Collaborative Filtering – Problem Statement

Page 34: Dreamforce_2012_Hadoop_Use_Cases

Popular File

Page 35: Dreamforce_2012_Hadoop_Use_Cases

Related File

Page 36: Dreamforce_2012_Hadoop_Use_Cases

Amazon published this algorithm in 2003. Amazon.com Recommendations: Item-to-Item Collaborative Filtering, by

Gregory Linden, Brent Smith, and Jeremy York. IEEE Internet Computing,

January-February 2003.

At Salesforce, we adapted this algorithm for Hadoop, and we use

it to recommend files to view and users to follow.

We found this relationship using item-to-item collaborative filtering

Page 37: Dreamforce_2012_Hadoop_Use_Cases

Annual ReportVision Statement

Dilbert Comic

Darth Vader Cartoon

Disk Usage Report

Example: CF on 5 files

Page 38: Dreamforce_2012_Hadoop_Use_Cases

Annual Report

Vision Statement

Dilbert Cartoon

Darth Vader Cartoon

Disk Usage Report

Miranda (CEO)

1 1 1 0 0

Bob (CFO) 1 1 1 0 0

Susan (Sales)

0 1 1 1 0

Chun (Sales) 0 0 1 1 0

Alice (IT) 0 0 1 1 1

View History Table

Page 39: Dreamforce_2012_Hadoop_Use_Cases

Annual Report

Disk Usage Report

Darth Vader Cartoon

Dilbert Cartoon

Vision Statement

Relationships Between the Files

Page 40: Dreamforce_2012_Hadoop_Use_Cases

Annual Report

Disk Usage Report

Darth Vader Cartoon

Dilbert Cartoon

Vision Statement2

2

0

0

31

0

3

1 1

Relationships Between the Files

Page 41: Dreamforce_2012_Hadoop_Use_Cases

Annual Report

Vision Statement

Dilbert Cartoon

Darth Vader Cartoon

Disk Usage Report

Dilbert (2) Dilbert (3) Vision Stmt. (3) Dilbert (3) Dilbert (1)

Vision Stmt. (2) Annual Rpt. (2) Darth Vader (3) Vision Stmt. (1) Darth Vader (1)

Darth Vader (1) Annual Rpt. (2) Disk Usage (1)

Disk Usage (1)

The popularity problem: notice that Dilbert appears first in every list. This is probably not what we want.

The solution: divide the relationship tallies by file popularities.

Sorted Relationships for Each File

Page 42: Dreamforce_2012_Hadoop_Use_Cases

Annual Report

Disk Usage Report

Darth Vader CartoonDilbert Cartoon

Vision Statement.82

.630

0

.77.33

0

.77

.45 .58

Normalized Relationships Between the Files

Page 43: Dreamforce_2012_Hadoop_Use_Cases

Annual Report Vision Statement

Dilbert Cartoon

Darth Vader Cartoon

Disk Usage Report

Vision Stmt.(.82)

Annual Report (.82)

Darth Vader(.77)

Dilbert (.77)Darth Vader(.58)

Dilbert (.63) Dilbert (.77)Vision Stmt.(.77)

Disk Usage(.58)

Dilbert(.45)

Darth Vader (.33)

Annual Report(.63)

Vision Stmt.(.33)

Disk Usage(.45)

High relationship tallies AND similar popularity values now drive closeness.

Sorted relationships for each file, normalized by file popularities

Page 44: Dreamforce_2012_Hadoop_Use_Cases

1) Compute file popularities

2) Compute relationship tallies and divide by file popularities

3) Sort and store the results

The Item-to-Item CF Algorithm

Page 45: Dreamforce_2012_Hadoop_Use_Cases

MapReduce OverviewMap Shuffle Reduce

(adapted from http://code.google.com/p/mapreduce-framework/wiki/MapReduce)

Page 46: Dreamforce_2012_Hadoop_Use_Cases

<user, file>

Inverse identity map

<file, List<user>>

Reduce

<file, (user count)>

Result is a table of (file, popularity) pairs that you store in the Hadoop distributed cache.

1. Compute File Popularities

Page 47: Dreamforce_2012_Hadoop_Use_Cases

(Miranda, Dilbert), (Bob, Dilbert), (Susan, Dilbert), (Chun, Dilbert), (Alice, Dilbert)

Inverse identity map

<Dilbert, {Miranda, Bob, Susan, Chun, Alice}>

Reduce

(Dilbert, 5)

Example: File popularity for Dilbert

Page 48: Dreamforce_2012_Hadoop_Use_Cases

<user, file>

Identity map

<user, List<file>>

Reduce

<(file1, file2), Integer(1)>, <(file1, file3), Integer(1)>, … <(file(n-1), file(n)), Integer(1)>

Relationships have their file IDs in alphabetical order to avoid double counting.

2a. Compute Relationship Tallies − Find All Relationships in View History Table

Page 49: Dreamforce_2012_Hadoop_Use_Cases

(Miranda, Annual Report), (Miranda, Vision Statement), (Miranda, Dilbert)

Identity map

<Miranda, {Annual Report, Vision Statement, Dilbert}>

Reduce

<(Annual Report, Dilbert), Integer(1)>, <(Annual Report, Vision Statement), Integer(1)>, <(Dilbert, Vision Statement), Integer(1)>

Example 2a: Miranda’s (CEO) File Relationship Votes

Page 50: Dreamforce_2012_Hadoop_Use_Cases

<(file1, file2), Integer(1)>

<(file1, file2), List<Integer(1)>

Identity map

Reduce: count and divide by popularities

<file1, (file2, similarity score)>, <file2, (file1, similarity score)>

Note that we emit each result twice,one for each file that belongs to a relationship.

2b. Tally the Relationship Votes − Just a Word Count, Where Each Relationship Occurrence is a Word

Page 51: Dreamforce_2012_Hadoop_Use_Cases

<(Dilbert, Vader), Integer(1)>,<(Dilbert, Vader), Integer(1)>, <(Dilbert, Vader), Integer(1)>

<(Dilbert, Vader), {1, 1, 1}>

Identity map

Reduce: count and divide by popularities

<Dilbert, (Vader, sqrt(3/5))>, <Vader, (Dilbert, sqrt(3/5))>

Example 2b: the Dilbert/Darth Vader Relationship

Page 52: Dreamforce_2012_Hadoop_Use_Cases

<file1, (file2, similarity score)>

Identity map

<file1, List<(file2, similarity score)>>

Reduce

<file1, {top n similar files}>

Store the results in your location of choice

3. Sort and Store Results

Page 53: Dreamforce_2012_Hadoop_Use_Cases

<Dilbert, (Annual Report, .63)>,<Dilbert, (Vision Statement, .77)>,<Dilbert, (Disk Usage, .45)>,<Dilbert, (Darth Vader, .77)>

Identity map

<Dilbert, {(Annual Report, .63), (Vision Statement, .77), (Disk Usage, .45), (Darth Vader, .77)}>

Reduce

<Dilbert, {Darth Vader, Vision Statement}> (Top 2 files)

Store results

Example 3: Sorting the Results for Dilbert

Page 54: Dreamforce_2012_Hadoop_Use_Cases

Cosine formula and normalization trick to avoid the distributed

cache

Mahout has CF

Asymptotic order of the algorithm is O(M*N2) in worst case, but is

helped by sparsity.

Appendix

Page 55: Dreamforce_2012_Hadoop_Use_Cases

Narayan BharadwajNarayan BharadwajDirector, Product Management

@nadubharadwaj

Jed CrosbyJed CrosbyData Scientist@JedCrosby

Page 56: Dreamforce_2012_Hadoop_Use_Cases