recommending semantic nearest neighbors using storm and dato

22
Presented at Dato Conf, SF Personalization @ StumbleUpon Recommending Semantic Nearest Neighbors using Storm and Dato

Upload: ashok-venkatesan

Post on 24-Jan-2018

461 views

Category:

Engineering


1 download

TRANSCRIPT

Page 1: Recommending Semantic Nearest Neighbors Using Storm and Dato

Presented at Dato Conf, SF

Personalization @ StumbleUpon

Recommending"Semantic Nearest Neighbors"

using Storm and Dato

Page 2: Recommending Semantic Nearest Neighbors Using Storm and Dato

OVERVIEW  

Page 3: Recommending Semantic Nearest Neighbors Using Storm and Dato

StumbleUpon – Choose Topics, Discover Content

Page 4: Recommending Semantic Nearest Neighbors Using Storm and Dato

Bookmark, Organize and Share

Page 5: Recommending Semantic Nearest Neighbors Using Storm and Dato

Recommendations – Matching User With Content

TELEVISION MUSIC

1. Understand User 2. Understand Content 3. Recommend 4. Get Feedback

TELEVISON MUSIC

TRENDING FRIENDS

LIKEMINDED USERS

EXPERTS

ANIMALS

DOGS

PHOTOGRAPHY

MOVIES

ARTS

HUMOR

Page 6: Recommending Semantic Nearest Neighbors Using Storm and Dato

Architecture Overview

Ingestion Queue

Discovery Queue

Content Analysis

MySQL

Recommendation Engine

1. INGESTION

Cold Start Model

HBase ES

New Content

Event Processors

3. OFFLINE COMPUTATIONS

2. CHECK QUALITY

4. RECS 5. ONLINE COMPUTATION

Rec Models Rec Models Rec Models

Event Queue

Page 7: Recommending Semantic Nearest Neighbors Using Storm and Dato

CONTEXTUAL  RECS  

Page 8: Recommending Semantic Nearest Neighbors Using Storm and Dato

•  Problem: –  Recommend Items based on the topics discovered in the current

page a user is on

•  Strategies: –  Find semantically similar items –  Find items that dig further into a specific topic –  Find items that dig further into a broader topic –  Others…

Problem

Page 9: Recommending Semantic Nearest Neighbors Using Storm and Dato

•  Very quick “Ingestion to Recommendation” turn around time (x10 seconds) –  Adopt stream processing with at-least-once processing guarantees –  Build idempotent subsystems –  Capitalize on non-linearity wherever possible

•  Low latency retrieval of recs (x10 ms) –  Pre-compute recs –  Retrieve recs in θ(1) time

•  Horizontally scalable design –  Utilize distributed processing systems/data stores

Constraints/SLAs

Page 10: Recommending Semantic Nearest Neighbors Using Storm and Dato

•  (Offline) Utilize a high quality dataset to build a topic model •  (Online) For each URL ingested,

–  Extract text features that summarize the documents •  Use pre-built topic models for

–  Filtering noisy keywords –  Finding general topics –  Finding specific topics –  Computing topic hashes

•  Compute similarity/relevance •  Store for quick retrieval

Approach Overview

Page 11: Recommending Semantic Nearest Neighbors Using Storm and Dato

Feature Extraction

Wikipedia Annotation2

Detect Language

Parse

Noun Chunking1

Cleanup

Remove Boilerplate

Coalesce Tags

1Manning, Christopher D., et al. “The Stanford CoreNLP natural language processing toolkit.” Proceedings of 52nd Annual Meeting of the ACL: System Demonstrations. 2014. 2Milne, David, and Ian H. Witten. "An open-source toolkit for mining Wikipedia." Artificial Intelligence 194 (2013): 222-239.

Compute Tag Score

Page 12: Recommending Semantic Nearest Neighbors Using Storm and Dato

Topic Modeling

3Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." the Journal of machine Learning research 3 (2003): 993-1022.

image courtesy: http://parkcu.com/blog/latent-dirichlet-allocation/

Page 13: Recommending Semantic Nearest Neighbors Using Storm and Dato

•  Similar to constrained clustering, LDA can be run with topic associations4.

•  Perform hierarchical/agglomerative clustering on SU’s taxonomy to obtain K=75 clusters of topic sets.

•  Use the topic sets as possible labels for the latent topic z

•  The words themselves are not learnt for the specific topic they have been mapped to.

LDA with Topic Associations

4Andrzejewski, David, and Xiaojin Zhu. “Latent Dirichlet Allocation with topic-in-set knowledge.” Proceedings of the NAACL HLT 2009 Workshop on Semi-Supervised Learning for Natural Language Processing. Association for Computational Linguistics, 2009.

Page 14: Recommending Semantic Nearest Neighbors Using Storm and Dato

Example

Topic  Associa6ons  (Pre  LDA)   Top  Words  in  a  Topic  (Post  LDA)  

Page 15: Recommending Semantic Nearest Neighbors Using Storm and Dato

•  Choose relevant topics –  Rank/Threshold by to get

•  Filtering noisy tags –  Rank/Threshold by

•  Getting specific words –  Rank/Threshold by

•  Getting general words –  Rank/Threshold by

Using the Topic Models

Page 16: Recommending Semantic Nearest Neighbors Using Storm and Dato

Graphlab-Create I

Image Courtesy: https://dato.com/products/create/technology.html

Page 17: Recommending Semantic Nearest Neighbors Using Storm and Dato

•  Allows fast prototyping on a single machine –  Python Interface to a C++ backend –  Scalable Data Structures (Tabular and Graphs) made available –  Out-of-core implementation of standard ML algorithms –  Makes basic Data Engineering and Visualization tasks easy

•  Easy to deploy micro services (predictive services) around models built using Graphlab create/pandas/scikit-learn. –  REST-ful API hosted over a Tornado server –  Distributed cache –  Amazon Cloudwatch for monitoring (for AWS deploys)

•  (Con) Debugging the service can be difficult

Graphlab-Create II

Page 18: Recommending Semantic Nearest Neighbors Using Storm and Dato

•  Distributed Realtime Computation System –  Fault Tolerant, Scalable and Guaranteed Processing –  Master --> Zookeeper --> Worker Nodes

•  Workers –  Spout Stream sources–  Bolt Computation units

•  Data Flow –  Streams Unbounded sequence of Tuples–  Topologies A network of spouts and bolts

Storm Basics5

5http://www.slideshare.net/ptgoetz/cassandra-and-storm-at-health-market-sceince

Page 19: Recommending Semantic Nearest Neighbors Using Storm and Dato

Architecture

URLs

Webpage Surveyor Service

TMS*

Models

HTML  to  Text  

Text  to  (Tags  ,  Concepts)  

Merge    

1.  Topic  Model  Query  

2.a.  Load  ES  2.b.  Get  Similar  Items  

3.  Load  Similar  Items  for  quick  lookup  

Build  Topic  Model  

Fetch  Page  HTML  

To  S3  

SIMILAR  ITEM  TOPOLOGY  

KaXa  Broker  

*TMS  –  Topic  Model  Service  

Get  Similar  Items  

Page 20: Recommending Semantic Nearest Neighbors Using Storm and Dato

•  Number of Storm Workers: 3 •  Number of ES Nodes: 3 •  Training:

–  Document Size: 2M –  Vocabulary: 400K –  Time: ~8s/iteration (16 cores)

•  Predictive service performance: –  Peak requests handled: 200/min –  Avg response time: 110 ms

•  URL Turn around time: 10s •  Number of URLs ingested: 70/min

Some Numbers I

Page 21: Recommending Semantic Nearest Neighbors Using Storm and Dato

Some Numbers II

Page 22: Recommending Semantic Nearest Neighbors Using Storm and Dato

THANKS.  QUESTIONS?