enhance discovery solr and mahout
DESCRIPTION
Los Angeles/ OC Apache Lucene/Solr User group meeting held at Shopzilla in LA on January 19th 2012.TRANSCRIPT
1 CONFIDENTIAL |
Thinking Lucene Think Lucid
Grant Ingersoll Chief Scien@st Lucid Imagina@on
Enhancing Discovery with Solr and Mahout
2 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Evolution
Documents • Models • Feature Selection
User Interaction • Clicks • Ratings/Reviews
• Learning to Rank
• Social Graph
Queries • Phrases • NLP
Content Relationships • Page Rank, etc. • Organization
3 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Minding the Intersection
Search
Discovery Analytics
4 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Background – Apache Mahout – Apache Solr and Lucene
l Recommenda@ons with Mahout – Collabora@ve Filtering
l Discovery with Solr and Mahout
l Discussion
Topics
5 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Apache Lucene in a Nutshell
l hOp://lucene.apache.org/java
l Java based Applica@on Programming Interface (API) for adding search and indexing func@onality to applica@ons
l Fast and efficient scoring and indexing algorithms
l Lots of contribu@ons to make common tasks easier: – Highligh@ng, spa@al, Query Parsers, Benchmarking tools, etc.
l Most widely deployed search library on the planet
6 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Apache Solr in a Nutshell
l hOp://lucene.apache.org/solr
l Lucene-‐based Search Server + other features and func@onality
l Access Lucene over HTTP: – Java, XML, Ruby, Python, .NET, JSON, PHP, etc.
l Most programming tasks in Lucene are taken care of in Solr
l Face@ng (guided naviga@on, filters, etc.)
l Replica@on and distributed search support
l Lucene Best Prac@ces
7 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Apache Mahout in a Nutshell
l An Apache Socware Founda@on project to create scalable machine learning libraries under the Apache Socware License – hOp://mahout.apache.org
l The Three C’s: – Collabora@ve Filtering (recommenders) – Clustering – Classifica@on
l Others: – Frequent Item Mining – Primi@ve collec@ons – Math stuff
http://dictionary.reference.com/browse/mahout
8 CONFIDENTIAL |
Thinking Lucene Think Lucid
Recommenda@ons with Mahout
9 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Collabora@ve Filtering (CF) – Provide recommenda@ons solely based on preferences expressed between
users and items – “People who watched this also watched that”
l Content-‐based Recommenda@ons (CBR) – Provide recommenda@ons based on the aOributes of the items and user profile – ‘Modern Family’ is a sitcom, Bob likes sitcoms
• => Suggest Modern Family to Bob
l Mahout geared towards CF, can be extended to do CBR – Classifica@on can also be used for CBR
l Aside: search engines can also solve these problems
Recommenders
10 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Dracula Jane Eyre
Frankenstein Java Programming
Bob 1 4 ??? -
Mary 5 1 4 -
l In many instances, user’s don’t provide actual ra@ngs – Clicks, views, etc.
l Non-‐Boolean ra@ngs can also ocen introduce unnecessary noise – Even a low ra@ng ocen has a posi@ve correla@on with highly rated items in the
real world
l Example: Should we recommend Frankenstein to Bob?
To Rate or Not?
Dracula Jane Eyre Frankenstein
Bob 1 4 ???
Mary 5 1 4
11 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Collabora;ve Filtering with Mahout
l Extensive framework for collabora@ve filtering
l Recommenders – User based – Item based – Slope One
l Online and Offline support – Offline can u@lize Hadoop
Item 1
Item 2
… Item m
User 1 - 0.5 0.9
User 2 0.1 0.3 -
…
User n 0.8 0.7 0.1
Recommendations for User X
12 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
User Similarity
Item 1 Item 2 Item 3 Item 4
User 1
User 2 User
3 User 4
What should we recommend for User 1?
13 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Item Similarity
Item 1 Item 2 Item 3 Item 4
User 1
User 2 User
3 User 4
What should we recommend for User 1?
14 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Intui@on: There is a linear rela@onship between rated items – Y = mX + b where m = 1
l Solve for b upfront based on exis@ng ra@ngs: b = (Y-‐X) – Find the average difference in preference value for every pair of items
l Online can be very fast, but requires up front computa@on and memory
Slope One
User Item 1 Item 2 A 3.5 2 B ? 3
User A: 3.5 – 2 = 1.5
Item 1 (User B) = 3 + 1.5 = 4.5
15 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Online – Predates Hadoop – Designed to run on a single node
• Matrix size of ~ 100M interac@ons
– API for integra@ng with your applica@on
l Offline – Hadoop based – Designed to run on large cluster – Several approaches:
• RecommenderJob, ItemSimilarityJob, ParallelALSFactoriza@onJob
Online and Offline Recommenda;ons
16 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Essen@ally does matrix mul@plica@on using distributed techniques
l $MAHOUT_HOME/bin/examples/asf-‐email-‐examples.sh
RecommenderJob
101 102 103 104 105
101 7 2 0 1 3
102 2 8 3 5 2
103 0 3 3 6 4
104 1 5 6 4 7
105 3 2 4 7 9
User A
3.0
0
4.0
3.0
2.0
X =
Recs
30
37
38
53
64
17 CONFIDENTIAL |
Thinking Lucene Think Lucid
Discovery with Solr
18 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Goals: – Guide users to results without having to guess at keywords – Encourage serendipity – Never show empty results
l Out of the Box: – Face@ng – Spell Checking – More Like This – Clustering (Carrot2)
l Extend – Clustering (with Mahout) – Frequent Item Mining (with Mahout)
Discovery with Solr
19 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Automa@cally group similar content together to aid users in discovering related items and/or avoiding repe@@ve content
l Solr has search result clustering – Pluggable – Default implementa@on uses Carrot2
l Mahout has Hadoop based large scale clustering – K-‐Means, Minhash, Dirichlet, Canopy, Spectral, etc.
Clustering
20 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Discovery In Ac;on
l Pre-‐reqs: – Apache Ant 1.7.x, Subversion (SVN)
l Command Line 1: – svn co hOps://svn.apache.org/repos/asf/lucene/dev/trunk solr-‐trunk – cd solr-‐trunk/solr/ – ant example – cd example – java –Dsolr.clustering.enabled=true –jar start.jar
l Command Line 2 – cd exampledocs; java –jar post.jar *.xml
l hOp://localhost:8983/solr/browse?q=&debugQuery=true&annotateBrowse=true
21 CONFIDENTIAL |
Thinking Lucene Think Lucid
Solr + Mahout
22 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Most Mahout tasks are offline
l Solr provides many touch points for integra@on: – ClusteringEngine
• Clustering results – SearchComponent
• Sugges@ons – Related searches, clusters, MLT, spellchecking
– UpdateProcessor • Classifica@on of documents
– Func@onQuery
Basics
23 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Discover frequently co-‐occurring items
l Use Case: Related Searches from Solr Logs
l Hadoop and sequen@al versions – Parallel FP Growth
l Input: – <op@onal document id>TAB<TOKEN1>SPACE<TOKEN2>SPACE – Comma, pipe also allowed as delimiters
Example: Frequent Itemset Mining
24 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Goal: – Extract user queries from Solr logs – Feed into FIM to generate Related Keyword Searches
l Context: – Solr Query logs – bin/mahout regexconverter –input $PATH_TO_LOGS -‐-‐output /tmp/solr/output
-‐-‐regex "(?<=(\?|&)q=).*?(?=&|$)" -‐-‐overwrite -‐-‐transformerClass url -‐-‐formaOerClass fpg
– bin/mahout fpg -‐-‐input /tmp/solr/output/ -‐o /tmp/solr/fim/output -‐k 25 -‐s 2 -‐-‐method mapreduce
– bin/mahout seqdumper -‐-‐seqFile /tmp/solr2/results/frequentpaOerns/part-‐r-‐00000
FIM on Solr Query Logs
25 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l Key: Chris: Value: ([Chris, HosteOer],870), ([Chris],870), ([Search, Faceted, Chris, HosteOer, Webcast, Power, Mastering],18), ([Search, Faceted, Chris, HosteOer, Webcast, Power],18), ([Search, Faceted, Chris, HosteOer],18), ([Solr, new, Chris, HosteOer, webcast, along, sponsors, DZone, QA, Refcard],12), ([Solr, new, Chris, HosteOer, webcast, along, sponsors, DZone],12), ([Solr, new, Chris, HosteOer, webcast, along, sponsors],12), ([Solr, new, Chris, HosteOer, webcast, along],12), ([Solr, new, Chris, HosteOer, webcast],12), ([Solr, new, Chris, HosteOer],12)
Output
26 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
l hOp://lucene.apache.org
l hOp://mahout.apache.org
l hOp://manning.com/owen
l hOp://manning.com/ingersoll
l hOp://[email protected]
l grant@[email protected]
l @gsingers
Resources
27 CONFIDENTIAL |
Thinking Lucene Think Lucid
Appendix
28 CONFIDENTIAL | Copyright Lucid Imagina@on Copyright Lucid Imagina@on
Mahout Overview
Math Vectors/Matrices/SVD
Recommenders Clustering Classification Freq. Pattern Mining
Genetic
Utilities/Integration Lucene/Vectorizer
Collections (primitives)
Apache Hadoop
Applications
Examples
See http://cwiki.apache.org/confluence/display/MAHOUT/Algorithms