scaling recommendations, semantic search, & data analytics with solr
TRANSCRIPT
Scaling Recommendations, Semantic Search, & Data Analytics with Solr
Trey GraingerDirector of Engineering, Search & Analytics
@
Atla
Atlanta Solr Meetup 2014.10.21, Atlanta Tech Village
Sponsored by:
About Me
Trey Grainger Director of Engineering, Search & Analytics
• Joined CareerBuilder in 2007 as Software Engineer• MBA, Management of Technology – GA Tech• BA, Computer Science, Business, & Philosophy – Furman University• Mining Massive Datasets (in progress) - Stanford University
• Fun outside of CB: • Author (Solr in Action), plus several research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor
Overview
• Intro• CareerBuilder’s Search Infrastructure• Solr as a Recommendation Engine• Semantic Search with Solr • Solr-powered Data Analytics• Q & A
My Search Team
Joe StreekySearch Framework Development Manager
Search Infrastructure Team Core Search Team
Job Search Team Candidate Search Team Relevancy & Recommendations Team
Applied Search Teams:
About Me
Joseph Streeky Manager, Search Framework Development
• Joined CareerBuilder in 2005 as Software Engineer• BS, Computer Science – GA Tech• Natural Language Processing – Columbia University• Software Engineering for SaaS – University of California, Berkeley
About Search @CareerBuilder
• 2 million active jobs each month
• 60 million actively searchable resumes
• 450 globally distributed search servers (in the U.S., Europe, & the cloud)
• Thousands of unique, dynamically generated search indexes
• 1.5 billion search documents
• 2-3 million searches an hour
Our Search Platform
• Generic Search API wrapping Solr + our domain stack
• Goal: Abstract away search into a simple API so that any engineer can build search-based products with no prior search background
• 3 Supported Methods (with rich syntax):– AddDocument– DeleteDocument– Search*users pass along their own dynamically-defined schemas on each call
Business Case for Recommendations
• For companies like CareerBuilder, recommendations can provide as much or even greater business value (i.e. views, sales, job applications) than user-driven search capabilities.
• Recommendations create stickiness to pull users back to your company’s website, app, etc.
• John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development.
• Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry.
• Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job.
• Jane is a nurse educator in Boston seeking between $40K and $60K working in the state of Massachusetts
Consider the information you know about your users
http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA”) AND _val_:"map(salary, 40000, 60000,10, 0)”
*Example from chapter 16 of Solr in Action
Query for Jane
Jane is a nurse educator in Boston seeking between $40K and $60K working in the state of Massachusetts
{ ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":"Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503},
…]}}
*Example documents available @ https://github.com/treygrainger/solr-in-action/blob/first-edition/example-docs/ch16/
Search Results for Jane
{"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183},
{"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359}
• We built a recommendation engine!
• What is a recommendation engine?
– A system that uses known information (or derived information from that known information) to automatically suggest relevant content
• Our example was just an attribute based recommendation… we’ll see that behavioral-based (i.e. collaborative filtering) is also possible.
What did we just do?
Redefining “Search Engine”
• “Lucene is a high-performance, full-featured text search engine library…”
Yes, but really…
• Lucene is a high-performance, fully-featured token matching and scoring library… which can perform full-text searching.
Redefining “Search Engine”
or, in machine learning speak:
• A Lucene index is multi-dimensional sparse matrix… with very fast and powerful lookup and vector multiplication capabilities.
• Think of each field as a matrix containing each term mapped to each document
The Lucene Inverted Index (traditional text example)
Term Documents
a doc1 [2x]
brown doc3 [1x]
, doc5 [1x]
cat doc4 [1x]
cow doc2 [1x]
, doc5 [1x]
…...
once doc1 [1x]
, doc5 [1x]
over doc2 [1x]
, doc3 [1x]
the doc2 [2x]
, doc3 [2x]
, doc4
[2x], doc5
[1x]
… …
Document Content Field
doc1 once upon a time, in a land far, far away
doc2 the cow jumped over the moon.
doc3 the quick brown fox jumped over the lazy dog.
doc4 the cat in the hat
doc5 The brown cow said “moo” once.
… …
What you SEND to Lucene/Solr:How the content is INDEXED into Lucene/Solr (conceptually):
Match Text Queries to Text Fields
/solr/select/?q=jobcontent:(software engineer)
Job Content Field Documents
… …engineer doc1, doc3, doc4,
doc5
…mechanical doc2, doc4, doc6
… …software doc1, doc3, doc4,
doc7, doc8
… …
doc5
doc7 doc8
doc1 doc3 doc4
engineer
software
software engineer
Beyond Text Searching
• Lucene/Solr is a search matching engine
• When Lucene/Solr search text, they are matching tokens in the query with tokens in the index
• Anything that can be searched upon can form the basis of matching and scoring:– text, attributes, locations, results of functions, user
behavior, classifications, etc.
Approaches to Recommendations
• Content-based– Attribute-based
• i.e. income level, hobbies, location, experience
– Classification-based• i.e. “medical//nursing//oncology”, “animal//dog//terrier”
– Textual Similarity-based• i.e. Solr’s MoreLikeThis Request Handler & Search Handler
– Concept-based• i.e. Solr => “software engineer”, “java”, “search”, “open source”
• Collaborative Filtering• “Users who liked that also liked this…”
• Hybrid Approaches
Collaborative Filtering
Term Documentsuser1 doc1, doc5user2 doc2user3 doc2user4 doc1, doc3,
doc4, doc5user5 doc1, doc4… …
Document “Users who bought this product” field
doc1 user1, user4, user5
doc2 user2, user3
doc3 user4doc4 user4, user5
doc5 user4, user1… …
What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually):
Step 1: Find similar users who like the same documents
Document “Users who bought this product” field
doc1 user1, user4, user5
doc2 user2, user3
doc3 user4
doc4 user4, user5doc5 user4, user1… …
Top-scoring results (most similar users):1) user4 (2 shared likes)2) user5 (2 shared likes)3) user 1 (1 shared like)
doc1
user1 user4 user5
user4 user5
doc4
q=documentid: ("doc1" OR "doc4")
*Source: Solr in Action, chapter 16
Step 2: Search for docs “liked” by those similar users
Term Documents
user1 doc1, doc5
user2 doc2
user3 doc2
user4 doc1, doc3, doc4, doc5
user5 doc1, doc4
… …
Top recommended documents:1) doc1 (matches user4, user5, user1)2) doc4 (matches user4, user5)3) doc5 (matches user4, user1)4) doc3 (matches user4)
// doc2 does not match
Most similar users:1) user4 (2 shared likes)2) user5 (2 shared likes)3) user 1 (1 shared like)
/solr/select/?q=userlikes:("user4"^2 OR "user5"^2 OR "user1"^1)
*Source: Solr in Action, chapter 16
solrconfig.xml: <requestHandler name="/mlt" class="solr.MoreLikeThisHandler" />
Query:
/solr/jobs/mlt/?df=jobdescription& fl=id,jobtitle& rows=3& q=J2EE& // recommendations based on top scoring doc
mlt.fl=jobtitle,jobdescription& // inspect these fields for interesting terms
mlt.interestingTerms=details& // return the interesting terms
mlt.boost=true
Content-based Recommendations: More Like This (Query)
*Example from chapter 16 of Solr in Action
More Like This (Results)
{"match":{"numFound":122,"start":0,"docs":[ {"id":"fc57931d42a7ccce3552c04f3db40af8dabc99dc",
"jobtitle":"Senior Java / J2EE Developer"}] }, "response":{"numFound":2225,"start":0,"docs":[ {"id":"0e953179408d710679e5ddbd15ab0dfae52ffa6c",
"jobtitle":"Sr Core Java Developer"}, {"id":"5ce796c758ee30ed1b3da1fc52b0595c023de2db",
"jobtitle":"Applications Developer"}, {"id":"1e46dd6be1750fc50c18578b7791ad2378b90bdd",
"jobtitle":"Java Architect/ Lead Java Developer - WJAV Java - Java in Pittsburgh PA"},]},
"interestingTerms":[ "jobdescription:j2ee",1.0, "jobdescription:java",0.68131137, "jobdescription:senior",0.52161527, "jobtitle:developer",0.44706684, "jobdescription:source",0.2417754, "jobdescription:code",0.17976432, "jobdescription:is",0.17765637, "jobdescription:client",0.17331646, "jobdescription:our",0.11985878, "jobdescription:for",0.07928475, "jobdescription:a",0.07875194, "jobdescription:to",0.07741922, "jobdescription:and",0.07479082]}}
*Example from chapter 16 of Solr in Action
More Like This (passing in external document)
/solr/jobs/mlt/?df=jobdescription& fl=id,jobtitle& mlt.fl=jobtitle,jobdescription& mlt.interestingTerms=details& mlt.boost=true
stream.body=Solr is an open source enterprise search platform from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Providing distributed search and index replication, Solr is highly scalable. Solr is the most popular enterprise search engine. Solr 4 adds NoSQL features.
*Example from chapter 16 of Solr in Action
More Like This (Results)
{"response":{"numFound":2221,"start":0,"docs":[ {"id":"eff5ac098d056a7ea6b1306986c3ae511f2d0d89 ",
"jobtitle":"Enterprise Search Architect…"}, {"id":"37abb52b6fe63d601e5457641d2cf5ae83fdc799 ",
"jobtitle":"Sr. Java Developer"}, {"id":"349091293478dfd3319472e920cf65657276bda4 ",
"jobtitle":"Java Lucene Software Engineer"},]},
"interestingTerms":[
"jobdescription:search",1.0,
"jobdescription:solr",0.9155779, "jobdescription:features",0.36472517, "jobdescription:enterprise",0.30173126, "jobdescription:is",0.17626463, "jobdescription:the",0.102924034, "jobdescription:and",0.098939896]} }
*Example from chapter 16 of Solr in Action
Understanding Our Users
• Machine learning algorithms can help us understand what matters most to different groups of users.
Example: Willingness to relocate for a job (miles per percentile)
Software Engineers
Restaurant Workers
Search & Recommendations are on a continuum...
• Why limit yourself to JUST explicit search or JUST automated recommendations?
• By augmenting your user’s explicit queries with information you know about them, you can personalize their search results.
• Examples:– A known software engineer runs a blank keyword search in New York…
• Why not show software engineering higher in the results?
– A new user runs a keyword-only search for nurse• Why not use the user’s IP address to boost documents geographically
closer?
Clustering Query
/solr/clustering/?q=(solr or lucene) &rows=100 &carrot.title=titlefield &carrot.snippet=titlefield &LingoClusteringAlgorithm.desiredClusterCountBase=25//clustering & grouping don’t currently play nicely
Allows you to dynamically identify “concepts” and their prevalence within a user’s top search results
Original Query: q=(solr or lucene)
// can be a user’s search, their job title, a list of skills, // or any other keyword rich data source
Clustering Results
Clusters Identified:Developer (22) Java Developer (13) Software (10) Senior Java Developer (9) Architect (6) Software Engineer (6) Web Developer (5) Search (3) Software Developer (3) Systems (3) Administrator (2) Hadoop Engineer (2) Java J2EE (2) Search Development (2) Software Architect (2) Solutions Architect (2)
Stage 1: Identify Concepts
content:(“Developer”^22 or “Java Developer”^13 or “Software ”^10 or “Senior Java Developer”^9 or “Architect ”^6 or “Software Engineer”^6 or “Web Developer ”^5 or “Search”^3 or “Software Developer”^3 or “Systems”^3 or “Administrator”^2 or “Hadoop Engineer”^2 or “Java J2EE”^2 or “Search Development”^2 or “Software Architect”^2 or “Solutions Architect”^2)
// Your can also add the user’s location or the original keywords to the // recommendations search if it helps results quality for your use-case.
Stage 2: Use Semantic Links in your relevancy calculation
• Our primary approach: Search Co-occurrences[1] + Point-wise Mutual Information[1] + PGMHD[2]
• Strategy: Map/Reduce job which computes similar searches run for the same users
John searched for “java developer” and “j2ee”Jane searched for “registered nurse” and “r.n.” and “nurse”.Zeke searched for “java developer” and “scala” and “jvm”
• By mining the searches of tens millions of search terms per day, we get a list of top related searches, using multiple statistical measures.
• We also tie each search term to the top category of jobs (i.e java developer, truck driver, etc.), so that we know in what context people search for each term.
[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.
[2] K. Aljadda, M.Korayem, C. Ortiz, T. Grainger, J. Miller, W. York. "PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems," in IEEE Big Data 2014
Synonym Discovery Techniques
Examples of “related search terms”
Example: “accounting”accountant 8880,accounts payable 5235,finance 3675,accounting clerk 3651,bookkeeper 3225,controller 2898,staff accountant 2866,accounts receivable 2842
Example: “RN”:registered nurse 6588,rn registered nurse 4300,nurse 2492,nursing 912,lpn 707,healthcare 453,rn case manager 446,registered nurse rn 404,director of nursing 321,case manager 292
Categories of related terms...
Synonyms: cpa => Certified Public Accountant rn => Registered Nurse r.n. => Registered Nurse
Ambiguous Terms*: driver => driver (trucking) ~80% driver => driver (software) ~20%
Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig
*disambiguation occurs based upon context and popularity
Why Solr for Analytics?
• Allows “ad-hoc” querying of data by keywords
• Is good at on-the-fly aggregate calculations (facets + stats + functions + grouping)
• Solr is horizontally scalable, and thus able to handle billions of documents
• Insanely Fast queries, encouraging user exploration
//Range Faceting&facet.range=years_experience&facet.range.start=0&facet.range.end=10&facet.range.gap=1&facet.range.other=after
Faceting Overview/solr/select/?q=…&facet=true
//Field Faceting&facet.field=city
"facet_ranges":{ "years_experience":{ "counts":[ "0",1010035, "1",343831, … "9",121090 ], … "after":59462}}
"facet_fields":{ "city":[ "new york, ny",2337, "los angeles, ca",1693, "chicago, il",1535, … ]}
"facet_queries":{ "0 to 10 km":1187, "10 to 25 km":462, "25 to 50 km":794, "50+":105296 },
//Query Faceting:&facet.query={!frange key="0 to 10 km" l=0 u=10 incll=false}geodist()
&facet.query={!frange key="10 to 25 km" l=10 u=25 incll=false}geodist()
&facet.query={!frange key="25 to 50 km" l=25 u=50 incll=false}geodist()
&facet.query={!frange key="50+" l=50 incll=false}geodist()
&sfield=location&pt=37.7770,-122.4200
/solr/select/q=...&facet=true&facet.field=state
/solr/select/?q=…&facet=true&facet.field=month*
/solr/select/?q=…&facet=true&facet.field=military_experience
Building Blocks…
*string field in format 201305
Building Blocks…
/solr/select/?q="construction worker"&fq=city:"las vegas, nv"&facet=true&facet.field=company
/solr/select/?q="construction worker"&fq=city:"las vegas, nv"&facet=true&facet.field=lastjobtitle
Building Blocks…
/solr/select/? q=...&facet=true&facet.field=experience_ranges
/solr/select/?q=...&facet=true&facet.field=management_experience
Query 1:/solr/select/?...fq={!geofilt sfield=latlong pt=37.777,-122.420 d=80}&facet=true&facet.field=city&
"facet_fields":{ "city":[ "san francisco, ca",11713, "san jose, ca",3071, "oakland, ca",1482, "palo alto, ca",1318, "santa clara, ca",1212, "mountain view, ca",1045, "sunnyvale, ca",1004, "fremont, ca",726, "redwood city, ca",633, "berkeley, ca",599]}Query 2:
/solr/select/?...&facet=true&facet.field=city&fq=( _query_:"{!geofilt sfield=latlong pt=37.7770,-122.4200 d=20} " //san francisco OR _query_:"{!geofilt sfield=latlong pt=37.338,-121.886 d=20} " //san jose … OR _query_:"{!geofilt sfield=latlong pt=37.870,-122.271 d=20} " //berkeley
)
Geo-spatial Analytics
SOLR-2894: “Distributed Pivot Faceting”
Status: This feature was developed primarily by the CareerBuilder search team and committed by Chris Hostetter to the latest released version of Solr (4.10).
#1 Most requested Solr feature
56
SOLR-3583: “Stats within (pivot) facets”
Status: We have submitted a patch (built on top of distributed pivot facets), but this will likely be replaced with SOLR-6350 + SOLR 6351 in the future.
/solr/select?q=...&facet=true&facet.pivot=state,city&
facet.stats.percentiles=true&facet.stats.percentiles.averages=true&
facet.stats.percentiles.field=compensation&
f.compensation.stats.percentiles.requested=10,25,50,75,90&f.compensation.stats.percentiles.lower.fence=1000&f.compensation.stats.percentiles.upper.fence=200000&f.compensation.stats.percentiles.gap=1000
"facet_pivot":{ "state,city":[{ "field":"state", "value":"california", "count":1872280, "statistics":[ "compensation",[ "percentiles",[ "10.0","26000.0", "25.0","31000.0", "50.0","43000.0", "75.0","66000.0", "90.0","94000.0"], "percentiles_average",52613.72, "percentiles_count",1514592]], "pivot":[{ "field":"city", "value":"los angeles, ca", "count":134851, "statistics":{ "compensation":[ "percentiles",[ "10.0","26000.0", "25.0","31000.0", "50.0","45000.0", "75.0","70000.0", "90.0","95000.0"], "percentiles_average",54122.45, "percentiles_count",213481]}} … ]}]}
SOLR-3583: “Stats within (pivot) facets”
Real-world Use Case
Stats Pivot Faceting (Percentiles)Stats Pivot Faceting (Average)
Another
Pivot…Field Facet
Key Takeaways
• Traditional search & recommendations are at two ends of a continuum between user-driven and automatic matching, and Solr is really good at giving you access to that full continuum.
• Searching on text is one of many forms of matching. If you can migrate to searching on behaviors, entities, and concepts, you will see much better, more personalized results.
Solr is a highly-scalable platform for rapid matching across large amounts of unstructured and structured data. Performing real-time analytics at scale is not only possible, but incredibly fast and flexible.
2014 Publications & PresentationsBooks:Solr in Action - A comprehensive guide to implementing scalable
search using Apache Solr
Research papers:● Towards a Job title Classification System● Augmenting Recommendation Systems Using a Model of Semantically-related Terms Extracted from
User Behavior● sCooL: A system for academic institution name normalization● Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific jargon● PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems● SKILL: A System for Skill Identification and Normalization (pending publication)
Speaking Engagements:● WSDM 2014 Workshop: “Web-Scale Classification: Classifying Big Data from the Web”● Atlanta Solr Meetup● Atlanta Big Data Meetup● The Second International Symposium on Big Data and Data Analytics● Lucene/Solr Revolution 2014● RecSys 2014● IEEE Big Data Conference 2014
Contact Info
Yes, WE ARE HIRING @CareerBuilder. Come talk with me if you are interested…
▪ Trey [email protected]@treygrainger
Other presentations: http://www.treygrainger.com http://solrinaction.com
Meetup discount (42% off): solrmuau
Other Presentations: