searching on intent: knowledge graphs, personalization, and contextual disambiguation

Bay Area Search

Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation

2015.11.10Bay Area Search

Trey Grainger Director of Engineering, Search & Recommendations

Bay Area Search

Trey Grainger Director of Engineering, Search & Recommendations

• Joined CareerBuilder in 2007 as a Software Engineer• MBA, Management of Technology – Georgia Tech• BA, Computer Science, Business, & Philosophy – Furman University• Mining Massive Datasets (in progress) - Stanford University

Fun outside of CB: • Co-author of Solr in Action, plus a handful of research papers• Frequent conference speaker• Founder of Celiaccess.com, the gluten-free search engine• Lucene/Solr contributor

About Me

http://solrinaction.com/

http://www.celiaccess.com/

Bay Area Search

Agenda• Introduction• Traditional Keyword Search vs. Personalization vs. Semantic Search• Searching on Intent

- Type-ahead prediction- Spelling Correction- Entity / Entity-type Resolution- Contextual Disambiguation- Semantic Query Parsing- Query Augmentation- The Knowledge Graph

• Conclusion Knowledge Graph

Bay Area Search

At CareerBuilder, Solr Powers...At CareerBuilder, Solr Powers...

Search by the Numbers

5

Powering 50+ Search Experiences Including:

100 million +Searches per day

30+Software Developers, Data

Scientists + Analysts

500+Search Servers

1,5 billion +Documents indexed and

searchable

1Global Search

Technology platform

...and many more

Bay Area Search

Conceptual Framework for Information Retrieval:

Traditional Keyword Search

Recommendations

SemanticSearch

User Intent

Personalized Search

Augmented Search

Domain-awareMatching

Bay Area Search

Traditional Search

Bay Area Search

Classic Lucene Relevancy Algorithm (though BM25 to be default soon):

*Source: Solr in Action, chapter 3

Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q

Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q

norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost()

Bay Area Search

News Search : popularity and freshness drive relevanceRestaurant Search: geographical proximity and price range are criticalEcommerce: likelihood of a purchase is keyMovie search: More popular titles are generally more relevantJob search: category of job, salary range, and geographical proximity matter

TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors!

That’s great, but what about domain-specific knowledge?

Bay Area Search

Example of domain-specific relevancy calculation

News website:

/select? fq=$myQuery& q=_query_:"{!func}scale(query($myQuery),0,100)" AND _query_:"{!func}div(100,map(geodist(),0,1,1))" AND _query_:"{!func}recip(rord(publicationDate),0,100,100)" AND _query_:"{!func}scale(popularity,0,100)"& myQuery="street festival"& sfield=location& pt=33.748,-84.391

25%

25%

25%25%

*Example from chapter 16 of Solr in Action

Bay Area Search

Fancy boosting functionsSeparating “relevancy” and “filtering” from the query:

q=_val_:"$keywords"&fq={!cache=false v=$keywords}&keywords=solr

Keywords (50%) + distance (25%) + category (25%)q=_val_:"scale(mul(query($keywords),1),0,50)" AND _val_:"scale(sum($radiusInKm,mul(query($distance),-1)),0,25)” AND_val_:"scale(mul(query($category),1),0,25)" &keywords=solr&radiusInKm=48.28&distance=_val_:"geodist(latitudelongitude.latlon_is,33.77402,-84.29659)”&category=jobtitle:"java developer"&fq={!cache=false v=$keywords}

Bay Area Search

Personalization / Recommendations

Bay Area Search

John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development.

Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry.

Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job.

Jane is a nurse educator in Boston seeking between $40K and $60K

Beyond domain knowledge… consider per-user knowledge

Bay Area Search

http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)”

*Example from chapter 16 of Solr in Action

Query for Jane

Jane is a nurse educator in Boston seeking between $40K and $60K

Bay Area Search

{ ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503},

…]}}

*Example documents available @ http://github.com/treygrainger/solr-in-action/

Search Results for Jane

{"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183},

{"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359}

http://github.com/treygrainger/solr-in-action/

Bay Area Search

We built a recommendation engine!

What is a recommendation engine?“A system that uses known information (or derived information from that known information) to automatically suggest relevant content”

Our example was just an attribute based recommendation… but we can also use any behavioral-based features, as well (i.e. collaborative filtering).

What did we just do?

Bay Area Search

For full coverage of building a recommendation engine in Solr…

See my talk from Lucene Revolution 2012 (Boston):

Bay Area Search

Personalized Search

Why limit yourself to JUST explicit search or JUST automated recommendations?

By augmenting your user’s explicit queries with information you know about them, you can personalize their search results.

Examples:A known software engineer runs a blank job search in New York…Why not show software engineering higher in the results?

A new user runs a keyword-only search for nurseWhy not use the user’s IP address to boost documents geographically closer?

Bay Area Search

Semantic Search

Bay Area Search

What’s the problem we’re trying to solve today?User’s Query: machine learning research and development Portland, OR software engineer AND hadoop, java

Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java)

Semantic Query Parsing:"machine learning" AND "research and development" AND "Portland, OR" AND "software engineer" AND hadoop AND java

Semantically Expanded Query:("machine learning"^10 OR "data scientist" OR "data mining" OR "artificial intelligence")AND ("research and development"^10 OR "r&d") AND AND ("Portland, OR"^10 OR "Portland, Oregon" OR {!geofilt pt=45.512,-122.676 d=50 sfield=geo}) AND ("software engineer"^10 OR "software developer") AND (hadoop^10 OR "big data" OR hbase OR hive) AND (java^10 OR j2ee)

Bay Area Search

...we also really want to search on “things”, not “strings”…

Job Level Job title Company

Job Title Company School + Degree

Bay Area Search

Type-aheadPrediction

Building an Intent Engine

Search Box

Semantic Query Parsing

Intent Engine

Spelling Correction

Entity / Entity Type Resolution

Machine-learned Ranking

Relevancy Engine (“re-expressing intent”)

User Feedback (Clarifying Intent)

Query Re-writing Search Results

Query Augmentation

Knowledge Graph

Contextual Disambiguation

Bay Area Search

Type-ahead Predictions

Semantic Autocomplete• Shows top terms for any search

• Breaks out job titles, skills, companies, related keywords, and other categories

• Understands abbreviations, alternate forms, misspellings

• Supports full Boolean syntax and multi-term autocomplete

• Enables fielded search on entities, not just keywords

Spelling Correction

Entity / Entity-type Resolution

Bay Area Search

Differentiating related terms

Synonyms: cpa => certified public accountant rn => registered nurse r.n. => registered nurse

Ambiguous Terms*: driver => driver (trucking) ~80% likelihood

driver => driver (software) ~20% likelihood

Related Terms: r.n. => nursing, bsn hadoop => mapreduce, hive, pig

*differentiated based upon user and query context

Bay Area Search

Building a Taxonomy of Entities

Many ways to generate this:• Topic Modelling• Clustering of documents• Statistical Analysis of interesting phrases• Buy a dictionary (often doesn’t work for

domain-specific search problems)• …

Our strategy:Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain [1]

[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

Bay Area Search

Entity-type Recognition

Build classifiers trained onExternal data sources(Wikipedia, DBPedia, WordNet, etc.), as well asfrom our own domain.

The subject for a future talk / research paper…

java developer

registered nurse

emergency room

director

job title

skill

job level

locationwork typePortland, OR

part-time

Bay Area Search


Bay Area Search

How do we handle phrases with ambiguous meanings?

Example Related Keywords (representing multiple meanings)driver truck driver, linux, windows, courier, embedded, cdl,

deliveryarchitect autocad drafter, designer, enterprise architect, java

architect, designer, architectural designer, data architect, oracle, java, architectural drafter, autocad, drafter, cad, engineer

… …

Bay Area Search

Discovering ambiguous phrases

1) Classify user’s who ran each search in the search logs (i.e. by the job title classifications of the jobs to which they applied)

3) Segment the search term => related search terms list by classification, to return a separate related terms list per classification

2) Create a probabilistic graphical model of those classifications mapped to each keyword phrase.

Bay Area Search

Disambiguated meanings (represented as term vectors)Example Related Keywords (Disambiguated Meanings)architect 1: enterprise architect, java architect, data architect, oracle, java, .net

2: architectural designer, architectural drafter, autocad, autocad drafter, designer, drafter, cad, engineer

driver 1: linux, windows, embedded2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier

designer 1: design, print, animation, artist, illustrator, creative, graphic artist, graphic, photoshop, video2: graphic, web designer, design, web design, graphic design, graphic designer

3: design, drafter, cad designer, draftsman, autocad, mechanical designer, proe, structural designer, revit

… …

Bay Area Search

Using the disambiguated meaningsIn a situation where a user searches for an ambiguous phrase, what information can we use to pick the correct underlying meaning?

1. Any pre-existing knowledge about the user: • User is a software engineer• User has previously run searches for “c++” and “linux”

2. Context within the query:• User searched for windows AND driver vs. courier OR driver

3. If all else fails (and there is no context), use the most commonly occurring meaning.

driver 1: linux, windows, embedded2: truck driver, cdl driver, delivery driver, class b driver, cdl, courier

Bay Area Search


Bay Area Search

Query Parsing: The whole is greater than the sum of the parts

project manager vs. "project" AND "manager"building architect vs. "building" AND "architect"software architect vs. "software" AND "architect"

Consider: a "software architect" designs and builds software a "building architect" uses software to design architecture

User’s Query:machine learning research and development Portland, OR software engineer AND hadoop java

Traditional Query Parsing: (machine AND learning AND research AND development AND portland) OR (software AND engineer AND hadoop AND java)

≠

Identifying the correct phrase (not just the parts) is crucial here!

Bay Area Search

Bay Area Search

Probabilistic Query Parser

Goal: given a query, predict which combinations of keywords should be combined together as phrases

Example: senior java developer hadoop

Possible Parsings:senior, java, developer, hadoop"senior java", developer, hadoop"senior java developer", hadoop"senior java developer hadoop”"senior java", "developer hadoop”senior, "java developer", hadoopsenior, java, "developer hadoop"

Bay Area Search

Input: senior hadoop developer java ruby on rails perl

Bay Area Search

Semantic Search Architecture – Query Parsing1) Generate the previously discussed taxonomy of

Domain-specific phrases • You can mine query logs or actual text of documents for

significant phrases within your domain [1]

2) Feed these phrases to SolrTextTagger (uses Lucene FST for high-throughput term lookups)

3) Use SolrTextTagger to perform entity extraction on incoming queries (tagging documents is also possible)

4) Also invoke probabilistic parser to dynamically identify unknown phrases from a corpus of data (language model)

5) Shown on next slides:Pass extracted entities to a Query Augmentation phase to rewrite the query with enhanced semantic understanding

[1] K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014.

[2] https://github.com/OpenSextant/SolrTextTagger

https://github.com/OpenSextant/SolrTextTagger

Bay Area Search

Query Augmentation

Bay Area Search

machine learning

Keywords:

Search Behavior,Application Behavior, etc.

Job Title Classifier, Skills Extractor, Job Level Classifier, etc.

Semantic Query Augmentation

keywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) }{ BOOST_TO_TOP: ( job_title:("software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) }

Modified Query:

Related Occupationsmachine learning: {15-1031.00 .58Computer Software Engineers, Applications

15-1011.00 .55Computer and Information Scientists, Research

15-1032.00 .52 Computer Software Engineers, Systems Software }

machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, }

Common Job Titles

Semantic Search Architecture – Query Augmentation

Related Phrases

machine learning: { data mining .9, matlab .8, data scientist .75, artificial intelligence .7, neural networks .55 }

Known keyword phrasesjava developermachine learningregistered nurse

FST

Knowledge

Graph in

+

Bay Area Search

Query Enrichment

Bay Area Search

Document Enrichment

Bay Area Search

Knowledge Graph

Bay Area Search

Serves as a “data science toolkit” API that allows dynamically navigating and pivoting through multiple levels of relationships between items in our domain. Compare the relationships of skills to keywords, job titles to skills to keywords, skills to government occupation codes, skills to experience level, etc.

Knowledge Graph API

Core similarity engine, exposed via APIAny product can leverage our core relationship scoring engine to score any list of entities against any other list

Full domain supportKeywords, job titles, skills, companies, job levels, locations, and all other taxonomies.

Intersections, overlaps, & relationship scoring, many levels deepUsers can either provide a list of items to score, or else have the system dynamically discover the most related items (or both).

Knowledge Graph

Bay Area Search

So how does it work?

Foreground vs. Background AnalysisEvery term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context.

countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x)))

{ "type":"keywords”, "values":[ { "value":"hive", "relatedness":0.9773, "popularity":369 },

{ "value":"java", "relatedness":0.9236, "popularity":15653 },

{ "value":".net", "relatedness":0.5294, "popularity":17683 },

{ "value":"bee", "relatedness":0.0, "popularity":0 },

{ "value":"teacher", "relatedness":-0.2380, "popularity":9923 },

{ "value":"registered nurse", "relatedness": -0.3802 "popularity":27089 } ] }

We are essentially boosting terms which are more related to some known feature (and ignoring terms which are equally likely to appear in the background corpus)

+-

Foreground Query: "Hadoop"

Knowledge Graph

Bay Area Search

Knowledge Graph – Potential Use Cases

Cross-walk between Types• Have an ID field, but want to enable free text search

on the most associated entity with that ID?

• Have a “state” (geo) search box, but want to accept any free-text location and map it to the right state?

• Have an old classification taxonomy and want to know how the values from the old system now map into the new values?

Build User Profiles from Search Logs• If someone searches for “Java”, and then “JQuery”,

and then “CSS”, and then “JSP”, what do those have in common?

• What if they search for “Java”, and then “C++”, and then “Assembly”?

Discover Relationships Between Anything• If I want to become a data scientist and know

Python, what libraries should I learn?

• If my last job was mid-level software engineer and my current job is Engineering Lead, what are my most likely next roles?

Traverse arbitrarily deep, Sort on anything• Build an instant co-occurrence matrix, sort the top

values by their relatedness, and then add in any number of additional dimensions (RAM permitting).

Data Cleansing• Have dirty taxonomies and need to figure out which

items don’t belong?• Need to understand the conceptual cohesion of a

document (vs spammy or off-topic content)?

Knowledge Graph

Bay Area Search

2014 - 2015 Publications & PresentationsBooks:Solr in Action - A comprehensive guide to implementing scalable search using Apache Solr

Research papers:● Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific jargon - 2014● Towards a Job title Classification System - 2014● Augmenting Recommendation Systems Using a Model of Semantically-related Terms

Extracted from User Behavior - 2014● sCooL: A system for academic institution name normalization - 2014● PGMHD: A Scalable Probabilistic Graphical Model for Massive Hierarchical Data Problems - 2014● SKILL: A System for Skill Identification and Normalization – 2015● Carotene: A Job Title Classification System for the Online Recruitment Domain - 2015● WebScalding: A Framework for Big Data Web Services - 2015● A Pipeline for Extracting and Deduplicating Domain-Specific Knowledge Bases - 2015● Macau: Large-Scale Skill Sense Disambiguation in the Online Recruitment Domain - 2015● Improving the Quality of Semantic Relationships Extracted from Massive User Behavioral Data – 2015● Query Sense Disambiguation Leveraging Large Scale User Behavioral Data - 2015

Speaking Engagements:● Over a dozen in the last year: Lucene/Solr Revolution 2014, WSDM 2014, Atlanta Solr Meetup, Atlanta Big Data Meetup, Second

International Syposium on Big Data and Data Analytics, RecSys 2014, IEEE Big Data Conference 2014 (x2), AAAI/IAAI 2015, IEEE Big Data 2015 (x6), Lucene/Solr Revolution 2015, and Bay Area Search Meetup

Bay Area Search

So What’s Next?

Bay Area Search

machine learning

Keywords:

Search Behavior,Application Behavior, etc.

Job Title Classifier, Skills Extractor, Job Level Classifier, etc.

Semantic Query Augmentation

keywords:((machine learning)^10 OR { AT_LEAST_2: ("data mining"^0.9, matlab^0.8, "data scientist"^0.75, "artificial intelligence"^0.7, "neural networks"^0.55)) }{ BOOST_TO_TOP: ( job_title:("software engineer" OR "data manager" OR "data scientist" OR "hadoop engineer")) }

Modified Query:

Related Occupationsmachine learning: {15-1031.00 .58Computer Software Engineers, Applications

15-1011.00 .55Computer and Information Scientists, Research

15-1032.00 .52 Computer Software Engineers, Systems Software }

machine learning: { software engineer .65, data manager .3, data scientist .25, hadoop engineer .2, }

Common Job Titles

Semantic Search Architecture – Query Augmentation

Related Phrases

machine learning: { data mining .9, matlab .8, data scientist .75, artificial intelligence .7, neural networks .55 }

Known keyword phrasesjava developermachine learningregistered nurse

FST

Knowledge

Graph in

+This Piece: How do you construct the best possible queries?

The answer… Learning to Rank (Machine-learned Ranking)

That can be a topic for next time…

Bay Area Search

Type-aheadPrediction

Building an Intent Engine

Search Box


Intent Engine

Spelling Correction

Entity / Entity Type Resolution

Machine-learned Ranking

Relevancy Engine (“re-expressing intent”)

User Feedback (Clarifying Intent)

Query Re-writing Search Results

Query Augmentation

Knowledge Graph


Bay Area Search

Conceptual Framework for Information Retrieval:

Traditional Keyword Search

Recommendations

SemanticSearch

User Intent

Personalized Search

Augmented Search

Domain-awareMatching

Bay Area Search

Additional References:

http://www.treygrainger.com/posts/presentations/building-a-cloud-like-knowledge-discovery-platform/

http://www.treygrainger.com/posts/presentations/crowdsourced-query-augmentation-through-the-semantic-discovery-of-domain-specific-jargon/

http://www.treygrainger.com/posts/presentations/building-a-real-time-solr-powered-recommendation-engine/

http://www.treygrainger.com/posts/presentations/building-a-real-time-solr-powered-recommendation-engine/

http://www.treygrainger.com/posts/presentations/building-a-real-time-big-data-analytics-platform-with-solr/

http://www.treygrainger.com/posts/presentations/building-a-real-time-big-data-analytics-platform-with-solr/

http://www.treygrainger.com/posts/presentations/scaling-recommendations-semantic-search-data-analytics-with-solr/

http://www.treygrainger.com/posts/presentations/scaling-recommendations-semantic-search-data-analytics-with-solr/

http://www.treygrainger.com/posts/presentations/enhancing-relevancy-through-personalization-semantic-search/

http://www.treygrainger.com/posts/presentations/semantic-multilingual-strategies-in-lucenesolr/

http://www.treygrainger.com/posts/presentations/leveraging-lucene-solr-as-a-knowledge-graph-and-intent-engine/

Bay Area Search

Bonus SlidesAudience question: how can you discover terms / related terms without having query logs to mine?

Bay Area Search

One Option: Clustering on documents to find semantic links

Bay Area Search

Setting up Clustering in solrconfig.xml<searchComponent name="clustering" enable=“true“ class="solr.clustering.ClusteringComponent"> <lst name="engine"> <str name="name">default</str> <str name="carrot.algorithm">

org.carrot2.clustering.lingo.LingoClusteringAlgorithm</str> <str name="MultilingualClustering.defaultLanguage">ENGLISH</str> </lst></searchComponent> <requestHandler name="/clustering" enable=“true" class="solr.SearchHandler"> <lst name="defaults"> <str name="clustering.engine">default</str> <bool name="clustering.results">true</bool> <str name="fl">*,score</str> </lst> <arr name="last-components"> <str>clustering</str> </arr></requestHandler>

Bay Area Search

Clustering Query

/solr/clustering/?q=solr &rows=100 &carrot.title=titlefield &carrot.snippet=titlefield &LingoClusteringAlgorithm.desiredClusterCountBase=25//clustering & grouping don’t currently play nicely

Allows you to dynamically identify “concepts” and their prevalence within a user’s top search results

Bay Area Search

Original Query: q=solr

Clustering Results

Clusters Identified:Developer (22) Java Developer (13) Software (10) Senior Java Developer (9) Architect (6) Software Engineer (6) Web Developer (5) Search (3) Software Developer (3) Systems (3) Administrator (2) Hadoop Engineer (2) Java J2EE (2) Search Development (2) Software Architect (2) Solutions Architect (2)

Identify Relationships:

Bay Area Search

q="solr" OR ("Developer”^0.22 or "Java Developer"^0.13 or "Software "^0.10 or "Senior Java Developer"^0.9 or "Architect"^0.6 or "Software Engineer"^0.6 or "Web Developer"^0.5 or "Search"^0.3 or "Software Developer"^0.3 or "Systems"^0.3 or "Administrator"^0.2 or "Hadoop Engineer"^0.2 or "Java J2EE"^0.2 or "Search Development"^0.2 or "Software Architect"^0.2 or "Solutions Architect"^0.2)

Just plug in those semantic relationships as before…

Bay Area Search

Contact Info

Yes, WE ARE HIRING @ . Come talk with me if you are interested…

Trey Grainger [email protected] @treygrainger

http://solrinaction.comConference discount (39% off): 39solrmu

Other presentations: http://www.treygrainger.com

http://solrinaction.com/

http://www.treygrainger.com/