search at linkedin by sriram sankar and kumaresh pattabiraman

59
Recruiting Solutions Search at LinkedIn Sriram Sankar, Principal Staff Engineer Kumaresh Pattabiraman, Senior Product Manager

Upload: the-hive

Post on 20-Aug-2015

901 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Recruiting SolutionsRecruiting SolutionsRecruiting Solutions

Search at LinkedIn

Sriram Sankar, Principal Staff EngineerKumaresh Pattabiraman, Senior Product Manager

Page 2: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

https://www.youtube.com/watch?v=obCHKPYHuhA

2

Page 3: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Search at LinkedIn

Personalized professional search

Part of a bigger product experience

But a really big part of it

3

Page 4: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

4

Some history . . .

Page 5: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Approach to Search

Off the shelf components (Lucene) Extended to address Lucene limitations (Sensei,

Bobo, Zoie, Content Store) Specialized verticals (Cleo, Krati)

Stack adopted for other purposes (recommendations, newsfeed, ads, analytics, etc.)

5

Page 6: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Lucene

An open source API that supports search functionality: Add new documents to index Delete documents from the index Construct queries Search the index using the query Score the retrieved documents

6

Page 7: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

7

The Search Index

Inverted Index: Mapping from (search) terms to list of documents (they are present in)

Forward Index: Mapping from documents to metadata about them

Page 8: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

8

BLAH BLAH BLAH Kumaresh BLAH BLAH LinkedIn BLAH BLAH BLAH BLAH

BLAH BLAH Sriram BLAH LinkedIn BLAH BLAH BLAH BLAH BLAH BLAH BLAH2.

1.

Kumaresh Sriram LinkedIn

2

1

Inverted Index Forward Index

Page 9: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

9

The Search Index

The lists are called posting lists Upto hundreds of millions of posting lists Upto hundreds of millions of documents Posting lists may contain as few as a single hit and

as many as tens of millions of hits Terms can be

– words in the document– inferred attributes about the document

Page 10: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

10

Lucene Queries

“Sriram Sankar” Sriram Kumaresh +Sriram +LinkedIn +Kumaresh connection:418001 +Kumaresh industry:software

connection:418001^4

Page 11: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

11

Lucene Scoring

As documents are added to the index, Lucene maintains some metadata on the terms (e.g., term position, tf/idf)

Lucene accepts scoring information via query modifications, boosts, etc.

Lucene assigns a score to each retrieved document using this information

Page 12: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

12

Sensei

Layer over Lucene that provides: Sharding Cluster management Enhanced query language

Page 13: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

13

Page 14: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

14

Sensei BQL

SELECT *FROM carsWHERE price > 2000.00USING RELEVANCE MODEL my_model (favoriteColor:"black", favoriteTag:"cool") DEFINED AS (String favoriteColor, String favoriteTag) BEGIN float boost = 1.0; if (tags.contains(favoriteTag)) boost += 0.5; if (color.equals(my_color)) boost += 1.2; return _INNER_SCORE * boost; END

Page 15: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

15

Live Updates – Zoie and Content Store

The index reader has to be reopened before earlier live updates are visible

The only way to perform a live update is to replace the entire document – which requires access to the unchanged attributes also

Page 16: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

16

Zoie

Page 17: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

17

Search Content Store

SearchContent

Store

LuceneIndex

ActivityFeeds Deletes

Inserts

Page 18: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

18

Faceting

Page 19: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

19

Bobo

Page 20: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

20

Typeahead (Instant Search)

Results as you type

Conventional wisdom: Inverted indices cannot support typeahead

Cleo, Krati

Page 21: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

21

Fast forward to last year – and growing pains . . .

Page 22: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

22

Scalability

Rebuilding index from scratch extremely difficult

Not possible to use complex algorithms during indexing

Live updates at document granularity

Inflexible scoring – both at Lucene and Sensei levels

Page 23: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

23

Fragmentation

Too many open source components glued together with primary developers spread across many companies

Different instantiations starting to diverge to deal with their specific growing pains – so diverging stacks and distracted engineers

Page 24: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

24

Our new search stack . . .Two verticals already in

production

Page 25: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

25

Life of a Query

Query Rewriter/Planner

ResultsMerging

UserQuer

y

Search

Results

Search Shard

Search Shard

Page 26: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

26

Life of a Query – Within A Search Shard

Rewritten

Query

TopResult

sFromShard

INDEX

TopResult

s

Retrieve aDocument

Score theDocument

Page 27: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

27

Life of a Query – Within A Rewriter

Query

DATAMODEL

Rewriter

State

Rewriter

Module

DATAMODEL

DATAMODEL

Rewritten

Query

Rewriter

Module

Rewriter

Module

Page 28: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

28

Life of Data - Offline

INDEX

Derived Data

Raw Data

DATAMODEL

DATAMODEL

DATAMODEL

DATAMODEL

DATAMODEL

Page 29: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

29

Benefits of New Stack

A complete search engine Frequent reindexing possible (a full reset) Resharding becomes easy Clear separation of infrastructure and relevance

functions

A single stack with a single identity!

Page 30: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

30

Early Termination

We order documents in the index based on a static rank – from most important to least important

An offline relevance algorithm assigns a static rank to each document on which the sorting is performed

This allows retrieval to be early-terminated (assuming a strong correlation between static rank and importance of result for a specific query)

Happens to work well with personalized search also

Page 31: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

31

New Strategy for Live Updates

Lucene segments are “document-partitioned” We have enhanced Lucene with “term-partitioned”

segments We use 3 term-partitioned segments:

– Base index (never changed)– Live update buffer– Snapshot index

Fault tolerant, and performant No more content store!

Page 32: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

32

Base IndexSnapshot

IndexLive Update

Buffer

Page 33: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

33

Data Distribution

Bit torrent based data distribution framework

More details at a later time

Page 34: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

34

Relevance

Offline analysis – resulting in a better index and data models

Query rewriting – for better and more accurate recall

Scoring – to fine tune each of the retrieved results

Reranking – selection of top results for overall result set quality

Blending – to combine results from multiple verticals

Page 35: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

35

Machine Learned Scorers

Goal: To automatically build a function whose arguments are interesting features of the query and the document

Input to the machine learning system is a set of training data that describes how the function should behave on various combination of feature values

The function takes the form of standard templates – a linear formula is commonly used (due to simplicity)

Page 36: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

36

Linear Regression on a Single Feature

Page 37: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

37

LinkedIn Scorer:Different Linear Models for Different Intents

Relevance models incorporate user features:

score = P (Document | Query, User)

Tree with linear regression leaves

37

X 2=0

X2=?

X2=

1

X10< 0.1234 ?

Yes

No

Page 38: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

38

Going Forward

Further standardize infrastructure for relevance components

Scatter-gather

Java GC issues

Extend infrastructure to browser/device

Reintegrate diverging stacks

Page 39: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

39

Product Overview

Page 40: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

40

LinkedIn’s Vision

“Create economic opportunity for every member of the global workforce”

Page 41: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

41

The Economic Graph

Page 42: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

42

Search is core to the economic graph vision

Page 43: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

LI as a way to get the day job

Job Seeker

Who uses search?

Casual User

LI as professional identity

43

Outbound professional(Recruiter / Sales)

LI as day job

Page 44: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

44

Casual User

Name SearchTopic Search

Page 45: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Instant: Name Search

Search all members by name or approximate name

45

Page 46: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Unified Search: Topic Search

One federated search result page with all relevant entities about the topic

46

Page 47: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

47

Outbound professional

Exploratory people search

Page 48: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Instant: Search Suggestions

Entity-aware suggestions for companies, skills & titles

48

Page 49: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Instant: Just one keystroke

From name search to exploratory search

49

Page 50: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

People Search

Explore using facets and advanced search fields

50

Page 51: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

People Search

Leverage the network through shared connections

51

Page 52: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Recruiter & Sales Navigator

Products powered by search

52

Page 53: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

53

Job Seeker

Job Search

Page 54: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Instant: Search Suggestions

Entity-aware suggestions for companies, skills & titles

54

Page 55: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Job Search

Explore using facets and advanced search fields

55

Page 56: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

Job Search

Leverage the network through relationship to job poster or connections in the company

56

Page 57: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

57

Other Search Users include…

Students – University SearchInformation Seekers / Researchers - Content SearchAdvertisers / Content Marketers – Company & Group Search

Page 58: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

58

Bringing it all together

300 Million+ members

Search the economic graph of300M profiles

3B Endorsements300K jobs

3M Companies2M Groups

25K Schools100M+ pieces of professional

content

One indexOne unified search stack

Users

Product

Platform

Page 59: Search at Linkedin by Sriram Sankar and Kumaresh Pattabiraman

59