@indeedeng: imhotep - large scale analytics and machine learning at indeed

go.indeed.com/IndeedEngTalks

http://go.indeed.com/IndeedEngTalks

http://go.indeed.com/IndeedEngTalks

Imhotep Large Scale Analytics and Machine Learning at Indeed

Jeff PlaisanceEngineering Manager

I help people get jobs.

Indeed is aSearch Engine for Jobs

Indeed is a data driven organization

Indeed is a data driven organization

Data driven organizations need great tools

What does Imhotep allow you to do?

● Decision Tree Building● Analytics

Indeed’s Analytics Philosophy

Analytics systems should be:1. Interactive2. Not Sampled3. Not Approximate

Imhotep answers questions

What was the weekly average query time in the last quarter from people doing the query “software”?


What percent of jobsearch results pages are for page 2 and beyond?


What are the 5 most common queries in each country?

Total Job Searches From 2014-03-09 to 2014-03-23

?

Query Location

Query Location

Impression

Document

query: “indeed software engineer”location: “austin”impressions: 10clicks: 2time: 2014-03-17T12:00:00

Shard

0 21 3 4

5 76 8 9

10 1211 13 14

Server2014/03/02 2014/03/09 2014/03/11

2014/03/12 2014/03/22 2014/03/24

Documents Documents Documents

Documents Documents Documents

Cluster

2014-03-02

Server A

2014-03-03

Server B

2014-03-04

Server C

Cluster

2014-03-02 2014-03-03

Server B

2014-03-04

Server CServer A

Cluster

2014-03-02 2014-03-03

Server B

2014-03-04

Server C

Client

Session

Server A

Total Job Searches From 2014-03-09 to 2014-03-23

secret

Total Job Searches From 2014-03-09 to 2014-03-23 Per Day

2014-03-09 2014-03-16 2014-03-23

Metrics

● 64 bit integers● Exactly one value per doc● Random access by doc id

Metrics

● Time● Clicks● Impressions● Revenue● … or anything else that is a number

Groups

● Documents are placed into numbered groups

● Every document starts in group 1● Group 0 means “filtered out”

Groups

● Groups are stateful and scoped to a session● Regroup operations update group for each

doc in shard

width

Metric Regroup

● Iterate over doc_id->metric lookup● Set group to

(value - start)/ bucket_width● Useful for making graphs (buckets on x-axis)

1 2 3 4 5

start end

Get Group Stats

● For each group, sums a metric for all docs in that group

Bucket By Day

1. Regroup on time metric2. Get Group Stats for count metric (always 1)

Total Job Searches From 2014-03-09 to 2014-03-23 Per Day

2014-03-09 2014-03-16 2014-03-23

Total and US Job Searches From 2014-03-09 to 2014-03-23 Per Day

2014-03-09 2014-03-16 2014-03-23

Inverted Indexes

Inverted Index

● Like index in the back of a book● words = terms, page numbers = doc ids● Term list is sorted● Doc list for each term is sorted

doc id query country impressions clicks

0 software Canada 10 1

1 blank Canada 10 0

2 sales US 5 0

3 software US 8 1

4 blank US 10 1

Standard Index

Constructing an Inverted Indexquery country impression clicks

doc id blank sales software Canada US 5 8 10 0 1

0 ✔ ✔ ✔ ✔

1 ✔ ✔ ✔ ✔

2 ✔ ✔ ✔ ✔

3 ✔ ✔ ✔ ✔

4 ✔ ✔ ✔ ✔

Constructing an Inverted Indexfield term 0 1 2 3 4

query blank ✔ ✔

sales ✔

software ✔ ✔

country Canada ✔ ✔

US ✔ ✔ ✔

impressions 5 ✔

8 ✔

10 ✔ ✔ ✔

clicks 0 ✔ ✔

1 ✔ ✔ ✔

Inverted Indexfield term doc list

query blank 1, 4

sales 2

software 0, 3

country Canada 0, 1

US 2, 3, 4

impressions 5 2

8 3

10 0, 1, 4

clicks 0 1, 2

1 0, 3, 4

Inverted Indexes

Allow you to:● Quickly find all documents containing

a term● Intersect several terms to perform

boolean queries

Lucene

● Open source inverted index implementation● Reasonably fast● Widely used, well tested

Global and US Job Searches From 2014-03-09 to 2014-03-23 Per Day

2014-03-09 2014-03-16 2014-03-23

field term doc list

query blank 1, 4

sales 2

software 0, 3

country Canada 0, 1

US 2, 3, 4

impressions 5 2

8 3

10 0, 1, 4

clicks 0 1, 2

1 0, 3, 4

Searches in the US only

Searches in the US onlyfield term doc list

country Canada 0, 1

US 2, 3, 4

Searches in the US only

Query Regroup● Regroup all docs which do not match a

boolean query to group zero

field term doc list

country Canada 0, 1

US 2, 3, 4

Term Regroup

Splits docs in a group into one of two new groups based on presence/absence of a term

country:US everything else

1

32

Multiterm Regroup

Generalization of term regroup to N terms and N+1 new groups

country:US everything elsecountry:CA country:FR

52 3 4

1

Total and US Job Searches From 2014-03-09 to 2014-03-23 Per Day

2014-03-09 2014-03-16 2014-03-23

Inverted Index Compression

Size of Organic Dataset for last 5 months● Original: 102 TB● Inverted: 51 TB

Inverted Index Optimizations

● Compressed data structures○ Better use of RAM and processor cache○ Better use of memory bandwidth○ Increased CPU usage and time

● Micro optimizations matter!

Delta / Varint Encoding

● Doc id lists are sorted● Delta between a doc id and the previous doc

id is sufficient● Deltas are usually small integers● Use less bits for small integers and more bits

for large integers

Delta Encoding

field term doc list

query nursing 34, 86, 247, 301, 674, 714

Delta Encoding

field term doc list

query nursing 34, 86, 247, 301, 674, 714

34, 52, 161, 54, 373, 40

Small Integer Compression

● Golomb/Rice● Varint● Binary Packing● PForDelta

Small Integer Compression

● Golomb/Rice● Varint● Bit Packing● PForDelta

Varint Encoding

9838

Varint Encoding

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0

9838

Varint Encoding

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0

9838

? 1 1 0 1 1 1 0

Varint Encoding

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

9838

? 1 1 0 1 1 1 0

0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0

Varint Encoding

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0

9838

? 1 1 0 1 1 1 0

Varint Encoding

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0

9838

1 1 1 0 1 1 1 0

Varint Encoding

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0

9838

1 1 1 0 1 1 1 0

? 1 0 0 1 1 0 0

Varint Encoding

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

0 0 1 0 0 1 1 0 0 1 1 0 1 1 1 0

9838

1 1 1 0 1 1 1 0

0 1 0 0 1 1 0 0

Inverted Index Compression

Size of Organic Dataset for last 5 months● Original: 102 TB● Inverted: 51 TB● Delta / Varint: 17 TB

Flamdex

● Two files per field (terms/docs)● Can add fields without rebuilding index● Faster varint decoding● No TF or positions (or wasted time decoding

them)

Varints

Pros:● Compression● Can fit more of index in RAM● Higher information throughput per byte read

from disk

Varints

Cons:● Decodes one byte at a time● Lots of branch mispredictions● Not fast to decode

Vectorized Varint Decoding

01001010 11001000 01110001 01001110

10011011 01101010 10110101 00010111

01110110 10001101 10110011 11000001


01001010 11001000 01110001 01001110

10011011 01101010 10110101 00010111

01110110 10001101 10110011 11000001

pmovmskb: Extract top bit of each byte


01001010 11001000 01110001 01001110

10011011 01101010 10110101 00010111

01110110 10001101 10110011 11000001


010010100111


01001010 11001000 01110001 01001110

10011011 01101010 10110101 00010111

01110110 10001101 10110011 11000001


010010100111Lookup in 4096 entry lookup table

010010100111

Pattern of leading bits determines:● how many varints to decode● sizes and offsets of varints● length of longest varint in bytes● number of bytes to consume

010010100111

Decoding options for:● up to twelve 1 byte varints● six 1-2 byte varints● four 1-3 byte varints● two 1-5 byte varints


● Decode six 1-2 byte varints in parallel

● Need to pad out all 1 byte varints to 2 bytes

pshufb: Intel SSSE3 instruction to shuffle bytes


01001010 11001000 01110001 01001110

10011011 01101010 10110101 00010111

01110110 10001101 10110011 11000001

Decode 6 varints from 9 bytes


01001010 11001000 01110001 01001110

10011011 01101010 10110101 00010111

01110110 10001101 10110011 11000001

Pad out 1 byte ints to 2 bytes


01001010 00000000 11001000 01110001

01001110 00000000 10011011 01101010

10110101 00010111 01110110 00000000

Pad out 1 byte ints to 2 bytes


01001010 00000000 11001000 01110001

01001110 00000000 10011011 01101010

10110101 00010111 01110110 00000000

Reverse bytes in 2 byte varints


00000000 01001010 01110001 11001000

00000000 01001110 01101010 10011011

00010111 10110101 00000000 01110110

Reverse bytes in 2 byte varints


00000000 01001010 01110001 11001000

00000000 01001110 01101010 10011011

00010111 10110101 00000000 01110110

Mask out leading purple 1’s


00000000 01001010 01110001 01001000

00000000 01001110 01101010 00011011

00010111 00110101 00000000 01110110

Mask out leading purple 1’s


00000000 01001010 01110001 01001000

00000000 01001110 01101010 00011011

00010111 00110101 00000000 01110110

Shift top bytes of each varint 1 bit right (mask/shift/or)


00000000 01001010 00111000 11001000

00000000 01001110 00110101 00011011

00001011 10110101 00000000 01110110

Shift top bytes of each varint 1 bit right (mask/shift/or)


00000000 01001010 00111000 11001000

00000000 01001110 00110101 00011011

00001011 10110101 00000000 01110110

● ~10 instructions● No branches● Less than 2 instructions per varint


00000000 01001010 00111000 11001000

00000000 01001110 00110101 00011011

00001011 10110101 00000000 01110110

● Imhotep spends ~40% of its CPU time decoding varints


00000000 01001010 00111000 11001000

00000000 01001110 00110101 00011011

00001011 10110101 00000000 01110110

● Imhotep spends ~40% of its CPU time decoding varints

● Vectorized decoder ~3-5x faster○ Decompresses at 1.5 GB per second○ ~2x overall system performance

Top 5 Locations

Term Stats

atlanta 49

austin 14

boston 25

chicago 28

dallas 13

houston 36

new york 68

san francisco 54

Term Stats Iterator

● For each term in a field, sum metrics across all docs containing that term

Term Stats Iterator

● For each term in a field, sum metrics across all docs containing that term

● How do we compute this across many machines?

dallas 5

boston 12

austin 3

atlanta 16

dallas 8

chicago 19

austin 4

atlanta 12

chicago 9

boston 13

austin 7

atlanta 21

dallas 5

boston 12

austin 3

atlanta 16

chicago 9

boston 13

austin 7

atlanta 21

atlanta 49

dallas 5

boston 12

austin 3

atlanta 16

dallas 8

chicago 19

austin 4

atlanta 12

chicago 9

boston 13

austin 7

atlanta 21

atlanta 49

dallas 5

boston 12

austin 3

atlanta 16

chicago 9

boston 13

austin 7

atlanta 21

dallas 5

boston 12

austin 3

atlanta 16

dallas 8

chicago 19

austin 4

atlanta 12

chicago 9

boston 13

austin 7

atlanta 21

dallas 5

boston 12

austin 3

dallas 8

chicago 19

austin 4

atlanta 12

chicago 9

boston 13

austin 7

atlanta 21

atlanta 49atlanta 49

dallas 5

boston 12

austin 3

dallas 8

chicago 19

austin 4

chicago 9

boston 13

austin 7

atlanta 21


chicago 9

boston 13

austin 7


dallas 5

boston 12

austin 3

dallas 8

chicago 19

austin 4

austin 14atlanta 49

chicago 9

boston 13

austin 7

dallas 5

boston 12

austin 3

dallas 8

chicago 19

austin 4

austin 14

atlanta 49

chicago 9

boston 13

austin 7

dallas 5

boston 12

austin 3

dallas 8

chicago 19

austin 4

dallas 5

boston 12

austin 14

atlanta 49

chicago 9

boston 13

austin 7

dallas 8

chicago 19

austin 4

dallas 8

chicago 19

dallas 5

boston 12

austin 14

atlanta 49

chicago 9

boston 13

austin 7

chicago 9

boston 13

dallas 8

chicago 19

dallas 5

boston 12

austin 14

atlanta 49

chicago 9

boston 13dallas 8

chicago 19

dallas 5

boston 12

boston 25austin 14

atlanta 49

boston 25

austin 14

atlanta 49

chicago 9

boston 13

dallas 8

chicago 19

dallas 5

boston 12

dallas 5

boston 25

austin 14

atlanta 49

chicago 9

boston 13

dallas 8

chicago 19

chicago 9dallas 5

boston 25

austin 14

atlanta 49

dallas 8

chicago 19

chicago 9dallas 5

chicago 28boston 25

austin 14

atlanta 49

dallas 8

chicago 19

chicago 28

boston 25

austin 14

atlanta 49

chicago 9dallas 5

dallas 8

chicago 19

dallas 8

chicago 28

boston 25

austin 14

atlanta 49

chicago 9dallas 5

dallas 8

chicago 28

boston 25

austin 14

atlanta 49

dallas 5

dallas 8

dallas 13chicago 28

boston 25

austin 14

atlanta 49

dallas 5

dallas 5 dallas 8

dallas 13

chicago 28

boston 25

austin 14

atlanta 49

dallas 8

dallas 13

chicago 28

boston 25

austin 14

atlanta 49

dallas 13

chicago 28

boston 25

austin 14

atlanta 49

Term Stats 1-6

TS 1 TS 2 TS 3 TS 4 TS 5 TS 6

TS 1-6 TS 7-12 TS 13-18

TS 1-6 TS 7-12 TS 13-18

Term Stats 1-18

Amdahl’s Law

● The speedup of a program using multiple processors is limited by the time needed for the sequential fraction of the program

Amdahl’s Law

● Sequential part of FTGS is last step in merge

● Can we distribute some part of the final merge?

Hash Partition + Interleave

● Send all stats for each unique term to the same thread based on a hash of the term

● Interleave merged terms

TS 1-6 TS 7-12 TS 13-18

Term Stats 1-18

Shard Distribution

dallas 5

boston 12

austin 3

atlanta 16

dallas 8

chicago 19

austin 4

atlanta 12

chicago 9

boston 13

austin 7

atlanta 21

dallas 5boston 12austin 3

atlanta 16

dallas 8chicago 19

austin 4

atlanta 12

chicago 9

boston 13austin 7

atlanta 21

dallas 5

boston 12

atlanta 16

dallas 8

atlanta 12

boston 13

atlanta 21

dallas 5

boston 12

atlanta 16dallas 8

atlanta 12boston 13

atlanta 21

atlanta 49

dallas 5

boston 12 dallas 8 boston 13

boston 25atlanta 49

dallas 5 dallas 8

dallas 13boston 25

atlanta 49

dallas 13boston 25

atlanta 49

dallas 13

boston 25

atlanta 49

chicago 28

austin 14

dallas 13

boston 25

atlanta 49chicago 28

austin 14

atlanta 49

dallas 13

boston 25

atlanta 49chicago 28

austin 14

atlanta 49

dallas 13

boston 25

atlanta 49

chicago 28

austin 14

dallas 13

boston 25

atlanta 49

chicago 28

austin 14

austin 14atlanta 49

dallas 13

boston 25

chicago 28

austin 14

austin 14

atlanta 49

dallas 13

boston 25

chicago 28

austin 14

chicago 28

dallas 13

boston 25

austin 14

atlanta 49

boston 25austin 14

atlanta 49

chicago 28

dallas 13

boston 25

boston 25

austin 14

atlanta 49

chicago 28

dallas 13

boston 25

dallas 13

boston 25

austin 14

atlanta 49

chicago 28

chicago 28boston 25

austin 14

atlanta 49

dallas 13 chicago 28

chicago 28

boston 25

austin 14

atlanta 49

dallas 13 chicago 28

chicago 28

boston 25

austin 14

atlanta 49

dallas 13

dallas 13

dallas 13chicago 28

boston 25

austin 14

atlanta 49

dallas 13

chicago 28

boston 25

austin 14

atlanta 49

Shard Distribution

● Lots of datasets for different event types● Each dataset is split into one shard per

(hour/day)● Each shard has 2 replicas for fault tolerance● How do we assign shards to machines?

Shard Distribution Considerations

● Space● Load● Hot Spots● Adding/Removing machines

Homogeneous vs. Heterogeneous Systems

● Must decide how or if you will handle heterogeneous hardware

● Cannot balance for both space and load on heterogeneous hardware

1 TB

3 TB

Homogeneous vs. Heterogeneous


12 shards50% capacity used





read hotspot




wasted space

Hot Spots

When accessing any subset of a dataset, evenly spread the load across CPUs, drives, network cards

Hot Spots

When accessing any subset of a dataset, evenly spread the load across CPUs, drives, network cards

This is hard

Hot Spots

Maybe random is good enough?

Hot Spots

Maybe random is good enough?

On average about 10% more data read from the most loaded machine than the least

Two Choice Randomized Load Balancing

● 2 replicas of each shard to choose from● Greedily choose the machine that currently

has the least load from this client

Two Choice Randomized Load Balancing

● 2 replicas of each shard to choose from● Greedily choose the machine that currently

has the least load from this client● On average about 1% more data read from

the most loaded machine than the least

Rendezvous Hashing

● Assignment of a shard to machines determined only by the machines that exist in the cluster

● Hash all pairs of shard ID and machine ID and pick the largest two

Rendezvous Hashing

Shard ID: organic.2014-03-02T06:00:00

H(Shard ID + m1) = 0.592624H(Shard ID + m2) = 0.294647H(Shard ID + m3) = 0.736681H(Shard ID + m4) = 0.647578H(Shard ID + m5) = 0.835598

Rendezvous Hashing

0

1m5

m3m4

m1

m2

Rendezvous Hashing

● No coordination required - deterministic algorithm used to determine assignment

● No centralized storage for shard to machine assignment

Rendezvous Hashing

Expected max hash for a shard is

Rendezvous Hashing

Expected max hash for a shard is

Probability that new machine will get shard

Rendezvous Hashing


What was the weekly average query time in the last quarter from people doing the query “software”?

1. Query Regroup on query:software2. Metric Regroup on time, width 7 days3. Get Group Stats on query time and count,

divide after summing

Ramses


What percent of jobsearch results pages are for page 2 and beyond?

1. Get Group Stats on count2. Query Regroup on “-page:1”3. Get Group Stats on count4. Divide -page:1 count by total count

Ramses


What are the 5 most common queries in each country?

1. Multiterm Regroup on all values of country2. Term Group Stats Iteration on query

IQL

select count()

from jobsearch

‘2014-01-01’

‘2014-03-26’

group by country, query[5]

IQL

select count()

from jobsearch

‘2014-01-01’

‘2014-03-26’


Metrics

select count()

from jobsearch

‘2014-01-01’

‘2014-03-26’


IQL

Dataset

select count()

from jobsearch

‘2014-01-01’

‘2014-03-26’


IQL

Regroup

select count()

from jobsearch

‘2014-01-01’

‘2014-03-26’


IQL

Term Group Stats

Imhotep

Large Scale Analytics and Machine Learning

Imhotep

Large Scale Analytics and Machine Learning

● Varint Decoding: High Performance Vector Instructions

● Stream Merging: Hash Partition + Interleave

● Shard Distribution: Rendezvous Hashing

We’re Open Sourcing Imhotep

How You Can Use Imhotep

Data Ingestion● TSV Uploader● HadoopData Access● Imhotep Primitives● IQL

Next @IndeedEng TalkLarge Scale Interactive Analytics

with Imhotep

Tom Bergman, Product ManagerZak Cocos, Manager of Marketing Sciences

April 30, 2014

http://engineering.indeed.com/talks



@indeedeng: imhotep - large scale analytics and machine learning at indeed

Technology