faceting optimizations for solr: presented by toke eskildsen, state & university library,...

55
OCTOBER 13-16, 2016 AUSTIN, TX

Upload: lucidworks

Post on 16-Apr-2017

661 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

O C T O B E R 1 3 - 1 6 , 2 0 1 6 • A U S T I N , T X

Page 2: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

Faceting optimizations for Solr Toke Eskildsen

Search Engineer / Solr Hacker State and University Library, Denmark

@TokeEskildsen / [email protected]

Page 3: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

3/55 3

Overview

l  Web scale at the State and University Library, Denmark

l  Field faceting 101 l  Optimizations -  Reuse -  Tracking -  Caching -  Alternative counters

Page 4: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

4/55

Web scale for a small web

l  Denmark -  Consolidation circa 10th century -  5.6 million people

l  Danish Net Archive (http://netarkivet.dk) -  Constitution 2005 -  20 billion items / 590TB+ raw data

Page 5: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

5/55

Indexing 20 billion web items / 590TB into Solr

l  Solr index size is 1/9th of real data = 70TB l  Each shard holds 200M documents / 900GB -  Shards build chronologically by dedicated machine -  Projected 80 shards -  Current build time per shard: 4 days -  Total build time is 20 CPU-core years -  So far only 7.4 billion documents / 27TB in index

Page 6: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

6/55

Searching a 7.4 billion documents / 27TB Solr index

l  SolrCloud with 2 machines, each having -  16 HT-cores, 256GB RAM, 25 * 930GB SSD -  25 shards @ 900GB -  1 Solr/shard/SSD, Xmx=8g, Solr 4.10 -  Disk cache 100GB or < 1% of index size

Page 7: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

7/55

Danish Net Archive Search, late 2014

Page 8: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

8/55

String faceting 101 (single shard)

counter = new int[ordinals]

for docID: result.getDocIDs()

for ordinal: getOrdinals(docID)

counter[ordinal]++

for ordinal = 0 ; ordinal < counters.length ; ordinal++

priorityQueue.add(ordinal, counter[ordinal])

for entry: priorityQueue

result.add(resolveTerm(ordinal), count)

ord term counter 0 A 0 1 B 3 2 C 0 3 D 1006 4 E 1 5 F 1 6 G 0 7 H 0 8 I 3

Page 9: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

9/55

Test setup 1 (easy start)

l  Solr setup -  16 HT-cores, 256GB RAM, SSD -  Single shard 250M documents / 900GB

l  URL field -  Single String value -  200M unique terms

l  3 concurrent “users” l  Random search terms

Page 10: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

10/55

Vanilla Solr, single shard, 250M documents, 200M values, 3 users

Page 11: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

11/55

Allocating and dereferencing 800MB arrays

Page 12: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

12/55

Reuse the counter

counter = new int[ordinals]

for docID: result.getDocIDs()

for ordinal: getOrdinals(docID)

counter[ordinal]++

for ordinal = 0 ; ordinal < counters.length ; ordinal++

priorityQueue.add(ordinal, counter[ordinal])

<counter no more referenced and will be garbage collected at some point>

Page 13: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

13/55

Reuse the counter

counter = pool.getCounter()

for docID: result.getDocIDs()

for ordinal: getOrdinals(docID)

counter[ordinal]++

for ordinal = 0 ; ordinal < counters.length ; ordinal++

priorityQueue.add(ordinal, counter[ordinal])

pool.release(counter)

Note: The JSON Facet API in Solr 5 already supports reuse of counters

Page 14: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

14/55

Using and clearing 800MB arrays

Page 15: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

15/55

Reusing counters vs. not doing so

Page 16: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

16/55

Reusing counters, now with readable visualization

Page 17: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

17/55

Reusing counters, now with readable visualization

Why does it always take more than 500ms?

Page 18: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

18/55

Iteration is not free

counter = pool.getCounter()

for docID: result.getDocIDs()

for ordinal: getOrdinals(docID)

counter[ordinal]++

for ordinal = 0 ; ordinal < counters.length ; ordinal++

priorityQueue.add(ordinal, counter[ordinal])

pool.release(counter)

200M unique terms = 800MB

Page 19: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

19/55

ord counter 0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0

tracker N/A N/A N/A N/A N/A N/A N/A N/A N/A

Tracking updated counters

Page 20: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

20/55

ord counter 0 0 1 0 2 0 3 1 4 0 5 0 6 0 7 0 8 0

tracker 3

N/A N/A N/A N/A N/A N/A N/A N/A

counter[3]++

Tracking updated counters

Page 21: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

21/55

ord counter 0 0 1 1 2 0 3 1 4 0 5 0 6 0 7 0 8 0

tracker 3 1

N/A N/A N/A N/A N/A N/A N/A

counter[3]++

counter[1]++

Tracking updated counters

Page 22: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

22/55

ord counter 0 0 1 3 2 0 3 1 4 0 5 0 6 0 7 0 8 0

tracker 3 1

N/A N/A N/A N/A N/A N/A N/A

counter[3]++

counter[1]++

counter[1]++

counter[1]++

Tracking updated counters

Page 23: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

23/55

ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3

tracker 3 1 8 4 5

N/A N/A N/A N/A

counter[3]++

counter[1]++

counter[1]++

counter[1]++

counter[8]++

counter[8]++

counter[4]++

counter[8]++

counter[5]++

counter[1]++

counter[1]++

counter[1]++

Tracking updated counters

Page 24: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

24/55

Tracking updated counters

counter = pool.getCounter()

for docID: result.getDocIDs()

for ordinal: getOrdinals(docID)

if counter[ordinal]++ == 0 && tracked < maxTracked

tracker[tracked++] = ordinal

if tracked < maxTracked

for i = 0 ; i < tracked ; i++

priorityQueue.add(tracker[i], counter[tracker[i]])

else

for ordinal = 0 ; ordinal < counter.length ; ordinal++

priorityQueue.add(ordinal, counter[ordinal])

ord counter 0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3

tracker 3 1 8 4 5

N/A N/A N/A N/A

Page 25: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

25/55

Tracking updated counters

Page 26: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

26/55

Distributed faceting

Phase 1) All shards performs faceting. The Merger calculates the top-X terms. Phase 2) The term counts are requested from the shards that did not return them in phase 1. The Merger calculates the final counts for the top-X terms. for term: fineCountRequest.getTerms()

result.add(term,

searcher.numDocs(query(field:term), base.getDocIDs()))

Page 27: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

27/55

Test setup 2 (more shards, smaller field)

l  Solr setup -  16 HT-cores, 256GB RAM, SSD -  9 shards @ 250M documents / 900GB

l  domain field -  Single String value -  1.1M unique terms per shard

l  1 concurrent “user” l  Random search terms

Page 28: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

28/55

Pit of Pain™ (or maybe “Horrible Hill”?)

Page 29: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

29/55

Fine counting can be slow

Phase 1: Standard faceting Phase 2: for term: fineCountRequest.getTerms()

result.add(term,

searcher.numDocs(query(field:term), base.getDocIDs()))

Page 30: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

30/55

Alternative fine counting

counter = pool.getCounter()

for docID: result.getDocIDs()

for ordinal: getOrdinals(docID)

counter.increment(ordinal)

for term: fineCountRequest.getTerms()

result.add(term, counter.get(getOrdinal(term)))

} Same as phase 1, which yields ord counter

0 0 1 3 2 0 3 1006 4 1 5 1 6 0 7 0 8 3

Page 31: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

31/55

Using cached counters from phase 1 in phase 2

counter = pool.getCounter(key)

for term: query.getTerms()

result.add(term, counter.get(getOrdinal(term)))

pool.release(counter)

Page 32: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

32/55

Pit of Pain™ practically eliminated

Page 33: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

33/55

Pit of Pain™ practically eliminated

Stick figure CC BY-NC 2.5 Randall Munroe xkcd.com

Page 34: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

34/55

Test setup 3 (more shards, more fields)

l  Solr setup -  16 HT-cores, 256GB RAM, SSD -  23 shards @ 250M documents / 900GB

l  Faceting on 6 fields -  url: ~200M unique terms / shard -  domain & host: ~1M unique terms each / shard -  type, suffix, year: < 1000 unique terms / shard

Page 35: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

35/55

1 machine, 7 billion documents / 23TB total index, 6 facet fields

Page 36: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

36/55

High-cardinality can mean different things

Single shard / 250,000,000 docs / 900GB

Field References Max docs/term Unique terms domain 250,000,000 3,000,000 1,100,000

url 250,000,000 56,000 200,000,000

links 5,800,000,000 5,000,000 610,000,000

2440 MB / counter

Page 37: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

37/55

Different distributions domain 1.1M url 200M links 600M

High max

Low max

Very long tail

Short tail

Page 38: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

38/55

Theoretical lower limit per counter: log2(max_count)

max=1

max=7

max=2047

max=3

max=63

Page 39: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

39/55

int vs. PackedInts domain: 4 MB url: 780 MB links: 2350 MB

int[ordinals] PackedInts(ordinals, maxBPV)

domain: 3 MB (72%) url: 420 MB (53%) links: 1760 MB (75%)

Page 40: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

40/55

n-plane-z counters

Platonic ideal Harsh reality

Plane d

Plane c

Plane b

Plane a

Page 41: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

41/55

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000

Page 42: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

42/55

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000 L: 1 ≣ 000001

Page 43: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

43/55

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011

Page 44: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

44/55

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011 L: 3 ≣ 000101

Page 45: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

45/55

Plane d

Plane c

Plane b

Plane a

L: 0 ≣ 000000 L: 1 ≣ 000001 L: 2 ≣ 000011 L: 3 ≣ 000101 L: 4 ≣ 000111 L: 5 ≣ 001001 L: 6 ≣ 001011 L: 7 ≣ 001101 ... L: 12 ≣ 010111

Page 46: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

46/55

Comparison of counter structures domain: 4 MB url: 780 MB links: 2350 MB

domain: 3 MB (72%) url: 420 MB (53%) links: 1760 MB (75%)

domain: 1 MB (30%) url: 66 MB ( 8%) links: 311 MB (13%)

int[ordinals] PackedInts(ordinals, maxBPV) n-plane-z

Page 47: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

47/55

Speed comparison

Page 48: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

48/55

I could go on about

l  Threaded counting l  Heuristic faceting l  Fine count skipping l  Counter capping l  Monotonically increasing tracker for n-plane-z l  Regexp filtering

Page 49: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

49/55

What about huge result sets?

l  Rare for explorative term-based searches l  Common for batch extractions l  Threading works poorly as #shards > #CPUs l  But how bad is it really?

Page 50: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

50/55

Really bad! 8 minutes

Page 51: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

51/55

Heuristic faceting

l  Use sampling to guess top-X terms -  Re-use the existing tracked counters -  1:1000 sampling seems usable for the field links,

which has 5 billion references per shard l  Fine-count the guessed terms

Page 52: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

52/55

Over provisioning helps validity

Page 53: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

53/55

10 seconds < 8 minutes

Page 54: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

54/55

Web scale for a small web

l  Denmark -  Consolidation circa 10th century -  5.6 million people

l  Danish Net Archive (http://netarkivet.dk) -  Constitution 2005 -  20 billion items / 590TB+ raw data

Page 55: Faceting Optimizations for Solr: Presented by Toke Eskildsen, State & University Library, Denmark

55/55

Never enough time, but talk to me about

l  Threaded counting l  Monotonically increasing tracker for n-plane-z l  Regexp filtering l  Fine count skipping l  Counter capping