the evolution of lucene & solr numerics from strings to points: presented by steve rowe,...

23
OCTOBER 11-14, 2016 BOSTON, MA

Upload: lucidworks

Post on 07-Jan-2017

138 views

Category:

Technology


5 download

TRANSCRIPT

Page 1: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A

Page 2: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

The Evolution of Lucene & Solr Numerics from Strings to Points

Steve Rowe Senior Software Engineer, Lucidworks

@steven_a_rowe

Page 3: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

3

01Agenda

1. {Long time ago, yesterday}: History 2. Today: Benchmarks 3. Tomorrow: Future developments

Not on the agenda: geospatial; stats/analytics; streaming expressions

Page 4: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

4

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

1. BCDTypeField 2. SortableTypeField 3. TypeField

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField

Lucene 1.4 July 2004

FieldCache

Page 5: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

5

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

Lucene 1.4 July 2004

FieldCache

Binary terms

1. Modified UTF-8: null is 2 bytes C0 80; UTF-16 surrogate code units are 3 bytes; length in UTF-16 chars 2. String -> byte sequence

1. BCDTypeField 2. SortableTypeField 3. TypeField

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField

Page 6: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

6

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

Lucene 1.4 July 2004

FieldCache

1. In the beginning, everything was a String 2. Solr Int/Long/etc.: base 10 variable-width String 3. To make string-encoded integers sortable,

left-zero-pad to fixed width, e.g. 15 -> 000015

1. BCDTypeField 2. SortableTypeField 3. TypeField

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeFieldTypeField -=TypeField

Page 7: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

7

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

Lucene 1.4 July 2004

FieldCache

-=NumberTools

1. NumberTools: base 36 long 2. BCD: base 10k int/long 3. Sortable Int/Float/Long/Double/Date:

32-bit=12 bits/char; 64-bit=14 bits/char

1. BCDTypeField 2. SortableTypeField 3. TypeField

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField

BCDTypeField SortableTypeField

-=BCDTypeField -=SortableTypeField

Page 8: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

8

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

Lucene 1.4 July 2004

FieldCache

FieldCache: uninverted per-doc array of native field values, constructed at search time

1. BCDTypeField 2. SortableTypeField 3. TypeField

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField

Page 9: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

9

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

1. BCDTypeField 2. SortableTypeField 3. TypeField

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField

Lucene 1.4 July 2004

FieldCache Trie deprecated

Page 10: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

10

01Trie numerics

From http://www.thetaphi.de/share/Schindler-TrieRange.ppt:

421

52

4

44 6442

644642641634633632522521448446445423

63

5 6

Range

1. Fast range queries 2. Fewer terms required than term range queries 3. 7-bit encoded to minimize disk footprint 4. Adjustable “precisionStep”: number of bits to

shift when generating synthetic terms 5. Synthetic prefix terms created by stripping low

bits and prepending the shift amount in the first byte

1. E.g.: For 423, synthetic terms 42 and 4 are also indexed

2. When searching range [423, 642]: the lowest- precision terms covering the range are used:423, 44, 5, 63, 641, 642 (6 terms), versus 11 terms required by a term range query.

Page 11: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

11

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

Lucene 1.4 July 2004

FieldCache

1. BCDTypeField 2. SortableTypeField 3. TypeField

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField

DocValues

DocValues: field cache constructed at index-time

Page 12: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

12

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

1. BCDTypeField 2. SortableTypeField 3. TypeField

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField

Lucene 1.4 July 2004

FieldCache

Flexible indexing

Flexible indexing: simplify/enable new index formats via modularization

Page 13: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

13

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

1. BCDTypeField 2. SortableTypeField 3. TypeField

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField

Lucene 1.4 July 2004

FieldCache

1. Auto-prefix terms: generalization of trie numericstrategy, in block tree terms dictionary.

2. Intended to replace trie numerics: LUCENE-5966 3. Removed in favor of points.

Page 14: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

14

01Yesterday

Lucene 0.01 March 2000

Lucene 1.2 June 2002

Lucene 1.9 Feb. 2006

Solr 1.1 Dec. 2006

NumberTools

1. BCDTypeField 2. SortableTypeField 3. TypeField

Solr 1.4 Nov. 2009

Lucene 2.9 Sept. 2009

Trie numerics

Lucene/Solr 4.0 Oct. 2012

Lucene 2.4 Oct. 2008

UTF-8 terms

1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools

Modified UTF-8 terms

Lucene/Solr 5.2 June 2015

Auto-prefix terms

Lucene/Solr 6.2 Aug. 2016

-=Auto-prefix terms

Lucene/Solr 6.0 Apr. 2016

1. Dimensional Points 2. Trie deprecated

Solr 5.0 Feb. 2015

1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField

Lucene 1.4 July 2004

FieldCacheDimensional Points

Page 15: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

}}

15

01Dimensional Points

1. All point values in a field havethe same fixed width (max 128bit)

2. 1D - 8D 3. Block k-d tree 4. Points are sorted;

recursivelypartitioned alongthe longestdimension; then at a target cardinality, the“leaf block” is written out.

1-8 dimensions

1-16 bytes per dimension

4. An in-memory binarytree index points tothe leaf blocks.

5. Adaptive optimalpartitioning (versustrie numerics, whichgenerates termsirrespective of localdensity.)

Page 16: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

16

01Dimensional Points

1. Lucene-only - no Solr support yet 2. Optimized for query types:

range, distance, nearest-neighbor, and point-in-polygon 3. Multi-valued support 4. Not supported: value retrieval (store if you need this) 5. Not supported: sorting or faceting (use DocValues for these)

Page 17: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

17

01Dimensional Points

1D Native 1D 128-bit 1D-4D Range 2D Geospatial 3D Geospatial

Implementations

LongPointIntPoint DoublePoint FloatPoint BinaryPoint

BigIntegerPoint InetAddressPoint

LongRangeField IntRangeField DoubleRangeField FloatRangeField

LatLonPoint Geo3DPoint

Supported queries

1. any in set 2. exact 3. range

1. any in set 2. exact 3. range

1. intersects 2. contains 3. within

(given a range)

1. within box 2. within distance 3. within polygon 4. nearest neighbor

1.within shape

Page 18: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

18

01Today

Mike McCandless benchmarked pre-6.0 1D points and found*:

1. Points were substantially faster at both index- and query- time than the equivalent Trie numeric type.

2. Index size was smaller with points. 3. Query-time heap usage with points was much lower.

Adrien Grand re-ran Mike’s benchmark against a Lucene 6.2 snapshot**, and drew similarconclusions: “36% faster at query time, 71% faster at index time and used 66% less disk and 85% less memory"

* https://www.elastic.co/blog/lucene-points-6.0 ** https://www.elastic.co/blog/searching-numb3rs-in-5.0

Page 19: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

19

01Today

I benchmarked fixed range queries against trie and point long, int and double fields in 25 million NYC taxi trips using modified tools from luceneutil.

I create an index with three versions of each long, int and double field:

1. Trie numerics with the default precision step 2. Point fields 3. Trie numerics with a precision step the same width as the numbers - this should provide

a maximum performance threshold for String ranges.

Page 20: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

20

01TodayIndexing

time Index size

Points 31s 1.2GiB

Trie 53s 1.6GiB

Single-precision trie 19s 0.7GiB

The index has 24 fields defined: 6 string fields, 1 text field, 2 long fields,1 int field, and 14 double fields.

Page 21: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

21

01Todayfield cardinality hits type query time

passenger_count 10 7.5M

IntPoint 86ms

TrieInt/8 114ms

TrieInt/32 116ms

pick_up_date_time 4.1M 10.4M

LongPoint 69ms

TrieLong/16 105ms

TrieLong/64 365ms

trip_distance 4,754 9.6M

DoublePoint 116ms

TrieDouble/16 92ms

TrieDouble/64 105ms

Page 22: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

22

01Tomorrow

1. Add support for PointFields in Solr: SOLR-8396 2. David Smiley will be working on adding a Solr adaptor for LatLonPoint in the near future. 3. Trie numerics will be removed from Lucene in 7.0, but Solr may take ownership to provide

a longer backcompat timeframe. 4. FieldCache may be removed from Lucene / moved to Solr: LUCENE-7283

Page 23: The Evolution of Lucene & Solr Numerics from Strings to Points: Presented by Steve Rowe, Lucidworks

23

01References

1. Numeric Range Queries with Lucene TrieRange:http://www.thetaphi.de/share/Schindler-TrieRange.ppt

2. Generic XML-based Framework for Metadata Portals: http://epic.awi.de/17813/1/Sch2007br.pdf

3. Fun with flexible indexing: http://blog.mikemccandless.com/2010/10/fun-with-flexible-indexing.html

4. Searching numb3rs in 5.0: https://www.elastic.co/blog/searching-numb3rs-in-5.0 5. Multi-dimensional points, coming in Apache Lucene 6.0:

https://www.elastic.co/blog/lucene-points-6.0 6. Bkd-tree: A Dynamic Scalable kd-tree:

http://www.madalgo.au.dk/~large/Papers/bkdsstd03.ps 7. Luceneutil: https://github.com/mikemccand/luceneutil/