the evolution of lucene & solr numerics from strings to points: presented by steve rowe,...
TRANSCRIPT
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
The Evolution of Lucene & Solr Numerics from Strings to Points
Steve Rowe Senior Software Engineer, Lucidworks
@steven_a_rowe
3
01Agenda
1. {Long time ago, yesterday}: History 2. Today: Benchmarks 3. Tomorrow: Future developments
Not on the agenda: geospatial; stats/analytics; streaming expressions
4
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
1. BCDTypeField 2. SortableTypeField 3. TypeField
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField
Lucene 1.4 July 2004
FieldCache
5
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
Lucene 1.4 July 2004
FieldCache
Binary terms
1. Modified UTF-8: null is 2 bytes C0 80; UTF-16 surrogate code units are 3 bytes; length in UTF-16 chars 2. String -> byte sequence
1. BCDTypeField 2. SortableTypeField 3. TypeField
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField
6
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
Lucene 1.4 July 2004
FieldCache
1. In the beginning, everything was a String 2. Solr Int/Long/etc.: base 10 variable-width String 3. To make string-encoded integers sortable,
left-zero-pad to fixed width, e.g. 15 -> 000015
1. BCDTypeField 2. SortableTypeField 3. TypeField
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeFieldTypeField -=TypeField
7
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
Lucene 1.4 July 2004
FieldCache
-=NumberTools
1. NumberTools: base 36 long 2. BCD: base 10k int/long 3. Sortable Int/Float/Long/Double/Date:
32-bit=12 bits/char; 64-bit=14 bits/char
1. BCDTypeField 2. SortableTypeField 3. TypeField
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField
BCDTypeField SortableTypeField
-=BCDTypeField -=SortableTypeField
8
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
Lucene 1.4 July 2004
FieldCache
FieldCache: uninverted per-doc array of native field values, constructed at search time
1. BCDTypeField 2. SortableTypeField 3. TypeField
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField
9
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
1. BCDTypeField 2. SortableTypeField 3. TypeField
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField
Lucene 1.4 July 2004
FieldCache Trie deprecated
10
01Trie numerics
From http://www.thetaphi.de/share/Schindler-TrieRange.ppt:
421
52
4
44 6442
644642641634633632522521448446445423
63
5 6
Range
1. Fast range queries 2. Fewer terms required than term range queries 3. 7-bit encoded to minimize disk footprint 4. Adjustable “precisionStep”: number of bits to
shift when generating synthetic terms 5. Synthetic prefix terms created by stripping low
bits and prepending the shift amount in the first byte
1. E.g.: For 423, synthetic terms 42 and 4 are also indexed
2. When searching range [423, 642]: the lowest- precision terms covering the range are used:423, 44, 5, 63, 641, 642 (6 terms), versus 11 terms required by a term range query.
11
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
Lucene 1.4 July 2004
FieldCache
1. BCDTypeField 2. SortableTypeField 3. TypeField
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField
DocValues
DocValues: field cache constructed at index-time
12
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
1. BCDTypeField 2. SortableTypeField 3. TypeField
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField
Lucene 1.4 July 2004
FieldCache
Flexible indexing
Flexible indexing: simplify/enable new index formats via modularization
13
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
1. BCDTypeField 2. SortableTypeField 3. TypeField
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField
Lucene 1.4 July 2004
FieldCache
1. Auto-prefix terms: generalization of trie numericstrategy, in block tree terms dictionary.
2. Intended to replace trie numerics: LUCENE-5966 3. Removed in favor of points.
14
01Yesterday
Lucene 0.01 March 2000
Lucene 1.2 June 2002
Lucene 1.9 Feb. 2006
Solr 1.1 Dec. 2006
NumberTools
1. BCDTypeField 2. SortableTypeField 3. TypeField
Solr 1.4 Nov. 2009
Lucene 2.9 Sept. 2009
Trie numerics
Lucene/Solr 4.0 Oct. 2012
Lucene 2.4 Oct. 2008
UTF-8 terms
1. Flexible indexing 2. Binary terms 3. DocValues 4. -=NumberTools
Modified UTF-8 terms
Lucene/Solr 5.2 June 2015
Auto-prefix terms
Lucene/Solr 6.2 Aug. 2016
-=Auto-prefix terms
Lucene/Solr 6.0 Apr. 2016
1. Dimensional Points 2. Trie deprecated
Solr 5.0 Feb. 2015
1. -=BCDTypeField 2. -=SortableTypeField 3. -=TypeField
Lucene 1.4 July 2004
FieldCacheDimensional Points
}}
15
01Dimensional Points
1. All point values in a field havethe same fixed width (max 128bit)
2. 1D - 8D 3. Block k-d tree 4. Points are sorted;
recursivelypartitioned alongthe longestdimension; then at a target cardinality, the“leaf block” is written out.
1-8 dimensions
1-16 bytes per dimension
4. An in-memory binarytree index points tothe leaf blocks.
5. Adaptive optimalpartitioning (versustrie numerics, whichgenerates termsirrespective of localdensity.)
16
01Dimensional Points
1. Lucene-only - no Solr support yet 2. Optimized for query types:
range, distance, nearest-neighbor, and point-in-polygon 3. Multi-valued support 4. Not supported: value retrieval (store if you need this) 5. Not supported: sorting or faceting (use DocValues for these)
17
01Dimensional Points
1D Native 1D 128-bit 1D-4D Range 2D Geospatial 3D Geospatial
Implementations
LongPointIntPoint DoublePoint FloatPoint BinaryPoint
BigIntegerPoint InetAddressPoint
LongRangeField IntRangeField DoubleRangeField FloatRangeField
LatLonPoint Geo3DPoint
Supported queries
1. any in set 2. exact 3. range
1. any in set 2. exact 3. range
1. intersects 2. contains 3. within
(given a range)
1. within box 2. within distance 3. within polygon 4. nearest neighbor
1.within shape
18
01Today
Mike McCandless benchmarked pre-6.0 1D points and found*:
1. Points were substantially faster at both index- and query- time than the equivalent Trie numeric type.
2. Index size was smaller with points. 3. Query-time heap usage with points was much lower.
Adrien Grand re-ran Mike’s benchmark against a Lucene 6.2 snapshot**, and drew similarconclusions: “36% faster at query time, 71% faster at index time and used 66% less disk and 85% less memory"
* https://www.elastic.co/blog/lucene-points-6.0 ** https://www.elastic.co/blog/searching-numb3rs-in-5.0
19
01Today
I benchmarked fixed range queries against trie and point long, int and double fields in 25 million NYC taxi trips using modified tools from luceneutil.
I create an index with three versions of each long, int and double field:
1. Trie numerics with the default precision step 2. Point fields 3. Trie numerics with a precision step the same width as the numbers - this should provide
a maximum performance threshold for String ranges.
20
01TodayIndexing
time Index size
Points 31s 1.2GiB
Trie 53s 1.6GiB
Single-precision trie 19s 0.7GiB
The index has 24 fields defined: 6 string fields, 1 text field, 2 long fields,1 int field, and 14 double fields.
21
01Todayfield cardinality hits type query time
passenger_count 10 7.5M
IntPoint 86ms
TrieInt/8 114ms
TrieInt/32 116ms
pick_up_date_time 4.1M 10.4M
LongPoint 69ms
TrieLong/16 105ms
TrieLong/64 365ms
trip_distance 4,754 9.6M
DoublePoint 116ms
TrieDouble/16 92ms
TrieDouble/64 105ms
22
01Tomorrow
1. Add support for PointFields in Solr: SOLR-8396 2. David Smiley will be working on adding a Solr adaptor for LatLonPoint in the near future. 3. Trie numerics will be removed from Lucene in 7.0, but Solr may take ownership to provide
a longer backcompat timeframe. 4. FieldCache may be removed from Lucene / moved to Solr: LUCENE-7283
23
01References
1. Numeric Range Queries with Lucene TrieRange:http://www.thetaphi.de/share/Schindler-TrieRange.ppt
2. Generic XML-based Framework for Metadata Portals: http://epic.awi.de/17813/1/Sch2007br.pdf
3. Fun with flexible indexing: http://blog.mikemccandless.com/2010/10/fun-with-flexible-indexing.html
4. Searching numb3rs in 5.0: https://www.elastic.co/blog/searching-numb3rs-in-5.0 5. Multi-dimensional points, coming in Apache Lucene 6.0:
https://www.elastic.co/blog/lucene-points-6.0 6. Bkd-tree: A Dynamic Scalable kd-tree:
http://www.madalgo.au.dk/~large/Papers/bkdsstd03.ps 7. Luceneutil: https://github.com/mikemccand/luceneutil/