text tokenization

Text TokenizationElasticSearch Boston Meetup - 10/14/14

Bryan [email protected]

Intro● Traackr is an advanced Influencer discovery and monitoring tool

● We offer users the ability to identify influential authors who are relevant against a particular set of keywords

● Keywords can be a mix of exact and non-exact phrases (up to 50)

Background● Over the past year, we’ve done a lot of tuning around our text tokenization

strategies

● Strategies that work very well for non-exact phrase matching have adverse side-effects for exact phrase matching (and vice-versa)

● Despite our best efforts, a single analyzed field for content could not suffice both use cases

Tokenization Primer● Whitespace Tokenizer

● Invisible characters aren’t handled (e.g. Left-to-Right markers)

● @foobar node.js is great. (#tech) => @foobar[1], node.js[2], is[3], great.[4], (#tech)[5]

● Basic - Used in conjunction with a word-delimiter filter

@Override protected boolean isTokenChar(int c) {

return !Character.isWhitespace(c); }

Tokenization Primer● Pattern Tokenizer

● Separate text into terms via regular expressions (default is \W+)

● @foobar node.js is great. (#tech) => @foobar[1], node.js[2], is[3], great.[4], #tech[5]

● Regex can often become unwieldy, especially with I18n concerns● Contextual token separators not really feasible● Used in conjunction with a word-delimiter filter

"tokenizer": { "custom_pattern_tokenizer”: { "type": "pattern", "pattern": "[\\(\\)\\s]+", "group": -1}}

Tokenization PrimerOther tokenizers / filters to keep in mind...

● nGram & Edge nGram Tokenizers○ Favored in autocomplete use-cases○ Produce a lot of tokens

● Path Hierarchy / Email-URL Tokenizers○ Specialty use-cases

● Stemmers (Porter, Snowball)○ These are filters applied post-tokenization

Word Delimiter● Word Delimiter Filter

● Produces sub-word tokens based onnumerous rules (non-alphanumericchars, case transitions, etc.)

● Great for relaxed queries, but causes aheadache with exact phrases

● @foobar node.js is great. (#tech) => @foobar[1], foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6], tech[6]

"filter":{ "custom_word_delimiter":{ "type":"word_delimiter", "generate_word_parts":"1", "generate_number_parts":"0", "catenate_words":"1", "catenate_numbers":"0", "catenate_all":"0", "split_on_case_change":"1", "preserve_original":"1"}}

Word Delimiter● In the prev. example, a query_string

search on ‘@foobar’ will match text containing foobar (with or w/o the ‘@’)

● To solve this, the word delimiter can be configured with a type table

● @foobar node.js is great. (#tech) => @foobar[1], node.js[2], node[2], js[3], nodejs[3], is[4], great[5], great.[5], #tech[6]

● However, many edge-cases still arise and the increased # of tokens has an effect on relevance

"filter":{ "custom_word_delimiter":{ …

"type_table":[ "# => ALPHA", "@ => ALPHA", ]}}

Standard Tokenizer● Implements the Word Break rules from the Unicode Text Segmentation

algorithm, as specified in Unicode Standard Annex #29

● It does so by assigning character classes to underlying Unicode characters○ ALetter, Numeric, Extended_NumLetter○ Single_Quote, Double_Quote○ Mid_Number, Mid_Letter, Mid_NumLetter

● These character classes determine how word boundaries are detected in a piece of text - smart in that it’s all contextual

http://unicode.org/reports/tr29/

Standard Tokenizer● For instance, let’s examine the Mid_NumLetter character class

○ If assigned, it states that a Unicode character will be treated as a word boundary break unless surrounded by alpha-numerics

○ By default, the ‘.’ is categorized as a Mid_NumLetter

● Most punctuation symbols (e.g. parenthesis, brackets, ‘@’, ‘#’, etc.) are treated as hard word boundary breaks

● @foobar node.js is great. (#tech) => foobar[1], node.js[2], is[3], great[4], tech[5]

Standard Tokenizer● Unfortunately, there isn’t an easy way to customize the default word

boundary rules as implemented by the S.T. algorithm.

● One can either..○ Use a mapping char_filter to map punctuation symbols like ‘@’, ‘#’,

etc. to other characters that are treated like alpha-numerics (e.g. the ‘_’ is treated as an extended num-letter)

○ Copy the S.T. source and make modifications … which is what the Javadoc actually says to do :)

Standard Token. Extension● Now, there’s an ES plugin to do what we need: https://github.

com/bbguitar77/elasticsearch-analysis-standardext

● We can override the default character classfor any Unicode character that we desire

● @foobar node.js is great. (#tech) => @foobar[1], node.js[2], is[3], great[4], #tech[5]

"tokenizer": { "my_standard_ext": { "type": "standard_ext", "mappings": [ "@=>EXNL", "#=>EXNL" ] }}

https://github.com/bbguitar77/elasticsearch-analysis-standardext



Standard Token. ExtensionThe supported word-boundary property types are:● L -> A Letter● N -> Numeric● EXNL -> Extended Number Letter (preserved at start, middle, end of

alpha-numeric characters - e.g. '_')● MNL -> Mid Number Letter (preserved between alpha-numeric characters

- e.g. '.')● MN -> Mid Number (preserved between numerics - e.g. ',')● ML -> Mid Letter (preserved between letters - e.g. ':')● SQ -> Single-quote● DQ -> Double-quote

Standard Token. Extension index":{ "analysis":{ "analyzer" : { "my_analyzer" : { "type":"custom", "char_filter":[], "tokenizer": "my_standard_ext", "filter":["lowercase", "stop"] } }, "tokenizer": { "my_standard_ext": { "type": "standard_ext", "mappings": [ "@=>EXNL", "#=>EXNL" ] } }

Standard Token. Extension● Advantages

○ Reap all the benefits of the Standard Tokenizer (context-based segmentation rules, special char. handling, etc.) while retaining some flexibility in how certain Unicode characters are handled

○ Works very well for analyzed fields where we don’t want to incur the overhead / side-effects of the word-delimiter filter

○ Simpler configuration - tokenization rules are not spread between the tokenizer & token-filters

OutcomeBest strategy for us was having two separate analyzed fields for content

Strictly analyzed field - exact phrase matching -

“analysis” : { ... "strict_text" : { "type":"custom", "char_filter" : ["html_strip"], "tokenizer" : "my_standard_ext", "filter" : ["lowercase", "icu_folding"] }…

Broadly analyzed field - relaxed phrase matching -

“analysis” : { ... "broad_text" : { "type":"custom", "char_filter" : ["html_strip"], "tokenizer" : "my_standard_ext", "filter" : [“custom_word_delim”, "lowercase", “stop”, "icu_folding"] }…

OutcomeWhen a mix of exact & non-exact phrases are present, the user query is transformed into a simple Bool query composed of two underlying query_string queries

BoolQueryBuilder query = new BoolQueryBuilder(); if (!relaxedKeywords.isEmpty()) query.should(buildQueryStringQuery(relaxedKeywords));

if (!exactKeywords.isEmpty()) query.should(buildQueryStringQuery(exactKeywords));

query.minimumNumberShouldMatch(1);query.disableCoord(true);

Questions● Questions?

● If you’re interested in learning more about Traackr, please see us after the presentation or email us - we’re hiring!○ [email protected]○ 5k referral program

● Tech Blog - http://traackr-people.tumblr.com

mailto:[email protected]

mailto:[email protected]

http://traackr-people.tumblr.com

text tokenization

Data & Analytics