elasticsearch: the definitiveguide - · pdf fileelasticsearch:thedefinitiveguide clinton...

Elasticsearch: The Definitive Guide
Clinton Gormley and Zachary Tong
Beijing Cambridge Farnham Koln Sebastopol Tokyo IjJliiSflMi

Table of Contents
Foreword xxi
Preface xxiii
Parti. Getting Started
1. You Know, for Search 3
Installing Elasticsearch 4
Installing Marvel 5
Running Elasticsearch 5
Viewing Marvel and Sense 6
Talking to Elasticsearch 6Java API 6
RESTful API with JSON over HTTP 7
Document Oriented 9
JSON 9
Finding Your Feet 10Let's Build an Employee Directory 10
Indexing Employee Documents 10Retrieving a Document 13Search Lite 13
Search with Query DSL 16More-Complicated Searches 16Full-Text Search 17
Phrase Search 19
Highlighting Our Searches 19
Analytics 21Tutorial Conclusion 23
iii

Distributed Nature 23
Next Steps 24
2. Life Inside a Cluster 27
An Empty Cluster 28
Cluster Health 28
Add an Index 29
Add Failover 31
Scale Horizontally 32
Then Scale Some More 33
Coping with Failure 34
3. Data In, Data Out 37
What Is a Document? 38
Document Metadata 39
_index 39
_type 39
_id 40
Other Metadata 40
Indexing a Document 40
Using Our Own ID 40
Autogenerating IDs 41
Retrieving a Document 42
Retrieving Part of a Document 43
Checking Whether a Document Exists 44
Updating a Whole Document 44
Creating a New Document 45
Deleting a Document 46
Dealing with Conflicts 47
Optimistic Concurrency Control 49
Using Versions from an External System 51
Partial Updates to Documents 52
Using Scripts to Make Partial Updates 53
Updating a Document That May Not Yet Exist 54
Updates and Conflicts 55
Retrieving Multiple Documents 56
Cheaper in Bulk 58
Don't Repeat Yourself 62
How Big Is Too Big? 62
4. Distributed Document Store 63
Routing a Document to a Shard 63
iv | Table of Contents

How Primary and Replica Shards Interact 64Creating, Indexing, and Deleting a Document 65Retrieving a Document 67
Partial Updates to a Document 68Multidocument Patterns 69
Why the Funny Format? 71
5. SearchingThe Basic Tools 73The Empty Search 74
hits 75
took 75
shards 75
timeout 76
Multi-index, Multitype 76Pagination 77Search Lite 78
The _all Field 79
More Complicated Queries 80
6. Mapping and Analysis 81Exact Values Versus Full Text 82
Inverted Index 83
Analysis and Analyzers 86Built-in Analyzers 86When Analyzers Are Used 87Testing Analyzers 88
Specifying Analyzers 89Mapping 89Core Simple Field Types 90
Viewing the Mapping 91Customizing Field Mappings 91Updating a Mapping 93Testing the Mapping 94
Complex Core Field Types 95Multivalue Fields 95
Empty Fields 95Multilevel Objects 96Mapping for Inner Objects 96How Inner Objects are Indexed 97
Arrays ofInner Objects 97
Table of Contents | v

7. Full-Body Search 99Empty Search 99
Query DSL 100
Structure of a Query Clause 101
Combining Multiple Clauses 102
Queries and Filters 103
Performance Differences 103
When to Use Which 104
Most Important Queries and Filters 104term Filter 104
terms Filter 104
range Filter 104
exists and missing Filters 105bool Filter 105
match_all Query 106match Query 106
multi_match Query 106bool Query 107
Combining Queries with Filters 107
Filtering a Query 108Just a Filter 109
A Query as a Filter 109
Validating Queries 110
Understanding Errors 110
Understanding Queries 111
8. Sorting and Relevance 113Sorting 113
Sorting by Field Values 114Multilevel Sorting 115
Sorting on Multivalue Fields 115
String Sorting and Multifields 116What Is Relevance? 117
Understanding the Score 118Understanding Why a Document Matched 121
Fielddata 121
9. Distributed Search Execution 123
Query Phase 124Fetch Phase 125
Search Options 127preference 127
vi | Table of Contents

timeout128
routing 128search_type 129
scan and scroll129
Index Management 133Creating an Index 133Deleting an Index
134Index Settings 134Configuring Analyzers 135Custom Analyzers 136Creating a Custom Analyzer 137
Types and Mappings 139How Lucene Sees Documents
139How Types Are Implemented 140Avoiding Type Gotchas 140
The Root Object 142Properties 142Metadata: _source Field
143Metadata: _all Field
144Metadata: Document Identity 146
Dynamic Mapping 147Customizing Dynamic Mapping 149date_detection
149dynamic_templates 150
Default Mapping 151Reindexing Your Data 152Index Aliases and Zero Downtime 153
Inside a Shard155
Making Text Searchable156
Immutability 157Dynamically Updatable Indices 157Deletes and Updates 160
Near Real-Time Search161
refresh API162
Making Changes Persistent 163flush API
167
Segment Merging 168
Table of Contents | vii

optimize API 170
Part II. Search in Depth
175
Finding Exact Values 175term Filter with Numbers 176
term Filter with Text 177
Internal Filter Operation 180
Combining Filters 181Bool Filter 181
Nesting Boolean Filters 183
Finding Multiple Exact Values 184Contains, but Does Not Equal 185
Equals Exactly 186
Ranges 187
Ranges on Dates 188
Ranges on Strings 189
Dealing with Null Values 189exists Filter 190
missing Filter 192
exists/missing on Objects 193All About Caching 194
Independent Filter Caching 194Controlling Caching 195
Filter Order 196
Full-Text Search 199Term-Based Versus Full-Text 199
The match Query 201Index Some Data 201
A Single-Word Query 202Multiword Queries 203
Improving Precision 204
Controlling Precision 205
Combining Queries 206Score Calculation 207
Controlling Precision 207How match Uses bool 208
Boosting Query Clauses 209
Controlling Analysis 211
viii | Table of Contents

Default Analyzers 213
Configuring Analyzers in Practice 215Relevance Is Broken! 216
14. Multifield Search 219
Multiple Query Strings 219
Prioritizing Clauses 220
Single Query String 221Know Your Data 222
Best Fields 223
dis_max Query 224
Tuning Best Fields Queries 225
tie_breaker 226
multi_match Query 227
Using Wildcards in Field Names 228
Boosting Individual Fields 229
Most Fields 229
Multifield Mapping 230
Cross-fields Entity Search 233A Naive Approach 233
Problems with the most_fields Approach 234
Field-Centric Queries 234
Problem 1: Matching the Same Word in Multiple Fields 235Problem 2: Trimming the Long Tail 235
Problem 3: Term Frequencies 236
Solution 237
Custom _all Fields 237cross-fields Queries 238
Per-Field Boosting 240
Exact-Value Fields 241
15. Proximity Matching 243Phrase Matching 244Term Positions 244
What Is a Phrase 245
Mixing It Up 246Multivalue Fields 247
Closer Is Better 248
Proximity for Relevance 249
Improving Performance 251
Rescoring Results 251
Finding Associated Words 252
Tableof Contents | ix

Producing Shingles 253Multifields 255
Searching for Shingles 255Performance 257
16. Partial Matching 259Postcodes and Structured Data 260
prefix Query 261wildcard and regexp Queries 262
Query-Time Search-as-You-Type 264Index-Time Optimizations 266
Ngrams for Partial Matching 266Index-Time Search-as-You-Type 267
Preparing the Index 267
Querying the Field 269
Edge n-grams and Postcodes 272
Ngrams for Compound Words 273
17. Controlling Relevance 277Theory Behind Relevance Scoring 277
Boolean Model 278Term Frequency/Inverse Document Frequency (TF/IDF) 278Vector Space Model 281
Lucene's Practical Scoring Function 284
Query Normalization Factor 285
Query Coordination 286Index-Time Field-Level Boosting 288
Query-Time Boosting 288Boosting an Index 289
t.getBoost() 290
Manipulating Relevance with Query Structure 290Not Quite Not 291
boosting Query 292Ignoring TF/IDF 293constant_score Query 293
function_score Query 295Boosting by Popularity 296modifier 298
factor 300
boost_mode 301
max_boost 303
Boosting Filtered Subsets 303
x | Table of Contents

filter Versus query 304
functions 305
score_mode 305
Random Scoring 305
The Closer, The Better 307
Understanding the price Clause 310
Scoring with Scripts 310
Pluggable Similarity Algorithms 312
Okapi BM25 312
Changing Similarities 315
Configuring BM25 316
Relevance Tuning Is the Last 10% 317
Part III. Dealing with Human Language
18. Getting Started with Languages 321
Using Language Analyzers 322
Configuring Language Analyzers 323Pitfalls of Mixing Languages 325
At Index Time 325
At Query Time 326
Identifying Language 326One Language per Document 327
Foreign Words 328
One Language per Field 329
Mixed-Language Fields 331
Split into Separate Fields 331
Analyze Multiple Times 331Use n-grams 332
19. Identifying Words 335standard Analyzer 335
standard Tokenizer 336
Installing the ICU Plug-in 337
icu_tokenizer 337
Tidying Up Input Text 339
Tokenizing HTML 339
Tidying Up Punctuation 340
20. Normalizing Tokens 343In That Case 343
Table of Contents | xi

You Have an Accent 344
Retaining