elasticsearch: the definitiveguide - · pdf fileelasticsearch:thedefinitiveguide clinton...

Download Elasticsearch: The DefinitiveGuide - · PDF fileElasticsearch:TheDefinitiveGuide Clinton GormleyandZacharyTong Beijing • Cambridge • Farnham • Koln • Sebastopol • Tokyo IjJliiSflMi

If you can't read please download the document

Upload: hatuong

Post on 12-Feb-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • Elasticsearch: The Definitive Guide

    Clinton Gormley and Zachary Tong

    Beijing Cambridge Farnham Koln Sebastopol Tokyo IjJliiSflMi

  • Table of Contents

    Foreword xxi

    Preface xxiii

    Parti. Getting Started

    1. You Know, for Search 3

    Installing Elasticsearch 4

    Installing Marvel 5

    Running Elasticsearch 5

    Viewing Marvel and Sense 6

    Talking to Elasticsearch 6Java API 6

    RESTful API with JSON over HTTP 7

    Document Oriented 9

    JSON 9

    Finding Your Feet 10Let's Build an Employee Directory 10

    Indexing Employee Documents 10Retrieving a Document 13Search Lite 13

    Search with Query DSL 16More-Complicated Searches 16Full-Text Search 17

    Phrase Search 19

    Highlighting Our Searches 19

    Analytics 21Tutorial Conclusion 23

    iii

  • Distributed Nature 23

    Next Steps 24

    2. Life Inside a Cluster 27

    An Empty Cluster 28

    Cluster Health 28

    Add an Index 29

    Add Failover 31

    Scale Horizontally 32

    Then Scale Some More 33

    Coping with Failure 34

    3. Data In, Data Out 37

    What Is a Document? 38

    Document Metadata 39

    _index 39

    _type 39

    _id 40

    Other Metadata 40

    Indexing a Document 40

    Using Our Own ID 40

    Autogenerating IDs 41

    Retrieving a Document 42

    Retrieving Part of a Document 43

    Checking Whether a Document Exists 44

    Updating a Whole Document 44

    Creating a New Document 45

    Deleting a Document 46

    Dealing with Conflicts 47

    Optimistic Concurrency Control 49

    Using Versions from an External System 51

    Partial Updates to Documents 52

    Using Scripts to Make Partial Updates 53

    Updating a Document That May Not Yet Exist 54

    Updates and Conflicts 55

    Retrieving Multiple Documents 56

    Cheaper in Bulk 58

    Don't Repeat Yourself 62

    How Big Is Too Big? 62

    4. Distributed Document Store 63

    Routing a Document to a Shard 63

    iv | Table of Contents

  • How Primary and Replica Shards Interact 64Creating, Indexing, and Deleting a Document 65Retrieving a Document 67

    Partial Updates to a Document 68Multidocument Patterns 69

    Why the Funny Format? 71

    5. SearchingThe Basic Tools 73The Empty Search 74

    hits 75

    took 75

    shards 75

    timeout 76

    Multi-index, Multitype 76Pagination 77Search Lite 78

    The _all Field 79

    More Complicated Queries 80

    6. Mapping and Analysis 81Exact Values Versus Full Text 82

    Inverted Index 83

    Analysis and Analyzers 86Built-in Analyzers 86When Analyzers Are Used 87Testing Analyzers 88

    Specifying Analyzers 89Mapping 89Core Simple Field Types 90

    Viewing the Mapping 91Customizing Field Mappings 91Updating a Mapping 93Testing the Mapping 94

    Complex Core Field Types 95Multivalue Fields 95

    Empty Fields 95Multilevel Objects 96Mapping for Inner Objects 96How Inner Objects are Indexed 97

    Arrays ofInner Objects 97

    Table of Contents | v

  • 7. Full-Body Search 99Empty Search 99

    Query DSL 100

    Structure of a Query Clause 101

    Combining Multiple Clauses 102

    Queries and Filters 103

    Performance Differences 103

    When to Use Which 104

    Most Important Queries and Filters 104term Filter 104

    terms Filter 104

    range Filter 104

    exists and missing Filters 105bool Filter 105

    match_all Query 106match Query 106

    multi_match Query 106bool Query 107

    Combining Queries with Filters 107

    Filtering a Query 108Just a Filter 109

    A Query as a Filter 109

    Validating Queries 110

    Understanding Errors 110

    Understanding Queries 111

    8. Sorting and Relevance 113Sorting 113

    Sorting by Field Values 114Multilevel Sorting 115

    Sorting on Multivalue Fields 115

    String Sorting and Multifields 116What Is Relevance? 117

    Understanding the Score 118Understanding Why a Document Matched 121

    Fielddata 121

    9. Distributed Search Execution 123

    Query Phase 124Fetch Phase 125

    Search Options 127preference 127

    vi | Table of Contents

  • timeout128

    routing 128search_type 129

    scan and scroll129

    Index Management 133Creating an Index 133Deleting an Index

    134Index Settings 134Configuring Analyzers 135Custom Analyzers 136Creating a Custom Analyzer 137

    Types and Mappings 139How Lucene Sees Documents

    139How Types Are Implemented 140Avoiding Type Gotchas 140

    The Root Object 142Properties 142Metadata: _source Field

    143Metadata: _all Field

    144Metadata: Document Identity 146

    Dynamic Mapping 147Customizing Dynamic Mapping 149date_detection

    149dynamic_templates 150

    Default Mapping 151Reindexing Your Data 152Index Aliases and Zero Downtime 153

    Inside a Shard155

    Making Text Searchable156

    Immutability 157Dynamically Updatable Indices 157Deletes and Updates 160

    Near Real-Time Search161

    refresh API162

    Making Changes Persistent 163flush API

    167

    Segment Merging 168

    Table of Contents | vii

  • optimize API 170

    Part II. Search in Depth

    175

    Finding Exact Values 175term Filter with Numbers 176

    term Filter with Text 177

    Internal Filter Operation 180

    Combining Filters 181Bool Filter 181

    Nesting Boolean Filters 183

    Finding Multiple Exact Values 184Contains, but Does Not Equal 185

    Equals Exactly 186

    Ranges 187

    Ranges on Dates 188

    Ranges on Strings 189

    Dealing with Null Values 189exists Filter 190

    missing Filter 192

    exists/missing on Objects 193All About Caching 194

    Independent Filter Caching 194Controlling Caching 195

    Filter Order 196

    Full-Text Search 199Term-Based Versus Full-Text 199

    The match Query 201Index Some Data 201

    A Single-Word Query 202Multiword Queries 203

    Improving Precision 204

    Controlling Precision 205

    Combining Queries 206Score Calculation 207

    Controlling Precision 207How match Uses bool 208

    Boosting Query Clauses 209

    Controlling Analysis 211

    viii | Table of Contents

  • Default Analyzers 213

    Configuring Analyzers in Practice 215Relevance Is Broken! 216

    14. Multifield Search 219

    Multiple Query Strings 219

    Prioritizing Clauses 220

    Single Query String 221Know Your Data 222

    Best Fields 223

    dis_max Query 224

    Tuning Best Fields Queries 225

    tie_breaker 226

    multi_match Query 227

    Using Wildcards in Field Names 228

    Boosting Individual Fields 229

    Most Fields 229

    Multifield Mapping 230

    Cross-fields Entity Search 233A Naive Approach 233

    Problems with the most_fields Approach 234

    Field-Centric Queries 234

    Problem 1: Matching the Same Word in Multiple Fields 235Problem 2: Trimming the Long Tail 235

    Problem 3: Term Frequencies 236

    Solution 237

    Custom _all Fields 237cross-fields Queries 238

    Per-Field Boosting 240

    Exact-Value Fields 241

    15. Proximity Matching 243Phrase Matching 244Term Positions 244

    What Is a Phrase 245

    Mixing It Up 246Multivalue Fields 247

    Closer Is Better 248

    Proximity for Relevance 249

    Improving Performance 251

    Rescoring Results 251

    Finding Associated Words 252

    Tableof Contents | ix

  • Producing Shingles 253Multifields 255

    Searching for Shingles 255Performance 257

    16. Partial Matching 259Postcodes and Structured Data 260

    prefix Query 261wildcard and regexp Queries 262

    Query-Time Search-as-You-Type 264Index-Time Optimizations 266

    Ngrams for Partial Matching 266Index-Time Search-as-You-Type 267

    Preparing the Index 267

    Querying the Field 269

    Edge n-grams and Postcodes 272

    Ngrams for Compound Words 273

    17. Controlling Relevance 277Theory Behind Relevance Scoring 277

    Boolean Model 278Term Frequency/Inverse Document Frequency (TF/IDF) 278Vector Space Model 281

    Lucene's Practical Scoring Function 284

    Query Normalization Factor 285

    Query Coordination 286Index-Time Field-Level Boosting 288

    Query-Time Boosting 288Boosting an Index 289

    t.getBoost() 290

    Manipulating Relevance with Query Structure 290Not Quite Not 291

    boosting Query 292Ignoring TF/IDF 293constant_score Query 293

    function_score Query 295Boosting by Popularity 296modifier 298

    factor 300

    boost_mode 301

    max_boost 303

    Boosting Filtered Subsets 303

    x | Table of Contents

  • filter Versus query 304

    functions 305

    score_mode 305

    Random Scoring 305

    The Closer, The Better 307

    Understanding the price Clause 310

    Scoring with Scripts 310

    Pluggable Similarity Algorithms 312

    Okapi BM25 312

    Changing Similarities 315

    Configuring BM25 316

    Relevance Tuning Is the Last 10% 317

    Part III. Dealing with Human Language

    18. Getting Started with Languages 321

    Using Language Analyzers 322

    Configuring Language Analyzers 323Pitfalls of Mixing Languages 325

    At Index Time 325

    At Query Time 326

    Identifying Language 326One Language per Document 327

    Foreign Words 328

    One Language per Field 329

    Mixed-Language Fields 331

    Split into Separate Fields 331

    Analyze Multiple Times 331Use n-grams 332

    19. Identifying Words 335standard Analyzer 335

    standard Tokenizer 336

    Installing the ICU Plug-in 337

    icu_tokenizer 337

    Tidying Up Input Text 339

    Tokenizing HTML 339

    Tidying Up Punctuation 340

    20. Normalizing Tokens 343In That Case 343

    Table of Contents | xi

  • You Have an Accent 344

    Retaining