elasticsearch - suche im zeitalter der clouds

ElasticSearch – Suche im Zeitalter der Clouds

Christian Meder

Bernhard Pflugfelder inovex Gmbh

Background ‣  open source (free software)

‣  Linux

‣  Web

‣  Java

‣  Android

‣  CTO@inovex

‣  Christian Meder

Christian Meder

Speaker

2

Background ‣  Lucene

‣  Solr

‣  Text Mining Technologies, Information Retrieval

‣  Hadoop

‣  Java

‣  Big Data Engineer@inovex

‣  [email protected]

Bernhard Pflugfelder Speaker

3

‣  Search is everywhere ‣  Elasticsearch

‣  Examples

‣  Overview

‣  Features

Agenda

4

Search, what?

5

Enterprise Search Search applications

6

Online shops Search applications

7

Semantic search Search applications

8

Navigation & Information access

Search applications

9

Data analysis Search applications

10

http://datarpm.com/product

Log-file Analysis Search applications

11

http://kibana.org/

Document store

Search applications

12

‣  Can you think of other scenarios where search applications will also do a good job?

‣  Remind the key capabilities of search technologies:

‣  Persistency

‣  Flexible data model

‣  Unstructured data, but not only

‣  Extremely quick access to data

‣  Horizontal scalability

There are plenty of applications scenarios out there where search technologies shall be considered!

Document store Search applications

13

Open source

Search technologies

14

http://lucene.apache.org

http://lucene.apache.org/solr/

http://www.elasticsearch.org

Lucene is an open source, pure Java API for enabling information retrieval

‣  Originally developed by Doug Cutting 1999 and became Apache TLP in 2001 ‣  Licensed by Apache License 2.0 ‣  Pure Java Library with implementations for :

‣  Lucene.NET (http://lucenenet.apache.org) ‣  PyLucene (http://lucene.apache.org/pylucene/) ‣  and more:

http://wiki.apache.org/lucene-java/LuceneImplementations ‣  Large and very active developer community, well documented and supported (38

active committer!) ‣  Current stable release: 4.2.1 ‣  Widely used and adopted for commercial / non-commercial projects:

http://wiki.apache.org/lucene-java/PoweredBy

Overview

15

http://lucene.apache.org/

Solr is a standalone enterprise search server & document store with based on Lucene

‣  Created by Yonik Seeley at CNET Networks in 2004

‣  Introduced as Apache Incubator in 2006, became TLP in 2007 ‣  Licensed by Apache License 2.0 ‣  Seeley and others founded Lucid Imagination -> LucidWorks ‣  Large and very active developer community, well documented and supported

(strong relationship to Lucene community also) ‣  Current stable release: 4.2.1 ‣  Widely used and adopted for commercial / non-commercial projects:

http://wiki.apache.org/solr/PublicServers

Overview

16

http://lucene.apache.org/solr/

“You know, for search” (Shay Banon)

Search technologies

17

Elasticsearch is a “distributed-from-scratch” search server based on Lucene

Created by Shay Banon with a first version made public in 02/2010:

Elasticsearch itself was born out of my frustration with the fact that there isn’t really a good, open source, solution for distributed search engine out there, which also combines what I expect of search engines after building Compass (and on that, I will blog later…). I have been working on this for the past several months, pouring my search and distributed knowledge into this (and portions of my heart and time ;) )

[http://www.elasticsearch.org/blog/2010/02/08/youknowforsearch.html]

Motivation

18

http://www.elasticsearch.org/

‣  Current stable version 0.20.6 working with Lucene 3.6 ‣  Available version 0.90 RC2 includes Lucene 4.2.1 integration

‣  Licensed by Apache License 2.0

‣  Small, but growing group of core developer

‣  Strong support of valuable Lucene committer

‣  Company elasticsearch.com founded in 2012

‣  By the people behind elasticsearch.org

‣  www.elasticsearch.com

Overview

19


Customers

20


‣  Code search is organized on a cluster ‣  26 storage nodes holding the searchable data

‣  8 client nodes coordinating query requests

‣  Storage cluster has 2TB of SSD based storage

‣  17 TB of indexed data is stored in cluster

‣  shared in the cluster with replication factor of 1

‣  makes overall 34 TB of indexed data

Github

21


‣  Question-and-answer website ‣  aggregates questions and answer in terms of topics

‣  Sources are the web in general, social media

‣  Goals for search:

‣  low latency for queries

‣  increased relevancy of results.

‣  evaluates elasticsearch against Solr and Sphinx

‣  “After much benchmarking with our data set, we discovered that ElasticSearch was clearly the fastest of the possible search platforms we were considering.”

Quora

22


Quora

23


http://www.quora.com/Full-Text-Search-on-Quora/What-technology-does-Quora-use-for-its-full-text-search-infrastructure/answer/Adrien-Lucas-Ecoffet?srid=pilt&share=1

Soundcloud

24

http://bed-con.org/2013/wp-content/uploads/2013/04/Wie_SoundCloud_skaliert.pdf


Moloch

25

https://github.com/aol/moloch


Huffington Post

26

http://blogs.vmware.com/vfabric/2013/03/scaling-real-time-comments-huffpost-live-with-rabbitmq.html


Search pipeline

27

‣  Scalable, High-Performance Indexing ‣  over 95GB/hour on modern hardware

‣  small RAM requirements

‣  incremental indexing as fast as batch indexing

‣  index size roughly 20-30% the size of text indexed

‣  Powerful, Accurate and Efficient Search Algorithms ‣  ranked searching -- best results returned first

‣  many powerful query types

‣  fielded searching (e.g., title, author, contents)

‣  date-range searching

‣  sorting by any field

‣  multiple-index searching with merged results

‣  allows simultaneous update and searching [From http://lucene.apache.org/core/features.html]

Highlights

28

http://lucene.apache.org/

‣  Pure Java application ‣  Powered by Lucene

‣  Document-oriented

‣  Schema-less

‣  HTTP API with JSON In & Out

‣  Indexing / Updating

‣  Searching

‣  Administration / Monitoring

‣  Extendable by plugins

‣  Distribution is a fundamental paradigm of Elasticsearch

Overview

29


Architecture

30

2 1 1 2

3 2 1

3 3

Primary Shard Replica Shard

Master node

Node

Node


‣  Index distribution by auto sharding ‣  Automatic replication and balancing

‣  Fault tolerant + high availability

‣  Cluster building & managment

‣  node detection through zen discovery

‣  nodes communicate via unicast / multicast

‣  automatic master election

‣  influence into master / data node assignment possible

‣  Master responsible to

‣  route the search request

‣  include new nodes into cluster

‣  Index / query routing (automatic / individual)

Architecture

31


Elasticsearch-head

32


https://github.com/mobz/elasticsearch-head

Elasticsearch-head

33


https://github.com/mobz/elasticsearch-head

Schema-less, but

34


‣  Define a mapping for type book

‣  Retrieve the current mapping for type book

Schema-less, but

35

# echo " { "mappings" : {

"books" : { "properties" : { ”id" : { "type" : "string" }, "title" : { "type" : "string" },

"author" : { "type" : "string" }, ”subject" : { "type" : ”string" }, ”view_count" : { "type" : ”integer" }, "created" : { "type" : "date",

"format" : “dateOptionalTime" } }}}} " > book.json curl –XPUT 'localhost:9200/gutenberg/books/_mapping’ –d @book.json

# curl 'localhost:9200/gutenberg/books/_mapping?pretty=1


‣  Search on terms, numeric values, dates, numeric ranges, date/time ranges ‣  Lots of query types

‣  terms, phrases, fuzzy, wildcard, ranges

‣  faceting, filtering

‣  Geospatial search called GeoShape Query

‣  Configurable caching for

‣  Filter queries

‣  Field values

‣  NRT search with separate API

‣  Sorting, Highlighting

‣  MoreLikeThis

‣  Multi Tenancy

Search highlights

36


Faceted search

37


Suggestion

38


Highlighting

39


Local search

40


Multi Tenancy

41


‣  Gateway module stores cluster metadata to: ‣  Local FS, Shared FS, Hadoop, Amazon S3

‣  River:

‣  Pluggable service to constantly pull data

‣  Manage over specific REST endpoint

‣  Implementations for CouchDB, MongoDB, JDBC, Solr, …

‣  Bulk indexing

‣  Default: single document indexing

‣  Bulk indexing over specific REST endpoints

‣  Lucene Analyzer specification over elasticsearch.yml or API

Some more features

42


‣  Query types such as term, terms, match, wildcard, fuzzy, range, … ‣  Multi Search

‣  Get

‣  Multi Get

‣  Filter

‣  Facets

‣  Highlighting

‣  Suggest

‣  MoreLikeThis

‣  Index boosting

‣  Explain

‣  Percolate

Search API

43


‣  Create, Delete, Exists, Open, Close, Optimize, Refresh, Flush, Settings ‣  Index templates (mappings + settings)

‣  Get, Put, Delete Mapping

‣  Get, update settings

‣  Snapshot

‣  Aliases

‣  Warmers

‣  Statistics, Status

Indices API

44


‣  Live configuration of cluster settings ‣  minimum master nodes

‣  cache sizes

‣  routing

‣  allocation

‣  moving shards

‣  Moving replicas

‣  Cluster health & status

‣  Nodes info & stats, Shutdown all / specific nodes

Cluster API

45


+  Elasticssearch feels light-weighted +  Simple but effective architecture

+  Easiness of use, even when using distributed search

+  High matureness, even though ES is young

+  High-performance search (at least based on current benchmarks seen)

+  Modern technologies used (HTTP, JSON, NoXML, Java, Guava)

-  Still small community and small group of core developer

-  Missing data connectors (e.g. dataimporthandler),

-  Missing search features grouping & search result clustering

-  Less number of query types

-  Less possibilities for boosting (e.g function queries)

-  Less number of analyzers

Pros & Cons

46


‣  The world becomes data-driven and user-driven ‣  large data volumes

‣  multiple sources

‣  many users shall be able to access

‣  Therefore search technologies Elasticsearch becomes important: ‣  Easy aggregation of data from multiple sources

‣  Provide unified access layer through search

‣  Scalable regarding data volume and users

‣  Highly configurable

‣  ElasticSearch is easy to use, distributed, scalable and search is fast

Wrap up

47


Thank you!

End

48

elasticsearch - suche im zeitalter der clouds

Technology