99 problems, but the search ain't one
DESCRIPTION
ElasticSearch is the new kid on the search block. Built on top of Lucene and adhering to the best concepts of so-called NoSQL movement, ElasticSearch is a distributed, highly available, fast RESTful search engine, ready to be plugged into Web applications.TRANSCRIPT
99 Problems, ButThe Search Ain’t OneAndrei Zmievski • PHP UK •!Feb 25, 2011
who am I?
curl http://localhost:9200/speaker/info/andrei
{“name”: “Andrei Zmievski”, “projects”: [“PHP”, “PHP-GTK”, “Smarty”, “Unicode/i18n”], “likes”: [“coding”, “beer”, “brewing”, “photography”], “twitter”: “@a”, “email”: “[email protected]”}
what is elasticsearch?
a search engine for the NoSQL generation
domain-driven
distributed
RESTful
Hitchhiker’s Guide to the Galaxy (no, really)
document model
document-oriented
JSON-based
schema-free
based on Lucene
multi-tenancy
distributed, out of the box
engine
nomenclature
index
type
document
_id
node
3 easy steps
1. index!"#$%&'()*+%,--./00$1!2$,13-/45660!17803.92:9#0;%&<=
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7==-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N=
requ
est
>
%%%%?1:?/-#"9
%%%%?OB7<9P?/?!178?
%%%%?O-I.9?/?3.92:9#?
%%%%?OB<?/?;?
Nresp
onse
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
!!!!"#$#%&"!'!()%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
total number of hits
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
!!!!!!"*+,-./"!'!"0$,1")%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse the index of the doc
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
!!!!!!"*#23."!'!"43.%5.6")%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse the type of the doc
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"7")%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
the id of the doc
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"7")%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
the id of the doc
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"7")%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
the hit score
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%?-11:?%/%TE
%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
!!!!!!"*+-"!'!"7")%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
8!!!!",%9."'!":,-6.+!;9+.<45+")!!!!"#%&5"'!"==!>6$?&.94)!?@#!#A.!B.%60A!:+,C#!D,.")!!!!"&+5.4"'!E"0$-+,F")!"?..6")!"3A$#$F6%3A2"G)!!!!"#H+##.6"'!"%")!!!!"A.+FA#"'!(IJK%N%J%N%N
resp
onse
the original source
2. search!"#$%,--./00$1!2$,13-/45660!17803.92:9#0O392#!,QRSF99#
requ
est
>%"#$$5"!'!L)%%?O3,2#<3?%/%>
%%%%?-1-2$?%/%;E
%%%%?3"!!9338"$?%/%;E
%%%%?82B$9<?%/%6
%%NE
%%?,B-3?%/%>
%%%%?-1-2$?%/%;E
%%%%?@2PO3!1#9?%/%6UV46LM64E
%%%%?,B-3?%/%G%>
%%%%%%?OB7<9P?%/%?!178?E
%%%%%%?O-I.9?%/%?3.92:9#?E
%%%%%%?OB<?%/%?5?E
%%%%%%?O3!1#9?%/%6UV46LM64E
%%%%%%?O31"#!9?%/%
>
%%%%?72@9?/%?A7<#9B%C@B9D3:B?E
%%%%?-2$:?/%?44%(#1F$9@3E%F"-%-,9%*92#!,%AB7=-%)79?E
%%%%?$B:93?/%G?!1<B7H?E%?F99#?E%?.,1-1H#2.,I?JE
%%%%?-KB--9#?/%?2?E
%%%%?,9BH,-?/%;LM
N%N%J%N%N
resp
onse
the execution time
3. profit
that’s up to you
demo
distributed model
provides:
performance
resiliency (high-availability)
shards
a portion of the document space
each one is a separate Lucene index
thus, many per-index settings are available
document is sharded by its _id value
but can be assigned (routed) to a shard deterministically
zero-conf discovery
zen (multicast and unicast)
cloud (EC2 via API)
auto-routing
master node:
maintains cluster state
reassigns shards if nodes leave/join cluster
any node can serve as the request router
the query is handled via scatter-gather mechanism
replicas
each shard can have 1 or more replicas
# of replicas can be updated dynamically after index creation
replicas can be used for querying in parallel
shard allocationnode 1
start with a single node
shard allocation
PUT /person { “index”: { “number_of_shards”: 2, “number_of_replicas”: 1}}
node 1person1person2
shard allocationnode 1person1person2
node 2person1person2
start the second node
shard allocationnode 1 node 2 node 3 node 4person1person2
person1person2
start 2 more nodes
shard allocationnode 1 node 2 node 3 node 4person1
person2person1
person2
start 2 more nodes
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
PUT /person/info/1{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
hashed to shard 1PUT /person/info/1{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
replicated
PUT /person/info/1{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
PUT /person/info/2{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
hashed to shard 2
PUT /person/info/2{ … }
document shardingnode 1 node 2 node 3 node 4person1
person2person1
person2
replicated
PUT /person/info/2{ … }
scatter-gathernode 1 node 2 node 3 node 4person1
person2person1
person2
GET /person/_search?q=name:thomas
shard allocationnode 1 node 2 node 3 node 4person1
person2person1
person2
GET /person/_search?q=name:thomas
shard allocationnode 1 node 2 node 3 node 4person1
person2person1
person2
GET /person/_search?q=name:thomas
shard allocationnode 1 node 2 node 3 node 4person1
person2person1
person2
GET /person/_search?q=name:thomas
transactional model
per-document consistency
no need to commit/flush
uses write-behind transaction log
write consistency (W) can be controlled
one, quorum, or all
(near) real-time search
1 second refresh rate by default
_refresh API also
index storage
node data considered transient
can be stored in local file system, JVM heap, native OS memory, or FS & memory combination
persistent storage requires a gateway
gateways
persistent store for cluster state and indices
asynchronous, translog-based write strategy
allows full recovery if a cluster restart is needed
supported gateways:local
shared FS
Hadoop via HDFS
S3
mapping
describes document structure to the search engine
automatically created with sensible defaults
explicit mapping can be provided (generally, a good idea)
can run into merge conflicts
mapping
important meta fields:
_source
_all
_boost
mapping types
simple:
string, integer/long, float/double, boolean, and null)
complex:
array, object
sample mapping
>?"39#?/%%%%%%?<9#B!:?E
%?-B-$9?/%%%%%?W17X-%(27B!?E
%?-2H3?/%%%%%%G?.#18B$B7H?E%?<9F"HHB7H?E%?.,.?JE
%?.13-W2-9?/%%?56;6&;5&55+;M/;Y/;5?E
%?.#B1#B-I?/%%5Ndocu
men
t
>?.13-?/%>
%%?.#1.9#-B93?%/%>
%%%%?"39#?/%>?-I.9?/%?3-#B7H?E%?B7<9P?/%?71-O272$IZ9<?NE
%%%%?@9332H9?/%>?-I.9?/%?3-#B7H?E%[F113-\/%;UVNE
%%%%?-2H3?/%>?-I.9?/%?3-#B7H?E%?B7!$"<9OB7O2$$?/%?71?NE
%%%%?.13-W2-9?%/%>?-I.9?%/%?<2-9?E%[3-1#9\/%[71\NE
%%%%?.#B1#B-I?%/%>?-I.9?%/%?B7-9H9#?N
NNN
map
ping
analyzers
break down (tokenize) and normalize fields during indexing and query strings at search time
analyzer = tokenizer + token filters (0 or more)
*-27<2#<%A72$IZ9#%S
%%%*-27<2#<%+1:97BZ9#%]
%%%%%%%*-27<2#<%+1:97%^B$-9#%]
%%%%%%%_1K9#!239%+1:97%^B$-9#%]
%%%%%%%*-1.%+1:97%^B$-9#
analyzers
analyzers, tokenizers, and filters can be customizedB7<9P/
%%272$I3B3/
%%%%272$IZ9#/
%%%%%%.@&%,F/%%%%%%%%-I.9/%!"3-1@
%%%%%%%%-1:97BZ9#/%3-27<2#<
%%%%%%%%8B$-9#/%G3-27<2#<E%$1K9#!239E%3-1.E
%%%%%%%%%%%%%%%%%23!BB81$<B7HE%.1#-9#*-9@Jelas
ticse
arch
.ym
l
`
?-B-$9?/%>?-I.9?/%?3-#B7H?E%?272$IZ9#?/%?9"$27H?NE
`
map
ping
API
API conventions
append ?pretty=true to get readable JSON
boolean values: false/0/off = false, rest is true
JSONP support via callback parameter
API structure
http://host:port/[index]/[type]/[_action/id]
GET http://es:9200/_status
GET http://es:9200/twitter/_status
POST http://es:9200/twitter/tweet/1
GET http://es:9200/twitter/tweet/1
API structure
http://host:port/[index]/[type]/[_action/id]
GET http://es:9200/twitter/tweet/_search
GET http://es:9200/twitter/user/_search
GET http://es:9200/twitter/tweet,user/_search
GET http://es:9200/twitter,facebook/_search
GET http://es:9200/_search
_cluster API structure
GET /_cluster/health
GET /_cluster/health/index1,index2
GET /_cluster/nodes/stats
GET /_cluster/nodes/nodeId1,nodeId2/stats
API {core}
index
bulk
delete
delete by query
get
count
search
query
from/size paging
sort
highlighting
selective fields
API {indices}
create
delete
open/close
get/put/delete mapping
refresh
optimize
snapshot
update settings
analyze
status
flush
API {cluster}
health
state
nodes info
nodes stats
nodes shutdown
Query DSL
term / terms
range
prefix
bool
fuzzy
wildcard
query_string
default_operator
analyzer
phrase_slop
etc
filters
share some similar features with queries (term, range, etc)
why use a filter?
filters
faster than queries
cached (depends on the filter)
the cache is used for different queries against the same filter
no scoring
more useful ones: term, terms, range, prefix, and, or, not, exists, missing, query
facets
provide aggregated data based on the search request
terms, histogram, date histogram, range, statistical, and more
geo search
implemented as filters (and a facet)
geo_distance
geo_bounding_box
geo_polygon
interfaces
REST
including memcached
Java /!Groovy
Language clients (REST/Thrift):
pyes, PHP (standalone and symfony), Ruby, Perl
Flume sink implementation
elastica
similar to the other PHP ElasticSearch client
API naming is consistent with Zend Framework
can be extended for new filters, facets, etc
still under development
elastica
$es = new Elastica_Client('vm', 9200);$index = new Elastica_Index($es, 'test');$index->create(array(), true);$type = new Elastica_Type($index, 'person');$doc = new Elastica_Document(1, array('name' => 'Andrei Zmievski', 'email' => '[email protected]', 'username' => 'andrei', 'bills' => array(2, 3, 5)));$type->addDocument($doc);
$qs = new Elastica_Query_QueryString('andrei');$query = new Elastica_Query($qs);$resultSet = $type->search($query);print $resultSet->count();
exam
ple
data import
ES is not the primary data store (usually)
to import/synchronize data:
write an agent (Gearman, message queues, etc)
use rivers (CouchDB, RabbitMQ, Twitter)
10 more features
versioning
index aliases
parent/child docs
scripting
dynamic mapping templates
load balancing nodes
plugins
more_like_this
multi_field mapping
percolation
References
http://github.com/elasticsearch/elasticsearch
http://www.elasticsearch.org/community/forum
IRC: #elasticsearch on irc.freenode.net
twitter: @elasticsearch
HTTP://ZMIEVSKI.ORG/TALKS