elasticsearch field data types

25
FIELD DATA TYPES by Bo Andersen - codingexplained.com

Upload: bo-andersen

Post on 22-Jan-2018

152 views

Category:

Technology


3 download

TRANSCRIPT

FIELD DATA TYPESby Bo Andersen - codingexplained.com

OUTLINE

➤ Core data types

➤ String, numeric, data, boolean, binary

➤ Complex data types

➤ Object, array, nested

➤ Geo data types

➤ Geo-point, Geo-shape

➤ Specialized data types

➤ IPv4, completion, token count, attachment

CORE DATA

TYPES

STRING

➤ String field types accept string values

➤ Can be sub-divided into full text and keywords

➤ We will take a look at these next

STRING - FULL TEXT

➤ Typically used for text based relevance searches (e.g. search for products by name)

➤ Full text fields are analyzed

➤ Data is passed through an analyzer to convert the string into a list of individual

terms, before being indexed

➤ This allows Elasticsearch to search for individual words within a full text field

➤ Full text fields are not used for sorting and are rarely used for aggregations

STRING - KEYWORDS

➤ Exact values such as tags, status, e-mail addresses, etc.

➤ Keywords fields are not analyzed

➤ The exact string value is added to the index as a single term

➤ Typically used for filtering

➤ E.g. find all products where status is "On Discount"

➤ Also often used for sorting and aggregations

NUMERIC

➤ Supports the following numeric types

➤ long (signed 64-bit integer)

➤ integer (signed 32-bit integer)

➤ short (signed 16-bit integer)

➤ byte (signed 8-bit integer)

➤ double (double-precision 64-bit floating point)

➤ float (single-precision 32-bit floating point)

DATE

➤ Dates in Elasticsearch can be either

➤ Strings containing formatted dates

➤ E.g. 2016-01-01 or 2016/01/01 12:00:00

➤ A long number representing milliseconds since the epoch

➤ An integer representing seconds since the epoch

➤ Internally stored as a long number representing milliseconds since the epoch

DATE - FORMATS

➤ Defaults to strict_date_optional_time||epoch_millis

➤ Dates with optional timestamps, which conform to the formats supported by strict_date_optional_time - or milliseconds since the epoch

➤ Examples

➤ 2016-01-01 (date only)

➤ 2016-01-01T12:00:00Z (date including time)

➤ 1410020500000 (milliseconds since the epoch)

➤ Multiple formats can be specified by separating them with the || separator

➤ E.g. yyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis

BOOLEAN

➤ Boolean fields accept true and false values as in JSON

➤ Can also accept strings and numbers which are interpreted as either true or false

➤ False values

➤ false, "false", "off", "no", "0", "" (empty string), 0, 0.0

➤ True values

➤ Anything that is not false

BINARY

➤ A binary value as a Base64 encoded string

➤ E.g. aHR0cDovL2NvZGluZ2V4cGxhaW5lZC5jb20=

➤ Not searchable

COMPLEX DATA TYPES

OBJECT

➤ JSON documents are hierarchical

➤ A document may contain inner objects, which in turn may contain inner objects

➤ In Elasticsearch, documents are indexed as flat lists of key-value pairs

{

"message": "Some text...",

"customer.age": 26,

"customer.address.city": "Copenhagen",

"customer.address.country": "Denmark"

}

ARRAY

➤ Elasticsearch does not have a dedicated array type

➤ Any field can contain zero or more values by default

➤ All values in an array must be of the same data type

➤ When adding a field dynamically, the first value in the array determines the field type

➤ Examples

➤ Array of strings: ["Elasticsearch", "rocks"]

➤ Array of integers: [1, 2]

➤ Array of arrays: [1, [2, 3]] - equivalent of [1, 2, 3]

➤ Array of objects: [{ "name": "Andy", "age": 26 }, { "name":

"Brenda", "age": 32 }]

ARRAY - OBJECTS

➤ Arrays of objects do not work as you would expect

➤ You cannot query each object independently of the other objects in the array

➤ Lucene has no concept of inner objects

➤ Elasticsearch flattens object hierarchies into a list of field names and values

is stored similar to this:

{ "users : [{ "name": "Andy", "age": 26 }, { "name": "Brenda", "age": 32 }] }

{ "users.name": ["Andy", "Brenda"], "users.age": [32, 26] }

➤ The association between "Andy" and 26 is lost

➤ A search for a user named "Andy" who is 26 years old would return incorrect results!

➤ If you need to be able to do this, then you must use the nested data type

NESTED

➤ If you need to index arrays of objects and to maintain the independence of each

object in the array, you should used the nested data type

➤ Internally, nested objects index each object in the array as a separate hidden

document

➤ Each nested object can be queried independently of the others, with a nested

query

➤ A nested query is executed against the nested objects as if they were indexed as

separate documents (internally, this is actually the case)

GEO DATA TYPES

GEO-POINT

➤ Latitude-longitude pairs

➤ Used for geographical operations on documents (searching, sorting, ...)

{

"location": {

"lat": 33.5206608,

"lon": -86.8024900

}

}

{

"location": "33.5206608,-86.8024900"

}

{

"location": "drm3btev3e86"

}

{

"location": [-86.8024900,33.5206608]

}

1 2

3 4

GEO-SHAPE

➤ Geo shapes such as rectangles and polygons

➤ Should be used when either the data being indexed or the queries being executed

contain shapes other than just points

➤ LineString

➤ Array of two or more positions (array of arrays). Straight line in the case of two

points

➤ Polygon

➤ An array of arrays, where each array contains points

➤ The first and last points in the outer array must be the same (to close the polygon)

➤ ...

SPECIALIZED DATA TYPES

IPV4

➤ Used to map IPv4 addresses

➤ Internally, values are indexed as long values

COMPLETION

➤ The completion suggester is a so-called prefix suggester

➤ It does not do spell correction, but enables basic auto-complete functionality

➤ Useful for providing the user with suggestions while searching, e.g. like on Google

➤ Stores a FST (Finite State Transducer) as part of the index

➤ Allows for very fast loads and executions

➤ You don't have to worry about this - just know when to use this type

TOKEN COUNT

➤ An integer field which accepts string values

➤ The string values are analyzed, and the number of tokens are indexed

➤ Example

➤ A name property could have a length field of the type token_count

➤ Then, a search query could be executed to find persons whose name contains X

tokens (split by space, for instance)

ATTACHMENT

➤ Lets Elasticsearch index attachments in common formats

➤ E.g. PDF, XLS, PPT, ...

➤ Attachment content is stored as a Base64 encoded string

➤ This functionality is available as a plugin that must be installed

➤ sudo /path/to/elasticsearchbin/plugin install mapper-attachments

➤ Must be installed on every node of a cluster

➤ Nodes must be restarted after the installation

THANK YOU FOR WATCHING!