cassandra 3 new features @ geecon krakow 2016

43
Cassandra 3.0 new features DuyHai DOAN Apache Cassandra Evangelist Speaker’s Name, 11-13 May 2016

Upload: duyhai-doan

Post on 16-Apr-2017

375 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Cassandra 3 new features  @ Geecon Krakow 2016

Cassandra 3.0 new features

DuyHai DOAN Apache Cassandra Evangelist

Speaker’s Name, 11-13 May 2016

Page 2: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Apache Cassandra Evangelist!•  talks, meetups, confs!•  open-source devs (Achilles, Apache Zeppelin)!•  OSS Cassandra point of contact!

[email protected]! ☞ @doanduyhai

Who Am I ?

Page 3: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Datastax •  Founded in April 2010!•  We contribute a lot to Apache Cassandra™!•  400+ customers (25 of the Fortune 100), 450+ employees!•  Headquarter in San Francisco Bay area!•  EU headquarter in London, offices in France and Germany!

•  Datastax Enterprise = OSS Cassandra + extra features!

Page 4: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Agenda •  Materialized Views (MV)!•  User Defined Functions (UDF) & User Defined Aggregates (UDA)!•  JSON syntax!•  New SASI full text search!

Page 5: Cassandra 3 new features  @ Geecon Krakow 2016

Materialized Views (MV)

DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Page 6: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Why Materialized Views ? •  Relieve the pain of manual denormalization!

CREATE TABLE user(id int PRIMARY KEY, country text, …); CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));

Page 7: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views creation

CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));

CREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastname FROM user WHERE country IS NOT NULL AND id IS NOT NULL PRIMARY KEY(country, id)

Page 8: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized View Demo

Page 9: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views Performance •  Write performance

•  slower than normal write!•  local lock + read-before-write cost (but paid only once for all views)!•  for each base table update, worst case: mv_count x 2 (DELETE +

INSERT) extra mutations for the views!

Page 10: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views Performance •  Write performance vs manual denormalization

•  MV better because no client-server network traffic for read-before-write •  MV better because less network traffic for multiple views (client-side

BATCH)

•  Makes developer life easier à priceless

Page 11: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views Performance •  Read performance vs secondary index

•  MV better because single node read (secondary index can hit many nodes)

•  MV better because single read path (secondary index = read index + read data)

Page 12: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Materialized Views Consistency •  Consistency level!

•  CL honoured for base table, ONE for MV + local batchlog!

•  Weaker consistency guarantees for MV than for base table !

Page 13: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Q & A

! "

Page 14: Cassandra 3 new features  @ Geecon Krakow 2016

User Defined Functions (UDF)

DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Page 15: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Rationale •  Push computation server-side!

•  save network bandwidth (1000 nodes!)!•  simplify client-side code!•  provide standard & useful function (sum, avg …)!•  accelerate analytics use-case (pre-aggregation for Spark)!

Page 16: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

Page 17: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

Param name to refer to in the code!Type = Cassandra type!

Page 18: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

Always called. Null-check mandatory in code !

Page 19: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

If any input is null, function execution is skipped and return null!

Page 20: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

Cassandra types!•  primitives (boolean, int, …)!•  collections (list, set, map)!•  tuples!•  UDT!

Page 21: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDF ?

CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;

JVM supported languages!•  Java, Scala!•  Javascript (slow)!•  Groovy, Jython, JRuby!•  Clojure ( JSR 223 impl issue)!

Page 22: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

UDF Demo

Page 23: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

User Define Aggregate (UDA) •  Real use-case for UDF!

•  Aggregation server-side à huge network bandwidth saving !

•  Provide similar behavior for Group By, Sum, Avg etc …!

Page 24: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;

Only type, no param name!

State type!Initial state type!

Page 25: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;

Accumulator function signature:!accumulatorFunction(stateType, type1, type2, …)!RETURNS stateType!!Accumulator function ≈ foldLeft function !

Page 26: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;

Optional final function signature: finalFunction(stateType)

Page 27: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How to create an UDA ?

CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;

Optional final function signature: finalFunction(stateType)

Page 28: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

UDA Demo

Page 29: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Gotchas •  UDA in Cassandra is not distributed !!

•  Do not execute UDA on a large number of rows (106 for ex.)!•  single fat partition!•  multiple partitions!•  full table scan!!

•  à Increase client-side timeout!•  default Java driver timeout = 12 secs!

Page 30: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Cassandra UDA or Apache Spark ?

Consistency Level

Single/MultiplePartition(s)

RecommendedApproach

ONE Single partition! UDA with token-aware driver because node local!

ONE Multiple partitions! Apache Spark because distributed reads!

> ONE Single partition! UDA because data-locality lost with Spark!

> ONE Multiple partitions! Apache Spark definitely!

Page 31: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Q & A

! "

Page 32: Cassandra 3 new features  @ Geecon Krakow 2016

JSON Syntax

DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Page 33: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Why JSON ? •  JSON is a very good exchange format

•  But a terrible schema …!!

•  How to have best of both worlds ?!•  use Cassandra schema!•  convert rows to JSON format!

Page 34: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

JSON Syntax Demo

Page 35: Cassandra 3 new features  @ Geecon Krakow 2016

SASI full text search index

DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Page 36: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Why SASI ? •  Searching (and full text search) was always a pain point for

Cassandra!•  limited search predicates (=, <=, <, > and >= only)!•  limited scope (only on primary key columns)!

•  Existing secondary index performance is poor!•  reversed-index!•  use Cassandra itself as index storage …!•  limited predicate ( = ). Inequality predicate = full cluster scan😱!

Page 37: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

How is it implemented ? •  New index structure = suffix trees

•  Extended predicates (=, inequalities, LIKE %)!

•  Full text search (tokenizers, stop-words, stemming …)!

•  Query Planner to optimize AND predicates!

•  NO, we don’t use Apache Lucene

Page 38: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Who made it ? •  Open source contribution by an engineers team from …!!

Page 39: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Full Text Search Demo

Page 40: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

When is it available ? •  Right now with Cassandra ≥ 3.5!

•  available in Cassandra 3.4 but critical bugs!

•  Later improvement!•  index on collections (List, Set & Map) !!•  OR clause (WHERE (xxx OR yyy) AND zzz)!•  != operator!

Page 41: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

SASI vs Search Engine SASI vs Solr/ElasticSearch/Datastax Enterprise Search ?!

•  Cassandra is not a search engine !!! (database = durability)!•  always slower because 2 passes (SASI index read + original Cassandra

data)!•  no scoring•  no ordering (ORDER BY)!•  no grouping (GROUP BY) à Apache Spark for analytics!

!

!

Page 42: Cassandra 3 new features  @ Geecon Krakow 2016

Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016

Q & A

! "

Page 43: Cassandra 3 new features  @ Geecon Krakow 2016

Thank You @doanduyhai

[email protected]

https://academy.datastax.com/