cassandra 3 new features @ geecon krakow 2016
TRANSCRIPT
Cassandra 3.0 new features
DuyHai DOAN Apache Cassandra Evangelist
Speaker’s Name, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Apache Cassandra Evangelist!• talks, meetups, confs!• open-source devs (Achilles, Apache Zeppelin)!• OSS Cassandra point of contact!
☞ [email protected]! ☞ @doanduyhai
Who Am I ?
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Datastax • Founded in April 2010!• We contribute a lot to Apache Cassandra™!• 400+ customers (25 of the Fortune 100), 450+ employees!• Headquarter in San Francisco Bay area!• EU headquarter in London, offices in France and Germany!
• Datastax Enterprise = OSS Cassandra + extra features!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Agenda • Materialized Views (MV)!• User Defined Functions (UDF) & User Defined Aggregates (UDA)!• JSON syntax!• New SASI full text search!
Materialized Views (MV)
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Why Materialized Views ? • Relieve the pain of manual denormalization!
CREATE TABLE user(id int PRIMARY KEY, country text, …); CREATE TABLE user_by_country( country text, id int, …, PRIMARY KEY(country, id));
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views creation
CREATE TABLE user_by_country ( country text, id int, firstname text, lastname text, PRIMARY KEY(country, id));
CREATE MATERIALIZED VIEW user_by_country AS SELECT country, id, firstname, lastname FROM user WHERE country IS NOT NULL AND id IS NOT NULL PRIMARY KEY(country, id)
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized View Demo
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Performance • Write performance
• slower than normal write!• local lock + read-before-write cost (but paid only once for all views)!• for each base table update, worst case: mv_count x 2 (DELETE +
INSERT) extra mutations for the views!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Performance • Write performance vs manual denormalization
• MV better because no client-server network traffic for read-before-write • MV better because less network traffic for multiple views (client-side
BATCH)
• Makes developer life easier à priceless
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Performance • Read performance vs secondary index
• MV better because single node read (secondary index can hit many nodes)
• MV better because single read path (secondary index = read index + read data)
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Materialized Views Consistency • Consistency level!
• CL honoured for base table, ONE for MV + local batchlog!
• Weaker consistency guarantees for MV than for base table !
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Q & A
! "
User Defined Functions (UDF)
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Rationale • Push computation server-side!
• save network bandwidth (1000 nodes!)!• simplify client-side code!• provide standard & useful function (sum, avg …)!• accelerate analytics use-case (pre-aggregation for Spark)!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;
Param name to refer to in the code!Type = Cassandra type!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;
Always called. Null-check mandatory in code !
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;
If any input is null, function execution is skipped and return null!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;
Cassandra types!• primitives (boolean, int, …)!• collections (list, set, map)!• tuples!• UDT!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDF ?
CREATE [OR REPLACE] FUNCTION [IF NOT EXISTS] [keyspace.]functionName (param1 type1, param2 type2, …) CALLED ON NULL INPUT | RETURNS NULL ON NULL INPUT RETURNS returnType LANGUAGE language AS $$ // source code here $$;
JVM supported languages!• Java, Scala!• Javascript (slow)!• Groovy, Jython, JRuby!• Clojure ( JSR 223 impl issue)!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
UDF Demo
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
User Define Aggregate (UDA) • Real use-case for UDF!
• Aggregation server-side à huge network bandwidth saving !
• Provide similar behavior for Group By, Sum, Avg etc …!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;
Only type, no param name!
State type!Initial state type!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;
Accumulator function signature:!accumulatorFunction(stateType, type1, type2, …)!RETURNS stateType!!Accumulator function ≈ foldLeft function !
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;
Optional final function signature: finalFunction(stateType)
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How to create an UDA ?
CREATE [OR REPLACE] AGGREGATE [IF NOT EXISTS] [keyspace.]aggregateName(type1, type2, …) SFUNC accumulatorFunction STYPE stateType [FINALFUNC finalFunction] INITCOND initCond;
Optional final function signature: finalFunction(stateType)
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
UDA Demo
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Gotchas • UDA in Cassandra is not distributed !!
• Do not execute UDA on a large number of rows (106 for ex.)!• single fat partition!• multiple partitions!• full table scan!!
• à Increase client-side timeout!• default Java driver timeout = 12 secs!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Cassandra UDA or Apache Spark ?
Consistency Level
Single/MultiplePartition(s)
RecommendedApproach
ONE Single partition! UDA with token-aware driver because node local!
ONE Multiple partitions! Apache Spark because distributed reads!
> ONE Single partition! UDA because data-locality lost with Spark!
> ONE Multiple partitions! Apache Spark definitely!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Q & A
! "
JSON Syntax
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Why JSON ? • JSON is a very good exchange format
• But a terrible schema …!!
• How to have best of both worlds ?!• use Cassandra schema!• convert rows to JSON format!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
JSON Syntax Demo
SASI full text search index
DuyHai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Why SASI ? • Searching (and full text search) was always a pain point for
Cassandra!• limited search predicates (=, <=, <, > and >= only)!• limited scope (only on primary key columns)!
• Existing secondary index performance is poor!• reversed-index!• use Cassandra itself as index storage …!• limited predicate ( = ). Inequality predicate = full cluster scan😱!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
How is it implemented ? • New index structure = suffix trees
• Extended predicates (=, inequalities, LIKE %)!
• Full text search (tokenizers, stop-words, stemming …)!
• Query Planner to optimize AND predicates!
• NO, we don’t use Apache Lucene
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Who made it ? • Open source contribution by an engineers team from …!!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Full Text Search Demo
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
When is it available ? • Right now with Cassandra ≥ 3.5!
• available in Cassandra 3.4 but critical bugs!
• Later improvement!• index on collections (List, Set & Map) !!• OR clause (WHERE (xxx OR yyy) AND zzz)!• != operator!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
SASI vs Search Engine SASI vs Solr/ElasticSearch/Datastax Enterprise Search ?!
• Cassandra is not a search engine !!! (database = durability)!• always slower because 2 passes (SASI index read + original Cassandra
data)!• no scoring• no ordering (ORDER BY)!• no grouping (GROUP BY) à Apache Spark for analytics!
!
!
Duyhai DOAN (@doanduyhai) Kraków, 11-13 May 2016
Q & A
! "