cassandra day sv 2014: building a flexible, real-time big data applications platform on apache...

Building a Flexible, Real-time Big Data Applications Platform

on Cassandra with Kiji

Cassandra Day Silicon Valley07 April 2014

Clint KellyMember of Technical StaffWibiData

Overview

• The Kiji Project• The Kiji data model and KijiSchema• Mapping Kiji to Cassandra• Status and future work• Try it now!

Should there be any intro page that talks about WibiData anywhere?

The Kiji Project

Want to build this...

Have this...

Open source components

• Batch processing– Extract, transform, load– Train machine learning models

• Scalable storage– Time-series data

• Serialization– Complex data types

Hadoop, C*, HBase, Avro

KijiSchema

KijiMR KijiREST

KijiHive KijiScoring

KijiExpress

KijiSchema

• Schemas and data serialization• Complex, atomic data types

KijiSchema

KijiMR KijiREST

KijiExpress

record UserLog { long timestamp; int user_id; string url; long session_id;}

• Schema evolution• Table metadata

Kiji batch components

• Scala DSL ➔ describe MapReduce computations

• Machine learning library• Hive adapter

KijiSchema

KijiMR KijiREST

KijiExpress

Kiji real-time components

• REST server• Scoring server

KijiSchema

KijiMR KijiREST

KijiExpress

Kiji Summary

• Bridge between open-source technologies and real-time, big data applications

• Users are building real systems with Kiji now!– Personalized recommendation systems for retail– Energy usage and analytics reporting

The Kiji data model and KijiSchema

Table are composed of rows.

entity ID data

We call row keys “entity IDs.”

data0xfa “bob”

We support composite entity IDs (with hashed and unhashed components).

info0xfa “bob” songs

Data in rows is organized into “column families.”

songs:let it be

songs:help

songs:helterskelter

0xfa “bob” info:email

info:payment

Column families contain columns, named as “family:qualifier.”

songs:let it be

songs:help

songs:helterskelter

info:payment songs:

let it besongs:let it besongs:

let it besongs:let it be

1396560123

Individual columns can have many different timestamped versions.

songs:let it be

songs:help

songs:helterskelter

info:payment songs:

1396560123

Data values can be complex records

record SongPlay { long song_id; int user_rating; long session_id; device_type device;}

Locality groups

Separate logical organization of data (column families) from physical attributes (caching, compression, etc.)

info songs_todayentity ID songs_prev_year

Locality groups

Separate logical organization of data (column families) from physical attributes (caching, compression, etc.)

Need this data ASAP for real-time scoring. Use this data only for

batch jobs.

info songs_todayentity ID songs_prev_year

“real_time” (in-memory, uncompressed, TTL = 1 day)

“batch” (compressed, TTL = 12mo)

Locality groups

Always refer to columns by logical name (“family:qualifier”).

Need this data ASAP for real-time scoring. Use this data only for

batch jobs.

KijiSchema summary

• Data model similar to Cassandra, HBase, BigTable

• Contains time dimension (not present in C*)• Logical and physical organization separate• Complex schemas with Avro

Mapping Kiji to Cassandra

Implementation notes

• Built for Cassandra 2.0.6+• Native protocol / Java driver (no Thrift)• Asynchronous API• Assume users have Hadoop, ZooKeeper

Mapping a Kiji table ➔ Cassandra

• Locality group ➔ Table• Entity ID ➔ Primary key

– Hashed components ➔ partition key– Unhashed components ➔ clustering columns

• Family, qualifier, timestamp ➔ clustering columns• Cell values ➔ blobs

songs:let it be

songs:help

songs:helterskelter

info:payment songs:

1396560123

CQL for Kiji locality groupCREATE TABLE users_locality_group_fast ( userid bigint, user text, family text, qualifier text, timestamp bigint, value blob, PRIMARY KEY (userid, username, family, qualifier, timestamp)) WITH CLUSTERING ORDER BY ( username ASC, family ASC, qualifier ASC, timestamp DESC);

TODO: Show row diagram, arrows pointing to components?

cqlsh:kiji_music>SELECT * FROM kiji_table_users;

userid | username | family | qualifier | timestamp | value--------+----------+--------+----------------+-----------+--------------- 0xfa | bob | info | email | 139653249 | 1243970104327 0xfa | bob | songs | abbey road | 139656012 | 0981274331032 0xfa | bob | songs | help | 139625013 | 9074132704129 0xfa | bob | songs | help | 139621359 | 1923079210370 0xfa | bob | songs | help | 139625013 | 4745018223497 0xfa | bob | songs | helter skelter | 139621324 | 7710423974234

Physical organization of data on disk

songs:let it be

songs:help

songs:helterskelter

info:payment songs:

13965601230xfa:bob:info:email:t0:bob@gmail.com

0xfa:bob:info:payment:t1:AMEX1234...

0xfa:bob:songs:let it be:t5:...

0xfa:bob:songs:let it be:t4:…

0xfa:bob:songs:let it be:t2:…

0xfa:bob:songs:help:t2:…

0xfa:bob:songs:helter skelter:t1:…

Efficient queries = continuous scans!

Kiji queries ➔ CQL queries

All data in “info” column family for “bob” ➔SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’;

songs:let it be

songs:help

songs:helterskelter

info:payment songs:

1396560123

Data in “info:email” and last play of “help” for “bob” ➔

SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘info’ AND qualifier=‘email’;

SELECT value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND qualifier=‘help’ LIMIT 1;

songs:let it be

songs:help

songs:helterskelter

info:payment songs:

1396560123

All songs played by “bob” on April 2nd ➔SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’ AND timestamp >= 1396396800 AND timestamp <= 1396483200 ALLOW FILTERING;😱😱

songs:let it be

songs:help

songs:helterskelter

info:payment songs:

1396560123

songs:let it be

songs:help

songs:helterskelter

info:payment songs:

1396560123

!Bad Request: PRIMARY KEY part timestamp cannot be restricted (preceding part qualifier is either not restricted or by a non-EQ relation)

Queries that do not map well to CQL

• Break up into multiple CQL queries– Hooray for Session#executeAsync!

• Filter on the client– Potentially very expensive, but functional– Provide warning to user

• Educate users about table layout– Layout in previous example is terrible for that query

• Most issues related to “time” dimension

MapReduce

• Wrote new InputFormat, OutputFormat• Hadoop 2.x• Multiple C* queries per RecordReader• Does not use Thrift

Project status and next steps

Initial release in ~ 2 weeks

• Cassandra as part of the Bento Box• Cassandra working in KijiSchema, KijiMR

Support in the coming months

• Cassandra integration with KijiREST, KijiScoring, KijiExpress, etc.

• Expose Cassandra-specific features to users– Variable consistency levels– Load-balancing policies– Diagnostics (e.g., route tracing)

• Kiji support in CQLSH– Decode Avro values

Thanks to Cassandra community

• Great help on mailing lists for users, dev, java driver

• Webinars, meetups, C* Summit all available online

• Free training from DataStax• Very easy to get up-to-speed

Try it now -- Kiji Bento Box

• Latest compatible versions of all components• Hadoop, ZooKeeper, HBase• Cassandra in ~2 weeks

www.kiji.org/getstarted

Mention hiring?

KijiSchema

• Schemas and data serialization• Complex data types (e.g.,

nested maps)• Schema evolution• Metadata• Composite row keys• Transparent paging• Data-definition language, REPL

KijiSchema

KijiMR KijiREST

KijiExpress

Schema support

Support for complex schemas with Avrorecord UserLog { long timestamp; int user_id; string url;}

KijiSchema allows schema versioning

Column name translation

• “family:qualifier” -> “A:B”• Saves disk space• Improves performance• User-facing tools translate names• Possible to turn this off

All data in family “songs” for user “bob” ➔SELECT qualifier, value FROM music WHERE userid=0xfa AND user=‘bob’ AND family=‘songs’;

songs:let it be

songs:help

songs:helterskelter

info:payment songs:

1396560123

cassandra day sv 2014: building a flexible, real-time big data applications platform on apache...

info songs

payment songs

info0xfa bob songs

todayentity id songs

xfa bob songs helter

xfa bob songs abbey

xfa bob info email

kiji data model

Technology

support apache cassandra in production · anuj wadehra ....

cassandra summit 2014: apache cassandra at telefonica cbs

cassandra day atlanta 2015: troubleshooting with apache...

taller apache cassandra -...

cassandra summit 2014: apache cassandra on pivotal...

cassandra community webinar: apache cassandra internals

apache cassandra 2.0

cassandra day denver 2014: introduction to apache cassandra

apache cassandra cosnola

introduction to apache cassandra

diaposotivas apache-cassandra

apache cassandra at wayin

nosql apache cassandra

cassandra + hadoop: analisi batch con apache cassandra

introduction to apache cassandra - datastax - · pdf...

cassandra day london 2015: apache cassandra anti-patterns

apache cassandra at target - cassandra summit 2014

introduciton to apache cassandra

apache cassandra from the ground up -...

conhecendo apache cassandra meetup - conhecendo … · eiti...