c* summit 2013: high throughput analytics with cassandra by aaron stannard

Real Time Analytics with Cassandra, Hive, and Solr

Real Time Analytics with Cassandra, Hive, and Solr Aaron Stannard, Founder & CEO of MarkedUp

Powerful analytics tools for native apps

Understand your audience.

Gain valuable data on your users.

Monitor your app’s health.Log errors and crashes

remotely.

Drive more sales.

Better data = more revenue.

Do we really need real-time analytics?

Real time analytics isn’t inherently superior or necessary.

Building your own real-time analytics service with Cassandra and DataStax Enterprise

Cassandra Setup on EC2

Write Strategy

Read Strategy

Analytics Schema Strategy

•  All row keys should be predictable (not always possible)

•  U8lize physical sortability of columns

•  Use predictably sortable data types for column names (integers, dates)

•  Learn to love composite keys

•  Batch muta8ons are your friend

•  Use distributed counters for real-‐8me metrics

•  Use TTL for automa8on data expira8on (if necessary)

Time Series Schema 0: All Knowns

Time Series Schema 1: Bounded Number of Unknowns

Time Series Schema 2: Unbounded Number of Unknowns

Schema Tips

Adding Hive and Hadoop to the Mix

Mo’ data, mo’ problems

When is Hadoop necessary? •  Large volumes of data (100GB+)

•  Queries require retrospective / historical analysis

•  Need consistent results

•  Need to perform multi-stage analysis

•  Speed isn’t a concern (Hadoop is sloooooooooow)

Hadoop on easy mode: Hive •  SQL abstraction on top of Hadoop (more familiar)

•  Easier to deploy and test

•  Simplifies data warehousing

•  Easy to automatically import data from Cassandra

•  DSE eliminates need for HDFS

C* to Hive

Hive Syntax

Query: count the number items where “key” is greater than 100 RDBMS> select key, count(1) from kv1 where key > 100 group by key; Hive> select key, count(1) from kv1 where key > 100 group by key;

Hive Tips and Tricks

•  Don’t write data from Hive back to a hot Cassandra column family •  If writing data from Hive to Cassandra, use dedicated column

families •  You can write to multiple places on a single Hive read (table, CSV

file, etc…) •  Use sampling to test Hive queries on scaled-down data sets

How do you count millions of distinct items in real-time?

•  Solr: Lucene-‐based indexing engine •  Part of Apache Founda8on •  Full-‐text search •  Faceted search •  Distributed •  Integrates well with Cassandra

Solr Index Setup

Solr Search

Questions or Comments?

[email protected] hMps://markedup.com/

c* summit 2013: high throughput analytics with cassandra by aaron stannard

Technology