c* summit 2013: high throughput analytics with cassandra by aaron stannard

Post on 30-Nov-2014

9.366 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Building analytics systems is an increasingly common requirement for BI teams inside companies both big and small, and a feat made even more challenging when analytic results have to be produced in real-time. In this presentation the team from MarkedUp Analytics will show you techniques for leveraging Cassandra, Hadoop, and Hive to build a manageable and scalable analytics system capable of handling a wide range of business cases and needs.

TRANSCRIPT

Real Time Analytics with Cassandra, Hive, and Solr

Real Time Analytics with Cassandra, Hive, and Solr Aaron Stannard, Founder & CEO of MarkedUp

Powerful analytics tools for native apps

Understand your audience.

Gain valuable data on your users.

Monitor your app’s health.Log errors and crashes

remotely.

Drive more sales.

Better data = more revenue.

Do we really need real-time analytics?

Real time analytics isn’t inherently superior or necessary.

Building your own real-time analytics service with Cassandra and DataStax Enterprise

Cassandra Setup on EC2

Write Strategy

Read Strategy

Analytics Schema Strategy

•  All  row  keys  should  be  predictable  (not  always  possible)  

•  U8lize  physical  sortability  of  columns  

•  Use  predictably  sortable  data  types  for  column  names  (integers,  dates)  

 

•  Learn  to  love  composite  keys  

•  Batch  muta8ons  are  your  friend  

•  Use  distributed  counters  for  real-­‐8me  metrics  

•  Use  TTL  for  automa8on  data  expira8on  (if  necessary)  

 

Time Series Schema 0: All Knowns

Time Series Schema 1: Bounded Number of Unknowns

Time Series Schema 2: Unbounded Number of Unknowns

Schema Tips

Adding Hive and Hadoop to the Mix

Mo’ data, mo’ problems

When is Hadoop necessary? •  Large volumes of data (100GB+)

•  Queries require retrospective / historical analysis

•  Need consistent results

•  Need to perform multi-stage analysis

•  Speed isn’t a concern (Hadoop is sloooooooooow)

Hadoop on easy mode: Hive •  SQL abstraction on top of Hadoop (more familiar)

•  Easier to deploy and test

•  Simplifies data warehousing

•  Easy to automatically import data from Cassandra

•  DSE eliminates need for HDFS

C* to Hive

Hive Syntax

Query: count the number items where “key” is greater than 100 RDBMS> select key, count(1) from kv1 where key > 100 group by key; Hive> select key, count(1) from kv1 where key > 100 group by key;

Hive Tips and Tricks

•  Don’t write data from Hive back to a hot Cassandra column family •  If writing data from Hive to Cassandra, use dedicated column

families •  You can write to multiple places on a single Hive read (table, CSV

file, etc…) •  Use sampling to test Hive queries on scaled-down data sets

How do you count millions of distinct items in real-time?

•  Solr:  Lucene-­‐based  indexing  engine  •  Part  of  Apache  Founda8on  •  Full-­‐text  search  •  Faceted  search  •  Distributed  •  Integrates  well  with  Cassandra  

Solr Index Setup

Solr Search

Questions or Comments?

aaron@markedup.com    hMps://markedup.com/    

top related