c* summit 2013: high throughput analytics with cassandra by aaron stannard

28
Real Time Analytics with Cassandra, Hive, and Solr

Upload: planet-cassandra

Post on 30-Nov-2014

9.366 views

Category:

Technology


2 download

DESCRIPTION

Building analytics systems is an increasingly common requirement for BI teams inside companies both big and small, and a feat made even more challenging when analytic results have to be produced in real-time. In this presentation the team from MarkedUp Analytics will show you techniques for leveraging Cassandra, Hadoop, and Hive to build a manageable and scalable analytics system capable of handling a wide range of business cases and needs.

TRANSCRIPT

Page 1: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Real Time Analytics with Cassandra, Hive, and Solr

Page 2: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Real Time Analytics with Cassandra, Hive, and Solr Aaron Stannard, Founder & CEO of MarkedUp

Page 3: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Powerful analytics tools for native apps

Understand your audience.

Gain valuable data on your users.

Monitor your app’s health.Log errors and crashes

remotely.

Drive more sales.

Better data = more revenue.

Page 4: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard
Page 5: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Do we really need real-time analytics?

Page 6: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard
Page 7: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Real time analytics isn’t inherently superior or necessary.

Page 8: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard
Page 9: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Building your own real-time analytics service with Cassandra and DataStax Enterprise

Page 10: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Cassandra Setup on EC2

Page 11: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Write Strategy

Page 12: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Read Strategy

Page 13: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Analytics Schema Strategy

•  All  row  keys  should  be  predictable  (not  always  possible)  

•  U8lize  physical  sortability  of  columns  

•  Use  predictably  sortable  data  types  for  column  names  (integers,  dates)  

 

•  Learn  to  love  composite  keys  

•  Batch  muta8ons  are  your  friend  

•  Use  distributed  counters  for  real-­‐8me  metrics  

•  Use  TTL  for  automa8on  data  expira8on  (if  necessary)  

 

Page 14: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Time Series Schema 0: All Knowns

Page 15: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Time Series Schema 1: Bounded Number of Unknowns

Page 16: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Time Series Schema 2: Unbounded Number of Unknowns

Page 17: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Schema Tips

Page 18: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Adding Hive and Hadoop to the Mix

Mo’ data, mo’ problems

Page 19: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

When is Hadoop necessary? •  Large volumes of data (100GB+)

•  Queries require retrospective / historical analysis

•  Need consistent results

•  Need to perform multi-stage analysis

•  Speed isn’t a concern (Hadoop is sloooooooooow)

Page 20: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Hadoop on easy mode: Hive •  SQL abstraction on top of Hadoop (more familiar)

•  Easier to deploy and test

•  Simplifies data warehousing

•  Easy to automatically import data from Cassandra

•  DSE eliminates need for HDFS

Page 21: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

C* to Hive

Page 22: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Hive Syntax

Query: count the number items where “key” is greater than 100 RDBMS> select key, count(1) from kv1 where key > 100 group by key; Hive> select key, count(1) from kv1 where key > 100 group by key;

Page 23: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Hive Tips and Tricks

•  Don’t write data from Hive back to a hot Cassandra column family •  If writing data from Hive to Cassandra, use dedicated column

families •  You can write to multiple places on a single Hive read (table, CSV

file, etc…) •  Use sampling to test Hive queries on scaled-down data sets

Page 24: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

How do you count millions of distinct items in real-time?

Page 25: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

•  Solr:  Lucene-­‐based  indexing  engine  •  Part  of  Apache  Founda8on  •  Full-­‐text  search  •  Faceted  search  •  Distributed  •  Integrates  well  with  Cassandra  

Page 26: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Solr Index Setup

Page 27: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Solr Search

Page 28: C* Summit 2013: High Throughput Analytics with Cassandra by Aaron Stannard

Questions or Comments?

[email protected]    hMps://markedup.com/