cassandra in xpatterns

17
1 Atigeo Confidential Cassandra in xPatterns Cassandra Day Seattle July 2014 David Talby Claudiu Barbura SVP Engineering Sr. Director Engineering

Upload: planet-cassandra

Post on 15-Jan-2015

286 views

Category:

Documents


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Cassandra in xPatterns

1 Atigeo Confidential

Cassandra in xPatterns

Cassandra Day SeattleJuly 2014

David Talby Claudiu BarburaSVP Engineering Sr. Director Engineering

Page 2: Cassandra in xPatterns

2 Atigeo Confidential

• xPatterns Architecture

• xPatterns dashboard application (Demo)

• Export to NoSql API & GUI (Demo)

• Data model optimization

• Publishing from HDFS/Hive/Shark to Cassandra

• Generated REST API’s Instrumentation Throttling & auto-retries

• Geo-Replication Cross-data-center replication, encryption & failover

• Lessons Learned since 0.6 till 2.0.6

Agenda

Page 3: Cassandra in xPatterns

3 Atigeo Confidential

Page 4: Cassandra in xPatterns

4 Atigeo Confidential

Page 5: Cassandra in xPatterns

5 Atigeo Confidential

Export to NoSql API

• Datasets in the warehouse need to be exposed to high-throughput low-latency real-time APIs. Each application requires extra processing performed on top of the core datasets, hence additional transformations are executed for building data marts inside the warehouse

• Exporter tool builds the efficient data model and runs an export of data from a Shark/Hive table to a Cassandra Column Family, through a custom Spark job with configurable throughput (configurable Spark processors against a Cassandra ring) (instrumentation dashboard embedded, logs, progress and instrumentation events pushed though SSE)

• Data Modeling is driven by the read access patterns provided by an application engineer building dashboards and visualizations: lookup key, columns (record fields to read), paging, sorting, filtering

• The end result of a job run is a REST API endpoint (instrumented, monitored, resilient, geo-replicated) that uses the underlying generated Cassandra data model and fuels the data in the dashboards

• Configuration API provided for creating export jobs and executing them (ad-hoc or scheduled).

Page 6: Cassandra in xPatterns

6 Atigeo Confidential

Page 7: Cassandra in xPatterns

7 Atigeo Confidential

Mesos/Spark cluster

Page 8: Cassandra in xPatterns

8 Atigeo Confidential

Cassandra multi DC ring – write latency

Page 9: Cassandra in xPatterns

9 Atigeo Confidential

Nagios monitoring

Page 10: Cassandra in xPatterns

10 Atigeo Confidential

Page 11: Cassandra in xPatterns

11 Atigeo Confidential

Referral Provider Network• One of the many applications that we built for our largest healthcare customers using

the xPatterns APIs and tools on the new upgraded infrastructure: ELT Pipeline, Jaws, Export to NoSql API. The dashboard for the RPN application was built using D3.js and angular against the generic api published by the export tool.

• The application allows for building a graph of downstream and upstream referred and referring providers, grouped by specialty, with computed aggregates like patient counts, claim counts and total charged amounts. RPN is used for both fraud detection and for aiding a clinic buying decision, by following the busiest graph paths.

• The dataset behind the app consists of 8 billion medical records, from which we extracted 1.7 million providers (Shark warehouse) and built 53 million relationships in the graph (persisted in Cassandra)

• While we demo the graph building we will also look at the Graphite instrumentation dashboard for analyzing the runtime performance of the geo-replicated Cassandra read operations (latency in the 20-50ms range)

Page 12: Cassandra in xPatterns

12 Atigeo Confidential

Page 13: Cassandra in xPatterns

13 Atigeo Confidential

Graphite – Cassandra multi DC ring

Page 14: Cassandra in xPatterns

14 Atigeo Confidential

VPC-to-VPC IPSEC Tunnel

Page 15: Cassandra in xPatterns

15 Atigeo Confidential

• NTP: synchronize ALL clocks (servers and clients)

• Reduce the number of CFs (avoid OOM … memtable_total_space_in_mb)

• Rows not too skinny and not too wide (avoid OOM)o Less memory pressure during high-throughput writeso Reduced network I/O, less rows, more column sliceso Key cache & bloom filter index size affects perfo Efficient compaction, avoid hot spots

• Custom serialization and dynamic columns for maximum perf gain (40%)

• Do not drop CFs before emptying them (truncate/compact first)

• Monitoring, instrumentation, automatic restarts

• ConsistencyLevel: ONE is best … for our use cases

• Key cache, Snappy (LZ4) compression, vnodes

Lessons learned 0.6 - 2.0.6

Page 17: Cassandra in xPatterns

© 2013 Atigeo, LLC. All rights reserved. Atigeo and the xPatterns logo are trademarks of Atigeo. The information herein is for informational purposes only and represents the current view of Atigeo as of the date of this presentation. Because Atigeo must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Atigeo, and Atigeo cannot guarantee the accuracy of any information provided after the date of this presentation. ATIGEO MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.