diagnosing problems in production: cassandra summit 2014

13
©2013 DataStax Confidential. Do not distribute without consent. Jon Haddad, Technical Evangelist, Datastax Blake Eggleston, Software Developer, Datastax Diagnosing Problems in Production 1

Upload: jon-haddad

Post on 29-Nov-2014

267 views

Category:

Technology


0 download

DESCRIPTION

At the 2014 Cassandra summit we covered how to ensure that your production experience with Cassandra is top notch by identifying the proper tools that should be put in place beforehand, and what tools you need to identify problems in real time. Presented by Jon Haddad & Blake Eggleston

TRANSCRIPT

Page 1: Diagnosing Problems in Production: Cassandra Summit 2014

©2013 DataStax Confidential. Do not distribute without consent.

Jon Haddad, Technical Evangelist, DatastaxBlake Eggleston, Software Developer, Datastax

Diagnosing Problems in Production

1

Page 2: Diagnosing Problems in Production: Cassandra Summit 2014

Preventative Measures• Opscenter •Metrics Integration •Munin •Monit •Nagios / Icinga • Graphite / Statsd (application level) • Variety of 3rd party monitoring services

Page 3: Diagnosing Problems in Production: Cassandra Summit 2014

Narrow Down the Problem•Weird consistency issues - NTP? • Last write wins - if servers have different time, which is the last write?

• Problems with Streaming / Repair - version conflicts • Cleanup after you add nodes (reclaim disk space) • Slow queries • Compaction • Histograms • Tracing

•Nodes flapping / failing • Check ops center • Dig into system metrics • JVM GC issues

Page 4: Diagnosing Problems in Production: Cassandra Summit 2014

Compaction• Compaction merges SSTables • Too much compaction? • Opscenter provides insight into compaction cluster wide • nodetool • compactionhistory • getcompactionthroughput

• Leveled vs Size Tiered • Leveled on SSD + Read Heavy • Size tiered on Spinning rust • Size tiered is great for write heavy time series workloads

Page 5: Diagnosing Problems in Production: Cassandra Summit 2014

System Utilities• iostat • disk level statistics

• htop • process overview

• iftop & netstat • network utilities

• dstat • all the above in 1 tool

• strace • …for the hardcore

Page 6: Diagnosing Problems in Production: Cassandra Summit 2014

Histograms• proxyhistograms • High level read and write times • Includes network latency

• cfhistograms <keyspace> <table> • reports stats for single table on a single

node • Used to identify tables with

performance problems

Page 7: Diagnosing Problems in Production: Cassandra Summit 2014

Query Tracing

Page 8: Diagnosing Problems in Production: Cassandra Summit 2014

JVM GC Overview•What is garbage collection? • Manual vs automatic memory management

• Generational garbage collection (ParNew & CMS) • New Generation • Old Generation

Page 9: Diagnosing Problems in Production: Cassandra Summit 2014

New Generation•New objects are created in the new gen •Minor GC • Occurs when new gen fills up • Stop the world • Dead objects are removed • Live objects are promoted into old gen • Removing objects is fast, promoting objects is slow

Page 10: Diagnosing Problems in Production: Cassandra Summit 2014

Old Generation• Objects are promoted to new gen from old gen •Major GC • Old generations fills up some percentage. • Mostly concurrent • 2 short stop the world pauses

• Full GC • Occurs when old gen fills up or objects can’t be promoted • Stop the world • Collects all generations • These are bad!

Page 11: Diagnosing Problems in Production: Cassandra Summit 2014

GC Profiling• Opscenter gc stats • Look for correlations between gc spikes

and read/write latency

• Cassandra GC Logging • Can be activated in cassandra-env.sh

• jstat • prints gc activity

Page 12: Diagnosing Problems in Production: Cassandra Summit 2014

GC Profiling•What to look out for: • Long, multi-second pauses • Caused by Full GCs. Old gen is filling up faster than the concurrent GC can keep up with

it. Typically means garbage is being promoted out of the new gen too soon • Long minor GC • Many of the objects in the new gen are being promoted to the old gen. • Most commonly caused by new gen being too big • Sometimes caused by objects being promoted prematurely

Page 13: Diagnosing Problems in Production: Cassandra Summit 2014

©2013 DataStax Confidential. Do not distribute without consent. 13

Jon: @rustyrazorblade

Blake: @blakeeggleston