Running Cassandra in AWS
Post on 06-May-2015
DESCRIPTIONFor this upcoming meetup, we welcome Patrick Eaton PhD, Systems Architect at Stackdriver, and Joey Imbasciano, Cloud Platform Engineer at Stackdriver. What You'll Learn At This Meetup: Why Stackdriver chose Cassandra over other DB offerings Stackdriver's data pipeline that runs into Cassandra Operating Cassandra Running on AWS Stackdriver's approach to disaster recovery Patrick and Joey will be presenting their use of Apache Cassandra at Stackdriver, some lesson's learned, technical tips and a Q&A to end the evening.
- 1.Running Cassandra in AWS Patrick Eaton, PhD email@example.com @PatrickREatonJoey Imbasciano firstname.lastname@example.org @_joeyi
2. Stackdriver at a GlanceStackdriver's hosted intelligent monitoring service helps SaaS companies innovate more by reducing the burden of day-to-day operations Cloud-native and cloud-aware Designed for complex distributed applications Founded by cloud/infrastructure industry veterans (Microsoft, VMware, EMC, Endeca, Red Hat) with deep systems and DevOps expertise Team of ~25, based in Downtown Boston 3. Intelligent Monitoring Discover customers cloud-hosted applications Infrastructure inventory Logical units, like groups/clusters Services, hosted and self-managed Elastic resourcesMonitor Various data sources Provider metrics Host metrics Custom metrics Endpoints Events Health Rich visualizationsAnalyze Integrate data sources Aggregate metrics Report utilization, cost, etc. Detect policy violations Recommend actions 4. Lambda Architecture Typical of modern architectures for on-line applications. Formalized by Nathan Marz Composed of "batch", "speed", and "serving" layers Batch layer Store of record Compute arbitrary views Speed layer Low latency updates Streaming algorithms Serving layer Combine data from batch and speed layers to answer queriesServingSpeedBatchData 5. Stackdriver Architecture Shares characteristics of lambda architecture Indexing (speed) path Make "live" data available "pre-analysis" Analysis (batch) path Compute aggregations Create recommendations Query (serving) layer Combine "live" and analyzed data to answer queries May require on-the-fly analysis Alerting (speed) path (not discussed here) Stream processing to detectQuery (Serving) Notification (Serving)DatabaseIndexing (Speed)Analysis (Batch)policy-based anomalies DataAlerting (Speed) 6. Database Options We chose Cassandra! True P2P architecture Good support for write-heavy workloads Compatible data model for time series data Column per metric type, timestamps as columns Why not MySQL? Experience with operating large, sharded deployments Relational data model not a good match Why not HBase? Operational complexity - zk, hadoop, hdfs, ... Special "Master" role Why not Dynamo? Avoid vendor lock-in and high cost 7. Stackdriver Architecture ++ Archival pipeline stores all data Very small surface area, battle-tested Critical for disaster recovery S3 considered durable enough Replicated for availabilityQueryCassandraRoll-ups Analysis RecsInventory Data Series Analyze Archive means Cassandra is "soft state" C* consolidates analysis and indexing results Properties of data in C* Immutable data Append-only Read-1, write-1 consistencyS3ArchiveIndexScales out easily Indexers, archivers, analyzers, query servers Data 8. Cassandra at Stackdriver Cluster Configuration Version: Datastax Community Edition 1.2.10 Replication Factor: 3 Vnodes Murmur3Partitioner Ec2Snitch Aids in request efficiency Enables Cassandra to ensure replicas are in different Availability Zones phi_convict_threshold: 8 -> 12 Used to determine when nodes are down AWS network can be spotty 9. Cassandra Topology in AWS Where we started...Where we are...1 us-east-1a us-east-1a32us-east-1cus-east-1b us-east-1cKeep it balanced!us-east-1b 10. Cassandra EC2 Node Configuration m1.xlarge 4 cores 15 GB RAM 4 ephemeral disks available 4 disks RAID-0 for Data Volume and CommitLog ext4 - defaults,noatime mdadm RAID-0 Compactions Heavy Read/Write IO 11. Cassandra Automation and Operations Combination of Boto, Fabric, &Puppet Boto for AWS API Fabric + Puppet for Bootstrapping Fabric for Operations One command to: Launch a new cluster Upsize a cluster Replace a dead node Remove existing nodes List nodes in a cluster 12. Our (Internal) Slogan 13. Cassandra Backups using S3 No Cassandra Powered Backups Restore from S3 Useful for major version upgrades DataS3Bulk LoaderMap Reduce1. Data is archived when it is received 2. Bulk loader reads from S3 3. M/R re-analyzes data 4. Cassandra is repopulatedCassandra 14. Disaster Recover in the Wild October 23, Stackdriver suffered a total loss of our C* cluster Exhausted memory due to number of open file descriptors (see graph) We did not notice the problem until it was too late Nodes began crashing, resulted in inconsistent view of the ring Attempted to restart the cluster unsuccessfully for ~2 hours Provisioned new 36 node cluster in ~2 hours Directed live data to new cluster Started bulk restore operation from archive Full-fidelity data and aggregations No data loss due to archival pipeline See http://www.stackdriver.com/post-mortem-october-23-stackdriver-outage/ 15. Cluster Restoration Process S3Map ReduceBulk LoaderHistorical Data New Cluster UI UI UIUI UI APIUI UI Gateway New DataOld Cluster 16. Thank you! Yes, we are hiring! Patrick Eaton - email@example.com - @PatrickREaton Joey Imbasciano - firstname.lastname@example.org - @_joeyi