c* summit 2013: optimizing the public cloud for cost and scalability with cassandra - the metricshub...

Optimizing the Public Cloud for Cost and Scalability with Cassandra

#CASSANDRA13

Charles LamannaSenior Development Lead @clamanna

Ricardo VillalobosSenior Cloud Architect@ricvilla

MetricsHubkeep services up and running for the lowest possible cost

#CASSANDRA13

Live Status

Cost Awareness

Alerts and Notifications

Actions and Scaling

$

#CASSANDRA13

#CASSANDRA13

growth2000+ customers in 6 months

#CASSANDRA13

10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130

500

1000

1500

2000

2500

Number of MetricsHub Customers

#CASSANDRA13

10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130

1000

2000

3000

4000

5000

6000

7000

8000

9000

Number of VMs Monitored by MetricsHub

#CASSANDRA13

10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130

1

2

3

4

5

6

7

8 Number of Metricshub Employees

#CASSANDRA13

storing data200M data points per hour

#CASSANDRA13

Planning for huge data ingestion rates• MetricsHub requires high scale, real-time data: • 1,000 data points per minute per VM• 12 data points per endpoint per minute• 500+ data points per storage account per hour

• Need to aggregate, analyze and take actions based on this data stream (in near real-time)

• Must be cheap, scalable and reliable

#CASSANDRA13

Looked at Redis…• Perform aggregation in memory (using INCR and other

native operations)

• Flush aggregate data from Redis to persistent storage at a regular interval

• Is fast, powerful and a good OSS community

#CASSANDRA13

… but it was fragile, and expensive for this use case

• RAM/Memory in the public cloud is *expensive* (but storage is *cheap*)

• Flushing the data requires complex coordination

• If we did not flush quickly enough – out of memory!

#CASSANDRA13

Looked at SQL…• Create tables for different time windows and

granularities

• Roll over from table-to-table (and drop entire tables when the data expires)

• Update in place (for counters, min, max, etc.) in a reliable way

#CASSANDRA13

… but SQL did not fit• Higher write than read volume pushed boundaries of

the servers

• Requires complex sharding after just a few dozen new customers

• Is possible, but not worth the operational cost

#CASSANDRA13

Then we tried Cassandra (and never went back)

• Scales fluidly • Grows horizontally – double the nodes, double capacity• Add / remove capacity / nodes with no downtime

• Highly available• No single point of failure• Replication factor (i.e. hot copies) is just a config switch

#CASSANDRA13

… and by the way

• Little-to-none operations cost• New nodes take minutes to setup• Nodes just keep running for months on end

• “Aggregate on write” – no jobs required!• Atomic distributed counters make it easy to do

aggregates on write

• …and a nice kicker: has *great* perf / COGS in Azure

#CASSANDRA13

architecture68 virtual machines (PAAS and IAAS)

#CASSANDRA13

#CASSANDRA13

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Portal Web Role

(3 instances)

Cassandra VM Cluster

(32 XL instances)

Web API Web Role

(8 instances)

End User Web Browsers

Monitored Customer Resources

(e.g. websites; SQL databases)

Monitored Virtual Machines

Endpoints Replicated datain multiple

datacenters

ClientsPaaS

IaaS

Services

Avoiding state

• Application logic / code all lives on stateless machines

• Keeps it simple: decreases human operations cost

• Use Azure PAAS offerings (Web and Worker roles)

Table Storage


SQL Database

Blob storage

Portal Web Role

(3 instances)


(32 XL instances)

Web API Web Role

(8 instances)


datacenters

#CASSANDRA13

PaaS

Windows Azure Cloud Services (PAAS)

• Scale horizontally (grew from 1 to 30+ instances)

• Managed by the platform (patched; coordinated recycling; failover; etc.)

• 1 click deployment from Visual Studio (with automatic load balancer swaps)

Web Role Worker Role

#CASSANDRA13

Table Storage

SQL Database

Blob storage

Portal Web Role

(3 instances)


(32 XL instances)

Web API Web Role

(8 instances)


datacenters

Jobs Worker RoleRuns recurring tasks to pull, generate and analyze data

Jobs are synchronized and scheduled using Windows Azure Tables and Queues


Table Storage


SQL Database

Blob storage

Portal Web Role

(3 instances)


(32 XL instances)


datacenters

Web API Role

RESTful endpoint for saving and reading custom metrics.

Highly concurrent, secure & scalable.

Web API Web Role

(8 instances)

Table Storage


SQL Database

Blob storage


(32 XL instances)

Web API Web Role

(8 instances)


datacenters

Portal Web Role

Interface for our customers – shows trends, charts and issues. Portal Web

Role (3 instances)

Table Storage


SQL Database

Blob storage

Web API Web Role

(8 instances)


datacenters

Maintains all state for metrics / time series data.

Portal Web Role

(3 instances)


(32 XL instances)

Cassandra Cluster

Windows Azure Virtual Machines (IaaS)

Management Portal

Scripting (Windows, Linux and Mac)

REST API

Starting Select Image and VM Size New Disk Persisted in Storage

Boot VM from New Disk

32 nodes, 8 “pods” of 4 nodes

……..

……….

Exposed via a single endpoint (port 9160)

9160

9160 Exposed via a single endpoint (port 9161)

Exposing the pods• Each pod of 4 nodes

has a single load balanced endpoint

• Clients (on our stateless roles) treats the endpoint as a pool

• Blacklists and skips an endpoint if it starts producing a lot of errors

#CASSANDRA13

Where does the data go?

• Data files are on 8 mounted network backed disks (*not* ephemeral disks)

• Data disks are geo-replicated (3 copies local; 1 remote) for “free” DR

• Azure data disks offer great throughput (VMs end up CPU bound)

#CASSANDRA13

Our Column Families (CQL 3)

CREATE TABLE oneminute (

rk text, ck text, cnt counter, sum counter, PRIMARY KEY (rk, ck)

);

#CASSANDRA13

Updating values…Realtime “average” values at any granularity, for any time window

updateoneminute/tenminute/oneday

setsum = sum + {sample_value},cnt = cnt + 1

where rk = '{customer_name}' and ck = '{metric_path}'

#CASSANDRA13

Reading values…

*ONE* round trip to fetch a metric over time (e.g. CPU over past week)

select * from oneminutewhere rk = ‘{customer_name}' and ck < '{metric_path_start}' and ck >= '{metric_path_end}‘order by ck desc;

#CASSANDRA13

What’s next?

• Windows Azure Virtual Networks to connect / secure all of our resources

(PAAS + IAAS + Services)• Expand Cassandra cluster across

datacenter boundaries for improved availability• Integrate with more off-the-shelf Azure

components to reduce operational overhead

#CASSANDRA13

#CASSANDRA13

Global Physical Infrastructureservers/network/datacenters

automated

elastic

managed resources

usage based

REST API + OTHER SERVICES

compute data management networking

SQL database

noSQL databasewebsites blob connect

virtual network

traffic manager

cloud services VMs

#CASSANDRA13

c* summit 2013: optimizing the public cloud for cost and scalability with cassandra - the metricshub...

Technology

portal web

cassandra

public cloud

32 xl instances

paas

table

time

24 instances