c* summit 2013: optimizing the public cloud for cost and scalability with cassandra - the metricshub...

35
Optimizing the Public Cloud for Cost and Scalability with Cassandra #CASSANDRA13 Charles Lamanna Senior Development Lead @clamanna Ricardo Villalobos Senior Cloud Architect @ricvilla

Upload: planet-cassandra

Post on 01-Nov-2014

910 views

Category:

Technology


0 download

DESCRIPTION

MetricsHub is a monitoring and scalability service for public clouds, allowing companies to continuously gather data from their systems and auto-scale their deployments to optimize service costs. Taking advantage of Cassandra rapid ingestion rates, reliable replication model, and easiness of deployment, Metrics Hub can handle billions of datapoints per day. During this session, you will learn about the architecture supporting this service, which combines the power of the PaaS + IaaS on the Windows Azure platform.

TRANSCRIPT

Page 1: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Optimizing the Public Cloud for Cost and Scalability with Cassandra

#CASSANDRA13

Charles LamannaSenior Development Lead @clamanna

Ricardo VillalobosSenior Cloud Architect@ricvilla

Page 2: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

MetricsHubkeep services up and running for the lowest possible cost

#CASSANDRA13

Page 3: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Live Status

Cost Awareness

Alerts and Notifications

Actions and Scaling

$

#CASSANDRA13

Page 4: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

#CASSANDRA13

Page 5: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

#CASSANDRA13

Page 6: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

growth2000+ customers in 6 months

#CASSANDRA13

Page 7: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130

500

1000

1500

2000

2500

Number of MetricsHub Customers

#CASSANDRA13

Page 8: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130

1000

2000

3000

4000

5000

6000

7000

8000

9000

Number of VMs Monitored by MetricsHub

#CASSANDRA13

Page 9: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

10/18/2012 12/7/2012 1/26/2013 3/17/2013 5/6/2013 6/25/20130

1

2

3

4

5

6

7

8 Number of Metricshub Employees

#CASSANDRA13

Page 10: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

storing data200M data points per hour

#CASSANDRA13

Page 11: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Planning for huge data ingestion rates• MetricsHub requires high scale, real-time data: • 1,000 data points per minute per VM• 12 data points per endpoint per minute• 500+ data points per storage account per hour

• Need to aggregate, analyze and take actions based on this data stream (in near real-time)

• Must be cheap, scalable and reliable

#CASSANDRA13

Page 12: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Looked at Redis…• Perform aggregation in memory (using INCR and other

native operations)

• Flush aggregate data from Redis to persistent storage at a regular interval

• Is fast, powerful and a good OSS community

#CASSANDRA13

Page 13: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

… but it was fragile, and expensive for this use case

• RAM/Memory in the public cloud is *expensive* (but storage is *cheap*)

• Flushing the data requires complex coordination

• If we did not flush quickly enough – out of memory!

#CASSANDRA13

Page 14: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Looked at SQL…• Create tables for different time windows and

granularities

• Roll over from table-to-table (and drop entire tables when the data expires)

• Update in place (for counters, min, max, etc.) in a reliable way

#CASSANDRA13

Page 15: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

… but SQL did not fit• Higher write than read volume pushed boundaries of

the servers

• Requires complex sharding after just a few dozen new customers

• Is possible, but not worth the operational cost

#CASSANDRA13

Page 16: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Then we tried Cassandra (and never went back)

• Scales fluidly • Grows horizontally – double the nodes, double capacity• Add / remove capacity / nodes with no downtime

• Highly available• No single point of failure• Replication factor (i.e. hot copies) is just a config switch

#CASSANDRA13

Page 17: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

… and by the way

• Little-to-none operations cost• New nodes take minutes to setup• Nodes just keep running for months on end

• “Aggregate on write” – no jobs required!• Atomic distributed counters make it easy to do

aggregates on write

• …and a nice kicker: has *great* perf / COGS in Azure

#CASSANDRA13

Page 18: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

architecture68 virtual machines (PAAS and IAAS)

#CASSANDRA13

Page 19: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

#CASSANDRA13

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Portal Web Role

(3 instances)

Cassandra VM Cluster

(32 XL instances)

Web API Web Role

(8 instances)

End User Web Browsers

Monitored Customer Resources

(e.g. websites; SQL databases)

Monitored Virtual Machines

Endpoints Replicated datain multiple

datacenters

ClientsPaaS

IaaS

Services

Page 20: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Avoiding state

• Application logic / code all lives on stateless machines

• Keeps it simple: decreases human operations cost

• Use Azure PAAS offerings (Web and Worker roles)

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Portal Web Role

(3 instances)

Cassandra VM Cluster

(32 XL instances)

Web API Web Role

(8 instances)

Endpoints Replicated datain multiple

datacenters

#CASSANDRA13

PaaS

Page 21: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Windows Azure Cloud Services (PAAS)

• Scale horizontally (grew from 1 to 30+ instances)

• Managed by the platform (patched; coordinated recycling; failover; etc.)

• 1 click deployment from Visual Studio (with automatic load balancer swaps)

Web Role Worker Role

#CASSANDRA13

Page 22: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Table Storage

SQL Database

Blob storage

Portal Web Role

(3 instances)

Cassandra VM Cluster

(32 XL instances)

Web API Web Role

(8 instances)

Endpoints Replicated datain multiple

datacenters

Jobs Worker RoleRuns recurring tasks to pull, generate and analyze data

Jobs are synchronized and scheduled using Windows Azure Tables and Queues

Jobs Worker Role (24 instances)

Page 23: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Portal Web Role

(3 instances)

Cassandra VM Cluster

(32 XL instances)

Endpoints Replicated datain multiple

datacenters

Web API Role

RESTful endpoint for saving and reading custom metrics.

Highly concurrent, secure & scalable.

Web API Web Role

(8 instances)

Page 24: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Cassandra VM Cluster

(32 XL instances)

Web API Web Role

(8 instances)

Endpoints Replicated datain multiple

datacenters

Portal Web Role

Interface for our customers – shows trends, charts and issues. Portal Web

Role (3 instances)

Page 25: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Table Storage

Jobs Worker Role (24 instances)

SQL Database

Blob storage

Web API Web Role

(8 instances)

Endpoints Replicated datain multiple

datacenters

Maintains all state for metrics / time series data.

Portal Web Role

(3 instances)

Cassandra VM Cluster

(32 XL instances)

Cassandra Cluster

Page 26: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Windows Azure Virtual Machines (IaaS)

Management Portal

Scripting (Windows, Linux and Mac)

REST API

Starting Select Image and VM Size New Disk Persisted in Storage

Boot VM from New Disk

Page 27: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

32 nodes, 8 “pods” of 4 nodes

Page 28: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

……..

……….

Exposed via a single endpoint (port 9160)

9160

9160 Exposed via a single endpoint (port 9161)

Exposing the pods• Each pod of 4 nodes

has a single load balanced endpoint

• Clients (on our stateless roles) treats the endpoint as a pool

• Blacklists and skips an endpoint if it starts producing a lot of errors

#CASSANDRA13

Page 29: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Where does the data go?

• Data files are on 8 mounted network backed disks (*not* ephemeral disks)

• Data disks are geo-replicated (3 copies local; 1 remote) for “free” DR

• Azure data disks offer great throughput (VMs end up CPU bound)

#CASSANDRA13

Page 30: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Our Column Families (CQL 3)

CREATE TABLE oneminute (

rk text,  ck text,  cnt counter,  sum counter,  PRIMARY KEY (rk, ck)

);

#CASSANDRA13

Page 31: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Updating values…Realtime “average” values at any granularity, for any time window

updateoneminute/tenminute/oneday

setsum = sum + {sample_value},cnt = cnt + 1

where rk = '{customer_name}' and ck = '{metric_path}'

#CASSANDRA13

Page 32: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

Reading values…

*ONE* round trip to fetch a metric over time (e.g. CPU over past week)

select * from oneminutewhere rk = ‘{customer_name}' and ck < '{metric_path_start}' and ck >= '{metric_path_end}‘order by ck desc;

#CASSANDRA13

Page 33: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

What’s next?

• Windows Azure Virtual Networks to connect / secure all of our resources

(PAAS + IAAS + Services)• Expand Cassandra cluster across

datacenter boundaries for improved availability• Integrate with more off-the-shelf Azure

components to reduce operational overhead

#CASSANDRA13

Page 34: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

#CASSANDRA13

Global Physical Infrastructureservers/network/datacenters

automated

elastic

managed resources

usage based

REST API + OTHER SERVICES

compute data management networking

SQL database

noSQL databasewebsites blob connect

virtual network

traffic manager

cloud services VMs

Page 35: C* Summit 2013: Optimizing the Public Cloud for Cost and Scalability with Cassandra - The MetricsHub Story by Charles Lamanna and Ricardo Villalobos

#CASSANDRA13