amazon rds for mysql – diagnostics, security, and data migration (dat302) | aws re:invent 2013

59
© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc. DAT302 - A Closer Look at Amazon RDS for MySQL - Deep Dive into Diagnostics, Security, and Migration Pavan Pothukuchi, Sr. Product Manager, Amazon RDS Sorin Stoina, Operations Lead, Optaros Antonio Graeff, Technology Director, Titans Group November 14, 2013

Upload: amazon-web-services

Post on 12-Jan-2015

3.009 views

Category:

Technology


5 download

DESCRIPTION

Learn how to monitor your database performance closely and troubleshoot database issues quickly using a variety of features provided by Amazon RDS and MySQL including database events, logs, and engine-specific features. You also learn about the security best practices to use with Amazon RDS for MySQL. In addition, you learn about how to effectively move data between Amazon RDS and on-premises instances. Lastly, you learn the latest about MySQL 5.6 and how you can take advantage of its newest features with Amazon RDS.

TRANSCRIPT

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

DAT302 - A Closer Look at Amazon RDS for MySQL -

Deep Dive into Diagnostics, Security, and Migration

Pavan Pothukuchi, Sr. Product Manager, Amazon RDS

Sorin Stoina, Operations Lead, Optaros

Antonio Graeff, Technology Director, Titans Group

November 14, 2013

What’s in 2013?

Features

Diagnostics

Monitoring

Monitoring

• CloudWatch &

alarms

• SNS notifications

• Other monitoring

tools

• Connect dots!

RDS – CloudWatch & Alarms

Freeable Memory

• Scale Instance Up

• Use Read Replicas

Swap Usage

RDS – CloudWatch & Alarms

Metric trend Potential Action

Freeable Space Scale Storage Up

Binary Log Disk Usage Check Read Replicas

Freeable Memory and Swap Usage Scale Compute

Write Latency and Queue Depth Add Provisioned IOPS

DB Connections Check connection pooling

RDS - Event Notifications

RDS - Event Notifications

Database Logs

• Error Log

• Slow Query Log

• General Log

Error Log

• Archived ever 5 min

• Retained for 24 hours

• Example: Unable to start MySQL

Sample Log content: InnoDB: Initializing buffer pool, size = 6.0G InnoDB: Completed initialization of buffer pool InnoDB: Fatal error: cannot allocate memory for the buffer pool

• Action: Audit mem parameters (for e.g., innodb_buffer_pool_size)

Slow Query Log

• Download from AWS

Management Console

• Access from tables

• Connect the dots

Other Monitoring Tools

• MONyog

• Percona

• New Relic

• Graphite

• Splunk

Security

Security Internet

IAM

VPC

VPC

• Run DB in a private subnet

• Use separate Sec. Group for DB

• Connect through CNAME

• Use

AWS Identity and Access Management (IAM)

• DO NOT share AWS account credentials

• Create IAM users

• Tag resources

• Delegate access

Data Migration

Advanced Migration

On Premises AWS

Setup replication

T1 T1 T2 T2

e-Commerce

solutions

Customer Highlights

Sorin

Stoina

Value Added

Services - Telecom

Antonio

Graeff

Who we are, What we do • Optaros is a global digital commerce service partner

• Hosting and support for multiple customers

• New and emerging shopping models – Flash sales

– Private event retailing

• High traffic “Daily Deal” sites – 5 mio. unique visitors

– 2000 page views/second

– 15 add to carts per second

– 3 orders/sec

• Using AWS since 2009, RDS since 2010

Private Event Retailing (PER)

• “Daily Deal” or “Private Sales”

• 24, 48, or 72 hour events

• Massive discounts designed to entice customers

• Invitation only – Customers are selected based on purchase history

– Email blast is sent as the event starts

• Users can “reserve” items for a limited time by adding them to their cart

• “Cyber Monday every Monday”

Typical Shopping Cart Architecture

PER Traffic Pattern

RDS in E-Commerce

• Highly transactional, ACID is a must

• Highly available – Multi-AZ: fail-over, on-the-fly changes to RDS instances

• Massive write and read-intensive loads – Writes: sign-up, add to cart, checkout – Provisioned IOPS

– Reads: catalog browsing, stock availability – read replicas

• Operational efficiency – High/low peak traffic ratio is huge, sometimes as high as 100:1

– 50+ database servers with 5 devops engineers

Tools & Techniques

• Jenkins – Event prep automation

• CloudFormation – Environment management

• CloudWatch for metrics – And Graphite for good measure

• Percona toolkit – http://www.percona.com/software/percona-toolkit

• MONyog

• Optaros Cloud Console – Database monitor

Jenkins

Jenkins

• We have automated jobs to “Scale up” the

infrastructure: – Frontend servers – increase auto-scaling array to 30+

– Start up to 10 extra cache machines

– RDS read replicas – start 4 read replicas in parallel

• Jobs complete within 30 minutes – used to take

a lot longer before parallel read replica creation

AWS CloudFormation

• Keep your RDS parameter groups, security groups and network ACLs in sync across environments

sorin-macbook:stacks sorin$ stack -d cross-client-tools-prod.rb

@@ -7188,7 +7188,7 @@

"innodb_purge_threads": 1,

"max_allowed_packet": 20971520,

"max_connect_errors": "10000",

- "query_cache_size": 33554432,

+ "query_cache_size": 65554432,

"thread_cache_size": 32,

"tx_isolation": "READ-COMMITTED”

Amazon CloudWatch and Graphite

• Graphite is our central system for metrics – Pull RDS data from CloudWatch into Graphite

– Parse InnoDB and system variables and push to Graphite

– Application and system metrics go in there as well

• Single dashboard for the whole application

• Graphite’s API is polled by other alerting and

monitoring systems as well

Amazon CloudWatch and Graphite

MONyog

MONyog

• Commercial app for MySQL management

• Monitors and alerts on key metrics

• Useful diagnostics – Caches

– Deadlocks

– Temporary tables

– etc.

• Advice on best practices

MONyog alert

Server: prod rds-read-replica0

Sampling timeframe: All Time/Current

Name Currently running threads

Group Current Connections

Type Critical

Thresho

ld

500

Value 1204

Advice If the database is overloaded you'll get an increased number of queries running. Occasional spikes are OK for

very short period of time. Too many active threads indicate that:

1. MySQL is taking too much time to process you requests.

2. You are continuously retrieving/updating large datasets.

Make sure that queries are tuned to use indexes. ExecuteSHOW FULL PROCESSLIST of find queries that

are getting locked continuously. Try isolating long running queries by enabling the slow query log.

Percona Toolkit

• http://percona.com/software/percona-toolkit

• pt-query-digest in particular – Can be used on the slow query log or a tcpdump file

– Since you can’t access the RDS instances, you can run it on your application server

– #tcpdump -i eth0 port 3306 -s 65535 -x -n -q -tttt > tcpdump.out

– #pt-query-digest --type=tcpdump tcpdump.out

• pt-table-checksum won’t work – It requires special privileges

– Fortunately, it’s really easy to rebuild read replicas

– sync_binlog can be a problem when using read replicas

– Less of a problem with MySQL 5.6 crash free slaves

In-House Database Monitor

• “Snapshot” InnoDB status and process list every

10 seconds

• Go back in time up to 7 days

• Helps identify contentions, rogue queries, etc.

• Uses Amazon S3 for storage

In-House Database Monitor

In-House Database Monitor

In-House Database Monitor

Up Next

• Manage read replicas using CloudFormation

• Use Provisioned IOPS more for lower latency

• Upgrade more environments to MySQL 5.6

• Better disaster recovery – cross-region DB

snapshot

e-Commerce

solutions

Customer Highlights

Sorin

Stoina

Value Added

Services - Telecom

Antonio

Graeff

Titans Group

• VAS (Value Added Services) provider for mobile

and fixed-line carriers and ISPs

• White label personal cloud, mobile security and

mobile learning products

• Over 10 million active users in 17 countries in

Latin America

Carrier billing platform

• Complex business rules (trial and subscription periods, bundle, self-renewal)

• Lots of safeguards to prevent overcharge

• High volume, high value data

• Uptime counts: lost transaction is lost revenue

• Transactions concentrated in some days of the month

• Many different regulatory issues for logging, privacy and data retention

Before

Before

• Single pair of on-premises MySQL servers in

master-slave configuration

• Less than 100k transactions a day but growing

fast

• No full-time DBA

• Rapidly iterating the application (while

converting from PHP to Python)

Problems

• Upgrading memory, CPU and storage (SSD)

and still hitting hardware bottlenecks

• Database for queues (please, don't!)

The turning point

• AWS announces Provisioned IOPS Storage for

RDS in September 2012

• Let's migrate!

Migrating from on-premises to RDS

• Then: dump from MySQL and load on RDS,

replay binary logs on RDS (downtime)

• Percona Toolkit pt-table-sync for sanity checks

• Now (much easier!): RDS as slave, promote

slave to master (almost online)

After

After

• Several RDS instances

• Specialized databases by function (contracts,

transactions, whitelists, blacklists)

• Several million transactions a day and still

growing fast

• Still no full-time DBA

How RDS helped us

• Focus on application versus focus on database

operation

• Easy scaling up

• Multi-AZ - High availability (99.95% Uptime SLA)

• Read Replica – for read load and ad hoc analysis

• Snapshots - For testing and archival

• Tagging - Cost reporting by product and client

Performance monitoring - New Relic

Performance monitoring - New Relic

Performance monitoring - New Relic

Performance monitoring - New Relic

Log management - Splunk

Next steps

• Automate data lifecycle management

• Migrating cold data from RDS to RedShift and to

S3 and Glacier

Please give us your feedback on this

presentation

As a thank you, we will select prize

winners daily for completed surveys!

DAT302