rds for mysql, no bs operations and patterns

RDS for MySQLNo BS Operations and Patterns

Laine Campbell, CEO PalominoDB

The Party Line

Relational Database ServiceFully ManagedSimple to DeployEasy to ScaleReliableCost Effective

Fully Managed

Ignore the man behind the curtainBackupsProvisioningPatchingPerformance ManagementFailoverReplication

Fully Managed

BackupsSnapshot Based - Same as EBS

Snapshots cause spikes in latencyAvoided in Multi-AZ

Snapshots are taken from masterOr the standby in Multi-AZ

Set up automatic schedulesPoint in Time Recovery via binlogsUser executed snapshots

RDS Backups

Can I snapshot a replica?Nope. Backup from your master.

Of course, you can promote a replica, then snapshot it for testbeds.

RDS Backups

I like RDS BackupsWhen using Multi-AZ

AND

When loads are minimal

It's like unicorns are flying my binlogs to heaven

Fully Managed

Provisioning

Rapid Master LaunchesMaster in a few minutes (or it's free?)Standby in a different AZ? Push a button!

Rapid Replica BuildsNeed more replicas? Push a button!

RDS Provisioning

Provisioning your masterStandalone - no failover or redundancy

Multi-AZ - standby in a separate availability zone

Pick your Version

Pick your maintenance window

RDS Provisioning

Overview of AZ and RegionsAmazon Regions equate to data-centers in different geographical regions. (99.5% SLA based on more than one AZ being unavailable)

Availability zones are isolated from one another in the same region to minimize impact of failures.

RDS does not interact across regions.

RDS Provisioning

Can multiple AZs save me?Amazon states AZs do not share :

● Cooling● Network● Security● Generators● Facilities

RDS Provisioning

Can multiple AZs save me?Apr, 2011 - US East Region EBS Failed

* Incorrect network failover.* Saturated intra-node communications.* Cascading failures impacted EBS in all AZs.

Jul, 2012 - US East Partial Impact* Electrical storms impacted multiple sites.* Failover of metadata DB took too long.* EBS I/O was frozen to minimize corruption.

RDS Provisioning

Can multiple AZs save me?

They can reduce risk.

Cross AZ latency can vary as much as 3x. (too slow to allow mysql cluster across AZs)

A multi-az failover can create a degraded performance condition when minimal latency is required.

Multi-AZ Failover

From AWS Docs

RDS Provisioning

Multi-AZ Magical FailoverReplicates via unicorn express

Fails over quite often, with up to 30 seconds of downtime

You do not get to choose your failover AZ

Typical I/O write impact for synch replicationaka unicorn express

Multi-AZ Failover

From AWS Blog

RDS Provisioning

Pick Your VersionMySQL 5.1 or MySQL 5.5

:( No MariaDB :(:( No XtraDB :(

:( No Drizzle :(:( No TokuDB :(

RDS Provisioning

Pick Your Maintenance Window30 minute window your software patching can occurCan be different for different instancesYou need to plan ahead for instances to be out of service.

RDS Provisioning

They'll shut off my DB????

RDS Provisioning

Auto-Version Minor UpgradeIf you choose no, you will not experience automatic upgrades (and thus downtime).Some critical security patches can still be done.RDS team is fairly good about communicating upgrades.

RDS Provisioning

Basic Instance TypesMicro - 630 MB RAM, 2 ECU - Low I/OSmall - 1.7 GB RAM, 1 ECU - Med I/OLarge - 7.5 GB RAM, 4 ECU - High I/OXLarge - 15 GB RAM, 8 ECU - High I/O

RDS Provisioning

Fancy Instance Types

High Mem XL - 17.1 GB RAM, 6.5 ECU - High I/OHigh Mem 2XL - 34 GB RAM, 13 ECU - High I/OHigh Mem 4XL - 68 GB RAM, 26 ECU - High I/O

RDS Provisioning

Storage ProvisioningFrom 5 GB to 3 TBAt 300 GB, EBS Volumes start to get striped.Striping = better performanceProvisioned IOPS (up to 30,000)

= more stable I/O and costs more too!

RDS Provisioning

Virtual Private Cloud (VPC)Allows you to create your own virtual network simulating traditional DC networks.

You must create a DB Subnet Group in VPC

VPC Subnets cannot cross availability zones.

VPC security group allows access control to your DB

RDS Provisioning

Virtual Private Cloud (VPC)Mixed architectures with some VPC, and some non-VPC creates major issues.

Auto-scaling becomes difficult.

Don't do it!

RDS Provisioning

Database Security Groups

Controls all MySQL access to RDS instances.

Defaults to "deny all"

Access can be granted by IP Range and EC2 sec groups.

RDS Provisioning

Database Security Groups

Don't grant access to 10.x.x.x, use a security group.

IPs entered with CIDR - Classless Inter-Domain Routing

Make sure you understand CIDR! (or you may haveunwelcome visitors!)

RDS Provisioning

Parameter GroupsDefines parameters used by your RDS instances.

There is a "default" group that you can modify.

One or more RDS instances can map to an individual parameter group.

RDS Provisioning

Parameter Group Best PracticesDon't ever use the default group.

The default group doesn't allow dynamic parameterchanges. Everything requires a restart.

Build different groups for each mysql master/replicagrouping.

RDS Provisioning

Parameter Group Best PracticesUse different parameter groups for masters vs. replicas.

Consider using different parameter groups for different replica types (app query, ad hoc, ETL)

Remember to use test environments. Test!!!

RDS Provisioning

Why different parameter groups?Granularity - Do you want to apply the same parameter to everything in the cluster?

● Read Only?● Slow Logging?● innodb_flush_method

RDS Provisioning

RDS Provisioning

Provisioning your ReplicasDoes not have to be the same instance type as themaster.

Pick your availability zone (great for mapping replicasto app servers in the same AZ.)

Don't forget to apply a different parameter groupthan your master.

RDS Provisioning

Provisioning your ReplicasAdding a replica impacts your master performance.(If not in multi-az)

You can only launch in serial - and it can take anon-trivial amount of time to launch.

Adding many replicas can take awhile. Script it!

RDS Provisioning

What can I do with my replica?Send queries to it

Promote it to a master

Poke it with a stick

Use it for special purposes (mysqldump, ETL, ad hoc)

RDS Provisioning

Sending queries to the replica?Set up Route53 cnames - weighted round robin.

Internal elastic load balancer in the VPC.

VPC/Route53 does not do a mysql health check.

HAProxy can be leveraged.

RDS Provisioning

Replica master PromotionThis is a great way to build a test environment.

Can be leveraged for rolling migrations

But a replica can't have a replica! Must promote first!

RDS Provisioning

Replica promotion for failoverThis can be used instead of Multi-AZ. Why?

When using log_sync=0, a master failover in multi-azmay strand your replicas.

Old log doesn't close correctly. Replica cannotproceed. And you can't move to the next log!

RDS Provisioning

All of my replicas must be rebuilt!

A Day in the Life

What does an RDS DBA do?

A Day in the Life

What does an RDS DBA do?Need a replica?

Push a button or call an API.

Need to create a test environment?Promote a replica, call an API.

New Cluster?Push a button or call an API.

A Day in the Life

What does an RDS DBA do?Need a backup?

Push a button or call an API.

Need to recover a database?Push a button or call an API.

New Cluster?Push a button or call an API.

A Day in the Life

Need to do a query review?

You don't have access to the logs at the filesystem level.

You can look in the console or via API for some initial diagnostics.

A Day in the Life

Query ReviewsNeed to do a REAL query review?

Log to the csv table - slow_log mysql -u user -p -h host.rds.amazonaws.com -D mysql -s -r -e "SELECT CONCAT( '# Time:

', DATE_FORMAT(start_time, '%y%m%d %H%i%s'), '\n', '# User@Host: ', user_host, '\n', '# Query_time: ', TIME_TO_SEC(query_time), ' Lock_time: ', TIME_TO_SEC(lock_time), ' Rows_sent: ', rows_sent, ' Rows_examined: ', rows_examined, '\n', sql_text, ';' ) FROM mysql.slow_log" > /tmp/mysql.slow_log.log

pt-query-digest --limit 100% /tmp/mysql.slow_log.log > /tmp/query-digest.txt

A Day in the Life

Query ReviewsNo Microsecond Patch

Using long-query-time=0 logs all queriesBut they record as 0 on timeYou have no accurate profile of query time for < 1 sec.

You also can't use TCPDump on the MySQL Instance.We often use this if logging everything will dropperformance on your DB instance to unacceptable levels.

WHICH IT CAN

A Day in the Life

Need to rotate logs?

call mysql.rds_rotate_slow_log;

call mysql.rds_rotate_general_log;

A Day in the Life

Need to kill a process?

call mysql.rds_kill_query (99);

kills the current query for this thread.

call mysql.rds_kill (99);

kills the thread.

A Day in the Life

Managing Replication

Need to stop replication? Break it yourself!

call mysql.rds_skip_repl_error;

Skips the current replication error.

A Day in the Life

Reviewing Status Trends

Global Status History

Event snapshots status into mysql.rds_global_status_history;

You can trend this into many tools.

Monitoring MySQL

CloudwatchCPUUtilizationDatabase ConnectionsFreeStorageSpaceNetwork In/OutRead/Write IOPsRead/Write BytesRead/Write Latency

Monitoring MySQL

Where are the MySQL Metrics?

Cloudwatch doesn't expose them.

You can use: Cacti, Graphite, Zabbix, etc... fortrending.

Monitoring MySQL

Can I alert on cloudwatch metrics?

Cloudwatch allows you to set up your alerts.

But you probably want all metrics and alerts in the same system, don't you?

Monitoring MySQL

Also cloudwatch is unreliable

It often doesn't poll at every interval.

Can miss/skip important events.

Monitoring MySQL

What can I use?

Nagios can poll mysql directly

Poll from graphite

Some things that suck

Moving data in and out

Want to do a dump and load upgrade?

Want to migrate to a new region?

Want to do multi-layer replication?


Migrations/Upgrades out of RDS

Take a replica out of service.Dump your data.Upgrade your binaries.Load your data.Give replicas to your replica.Failover reads, then writes.MINIMAL DOWNTIME


Migrations/Upgrades in RDS


Migrations/Upgrades in RDS

Dump a bunch of tables.Load deltas via tons of scripting.Keep the deltas on each table minimal.Take a few hours downtime.Sync the delta.Test.Go live and drink a lot.


This also applies to:

Moving data between regions.

Migration to EC2 from RDS.

Migrating to a datacenter from AWS

Patterns for RDS

Prototyping and Testing:

Rapid build and destroy.

Short lifecycles.

Quick testing lifecycles.

Patterns for RDS

Moderate Uptime SLAs:

Region Level SLA is 99.5% across two AZ's (43.8 hours of downtime per year)

Add in failover times for multi-AZ master (6 more hours)

Expect around 4 days of downtime withoutmulti-region

Patterns for RDS

That doesn't include:

Downtime from bad queries

Downtime from user error

Downtime from upgrades/migrations

Patterns for RDS

Relaxed Latency Requirements:

Multi-AZ can introduce cross-AZ latencywithout AZ specific architectural design.

EBS storage can introduce unpredictableLatency without P-IOPS

Snapshots of master, replica builds and multi-AZfailovers can impact write latency.

Patterns for RDS

Relaxed Latency Requirements:

If you use write-through cache, this can be mitigated

If you use significant caching, this can be mitigated

If you use AZ aware design, this can be mitigated

Patterns for RDS

Dataset Specifics:

Small datasets can allow for rapid region migrations

Read only datasets can also allow for this

Data you don't mind losing can also allow for this

Patterns for RDS

No DBA(s):

You still need DBAs to design, tune and configure.

But RDS does reduce some DBA overhead.

With investment in automation, this overhead is notsignificant.

Still, automation requires money/hours. If you haveno budget, RDS is a good way to start.

War Stories

Obama for America:US-East Region

Multi-AZ

5 Clusters, 30 Instances

Provisioned IOPs, 1 TB Storage

Obama for America

Data Growth:Opsview had no visibility to OS, and thus wewere surprised regularly by storage growth. Had to build custom plugins.

Upgrading storage or instance size in multi-AZ can cause an unpredictable downtime window.

Downtime is small, but the whole process can take30 minutes and you don't know when the REALdowntime will occur.

Obama for America

Hurricane Sandy:Hurricane Sandy was poised to strike Virginia andUS East.

Luckily we had built out EC2 and data migrationscripts.

Took 3 days solid for the whole team to build out US-West region.

Obama for America

Human Error:While doing rolling DDL, sql_log_bin disabled at theglobal level on master. (Damn you 5.5!!!!)

No access to binlogs made troubleshooting verychallenging.

An hour of troubleshooting because we blamed thedisk and had no visibility.

Had to rebuild all replicas in serial overnight once

Obama for America

Migration to P-IOPs:

Things that make you go hmmm....

War Stories

Call of Duty, Black Ops 2:5 Clusters, 25 instances.

US East

Multi-AZ

Provisioned IOPs

CoD Black Ops 2

Hurricane Sandy:Data migration scripts not setup for continuousreplication.

Had to draw a line in the sand on when to movedata.

Any additional data would be lost, if cutover occurred.

CoD Black Ops 2

Multi-AZ Failover:Writes required sync_binlog=0

Master failed over to standby.

All replicas stopped replicating.

DBA couldn't “change master”

Read load swarmed the master while we rebuilt.

CoD Black Ops 2

Provisioned IOPs:Came out, super exciting!

Let's migrate!

Oh, no push button migration.

2 Senior DBAs, 3 weeks to build migration scriptsand test/migrate.

Q&A

Laine Campbell, CEO PalominoDB

http://www.slideshare.net/lainecampbell

rds for mysql, no bs operations and patterns

Technology