rds for mysql, no bs operations and patterns
DESCRIPTION
Amazon's RDS for MySQL is a wonderful tool with a significant value. It can also create a lot of havoc if you are not aware of it's limitations and changes before you make it a core part of your environment. In this deck, we discuss those issues.TRANSCRIPT
RDS for MySQLNo BS Operations and Patterns
Laine Campbell, CEO PalominoDB
The Party Line
Relational Database ServiceFully ManagedSimple to DeployEasy to ScaleReliableCost Effective
Fully Managed
Ignore the man behind the curtainBackupsProvisioningPatchingPerformance ManagementFailoverReplication
Fully Managed
BackupsSnapshot Based - Same as EBS
Snapshots cause spikes in latencyAvoided in Multi-AZ
Snapshots are taken from masterOr the standby in Multi-AZ
Set up automatic schedulesPoint in Time Recovery via binlogsUser executed snapshots
RDS Backups
Can I snapshot a replica?Nope. Backup from your master.
Of course, you can promote a replica, then snapshot it for testbeds.
RDS Backups
I like RDS BackupsWhen using Multi-AZ
AND
When loads are minimal
It's like unicorns are flying my binlogs to heaven
Fully Managed
Provisioning
Rapid Master LaunchesMaster in a few minutes (or it's free?)Standby in a different AZ? Push a button!
Rapid Replica BuildsNeed more replicas? Push a button!
RDS Provisioning
Provisioning your masterStandalone - no failover or redundancy
Multi-AZ - standby in a separate availability zone
Pick your Version
Pick your maintenance window
RDS Provisioning
Overview of AZ and RegionsAmazon Regions equate to data-centers in different geographical regions. (99.5% SLA based on more than one AZ being unavailable)
Availability zones are isolated from one another in the same region to minimize impact of failures.
RDS does not interact across regions.
RDS Provisioning
Can multiple AZs save me?Amazon states AZs do not share :
● Cooling● Network● Security● Generators● Facilities
RDS Provisioning
Can multiple AZs save me?Apr, 2011 - US East Region EBS Failed
* Incorrect network failover.* Saturated intra-node communications.* Cascading failures impacted EBS in all AZs.
Jul, 2012 - US East Partial Impact* Electrical storms impacted multiple sites.* Failover of metadata DB took too long.* EBS I/O was frozen to minimize corruption.
RDS Provisioning
Can multiple AZs save me?
They can reduce risk.
Cross AZ latency can vary as much as 3x. (too slow to allow mysql cluster across AZs)
A multi-az failover can create a degraded performance condition when minimal latency is required.
Multi-AZ Failover
From AWS Docs
RDS Provisioning
Multi-AZ Magical FailoverReplicates via unicorn express
Fails over quite often, with up to 30 seconds of downtime
You do not get to choose your failover AZ
Typical I/O write impact for synch replicationaka unicorn express
Multi-AZ Failover
From AWS Blog
RDS Provisioning
Pick Your VersionMySQL 5.1 or MySQL 5.5
:( No MariaDB :(:( No XtraDB :(
:( No Drizzle :(:( No TokuDB :(
RDS Provisioning
Pick Your Maintenance Window30 minute window your software patching can occurCan be different for different instancesYou need to plan ahead for instances to be out of service.
RDS Provisioning
They'll shut off my DB????
RDS Provisioning
Auto-Version Minor UpgradeIf you choose no, you will not experience automatic upgrades (and thus downtime).Some critical security patches can still be done.RDS team is fairly good about communicating upgrades.
RDS Provisioning
Basic Instance TypesMicro - 630 MB RAM, 2 ECU - Low I/OSmall - 1.7 GB RAM, 1 ECU - Med I/OLarge - 7.5 GB RAM, 4 ECU - High I/OXLarge - 15 GB RAM, 8 ECU - High I/O
RDS Provisioning
Fancy Instance Types
High Mem XL - 17.1 GB RAM, 6.5 ECU - High I/OHigh Mem 2XL - 34 GB RAM, 13 ECU - High I/OHigh Mem 4XL - 68 GB RAM, 26 ECU - High I/O
RDS Provisioning
Storage ProvisioningFrom 5 GB to 3 TBAt 300 GB, EBS Volumes start to get striped.Striping = better performanceProvisioned IOPS (up to 30,000)
= more stable I/O and costs more too!
RDS Provisioning
Virtual Private Cloud (VPC)Allows you to create your own virtual network simulating traditional DC networks.
You must create a DB Subnet Group in VPC
VPC Subnets cannot cross availability zones.
VPC security group allows access control to your DB
RDS Provisioning
Virtual Private Cloud (VPC)Mixed architectures with some VPC, and some non-VPC creates major issues.
Auto-scaling becomes difficult.
Don't do it!
RDS Provisioning
Database Security Groups
Controls all MySQL access to RDS instances.
Defaults to "deny all"
Access can be granted by IP Range and EC2 sec groups.
RDS Provisioning
Database Security Groups
Don't grant access to 10.x.x.x, use a security group.
IPs entered with CIDR - Classless Inter-Domain Routing
Make sure you understand CIDR! (or you may haveunwelcome visitors!)
RDS Provisioning
Parameter GroupsDefines parameters used by your RDS instances.
There is a "default" group that you can modify.
One or more RDS instances can map to an individual parameter group.
RDS Provisioning
Parameter Group Best PracticesDon't ever use the default group.
The default group doesn't allow dynamic parameterchanges. Everything requires a restart.
Build different groups for each mysql master/replicagrouping.
RDS Provisioning
Parameter Group Best PracticesUse different parameter groups for masters vs. replicas.
Consider using different parameter groups for different replica types (app query, ad hoc, ETL)
Remember to use test environments. Test!!!
RDS Provisioning
Why different parameter groups?Granularity - Do you want to apply the same parameter to everything in the cluster?
● Read Only?● Slow Logging?● innodb_flush_method
RDS Provisioning
RDS Provisioning
Provisioning your ReplicasDoes not have to be the same instance type as themaster.
Pick your availability zone (great for mapping replicasto app servers in the same AZ.)
Don't forget to apply a different parameter groupthan your master.
RDS Provisioning
Provisioning your ReplicasAdding a replica impacts your master performance.(If not in multi-az)
You can only launch in serial - and it can take anon-trivial amount of time to launch.
Adding many replicas can take awhile. Script it!
RDS Provisioning
What can I do with my replica?Send queries to it
Promote it to a master
Poke it with a stick
Use it for special purposes (mysqldump, ETL, ad hoc)
RDS Provisioning
Sending queries to the replica?Set up Route53 cnames - weighted round robin.
Internal elastic load balancer in the VPC.
VPC/Route53 does not do a mysql health check.
HAProxy can be leveraged.
RDS Provisioning
Replica master PromotionThis is a great way to build a test environment.
Can be leveraged for rolling migrations
But a replica can't have a replica! Must promote first!
RDS Provisioning
Replica promotion for failoverThis can be used instead of Multi-AZ. Why?
When using log_sync=0, a master failover in multi-azmay strand your replicas.
Old log doesn't close correctly. Replica cannotproceed. And you can't move to the next log!
RDS Provisioning
All of my replicas must be rebuilt!
A Day in the Life
What does an RDS DBA do?
A Day in the Life
What does an RDS DBA do?Need a replica?
Push a button or call an API.
Need to create a test environment?Promote a replica, call an API.
New Cluster?Push a button or call an API.
A Day in the Life
What does an RDS DBA do?Need a backup?
Push a button or call an API.
Need to recover a database?Push a button or call an API.
New Cluster?Push a button or call an API.
A Day in the Life
Need to do a query review?
You don't have access to the logs at the filesystem level.
You can look in the console or via API for some initial diagnostics.
A Day in the Life
Query ReviewsNeed to do a REAL query review?
Log to the csv table - slow_log mysql -u user -p -h host.rds.amazonaws.com -D mysql -s -r -e "SELECT CONCAT( '# Time:
', DATE_FORMAT(start_time, '%y%m%d %H%i%s'), '\n', '# User@Host: ', user_host, '\n', '# Query_time: ', TIME_TO_SEC(query_time), ' Lock_time: ', TIME_TO_SEC(lock_time), ' Rows_sent: ', rows_sent, ' Rows_examined: ', rows_examined, '\n', sql_text, ';' ) FROM mysql.slow_log" > /tmp/mysql.slow_log.log
pt-query-digest --limit 100% /tmp/mysql.slow_log.log > /tmp/query-digest.txt
A Day in the Life
Query ReviewsNo Microsecond Patch
Using long-query-time=0 logs all queriesBut they record as 0 on timeYou have no accurate profile of query time for < 1 sec.
You also can't use TCPDump on the MySQL Instance.We often use this if logging everything will dropperformance on your DB instance to unacceptable levels.
WHICH IT CAN
A Day in the Life
Need to rotate logs?
call mysql.rds_rotate_slow_log;
call mysql.rds_rotate_general_log;
A Day in the Life
Need to kill a process?
call mysql.rds_kill_query (99);
kills the current query for this thread.
call mysql.rds_kill (99);
kills the thread.
A Day in the Life
Managing Replication
Need to stop replication? Break it yourself!
call mysql.rds_skip_repl_error;
Skips the current replication error.
A Day in the Life
Reviewing Status Trends
Global Status History
Event snapshots status into mysql.rds_global_status_history;
You can trend this into many tools.
Monitoring MySQL
CloudwatchCPUUtilizationDatabase ConnectionsFreeStorageSpaceNetwork In/OutRead/Write IOPsRead/Write BytesRead/Write Latency
Monitoring MySQL
Where are the MySQL Metrics?
Cloudwatch doesn't expose them.
You can use: Cacti, Graphite, Zabbix, etc... fortrending.
Monitoring MySQL
Can I alert on cloudwatch metrics?
Cloudwatch allows you to set up your alerts.
But you probably want all metrics and alerts in the same system, don't you?
Monitoring MySQL
Also cloudwatch is unreliable
It often doesn't poll at every interval.
Can miss/skip important events.
Monitoring MySQL
What can I use?
Nagios can poll mysql directly
Poll from graphite
Some things that suck
Moving data in and out
Want to do a dump and load upgrade?
Want to migrate to a new region?
Want to do multi-layer replication?
Some things that suck
Migrations/Upgrades out of RDS
Take a replica out of service.Dump your data.Upgrade your binaries.Load your data.Give replicas to your replica.Failover reads, then writes.MINIMAL DOWNTIME
Some things that suck
Migrations/Upgrades in RDS
Some things that suck
Migrations/Upgrades in RDS
Dump a bunch of tables.Load deltas via tons of scripting.Keep the deltas on each table minimal.Take a few hours downtime.Sync the delta.Test.Go live and drink a lot.
Some things that suck
This also applies to:
Moving data between regions.
Migration to EC2 from RDS.
Migrating to a datacenter from AWS
Patterns for RDS
Prototyping and Testing:
Rapid build and destroy.
Short lifecycles.
Quick testing lifecycles.
Patterns for RDS
Moderate Uptime SLAs:
Region Level SLA is 99.5% across two AZ's (43.8 hours of downtime per year)
Add in failover times for multi-AZ master (6 more hours)
Expect around 4 days of downtime withoutmulti-region
Patterns for RDS
That doesn't include:
Downtime from bad queries
Downtime from user error
Downtime from upgrades/migrations
Patterns for RDS
Relaxed Latency Requirements:
Multi-AZ can introduce cross-AZ latencywithout AZ specific architectural design.
EBS storage can introduce unpredictableLatency without P-IOPS
Snapshots of master, replica builds and multi-AZfailovers can impact write latency.
Patterns for RDS
Relaxed Latency Requirements:
If you use write-through cache, this can be mitigated
If you use significant caching, this can be mitigated
If you use AZ aware design, this can be mitigated
Patterns for RDS
Dataset Specifics:
Small datasets can allow for rapid region migrations
Read only datasets can also allow for this
Data you don't mind losing can also allow for this
Patterns for RDS
No DBA(s):
You still need DBAs to design, tune and configure.
But RDS does reduce some DBA overhead.
With investment in automation, this overhead is notsignificant.
Still, automation requires money/hours. If you haveno budget, RDS is a good way to start.
War Stories
Obama for America:US-East Region
Multi-AZ
5 Clusters, 30 Instances
Provisioned IOPs, 1 TB Storage
Obama for America
Data Growth:Opsview had no visibility to OS, and thus wewere surprised regularly by storage growth. Had to build custom plugins.
Upgrading storage or instance size in multi-AZ can cause an unpredictable downtime window.
Downtime is small, but the whole process can take30 minutes and you don't know when the REALdowntime will occur.
Obama for America
Hurricane Sandy:Hurricane Sandy was poised to strike Virginia andUS East.
Luckily we had built out EC2 and data migrationscripts.
Took 3 days solid for the whole team to build out US-West region.
Obama for America
Human Error:While doing rolling DDL, sql_log_bin disabled at theglobal level on master. (Damn you 5.5!!!!)
No access to binlogs made troubleshooting verychallenging.
An hour of troubleshooting because we blamed thedisk and had no visibility.
Had to rebuild all replicas in serial overnight once
Obama for America
Migration to P-IOPs:
Things that make you go hmmm....
War Stories
Call of Duty, Black Ops 2:5 Clusters, 25 instances.
US East
Multi-AZ
Provisioned IOPs
CoD Black Ops 2
Hurricane Sandy:Data migration scripts not setup for continuousreplication.
Had to draw a line in the sand on when to movedata.
Any additional data would be lost, if cutover occurred.
CoD Black Ops 2
Multi-AZ Failover:Writes required sync_binlog=0
Master failed over to standby.
All replicas stopped replicating.
DBA couldn't “change master”
Read load swarmed the master while we rebuilt.
CoD Black Ops 2
Provisioned IOPs:Came out, super exciting!
Let's migrate!
Oh, no push button migration.
2 Senior DBAs, 3 weeks to build migration scriptsand test/migrate.
Q&A
Laine Campbell, CEO PalominoDB
http://www.slideshare.net/lainecampbell