lessons learned while automating mysql in the aws cloud€¦ · cloudformation allows you to add...
TRANSCRIPT
Lessons learned while automating MySQL in the
AWS cloud
Stephane CombaudonDB Engineer - Slice
2
Our environment
● 5 DB stacks– Data volume ranging from 30GB to 2TB+.
● Master + N slaves for each stack.– Master is handling all application traffic.
– Specialized slaves (backups, reports, custom jobs).
● Stacks are duplicated in several dimensions– Regions (US, JP)
– Environment (QA, Staging, Prod)
3
Problems we wanted to fix
● Hosted in the AWS cloud, but relying on a 3rd party vendor for DB automation.
● 3rd party vendor became a liability– Expensive
– Automation only works with MySQL 5.5
– Security issues
– Fail over unavailable
4
Our goals
● Create our own MySQL automation!
● Instance lifecycle– DBA/SA people create instance from a template.
– Software gets provisioned automatically.
– Data gets provisioned automatically.
– Replication (if slave) starts automatically.
● Bonus: add ability to fail over to a slave easily.
● How can we get there?
5
Technical Solution Overview
● Creating instances from a template– CloudFormation
● Installing software– Chef
● Data provisioning– Galera? Custom scripts?
● High availability– Galera? MHA?
6
CloudFormation
● Provides a way to manage AWS resources through templates (infrastructure as code).
● A CloudFormation template– Is a JSON file.
– Describes the configuration of your resources.
● Pro: any AWS resource can be described.● Con: learning curve is steep.
7
AWS Components - 1st try
Master
Slave N
Slave 1
MHA Manager
Standalone EC2 instance
Autoscaling Group
8
Data Provisioning
● Galera– Natively solves data provisioning + HA issue.
– But not a good fit for all our workloads + app changes needed.
● Let’s write a custom provisioning script!– For a master
● Do nothing. We only create a master for a new (empty) stack.
– For a slave● Restore latest available backup.● Start replication.
– But how will servers know whether they’re a master or a slave?
9
MHA
● Automated vs semi-automated mode– App is not ready for automatic MySQL failover.
– Semi automated mode is chosen● Master failure detection is manual, slave promotion is a
single command.
● MHA requirement– MHA configuration needs to know the exact
instances of the replication topology.
10
Back to AWS components
● CloudFormation allows you to add dependencies between components– Create MHA Manager.
– Add IP of MHA Manager in some file of the MySQL servers when they are created by CloudFormation.
– During MySQL bootstrap, add IP of MySQL server to MHA config.
● But there’s a catch: if MHA Manager goes down, we lose our failover ability.
11
AWS Components - 2nd try
Master
Slave N
Slave 1
MHA Manager
Autoscaling Group of 1 instanceAutoscaling
Group
12
Back to AWS components again
● CloudFormation is no longer able to know the IP of the MHA Manager in advance.– Therefore MySQL servers can no longer register
themselves in MHA config file.
● This time again we need service discovery.
13
Service Discovery - 1
● No such service available in our infrastructure.
● We tried several options– Zookeeper, etcd: another infrastructure to manage.
– DynamoDB: race conditions.
● In the end the AWS API seemed a strong enough option.
14
Service Discovery - 2
● All components in a CF stack share a tag (aws:cloudformation:stackname)
● Within a CF stack, names of ASGs are predictable.
● We can then find the IP address of all instances within a specific ASG
15
Back to MHA config
● Now instances are able to register themselves when bootstrapping.
● Upon instance termination– MHA config needs to be updated.
– Hooks can be added to run custom script, but not very fast.
– What else can we do?
16
Another MHA problem - 1
● MHA command lines are not very user friendly.
● We built a wrapper script.– Simpler options
– Autocompletion
mha@manager$ masterha_master_switch --conf=/etc/mha.conf --master_state=alive --new_master_host=172.25.2.73 --orig_master_is_new_slave --interactive=0
root@manager# db_ha promote --new_master=172.25.2.73
17
Another MHA problem - 2
● Wait, couldn’t we also sync the MHA conf in this when running this script?– Yes, of course!
● MHA conf is synced on demand with this script– Ensures the conf is always up-to-date when we
need it.
– No more need to care about MySQL instance termination.
18
Another MHA problem - 3
● So far, so good but– Some of our slaves are not suitable at all to become
master.● We want no_master=1 in MHA config for these servers.
– MHA Manager just knows a bunch of MySQL servers, how can it add the no_master flag?
● We need to refine our AWS components diagram again.
19
AWS Components - 3rd try
Master
Slave N
Slave 1
MHA Manager
Autoscaling Group of 1 instance
ASG1 (Potential Masters)
ASG2 (Slaves only)
20
Recap so far
● At this point– We can create an arbitrary number of MySQL
servers.
– MHA config is synced automatically.
– Any node (MySQL or Manager) that fails is rebuilt automatically thanks to ASGs.
● All good? Not exactly…
21
Back to Data Provisioning
● We have separate code paths for master and slaves.– But how do we know if a new instance is a master or
a slave?
● Let’s use AWS API again– If the instance is part of ASG2: slave.
– If the instance is part of ASG1: 1st instance is master, others are slaves.
– We add a replication_role tag for each instance.
22
Backups - XtraBackup vs EBS snapshots
● EBS snapshots– Simple to use and super fast (incremental backups).
● XtraBackup– Very complex, super slow. Incremental backups are
difficult.
● Let’s use EBS snapshots then?– Well, not so fast
23
Backups vs Restores
● EBS snapshots are great for backups, not for restores.– Data is lazily loaded from S3, ie warmup takes
forever.
● Example on our write-heaviest cluster– Restore + replication catchup with XB: 8-9 hours.
– Same with EBS snapshots: I gave up after 2 days.
24
Backup Script
● XtraBackup takes full backup.● Backup is uploaded to S3.● Frequency of backups is stack dependent
– Configuration file in S3
● Tags are added on backup servers– Timestamp and status of latest backup.
– Progress bar if a backup is taken.
25
Roadmap
● Migration to 5.7– Automation already supports both 5.5 and 5.7.
● Better monitoring of errors on restores.● Integration with PMM
– Implemented but broken.
● Realtime binlog streaming to Elastic Filesystem– Implemented but broken.
● Group Replication instead of MHA.