lessons learned while automating mysql in the aws cloud€¦ · cloudformation allows you to add...

Lessons learned while automating MySQL in the

AWS cloud

Stephane CombaudonDB Engineer - Slice

2

Our environment

● 5 DB stacks– Data volume ranging from 30GB to 2TB+.

● Master + N slaves for each stack.– Master is handling all application traffic.

– Specialized slaves (backups, reports, custom jobs).

● Stacks are duplicated in several dimensions– Regions (US, JP)

– Environment (QA, Staging, Prod)

3

Problems we wanted to fix

● Hosted in the AWS cloud, but relying on a 3rd party vendor for DB automation.

● 3rd party vendor became a liability– Expensive

– Automation only works with MySQL 5.5

– Security issues

– Fail over unavailable

4

Our goals

● Create our own MySQL automation!

● Instance lifecycle– DBA/SA people create instance from a template.

– Software gets provisioned automatically.

– Data gets provisioned automatically.

– Replication (if slave) starts automatically.

● Bonus: add ability to fail over to a slave easily.

● How can we get there?

5

Technical Solution Overview

● Creating instances from a template– CloudFormation

● Installing software– Chef

● Data provisioning– Galera? Custom scripts?

● High availability– Galera? MHA?

6

CloudFormation

● Provides a way to manage AWS resources through templates (infrastructure as code).

● A CloudFormation template– Is a JSON file.

– Describes the configuration of your resources.

● Pro: any AWS resource can be described.● Con: learning curve is steep.

7

AWS Components - 1st try

Master

Slave N

Slave 1

MHA Manager

Standalone EC2 instance

Autoscaling Group

8

Data Provisioning

● Galera– Natively solves data provisioning + HA issue.

– But not a good fit for all our workloads + app changes needed.

● Let’s write a custom provisioning script!– For a master

● Do nothing. We only create a master for a new (empty) stack.

– For a slave● Restore latest available backup.● Start replication.

– But how will servers know whether they’re a master or a slave?

9

MHA

● Automated vs semi-automated mode– App is not ready for automatic MySQL failover.

– Semi automated mode is chosen● Master failure detection is manual, slave promotion is a

single command.

● MHA requirement– MHA configuration needs to know the exact

instances of the replication topology.

10

Back to AWS components

● CloudFormation allows you to add dependencies between components– Create MHA Manager.

– Add IP of MHA Manager in some file of the MySQL servers when they are created by CloudFormation.

– During MySQL bootstrap, add IP of MySQL server to MHA config.

● But there’s a catch: if MHA Manager goes down, we lose our failover ability.

11

AWS Components - 2nd try

Master

Slave N

Slave 1

MHA Manager

Autoscaling Group of 1 instanceAutoscaling

Group

12

Back to AWS components again

● CloudFormation is no longer able to know the IP of the MHA Manager in advance.– Therefore MySQL servers can no longer register

themselves in MHA config file.

● This time again we need service discovery.

13

Service Discovery - 1

● No such service available in our infrastructure.

● We tried several options– Zookeeper, etcd: another infrastructure to manage.

– DynamoDB: race conditions.

● In the end the AWS API seemed a strong enough option.

14

Service Discovery - 2

● All components in a CF stack share a tag (aws:cloudformation:stackname)

● Within a CF stack, names of ASGs are predictable.

● We can then find the IP address of all instances within a specific ASG

15

Back to MHA config

● Now instances are able to register themselves when bootstrapping.

● Upon instance termination– MHA config needs to be updated.

– Hooks can be added to run custom script, but not very fast.

– What else can we do?

16

Another MHA problem - 1

● MHA command lines are not very user friendly.

● We built a wrapper script.– Simpler options

– Autocompletion

mha@manager$ masterha_master_switch --conf=/etc/mha.conf --master_state=alive --new_master_host=172.25.2.73 --orig_master_is_new_slave --interactive=0

root@manager# db_ha promote --new_master=172.25.2.73

17


● Wait, couldn’t we also sync the MHA conf in this when running this script?– Yes, of course!

● MHA conf is synced on demand with this script– Ensures the conf is always up-to-date when we

need it.

– No more need to care about MySQL instance termination.

18


● So far, so good but– Some of our slaves are not suitable at all to become

master.● We want no_master=1 in MHA config for these servers.

– MHA Manager just knows a bunch of MySQL servers, how can it add the no_master flag?

● We need to refine our AWS components diagram again.

19

AWS Components - 3rd try

Master

Slave N

Slave 1

MHA Manager

Autoscaling Group of 1 instance

ASG1 (Potential Masters)

ASG2 (Slaves only)

20

Recap so far

● At this point– We can create an arbitrary number of MySQL

servers.

– MHA config is synced automatically.

– Any node (MySQL or Manager) that fails is rebuilt automatically thanks to ASGs.

● All good? Not exactly…

21

Back to Data Provisioning

● We have separate code paths for master and slaves.– But how do we know if a new instance is a master or

a slave?

● Let’s use AWS API again– If the instance is part of ASG2: slave.

– If the instance is part of ASG1: 1st instance is master, others are slaves.

– We add a replication_role tag for each instance.

22

Backups - XtraBackup vs EBS snapshots

● EBS snapshots– Simple to use and super fast (incremental backups).

● XtraBackup– Very complex, super slow. Incremental backups are

difficult.

● Let’s use EBS snapshots then?– Well, not so fast

23

Backups vs Restores

● EBS snapshots are great for backups, not for restores.– Data is lazily loaded from S3, ie warmup takes

forever.

● Example on our write-heaviest cluster– Restore + replication catchup with XB: 8-9 hours.

– Same with EBS snapshots: I gave up after 2 days.

24

Backup Script

● XtraBackup takes full backup.● Backup is uploaded to S3.● Frequency of backups is stack dependent

– Configuration file in S3

● Tags are added on backup servers– Timestamp and status of latest backup.

– Progress bar if a backup is taken.

25

Roadmap

● Migration to 5.7– Automation already supports both 5.5 and 5.7.

● Better monitoring of errors on restores.● Integration with PMM

– Implemented but broken.

● Realtime binlog streaming to Elastic Filesystem– Implemented but broken.

● Group Replication instead of MHA.

26

The end

● Thanks for attending!!

● Questions/comments– [email protected]

lessons learned while automating mysql in the aws cloud€¦ · cloudformation allows you to add...

Documents