lessons learned while automating mysql in the aws cloud cloudformation allows you to add...

Download Lessons learned while automating MySQL in the AWS cloud CloudFormation allows you to add dependencies

Post on 20-Jul-2020

0 views

Category:

Documents

0 download

Embed Size (px)

TRANSCRIPT

  • Lessons learned while automating MySQL in the

    AWS cloud

    Stephane Combaudon DB Engineer - Slice

  • 2

    Our environment

    ● 5 DB stacks – Data volume ranging from 30GB to 2TB+.

    ● Master + N slaves for each stack. – Master is handling all application traffic. – Specialized slaves (backups, reports, custom jobs).

    ● Stacks are duplicated in several dimensions – Regions (US, JP) – Environment (QA, Staging, Prod)

  • 3

    Problems we wanted to fix

    ● Hosted in the AWS cloud, but relying on a 3rd party vendor for DB automation.

    ● 3rd party vendor became a liability – Expensive – Automation only works with MySQL 5.5 – Security issues – Fail over unavailable

  • 4

    Our goals

    ● Create our own MySQL automation!

    ● Instance lifecycle – DBA/SA people create instance from a template. – Software gets provisioned automatically. – Data gets provisioned automatically. – Replication (if slave) starts automatically.

    ● Bonus: add ability to fail over to a slave easily.

    ● How can we get there?

  • 5

    Technical Solution Overview

    ● Creating instances from a template – CloudFormation

    ● Installing software – Chef

    ● Data provisioning – Galera? Custom scripts?

    ● High availability – Galera? MHA?

  • 6

    CloudFormation

    ● Provides a way to manage AWS resources through templates (infrastructure as code).

    ● A CloudFormation template – Is a JSON file. – Describes the configuration of your resources.

    ● Pro: any AWS resource can be described. ● Con: learning curve is steep.

  • 7

    AWS Components - 1st try

    Master

    Slave N

    Slave 1

    MHA Manager

    Standalone EC2 instance

    Autoscaling Group

  • 8

    Data Provisioning

    ● Galera – Natively solves data provisioning + HA issue. – But not a good fit for all our workloads + app changes needed.

    ● Let’s write a custom provisioning script! – For a master

    ● Do nothing. We only create a master for a new (empty) stack.

    – For a slave ● Restore latest available backup. ● Start replication.

    – But how will servers know whether they’re a master or a slave?

  • 9

    MHA

    ● Automated vs semi-automated mode – App is not ready for automatic MySQL failover. – Semi automated mode is chosen

    ● Master failure detection is manual, slave promotion is a single command.

    ● MHA requirement – MHA configuration needs to know the exact

    instances of the replication topology.

  • 10

    Back to AWS components

    ● CloudFormation allows you to add dependencies between components – Create MHA Manager. – Add IP of MHA Manager in some file of the MySQL

    servers when they are created by CloudFormation. – During MySQL bootstrap, add IP of MySQL server to

    MHA config.

    ● But there’s a catch: if MHA Manager goes down, we lose our failover ability.

  • 11

    AWS Components - 2nd try

    Master

    Slave N

    Slave 1

    MHA Manager

    Autoscaling Group of 1 instanceAutoscaling

    Group

  • 12

    Back to AWS components again

    ● CloudFormation is no longer able to know the IP of the MHA Manager in advance. – Therefore MySQL servers can no longer register

    themselves in MHA config file.

    ● This time again we need service discovery.

  • 13

    Service Discovery - 1

    ● No such service available in our infrastructure.

    ● We tried several options – Zookeeper, etcd: another infrastructure to manage. – DynamoDB: race conditions.

    ● In the end the AWS API seemed a strong enough option.

  • 14

    Service Discovery - 2

    ● All components in a CF stack share a tag (aws:cloudformation:stack­name)

    ● Within a CF stack, names of ASGs are predictable.

    ● We can then find the IP address of all instances within a specific ASG

  • 15

    Back to MHA config

    ● Now instances are able to register themselves when bootstrapping.

    ● Upon instance termination – MHA config needs to be updated. – Hooks can be added to run custom script, but not

    very fast. – What else can we do?

  • 16

    Another MHA problem - 1

    ● MHA command lines are not very user friendly.

    ● We built a wrapper script. – Simpler options – Autocompletion

    mha@manager$ masterha_master_switch --conf=/etc/mha.conf --master_state=alive --new_master_host=172.25.2.73 --orig_master_is_new_slave --interactive=0

    root@manager# db_ha promote --new_master=172.25.2.73

  • 17

    Another MHA problem - 2

    ● Wait, couldn’t we also sync the MHA conf in this when running this script? – Yes, of course!

    ● MHA conf is synced on demand with this script – Ensures the conf is always up-to-date when we

    need it. – No more need to care about MySQL instance

    termination.

  • 18

    Another MHA problem - 3

    ● So far, so good but – Some of our slaves are not suitable at all to become

    master. ● We want no_master=1 in MHA config for these servers.

    – MHA Manager just knows a bunch of MySQL servers, how can it add the no_master flag?

    ● We need to refine our AWS components diagram again.

  • 19

    AWS Components - 3rd try

    Master

    Slave N

    Slave 1

    MHA Manager

    Autoscaling Group of 1 instance

    ASG1 (Potential Masters)

    ASG2 (Slaves only)

  • 20

    Recap so far

    ● At this point – We can create an arbitrary number of MySQL

    servers. – MHA config is synced automatically. – Any node (MySQL or Manager) that fails is rebuilt

    automatically thanks to ASGs.

    ● All good? Not exactly…

  • 21

    Back to Data Provisioning

    ● We have separate code paths for master and slaves. – But how do we know if a new instance is a master or

    a slave? ● Let’s use AWS API again

    – If the instance is part of ASG2: slave. – If the instance is part of ASG1: 1st instance is master,

    others are slaves. – We add a replication_role tag for each instance.

  • 22

    Backups - XtraBackup vs EBS snapshots

    ● EBS snapshots – Simple to use and super fast (incremental backups).

    ● XtraBackup – Very complex, super slow. Incremental backups are

    difficult.

    ● Let’s use EBS snapshots then? – Well, not so fast

  • 23

    Backups vs Restores

    ● EBS snapshots are great for backups, not for restores. – Data is lazily loaded from S3, ie warmup takes

    forever.

    ● Example on our write-heaviest cluster – Restore + replication catchup with XB: 8-9 hours. – Same with EBS snapshots: I gave up after 2 days.

  • 24

    Backup Script

    ● XtraBackup takes full backup. ● Backup is uploaded to S3. ● Frequency of backups is stack dependent

    – Configuration file in S3

    ● Tags are added on backup servers – Timestamp and status of latest backup. – Progress bar if a backup is taken.

  • 25

    Roadmap

    ● Migration to 5.7 – Automation already supports both 5.5 and 5.7.

    ● Better monitoring of errors on restores. ● Integration with PMM

    – Implemented but broken.

    ● Realtime binlog streaming to Elastic Filesystem – Implemented but broken.

    ● Group Replication instead of MHA.

  • 26

    The end

    ● Thanks for attending!!

    ● Questions/comments – stephane@slice.com

    Slide 1 Slide 2 Slide 3 Slide 4 Slide 5 Slide 6 Slide 7 Slide 8 Slide 9 Slide 10 Slide 11 Slide 12 Slide 13 Slide 14 Slide 15 Slide 16 Slide 17 Slide 18 Slide 19 Slide 20 Slide 21 Slide 22 Slide 23 Slide 24 Slide 25 Slide 26