ZERO-DOWNTIME DATACENTER FAILOVERS(SWITCHING HOSTING PROVIDERS FOR DUMMIES)
1 — Luka Kladaric @ AWS Adria 2017.
WHO?Luka Kladaric
formerly a web developer for >10 years
now: freelancing, consulting, architecting, securing
2 — Luka Kladaric @ AWS Adria 2017.
migrating an entire company's infrastructure
from Rackspace to Amazon AWS
3 — Luka Kladaric @ AWS Adria 2017.
60 virtual machines
3 baremetal boxes (db)
assorted networking equipment
4 — Luka Kladaric @ AWS Adria 2017.
the migration took 2 months to execute
but a year and a half to prepare
5 — Luka Kladaric @ AWS Adria 2017.
FOUND STATE6 — Luka Kladaric @ AWS Adria 2017.
hand-crafted build server, unreproducible
7 — Luka Kladaric @ AWS Adria 2017.
half the servers are not deployable from scratch
or their deployability is unknown
8 — Luka Kladaric @ AWS Adria 2017.
same mysql account used by everyone everywhere
9 — Luka Kladaric @ AWS Adria 2017.
that mysql account is "root"
10 — Luka Kladaric @ AWS Adria 2017.
that mysql db is 1.5 TB big
11 — Luka Kladaric @ AWS Adria 2017.
no access to LB config
has a bunch of magic in it
changes often result in issues and outages
12 — Luka Kladaric @ AWS Adria 2017.
no server metrics / perfdata
no idea if overprovisioned and by how much
13 — Luka Kladaric @ AWS Adria 2017.
no access to disaster recovery instancein case the primary DC went down
(access goes through primary DC)
14 — Luka Kladaric @ AWS Adria 2017.
RACKSPACE WAS REALLY TERRIBLEa constant pain to deal with
unexpected outages of never explained causes
unresponsive support team
zero flexibility
15 — Luka Kladaric @ AWS Adria 2017.
HOW LONG WOULD IT TAKE TO MIGRATE THIS?optimistically: 3 months
conservatively: 6-9 months
realistically: a year
16 — Luka Kladaric @ AWS Adria 2017.
NO LEADERSHIP BUY-IN2 failed attempts to get approval
Infrastructure team makes a pact"Do Things The Right Way From Now On"
mask cleanup work with ongoing maintenance
17 — Luka Kladaric @ AWS Adria 2017.
A YEAR AND A HALF LATER...
majority of the issues were fixed
or at least significantly improved
18 — Luka Kladaric @ AWS Adria 2017.
PLOT TWISTRACKSPACE STARTS FALLING APART
19 — Luka Kladaric @ AWS Adria 2017.
New estimate: 19 man-days
(after final push for preparation)
20 — Luka Kladaric @ AWS Adria 2017.
SAVINGS ESTIMATE
$18k -> $6k
that's -66%
21 — Luka Kladaric @ AWS Adria 2017.
GOT APPROVAL!22 — Luka Kladaric @ AWS Adria 2017.
Actually executed in 25-30 man-days
over 2 months
23 — Luka Kladaric @ AWS Adria 2017.
HOW?24 — Luka Kladaric @ AWS Adria 2017.
"upgrading the fleet to Ubuntu 16.04"
all servers rebuilt and redeployed with Ansible
25 — Luka Kladaric @ AWS Adria 2017.
build server rebuilt from scratch
deployed from Ansible
all build jobs defined in code
no more tweaking jobs through UI
26 — Luka Kladaric @ AWS Adria 2017.
CloudFlare implemented for faster DNS failover
27 — Luka Kladaric @ AWS Adria 2017.
all LB logic slowly moved to our own haproxies
haproxy configuration auto-generated from Ansible
makes it easy to shuffle things around
28 — Luka Kladaric @ AWS Adria 2017.
all apps slowly migrated to be served through haproxies
avoiding Rackspace LB magic
29 — Luka Kladaric @ AWS Adria 2017.
VPN bridge between DCs~20 MB/s, ~20ms ping
good enough to treat as a "local" connectionfor shorter periods of time
30 — Luka Kladaric @ AWS Adria 2017.
mysql master-master replication between DCs
31 — Luka Kladaric @ AWS Adria 2017.
app servers in both DCs
32 — Luka Kladaric @ AWS Adria 2017.
haproxies in both DCs
aware of app servers in both DCsbut preferring local ones
"no request left behind"
33 — Luka Kladaric @ AWS Adria 2017.
failover with DNS at CloudFlare near-instantly
but even stray requests get handled
34 — Luka Kladaric @ AWS Adria 2017.
metrics, metrics, metrics
(Datadog ftw)
35 — Luka Kladaric @ AWS Adria 2017.
RESULTS36 — Luka Kladaric @ AWS Adria 2017.
core production migrated in days
internal tools migrated within a week or two
developer tools migrated within a month(git hosting, build server, etc)
obscure legacy services migrated within 2 months
37 — Luka Kladaric @ AWS Adria 2017.
all hardware at Rackspacedecomissioned within 3 months
38 — Luka Kladaric @ AWS Adria 2017.
sideffect: actual HA instead of fake HA
old "two or more of everything" approachtranslated well into Availability Zones
39 — Luka Kladaric @ AWS Adria 2017.
AND IT WAS GOOD40 — Luka Kladaric @ AWS Adria 2017.
41 — Luka Kladaric @ AWS Adria 2017.
QUESTIONS?42 — Luka Kladaric @ AWS Adria 2017.
THANK YOU!Luka [email protected]
@kll43 — Luka Kladaric @ AWS Adria 2017.