the two little bugs that almost - percona filethe two little bugs that almost brought down...
Post on 10-Jun-2018
223 Views
Preview:
TRANSCRIPT
The two little bugs that almost
brought down Booking.com
Jean-François Gagné (System Engineer)
jeanfrancois DOT gagne AT booking.comApril 25, 2017 – Percona Live Santa Clara 2017
2
For a, b and c relatively small:
Consequence(a + b + c) is much bigger than
Conseq.(a) + Conseq.(b) + Conseq.(c)
MySQL/MariaDB replication at Booking.com
● Typical Booking.com MySQL/MariaDB replication deployment:
+---+| M |+---+|+------+-- ... --+---------------+-------- ...| | | |
+---+ +---+ +---+ +---+| S1| | S2| | Sn| | M1|+---+ +---+ +---+ +---+
|+-- ... --+| |
+---+ +---+| T1| | Tm|+---+ +---+
3
Impacted setup (simplified)
+---+| M |+---+|+---------- .... ----------+--------------+| | |
+---+ +---+ +---+| M1| | Mi| | Mj|+---+ +---+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
4
Upgrade from 5.5 to new major version
+---+| M |+---+|+---------- .... ----------+--------------+| | |
+---+ +---+ +---+| M1| | Mi| | Mj|+---+ +---+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
5
Bad transaction on the master
+---+| M | <<-- “bad transaction”+---+|+---------- .... ----------+--------------+| | |
+---+ +---+ +---+| M1| | Mi| | Mj|+---+ +---+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
6
Oups ! M2 runs OOM and is killed
+---+| M | <<-- “bad transaction”+---+|+---------- .... ----------+--------------+| | |
+---+ +\-/+ +---+| M1| | Mi| | Mj|+---+ +/-\+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |
+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+
7
Oups2 ! all “blue” run OOM and are killed
+---+| M | <<-- “bad transaction”+---+|+---------- .... ----------+--------------+| | |
+---+ +\-/+ +---+| M1| | Mi| | Mj|+---+ +/-\+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |
+\-/+ +---+ +\-/+ +---+ +---+ +---+ +---+ +\-/+ +---+ +\-/+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+/-\+ +---+ +/-\+ +---+ +---+ +---+ +---+ +/-\+ +---+ +/-\+
8
What is the “bad” transaction ?
● DELETE FROM TABLE WHERE …lot of rows…;
● Transaction of ~2 GB in the binary logs (RBR)
● Obviously a bug in the application
(but it should not have triggered an OOM)
9
What needs to be done next ?
● Reminder: 5.5 is not replication crash safe
● Next version is crash safe, but can’t…
● Crashed slaves either OOM again or are corrupted
● We need to re-clone all crashed slaves !
10
What saved us ?
● Engaged team of skilled DBAs: all joined to help
● Data not too sensitive on replication delay
● Data not too sensitive on “skipping transactions”
● pt-slave-restart
● IDEMPOTENT mode
● A torrenting cloning tool
11
What could have helped us ?
● A “working” torrenting cloning tool…● Not used often enough, so we did not know it was broken
(fixed in less than 2 hours)
● An AUTO-FIX/AUTO-REPAIR mode (RBR)● Instead of skipping transaction (and make data diverge)
should repair (fix) slave drift (and make data converge)https://bugs.mysql.com/bug.php?id=54250http://blog.wl0.org/2016/05/the-differences-between-idempotent-and-my-suggested-auto-repair-mode/
12
● We are hiring !
● MySQL Engineer / DBA
● System Administrator
● System Engineer
● Site Reliability Engineer
● Developer / Designer
● Technical Team Lead
● Product Owner
● Data Scientist
● And many more…
● https://workingatbooking.com/
Want to know more…
Thanks
Jean-François Gagné
jeanfrancois DOT gagne AT booking.com
top related