the two little bugs that almost - percona filethe two little bugs that almost brought down...

Post on 10-Jun-2018

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

The two little bugs that almost

brought down Booking.com

Jean-François Gagné (System Engineer)

jeanfrancois DOT gagne AT booking.comApril 25, 2017 – Percona Live Santa Clara 2017

2

For a, b and c relatively small:

Consequence(a + b + c) is much bigger than

Conseq.(a) + Conseq.(b) + Conseq.(c)

MySQL/MariaDB replication at Booking.com

● Typical Booking.com MySQL/MariaDB replication deployment:

+---+| M |+---+|+------+-- ... --+---------------+-------- ...| | | |

+---+ +---+ +---+ +---+| S1| | S2| | Sn| | M1|+---+ +---+ +---+ +---+

|+-- ... --+| |

+---+ +---+| T1| | Tm|+---+ +---+

3

Impacted setup (simplified)

+---+| M |+---+|+---------- .... ----------+--------------+| | |

+---+ +---+ +---+| M1| | Mi| | Mj|+---+ +---+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+

4

Upgrade from 5.5 to new major version

+---+| M |+---+|+---------- .... ----------+--------------+| | |

+---+ +---+ +---+| M1| | Mi| | Mj|+---+ +---+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+

5

Bad transaction on the master

+---+| M | <<-- “bad transaction”+---+|+---------- .... ----------+--------------+| | |

+---+ +---+ +---+| M1| | Mi| | Mj|+---+ +---+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+

6

Oups ! M2 runs OOM and is killed

+---+| M | <<-- “bad transaction”+---+|+---------- .... ----------+--------------+| | |

+---+ +\-/+ +---+| M1| | Mi| | Mj|+---+ +/-\+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+

7

Oups2 ! all “blue” run OOM and are killed

+---+| M | <<-- “bad transaction”+---+|+---------- .... ----------+--------------+| | |

+---+ +\-/+ +---+| M1| | Mi| | Mj|+---+ +/-\+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+\-/+ +---+ +\-/+ +---+ +---+ +---+ +---+ +\-/+ +---+ +\-/+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+/-\+ +---+ +/-\+ +---+ +---+ +---+ +---+ +/-\+ +---+ +/-\+

8

What is the “bad” transaction ?

● DELETE FROM TABLE WHERE …lot of rows…;

● Transaction of ~2 GB in the binary logs (RBR)

● Obviously a bug in the application

(but it should not have triggered an OOM)

9

What needs to be done next ?

● Reminder: 5.5 is not replication crash safe

● Next version is crash safe, but can’t…

● Crashed slaves either OOM again or are corrupted

● We need to re-clone all crashed slaves !

10

What saved us ?

● Engaged team of skilled DBAs: all joined to help

● Data not too sensitive on replication delay

● Data not too sensitive on “skipping transactions”

● pt-slave-restart

● IDEMPOTENT mode

● A torrenting cloning tool

11

What could have helped us ?

● A “working” torrenting cloning tool…● Not used often enough, so we did not know it was broken

(fixed in less than 2 hours)

● An AUTO-FIX/AUTO-REPAIR mode (RBR)● Instead of skipping transaction (and make data diverge)

should repair (fix) slave drift (and make data converge)https://bugs.mysql.com/bug.php?id=54250http://blog.wl0.org/2016/05/the-differences-between-idempotent-and-my-suggested-auto-repair-mode/

12

● We are hiring !

● MySQL Engineer / DBA

● System Administrator

● System Engineer

● Site Reliability Engineer

● Developer / Designer

● Technical Team Lead

● Product Owner

● Data Scientist

● And many more…

● https://workingatbooking.com/

Want to know more…

Thanks

Jean-François Gagné

jeanfrancois DOT gagne AT booking.com

top related