Transcript
Page 1: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

The two little bugs that almost

brought down Booking.com

Jean-François Gagné (System Engineer)

jeanfrancois DOT gagne AT booking.comApril 25, 2017 – Percona Live Santa Clara 2017

Page 2: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

2

For a, b and c relatively small:

Consequence(a + b + c) is much bigger than

Conseq.(a) + Conseq.(b) + Conseq.(c)

Page 3: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

MySQL/MariaDB replication at Booking.com

● Typical Booking.com MySQL/MariaDB replication deployment:

+---+| M |+---+|+------+-- ... --+---------------+-------- ...| | | |

+---+ +---+ +---+ +---+| S1| | S2| | Sn| | M1|+---+ +---+ +---+ +---+

|+-- ... --+| |

+---+ +---+| T1| | Tm|+---+ +---+

3

Page 4: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

Impacted setup (simplified)

+---+| M |+---+|+---------- .... ----------+--------------+| | |

+---+ +---+ +---+| M1| | Mi| | Mj|+---+ +---+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+

4

Page 5: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

Upgrade from 5.5 to new major version

+---+| M |+---+|+---------- .... ----------+--------------+| | |

+---+ +---+ +---+| M1| | Mi| | Mj|+---+ +---+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+

5

Page 6: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

Bad transaction on the master

+---+| M | <<-- “bad transaction”+---+|+---------- .... ----------+--------------+| | |

+---+ +---+ +---+| M1| | Mi| | Mj|+---+ +---+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+

6

Page 7: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

Oups ! M2 runs OOM and is killed

+---+| M | <<-- “bad transaction”+---+|+---------- .... ----------+--------------+| | |

+---+ +\-/+ +---+| M1| | Mi| | Mj|+---+ +/-\+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+ +---+

7

Page 8: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

Oups2 ! all “blue” run OOM and are killed

+---+| M | <<-- “bad transaction”+---+|+---------- .... ----------+--------------+| | |

+---+ +\-/+ +---+| M1| | Mi| | Mj|+---+ +/-\+ +---+| | |+-----+- .. -+-----+ +- .. -+ +-----+-- .. -+-----+| | | | | | | | | |

+\-/+ +---+ +\-/+ +---+ +---+ +---+ +---+ +\-/+ +---+ +\-/+| S1| | S2| | S.| | Sn| | T.| | Tm| | U1| | U2| | U.| | Uo|+/-\+ +---+ +/-\+ +---+ +---+ +---+ +---+ +/-\+ +---+ +/-\+

8

Page 9: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

What is the “bad” transaction ?

● DELETE FROM TABLE WHERE …lot of rows…;

● Transaction of ~2 GB in the binary logs (RBR)

● Obviously a bug in the application

(but it should not have triggered an OOM)

9

Page 10: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

What needs to be done next ?

● Reminder: 5.5 is not replication crash safe

● Next version is crash safe, but can’t…

● Crashed slaves either OOM again or are corrupted

● We need to re-clone all crashed slaves !

10

Page 11: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

What saved us ?

● Engaged team of skilled DBAs: all joined to help

● Data not too sensitive on replication delay

● Data not too sensitive on “skipping transactions”

● pt-slave-restart

● IDEMPOTENT mode

● A torrenting cloning tool

11

Page 12: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

What could have helped us ?

● A “working” torrenting cloning tool…● Not used often enough, so we did not know it was broken

(fixed in less than 2 hours)

● An AUTO-FIX/AUTO-REPAIR mode (RBR)● Instead of skipping transaction (and make data diverge)

should repair (fix) slave drift (and make data converge)https://bugs.mysql.com/bug.php?id=54250http://blog.wl0.org/2016/05/the-differences-between-idempotent-and-my-suggested-auto-repair-mode/

12

Page 13: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

● We are hiring !

● MySQL Engineer / DBA

● System Administrator

● System Engineer

● Site Reliability Engineer

● Developer / Designer

● Technical Team Lead

● Product Owner

● Data Scientist

● And many more…

● https://workingatbooking.com/

Want to know more…

Page 14: The two little bugs that almost - Percona fileThe two little bugs that almost brought down Booking.com Jean-François Gagné (System Engineer) jeanfrancois DOT gagneAT booking.com

Thanks

Jean-François Gagné

jeanfrancois DOT gagne AT booking.com


Top Related