mysql group replication

Download MySQL Group Replication

If you can't read please download the document

Upload: ulf-wendel

Post on 14-Jun-2015

7.867 views

Category:

Technology


6 download

DESCRIPTION

MySQL Group Replication is a new 'synchronous', multi-master, auto-everything replication plugin for MySQL introduced with MySQL 5.7. It is the perfect tool for small 3-20 machine MySQL clusters to gain high availability and high performance. It stands for high availability because the fault of replica don't stop the cluster. Failed nodes can rejoin the cluster and new nodes can be added in a fully automatic way - no DBA intervention required. Its high performance because multiple masters process writes, not just one like with MySQL Replication. Running applications on it is simple: no read-write splitting, no fiddling with eventual consistency and stale data. The cluster offers strong consistency (generalized snapshot isolation). It is based on Group Communication principles, hence the name.

TRANSCRIPT

  • 1. MySQL Group Replication:'Synchronous',multi-master,auto-everythingUlf Wendel, MySQL/Oracle

2. The speaker says...MySQL 5.7 introduces a new kind of replication: MySQLGroup Replication. At the time of writing (10/2014)MySQL Group Replication is available as a preview releaseon labs.mysql.com. In common user terms it features(virtually) synchronous, multi-master, auto-everythingreplication. 3. Proper wording...An eager update everywhere system basedon the database state machine approachatop of a group communication systemoffering virtual synchrony andreliable total ordering messaging.MySQL Group Replication offersgeneralized snapshot isolation. 4. The speaker says...And here is a more technical description.... 5. WHAT ?!Hmm, how does it compare? 6. The speaker says...The technical description given for MySQL GroupReplication may sound confusing because it has elementsfrom the distributed systems and database systems theory.From around 1996 and 2006 the two research communitiesjointly formulated the replication method implemented byMySQL Group Replication.As a web developer or MySQL DBA you are not expected toknow distributed systems theory inside out. Yet tounderstand the properties of MySQL Group Replication andto get most of it, we'll have to touch some of the concepts.Let's see first how the new stuff compares to the existing. 7. Goals of distributed databasesAvailability Cluster as a whole unaffected by loss of nodesScalability Geographic distribution Scale size in terms of users and data Database specific: read and/or write loadDistribution Transparency Access, Location, Migration, Relocation (while in use) Replication Concurrency, Failure 8. The speaker says...MySQL Group Replication is about building a distributeddatabase. To catalog it and compare it with the existingMySQL solutions in this area, we can ask what the goals ofdistributed databases are. The goals lead to some criteriathat is used to give a first, brief overview.Goal: a distributed database cluster strives for maximumavailability and scalability while maintaining distributiontransparency.Criteria: availability, scalability, distribution transparency. 9. MySQL clustering cheat sheetMySQLReplicationMySQLClusterMySQLFabricAvailability Primary = SpoF,no auto failoverSharednothing,auto failoverSpoF monitored,auto failoverScalability ReadsPartialreplication,node limitPartialreplication,no node limitScale onWAN Asynchronous Synchronous(WAN option)Asynchronous(depends)DistributionTransparency R/W splitting SQL: yes(low level: no)Special clientsNo distributedqueries 10. The speaker says...Already today MySQL has three solutions to build adistributed MySQL cluster: MySQL Replication, MySQLCluster and MySQL Fabric. Each system has differentoptimizations, none can achieve all the goals of a distributedcluster at once. Some goals are orthogonal.Take MySQL Cluster. MySQL Cluster is a shared nothingsystem. Data storage is reundant, nodes fail independently.Transparent sharding (partial replication) ensures read andwrite scalability until the maximum number of nodes isreached. Great for clients: any SQL node runs any SQL,synchronous updates become visible immediatelyeverywhere. But, it won't scale on slow WAN connections. 11. How Group Replication fits inRepl. Cluster Group Repl. FabricAvailability Shared nothing,auto failoverShared nothing,auto failover/joinScalabilityPartialreplication,node limitFull replication,read and somewrite scalabilityScale onWANSynchronous(WAN option)(Virtually)SynchronousDistributionTransparencySQL: yes(low level: no)All nodes runall SQL 12. The speaker says...MySQL Group Replication has many of the desireableproperties of MySQL Cluster. Its strong on availability andclient friendly due to the distribution transparency. Nocomplex client or application logic is required to use thecluster. So, how do the two differ?Unlike MySQL Cluster, MySQL Group Replication supportsthe InnoDB storage engine. InnoDB is the dominant storageengine for web applications. This makes MySQL GroupReplication a very attractive choice for small clusters (3-7nodes) running Drupal, WordPress, in LAN settings! Also,Group Replication is not synchronous in a technical way. Forpractical matters it is. 13. Group Replication (vs. Cluster)Availability Nodes fail independently Cluster continues operation in case of node failuresScalability Geographic distribution: n/a, needs fast messaging All nodes accept writes, mild write scalability All nodes accept reads, full read scalabilityDistribution Transparency Full replication: all nodes have all the data Fail stop model: developer free'd to worry about consistency 14. The speaker says...Another major difference between MySQL Cluster andMySQL Group Replication is the use of partial replicationversus full replication. MySQL Cluster has transparentsharding (partial replication) build-in. On the inside, on thelevel of so-called MySQL Cluster data nodes, not every nodehas all the data. Writes don't add work to all nodes of thecluster but only a subset of them. Partial replication is theonly known solution to write scalability. With MySQL GroupReplication all nodes have all the data. Writes can beexecuted concurrently on different nodes but each writemust be coordinated with every other node. time to dig deeper >:). 15. Eager update everywhere... ?! 16. A developers categorization...Where are transactions run?Primary Copy Update EverywhereWhen doessynchronization happen?Eager (MySQL semi-synchReplication)MySQL ClusterMySQL Group3rd party: GaleraLazyMySQLReplication/Fabric3rd party: TungstenMySQL ClusterReplication 17. The speaker says...I've described MySQL Group Replication as an eagerupdate everywhere system. The term comes from acategorization of different database replication systems bythe two questions:- where can transaction every be run?- when are transactions synchronized between nodes?The answers to the questions tells a developer whichchallenges to expect. The answers determine whichadditional tasks an application must handle when its run ona cluster instead of a single server. 18. Lazy causes work...010101001011010101010110100101101010010101010101010110101011101010110111101Set price = 1.23Nodeprice = 1.23Node Node Nodeprice = 1.00 price = 1.23 price = 0.98 19. The speaker says...When you try to scale an application running it on a lazy(asynchronous) replication cluster instead of a single serveryou will soon have users complaining about outdated andincorrect data. Depending which node the applicationconnects to after a write, a user may or may not see his ownupdates. This can neither happen on a single server systemnor on an eager (synchronous) replication cluster. Lazyreplication causes extra work for the developer.BTW, have a look at PECL/mysqlnd_ms. It abstracts theproblem of consistency for you. Things like read-your-writesboil down to a single function call. 20. Primary Copy causes work...PrimaryWriteCopy Copy CopyReadReadRead Read 21. The speaker says...Judging from the developer perspective only, primary copy isan undesired replication solution. In a primary copy systemonly one node accepts writes. The other nodes copy theupdates performed on the primary. Because of the read-writesplitting, the replication system does not need tocoordinate conflicting operations. Great for the replicationsystem author, bad for the developer. As a developer youmust ensure that all write operations are directed to theprimary node... Again, have a look at PECL/mysqlnd_ms.MySQL Replication follows this approach. Worse, MySQLReplication is a lazy primary copy system. 22. Love: Eager Update EverywhereNodeWriteReadprice = 1.23price = 1.23 price = 1.23Node NodeWrite Read Write Read 23. The speaker says...From a developer perspective an eager update anywheresystem, like MySQL Group Replication, is indistinguishablefrom a single node. The only extra work it brings you is loadbalancing, but that is the case with any cluster. An eagerupdate anywhere cluster improves distribution transparencyand removes the risk of reading stale data. Transparencyand flexibility is improved because any transaction can bedirected to any replica. (Sometimes synchronizationhappens as part of the commit, thus strong consistency canbe achieved.) Fault tolerance is better than with PrimaryCopy. There is no single point of failure a single primary -that can cause a total outage of the cluster. Nodes may failindividually without bringing the cluster down immediately. 24. HOW? Distributed + DB?Database state machine? 25. The speaker says...In the mid-1990s two observations made the database anddistributed system theory communities wondered if theycould develop a joint replication approach.First Gray et. al. (database communitiy) showed that thecommon two-phase locking has an expected deadlock ratethat grows with the third power of the number of replicas.Second, Schiper and Raynal noted that transactions havecommon properties with group communication principles(distributed systems) such as ordering, agreement/'all-or-nothing'and even durability. 26. Three building blocksState machine replication trivial to understandAtomic Broadcast database meets distributed systems community OMG, how easy state machine replication is to implement!Deferred Update Database Replication database meets distributed systems community how we gain high availability and high performance what those MySQL Replication team blogs talk about ;-) 27. The speaker says...Finally, in 1999 Pedone, Guerraoui and Schiper publishedthe paper The Database State Machine Approach. Thepaper combines two well known building blocks forreplication with a messaging primitive common in thedistributed systems world: atomic broadcast.MySQL Group Replication is slightly different from this 1999version, more following a later refinement from 2005 plus abit of additional ease-of-use. However, by end of this chapteryou learned how MySQL Cluster and MySQL GroupReplication differ beyond InnoDB support and built-insharding. 28. State machine replicationInputSet A = 1Replica ReplicaReplicaOutputA = 1 A = 1 A = 1Output Output 29. The speaker says...The first building block is trivial: a state machine. A statemachine takes some input and produces some output.Assume your state machines are determinisitic. Then, if youhave a set of replicas all running the same state machineand they all get the same input, they all will produce thesame output. On an aside: state machine replication is alsoknown as active replication. Active means that every replicaexecutes all the operations, active adds compute load toevery replica. With passive replication, also called primary-backupreplication, one replica (primary) executes theoperations and forwards the results to the others. Passivesuffers under primary availability and possibly networkbandwith. 30. Requirement: AgreementInputSet A = 1Replica ReplicaReplicaOutputA = 1A = NULL 31. The speaker says...Here's more trivia about the state machine replicationapproach. There are two requirements for it to work. Quiteobviously, every replica has to receive all input to come tothe same output. And the precondition for receiving input isthat the replica is still alive.In academic words the requirement is: agreement. Everynon-faulty replica receives every request. Non-faulty replicasmust agree on the input. 32. Requirement: Order1) Set A = 12) Set B = 13) Set B = A *2Input: 1, 2, 3 Input: 1, 3, 2 Input: 3, 1, 2Replica ReplicaReplicaA = 1 A = 1B = 2 B = 1A = 1B = 1 33. The speaker says...The second trivial requirement for state machine replicationis ordering. To produce the same output any two statemachines must execute the very same input including theordering of input operations. The academic wording goes: ifa replica processes requests r1 before r2, then no replicaprocesses request r2 before r1. Note that if operationscommute, some reording may still lead to correct output.The sequence A = 1, B = 1, B = A * 2 and the sequence B =1, A = 1, B = A * 2 produce the same output.(Unrelated here: the database scaling talk touches the fancycommutative replicated data types Riak offers... hot!) 34. Atomic BroadcastDistributed systems messaging abstraction Meets all replicated state machine requirementsAgreement If a site delivers a message m then every site delivers mOrder No two sites deliver any two messages in different ordersTermination If a site broadcasts message m and does not fail, then everysite eventually delivers m We need this in asynchronous enivronments 35. The speaker says...State machine replication is the first building block forunderstanding the database state machine approach. Thesecond building block is a messaging abstraction from thedistributed systems world called atomic broadcast. Atomicbroadcast provides all the properties required for statemachine replication: agreement and ordering. It adds aproperty needed for communication in an asynchronoussystem, such as a system communicating via networkmessages: termination.All in all, this greatly simplifies state machine replication andcontributes to a simple, layered design. 36. Delivery, durability, groupClientReplicaReplicaReplicaMr. XReplicaReplicaReplicaGroupSend first, possibly delivered second 37. The speaker says...The Atomic broadcast properties given are literally copiedfrom the original paper describing the database statemachine replication approach. There is two things in it notexplained yet. First, atomic broadcast defines properties interms of message delivery. The delivery property not onlyensures total ordering despite slow transport but also coversmessage loss (MySQL desires uniform agreement here,something better than Corosync) and even the crash andrecovery of processors (durability)! A recovering processormust first deliver outstanding messages before it continues.Second, note that atomic broadcast introduces the notion ofa group. Only (correct) members of a group can exchangemessages. 38. Deferred Update: the best?Client ClientReplicaReplicaReplicaReplicaReplicaReplicaClient RequestServer CoordinationExecutionAgreementClient Response 39. The speaker says...We are almost there. The third building block to thedatabase state machine replication is deferred updatedatabase replication. The slide shows a generic functionalmodel used by Pedone and Schiper in 2010 to illustrate theirchoice of deferred update.The argument goes that deferredupdate combines the best of the two most prominent objectreplication techniques: active and passive replication. Onlythe comination of the best from the two will give both highavailability and high performance.Translation: MySQL Group Replication can in theory -have higher overall throughput than MySQL Replication. Doyou love the theory ;-) ? As a DBA you should. 40. Active Replication (SM)ReplicaReplicaReplicaReplicaReplicaReplicaClient ClientClient sends op to allRequests get orderedExecutionAll reply to client 41. The speaker says...In an active replication system, a pure state machinereplication system, the client operations are forwarded to allreplicas and each replica individually executes the operation.The two challenges are to ensure all replicas executerequests in the same order and all replicas decide the same.Recall, that we talk multi-threaded database servers here.A downside is that every replica has to execute theoperation. If the operation is expensive in terms of CPU, thiscan be a waste of CPU time. 42. Passive ReplicationBackupPrimaryBackupReplicaReplicaReplicaClient ClientClient sends op to primaryOnly primary executesPrimary forwards changesPrimary replies to client 43. The speaker says...The alternative is passive replication or primary-backupreplication. Here, the client talks to only one server, theprimary. Only the primary server executes client operations.After computation of the result, the primary forwards thechanges to the backups which apply tem.The problem here is that the primary determines thesystems throughput. None of the backups can contribute itscomputing power to the overall system throughput. 44. Multi-primary (pass.) replicationWhat we want... for performance: more than one primary for scalability: no distributed locking .. and of course: transactions Two-staged transaction protocolClient PrimaryPrimaryPrimaryTransaction processing Transaction termination 45. The speaker says...Multi-primary (passive) replication has all the ingredientsdesired.Transaction processing is two staged. First, a client picksany replica to execute a transaction. This replica becomesthe primary of the transaction. The transaction executeslocally, the stage is called transaction processing. In thesecond stage, during transaction termination, the primariesjointly decide whether the transaction can commit or mustabort.Because updates are not immediately applied, databasefolks call this deferred update our last building block. 46. Deferred Update DB ReplicationDeterministic certification Reads execute locally, Updates get certified Certification ensures transaction serializability Replicas decide independently about certification resultRead PrimaryWrite PrimaryPrimaryPrimaryRs/Ws/U 47. The speaker says...One property of transactions is isolation. Isolation is alsoknow as serializability: the concurrent execution oftransactions should be equivalent to a serial execution of thesame transactions. In Deferred Update system, readtransactions are processed and terminated on one replicaand serialized locally.Updates must be certified. After the transaction processingthe readset, writeset and updates are sent to all otherreplicas. The servers then decide in a deterministicprocedure whether (one-copy) serializability holds, if thetransaction commits. Because its a deterministic procedure,the servers can certify transactions independently! 48. Options for terminationAtomic Broadcast based this is what is used, by MySQL, by DBSMOptimization: Reordering (atop of Atomic Broadcast) in theory it means less transaction abortsOptimization limit: Generic Broadcast based this has issues, which make it nastyAtomic Commit based more transactions than atomic broadcast 49. The speaker says...There are several ways of implementing the terminationprotocol and the certification. There are two truly distinctchoices: atomic broadcast and atomic commit. Atomiccommit causes more transaction aborts than atomicbroadcast. So, it's out and atomic broadcast remains.Atomic broadcast can in theory be further optimizedtowards less transaction aborts using reordering. Forpractically matters, this is about where the optimizationsend. A weaker (and possibly faster) generic broadcastcauses problems in the transactional model. For databases,it could be an over-optimization. 50. Generic certification testTransactions have a state Executing, Comitting, Comitted, AbortedReads are handled locallyUpdates are send to all replicas Readset and writeset are forwardedOn each replica: search for 'conflicting' transactions Can be serialized with all previous transactions? Commit! Commit? Abort local transaction that overlap with update 51. The speaker says...No matter what termination procedure is used, the basicprocedure for certification in the deferred update model isalways the same. Updates/writes need certification. Thedata read and the data written by a transaction is forwardedto all other replicas.Every replica searches for potentially 'conflicting'transactions, the details depend on the terminationprocedure. A transaction is decided to commit if it does notviolate serializability with all previous transactions. Any localtransaction currently running and conflicting with the updateis aborted. 52. Database State MachineDeferred Update Database Replication as a statemachine Atomic Broadcast based terminationPlugin ServicesMySQLTransaction hooksPluginsMySQL Group ReplicationCapture Apply RecoverReplication Protocol incl. termination protocol/certifierGroup Communication System 53. The speaker says...The Database State Machine Approach combines all the bitsand pieces. Let's do a bottom up summary. Atomicbroadcast not only free's the database developer to botherabout networking APIs it also solves the nasty bits ofcommunicating in an asynchronous network. It providesproperties that meet the requirements of the state machinereplication. A deterministic state machine is what one needsto implement the termination protocol within deferred updatereplication. Deferred update replication does not usedistributed locking which Gray proved problematic and itcombines the best of active and passive replication. Sideeffects: simple replication protocol, layered code. 54. The termination algorithmUpdates are send to all replicas Readset and writeset are forwardedStep 1 - On each replica: certify Is there any comitted transaction that conflicts?(In the original paper: check for write-read conflicts betweencomitting transaction and comitted transactions using. Doesthe committing transaction readset overlap with any comittedtransactions writeset. Works slightly different in MySQL.)Step 2 On each replica: commitment Apply transactions decided to commit Handle concurrent local transactions: remote wins 55. The speaker says...The termination process has two logical steps, just like thegeneral one presented earlier. The very details of howexactly two transactions are checked for conflicts in the firststep don't matter here. MySQL Group Replication is using arefinement of the algorithm tailored to its own needs. As adeveloper all you need to know is: a remote transactionalways wins no matter how expensive local transactions are.And, keep conflicting writes on one replica. It's faster.The puzzling bit on the slide is the rule to check check acommiting transaction against any commited transaction forconflicts. Any !? Not any... only concurrent. 56. What's concurrent?Any other transaction that precedes the current one Recall: total ordering Recall: asynchronous, delay between broadcast and deliveryReplicaReplicaReplicaReplicaReplicaBroadcast Delivery1Total order 121 2 21 2 57. The speaker says...The definition of what concurrent means is a bit tricky. Itsdefined through a negation and that's confusing on the firstlook but becomes hopefully clear on the next slide.Concurrent to a transaction is any other transaction thatdoes precede it. If we know the order of all transactions inthe entire cluster -, then we can which transactions precedeone another.Atomic broadcast ensures total order on delivery. Someimplementations decide on ordering when sending and thatnumber (logical clock) could be be used. Any logical clockworks. 58. Certify against all previous?ReplicaReplicaReplicaReplicaReplicaTransaction(2)2Total order 3Certification2234344Broadcast:Transaction 4 is basedon all previous up to 2Certification when 4 is delivered:Check conflicts with trx >2 and trx < 4 59. The speaker says...The slide has an example how to find any other transactionthat precedes one. When a transaction enters thecommitting state and is broadcasted, the broadcast includesthe logical time (= total order number on the slide) of thelatest transaction comitted on the replica.Eventually the transaction is delivered on all sites. Upondelivery the certification considers all transactions thathappend after the logical time of the to be certifiedtransaction. All those transactions precede the one to becertified, they executed concurrently at different replicas. Wedon't have to look further in the past. Further in the past isstuff that's been decided on already. 60. TIME TO BREATHMySQL is different anyway... 61. The speaker says...Good news! The algorithm used by MySQL GroupReplication is different and simpler. For correctness, theprecedes relation is still relevant. But it comes for free... 62. A developers view on commitReplicaReplicaReplicaReplicaReplicaBEGIN COMMIT Resultt(3)4 Certify4 CertifyApplyClient Execute 63. The speaker says...We are not done with the theory yet but let's do some slidesthat take the developers perspective. Assuming you have toscale a PHP application, assuming a small cluster of ahandful MySQL servers is enough and assuming theseservers are co-located on racks, then MySQL GroupReplication is your best possible choice.Did you get this from the theory? Replication is'synchronous'. On commit you wait only for the server youare connected to. Once your transaction is broadcasted, youare done. You don't wait for the other servers to execute thetransaction. With uniform atomic broadcast, once yourtransaction is broadcasted, it cannot get lost. (That's why Itorture you with theory.) 64. MySQL ReplicationMasterSlaveReplicaReplicaFetch ReplicaBEGIN COMMIT OKBin log etc.ApplyClient execute 65. The speaker says...If your network is slow or mother earth, the speed of lightand network message round trip time adds too much tooyour transaction execution time, then asynchronous MySQLReplication is a better choice.In MySQL Replication the master (primary) never waits forthe network. Not even to broadcast updates. Slavesasynchronously pull changes. Despite pushing work on thedeveloper this approach has the downsite that a hardwarecrash on the master can cause transaction loss. Slaves mayor may not have pulled the latest data. 66. MySQL Semi-sync ReplicationMasterSlaveReplicaReplicaBEGIN COMMIT OKWait for first ACKFetch ReplicaBin logApplyClient ExecuteSlave Fetch Apply Replica 67. The speaker says...In the times of MySQL 5.0 the MySQL Communitysuggested that to avoid transaction loss the master shouldwait for one slave to acknowledge it has fetched the updatefrom the master. The fact that it's fetched does not meanthat it's been applied. The update may not be visible toclients yet.It is a back and forth whether database replication should beasynchronous or not. It depends on your needs.Back to theory after this break. 68. Back to theory!Virtual Synchrony? 69. Virtual SynchronyGroups and views A turbo-charged veryion of Atomic BroadcastP1P2P3P4M1M2VCM3M4G1 = {P1, P2, P3} G2 = {P1, P2, P3, P4} 70. The speaker says...Good news! Virtual Synchrony and Atomic Broadcast are thesame. Our Atomic Broadcast definition assumes a staticgroup. Adding group members, removing members ordetecting failed ones is covered.Virtual Synchrony handles all these membership changes.Whenever an existing group agrees on changes, a new viewis installed through a view change (VC) event.(The term 'virtual': it's not synchronous. There is a delay wedon't want to wait for short message delays. Yet, the systemappears to be synchronous to most real life observers.) 71. Virtual SynchronyView changes act as a message barrier That's a case causing troubles in Two-Phase CommitP1P2P3P4M5VCM6M7M8G2 = {P1, P2, P3, P4} G3 = {P1, P2, P3} 72. The speaker says...View changes are message barriers. If the group memberssuspect a member to have failed they install a new view.Maybe the former member was not dead but just too slow torespond, or disconnected for a brief period. False alarm. Theformer member then tries to broadcast some updates.Virtual Synchrony ensures that the updates will not be seenby the remaining members. Furthermore the former memberwill realize that it was excluded.Some GCS implementing virtual synchrony even provideabstractions that ensure a joining member learns all updatesit missed (state transfer) before it rejoins. 73. Auto-everything: failoverMySQL Group Replication has a pluggable GCS API Split brain handling? Depends onGCS and/or GCS config Default GCS is CorosyncMySQLMySQLMySQLMySQLMySQLMySQL 74. The speaker says...Good news! The Virtual Synchrony group membershipadvantages are fully exposed to the user level: node failuresare detected and handled automatically. PECL/mysqlnd_mscan help you with the client site. It's a minor tweak to have itautomatically learn about remaining MySQL server. Expectand update release soon.MySQL Group Replication works with any GroupCommunication system that can be accessed from C andimplements Virtual Synchrony. The default choice isCorosync. Split brain handling is GCS dependent. MySQLfollows view change notifications of the GCS. 75. Auto-everything: joiningElastic cluster grows and shrinks on demand State transfer done via asynch replication channelMySQLMySQLMySQLMySQLMySQLMySQLDonor State transferJoiner 76. The speaker says...Good news! When adding a server you don't fiddle with thevery details. You start the server, tell it to join the cluster andwait for it to catch up. The server picks a donor, beginsfetching updates using much of the existing MySQLReplication code infrastructure and that's it. 77. Back to theory!Generalized Snapshot Isolation 78. Deferred Update tweakTransaction read set does not need to be broadcasted Readset is hard to extract and can be huge Weaker serializability level than 1SR Sufficient for InnoDB default isolationRead PrimaryWrite PrimaryPrimaryPrimaryV/Ws/U 79. The speaker says...Good news! This is last bit of theory. The original DatabaseState Machine proposal was followed by a simpler toimplement proposal in 2005. If the clusters serialization levelis marginally lowered to snapshot, certification becomeseasier. Generalized snapshot isolation can be achievedwithout having to broadcast the readset of transactions.Recording the readset of a transaction is difficult in mostexisting databases. Also, readsets can be huge.Snapshot isolation is an isolation level for multi-versionconcurrency control. MVCC? InnoDB! Somehow... Whateverthis is the MySQL Group Replication termination basealgorithm. 80. Snapshot IsolationConcurrent and write conflict? First comitter wins! Reads use snapshot from the beginning of the transactionFirst committerConflict (both change x)T1T2T1T2BEGIN(v1), W(v1, x=1), COMMIT!, x:v2=1BEGIN(v1), W(v1, x=2), , , COMMIT?Concurrent write (version 1) 81. The speaker says...In Snapshot Isolations transactions take a snapshot whenthey begin. All reads return data from this snapshot.Although any other concurrent transaction may update theunderlying data while the transaction still runs, the change isunvisiable, the transaction runs in isolation. If two concurrenttransactions change the same data item they conflict. Incase of conflicts, the first comitter wins.MVCC requires that as part update of an data item itsversion is incremented. Future transactions will base theirsnapshot on the new version. 82. The actual termination protocolReplicaReplicaReplicaReplicaReplicaWrite(v2, x=1)CertificationObject Latest versionx 1y 13OK 83. The speaker says...Every replica checks the version of a write duringcertification. It compares the writes data items versionnumber with the latest it knows of. If the version is higher orequal than the one found in the replicas certification index,the write is accepted. A lower number indicates thatsomeone has already updated the data item before.Because the first comitter must win a write showing a lowerversion number than is in the certification index must abort.(The certification index fills over time and is truncatedperiodically by MySQL. MySQL reports the size throughPerformance Schema tables.) 84. Hmm...Does it work? 85. It's a preview there are limitsGeneral InnoDB only Corosync lacks uniform agreement No rules to prevent split-brain (it's a preview, you're allowed tofool yourself if you misconfigure the GCS!)Isolation level Primary Key based Foreign Keys and Unique Keys not supported yetNo concurrent DDL 86. That's it, folks!Questions? 87. The speaker says...(Oh, a question. Flips slide) 88. Network messages pffft!MySQL super hero at Facebook@markcallaghan Sep 30For MySQL sync replication, when all commits originate from 1 master isthere 1 network round trip or 2? http://mysqlhighavailability.com/mysql-group-replication-hello-world @Ulf_Wendel@markcallaghan AFAIK, on the logical level, there should be one. Someof your questions might depend on the GCS used. The GCS ispluggable@markcallaghan@Ulf_Wendel @h_ingo Henrik tells me it is "certification based" so Iremain confused 89. GCS != MySQL Semi-syncIt's many round trips, how many depends on GCS Default GCS is Corosync, Corosyc is Totem Ring Corosync uses a privilege-based approach for total ordering Many options: fixed sequencer, moving sequencer, ... Where you run your updates only impacts collision rateMySQLMySQLCorosyncCorosyncMySQLCorosync 90. The speaker says...No Mark, MySQL Group Replication cannot be understoodas a replacement for MySQL Semi-sync Replication. Thequestion about network round trips is hard to answer. AtomicBroadcast and Virtual Synchrony stack many subprotocolstogether. Let's consider a stable group, no network failure,Totem. Totem orders messages using a token that circulatesalong a virtual ring of all members. Whoever has the token,has the priviledge to broadcast. Others wait for the token toappear. Atomic Broadcast gives us all or nothing messaging.It takes at least another full round on the ring to be sure thebroadcast has been received by all. How many round tripsare that? Welcome to distributed systems... 91. THE ENDContact: [email protected] 92. The speaker says...Thank you for your attendance!Upcoming shows:Talk&Show! - YourPlace, any time