replication in the wild ankara cloud meetup - feb 2017

30
REPLICATION IN THE WILD Ensar Basri Kahveci

Upload: ankaracloud

Post on 19-Mar-2017

123 views

Category:

Technology


0 download

TRANSCRIPT

REPLICATION IN THE WILD

Ensar Basri Kahveci

Hello!Ensar Basri Kahveci

Distributed Systems Engineer @ HazelcastPh.D. Candidate @ Bilkent CS

twitter: metanet | github: metanet | linkedin: basrikahveci

- IMDG: storage + computation + messaging- Open source, distributed, highly available, elastic,

scalable- Distributed Java collections, JCache, HD store- Embedded or client - server deployment

- Clients: Java, Scala, C++, C#, Python, Node.js, ...

- Integration modules

- https://blog.hazelcast.com/announcing-hazelcast-imdg-3-8

- Rolling upgrades- User code deployment- Hot restart improvements- WAN replication improvements

REPLICATION- Putting a data set into

multiple nodes.- Each replica has a full copy.

- A few reasons for replication:- Performance- Availability and fault tolerance

REPLICATION + PARTITIONING- Mostly used with

partitioning.- Two partitions: P1, P2- Two replicas for each

partition.

NOTHING FOR FREE!- Very easy to do when the data is immutable.- Problems start when we have multiple copies

of the data and we want to update them.- Two main difficulties:

- Handling updates,- Handling failures.

CAP PrIncIple- Proposed by Eric Brewer in 2000 [13],

- Proved by Gilbert and Lynch in 2002 [14].- A shared-data system cannot achieve perfect

consistency and availability in the presence of partitions � CP vs AP.- Widespread acceptance, and yet a lot of criticism

[15 - 21].

consIstency AND avaIlabIlIty- Degrees of consistency:

- Data centric, client centric- Degrees of availability:

- High availability, sticky availability, non-availability

- Replication is directly related to C and A. [25]

The dangers of replIcatIon and a solutIon- Gray et al. [1] classify replication models by 2

parameters:- Where to make updates: primary copy or update

anywhere- When to make updates: eagerly or lazily

WHERE: PRIMARY COPY- There is a single replica

managing the updates.- Concurrency control is easy.

- No conflicts and no conflict-handling logic.- Updates are executed on the primary and

secondaries with the same order.- When primary fails, a new primary is elected.

- Ensuring a single and good primary is hard.

WHERE: UPDATE ANYWHERE- Each replica can initiate a

transaction to make an update.- Complex concurrency control.- Deadlocks or conflicts are

possible. - In practice, there is also

multi-leader.

WHEN: EAGER REPLICATION- Synchronously updates all

replicas as part of one atomic transaction.

- Provides strong consistency. - Degree of availability can

degrade on node failures.- Consensus algorithms.

WHEN: LAZY REPLICATION- Updates each replica with a

separate transaction.- Updates can execute quite fast.- Degree of availability is high.- Eventual consistency. - Data copies can diverge.

- Data loss or conflicts can occur.

WHERE?

WHEN?

PRIMARY COPY UPDATE ANYWHERE

EAGER

strong consistencysimple concurrency

slowinflexible

strong consistencycomplex concurrency

slowexpensivedeadlocks

LAZY

faststicky availability

eventual consistencysimple concurrency

inconsistency

fastflexible

high availabilityeventual consistency

inconsistency conflicts

WHERE?

WHEN?

PRIMARY COPY UPDATE ANYWHERE

EAGER

Multi Paxos [5]etcd and Consul (RAFT) [6]

Zookeeper (Zab) [7]Kafka

VoltDB [24]

Paxos [5]Hazelcast Cluster State Change [12]

MySQL 5.7 Group Replication [23]

LAZY

HazelcastMongoDB

ElasticSearchRedis

Dynamo [4]Cassandra

RiakHazelcast Active-Active WAN

Replication [22]

PRIMARY COPY + EAGER REPLICATION- When the primary fails, secondaries are

guaranteed to be up to date. - Raft, Kafka etc.

- Majority approach can be used. - In Kafka, in-sync-replica set [11] is maintained. - Secondaries can be used for reads.

UPDATE ANYWHERE + EAGER REPLICATION- Each replica can initiate a new transaction. - Concurrent transactions can compete with

each other.- Possibility of deadlocks.- In the basic Paxos algorithm, there is no

designated leader.

PRIMARy COPY + LAZY REPLICATION- The primary copy can execute updates fast.- Secondaries can fall behind the primary. It is

called replication lag.- It can lead to data loss during leader failover, but

still no conflicts.- Implies sticky availability.- Secondaries can be used for reads.

UPDATE ANYWHERE + LAZY REPLICATION- Dynamo-style [4] highly available databases.- Quorums.- Concurrent updates are first-class citizens.- Possibility of conflicts

- Avoiding, discarding, detecting & resolving conflicts- Eventual convergence

- Write repair, read repair and anti-entropy

QUORUMS- W + R > N

- W = 3, R = 1, N = 3 - W = 2, R = 2, N = 3

- If W or R is not met- Sloppy quorums and

hinted handoff

ConflIct-free replIcated data types (CRDTS)- Special data types that achieve strong

eventual consistency and monotonicity [2]

- No conflicts- Merge function has 3 properties:

- Commutative: A+B=B+A- Associative: A+(B+C)=(A+B)+C- Idempotent: f(f(x))=f(x)

- Riak Data Types [3]

DISCARDING CONFLICTS: LAST WRITE WINS- When 2 updates are concurrent, define an

arbitrary order among them.- i.e., pretend that one of them is more recent.

- Attach a timestamp to each write.- Cassandra uses physical timestamps [8], [9].

DETECTING CONFLICTS: VECTOR CLOCKS - In Dynamo paper [4], each update is done

against a particular version of a data entry. - Multiple versions of a data entry can exist together.

- Vector clocks [10] are used to track causality.- The system can determine the authoritative version:

syntactic reconciliation- The system cannot reconcile multiple versions:

semantic reconciliation

VECTOR CLOCKS

ResolvIng conflIcts and EVENTUAL CONVERGENCE- Write repair- Read repair- Anti-entropy

- Merkle trees

Recap- We apply replication to make distributed

systems performant, available and fault tolerant.

- It suffers from core problems of distributed systems.- Various replication protocols are built based

on when and where to make updates.- No silver bullet. It is a world of trade-offs.

REFerences[1] Gray, Jim, et al. "The dangers of replication and a solution." ACM SIGMOD Record 25.2 (1996): 173-182.[2] Shapiro, Marc, et al. "Conflict-free replicated data types." Symposium on Self-Stabilizing Systems. Springer, Berlin, Heidelberg, 2011.[3] http://docs.basho.com/riak/kv/2.2.0/learn/concepts/crdts/[4] DeCandia, Giuseppe, et al. "Dynamo: amazon's highly available key-value store." ACM SIGOPS operating systems review 41.6 (2007): 205-220.[5] Lamport, Leslie. "Paxos made simple." ACM Sigact News 32.4 (2001): 18-25.[6] Ongaro, Diego, and John K. Ousterhout. "In Search of an Understandable Consensus Algorithm." USENIX Annual Technical Conference. 2014.[7] Hunt, Patrick, et al. "ZooKeeper: Wait-free Coordination for Internet-scale Systems." USENIX annual technical conference. Vol. 8. 2010.[8] http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks[9] https://aphyr.com/posts/299-the-trouble-with-timestamps[10] Raynal, Michel, and Mukesh Singhal. "Logical time: Capturing causality in distributed systems." Computer 29.2 (1996): 49-56.[11] http://kafka.apache.org/documentation.html#replication[12] http://docs.hazelcast.org/docs/latest/manual/html-single/index.html#managing-cluster-and-member-states[13] E. Brewer, "Towards Robust Distributed Systems," Proc. 19th Ann. ACM Symp. Principles of Distributed Computing (PODC 00), ACM, 2000, pp. 7-10[14] https://codahale.com/you-cant-sacrifice-partition-tolerance/[15] http://blog.nahurst.com/visual-guide-to-nosql-systems[16] http://www.allthingsdistributed.com/2008/12/eventually_consistent.html[17] https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/[18] https://www.infoq.com/articles/cap-twelve-years-later-how-the-rules-have-changed[19] Gilbert, Seth, and Nancy Lynch. "Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services." Acm Sigact News 33.2 (2002): 51-59.[20] https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html[21] https://henryr.github.io/cap-faq/[22] http://docs.hazelcast.org/docs/3.7/manual/html-single/index.html#wan-replication[23] https://dev.mysql.com/doc/refman/5.7/en/group-replication.html[24] https://www.voltdb.com/architecture[25] Bailis, Peter, et al. "Highly available transactions: Virtues and limitations." Proceedings of the VLDB Endowment 7.3 (2013): 181-192.

THANKS!Any questions?