disque: a detailed overview of the distributed implementation - salvatore sanfilippo

37
Disque A new distributed message queue @antirez - Redis Labs

Upload: distributed-matters

Post on 17-Jan-2017

521 views

Category:

Data & Analytics


0 download

TRANSCRIPT

DisqueA new distributed message queue

@antirez - Redis Labs

Redis roots

• In memory, optional persistence.

• Same protocol.

• BSD license.

Asynchronous jobs execution, micro services

bus, distributed timer

API

ADDJOB queue job <timeout>

Disque Job IDsDI8497c0098d456946843784d3ea41af5525c741bf05a0SQ

Node ID prefix (32 bit)

Unique Message ID (128 bit)

TTL in minutes (16 bit)

GETJOB FROM q1 q2

ACKJOB id1 id2 …

Disque is all about explicit acknowledges.

Delivery semantics

• At least once by default.

• At most once also available.

ADDJOB queue job 0 RETRY 3600

ADDJOB queue job 0 TTL 86400

ADDJOB queue job 0 DELAY 3600

GUARANTEES

Synchronous replicationADDJOB myqueue task1 REPLICATE 3

ADDJOB queue job 0 ASYNC

Persistent (optionally)

LOADJOB … data …DELJOB … id …

Append Only File

Disque & CAP

• AP.

• Immutable messages (mostly).

• Converge to ACK state.

• CAP “A” availability (single node partition).

At least once delivery

• Liveness: eventually the message will be delivered.

• Safety: messages not yet delivered at least one time will never be evicted from the cluster.

• (But if message TTL is reached).

At most once delivery

• Safety: messages already dequeued will never be queued a second time.

• An immediate result of replicating to just one node, enqueue just one time (retry time set to zero).

Federation: all nodes are really the same

Best effort orderingMain Design Sacrifice

NACK and retries counters

• Alternative for explicit dead letters.

• Counters consistency is best effort.

• (but it does not matters).

• GETJOB exposes the two counters.

Disque tries hard to avoid multiple deliveries.

WHY?

• Costly: think at spikes after partitions or at CP stores to de-dup.

• No de-dup, nor idempotency, in certain uses, if duplication rate is acceptable.

• Not so hard: worth it.

INTERNALS

Message states

ACTIVEQUEUEDACKED

ACTIVE

• Node has a copy.

• Not available for delivery.

• ACTIVE -> QUEUED (On retry timer)

• ACTIVE -> ACKED (On ACK received)

QUEUED

• Node has a copy.

• Will deliver via GETJOB.

• QUEUED -> ACTIVE (On delivery)

• QUEUED -> ACKED (On ACK received)

ACKED

• Propagate via SETACK!

• Perform Garbage Collection of message.

• ACKED -> EVICTED (on succesful GC)

QUEUED

ACTIVE

WILLQUEUE

QUEUED

Sent 500 ms before ACTIVE -> QUEUED

WILLQUEUE MESSAGE

QUEUED

ACTIVE

QUEUED MESSAGE on ACTIVE -> QUEUED state change

ACKED

QUEUED

Reset retry timer

QUEUED

Dequeue if ID1 > ID2

QUEUED

SETACK

KNOWN SOURCE

ANY OTHER NODE

NEEDJOBS

YOURJOBS

Exponential delay + Broadcast & ad-hoc

NEEDJOBS

NEEDJOBS triggers

• Clients blocked with GETJOBS(and queues are empty)

• Queue drops to zero messages(and import rate > 0)

Message owners

Each node has, for each message,

a list* of owners

* a possibly inconsistent list

Ehm… some C code./* Job representation in memory. */

typedef struct job {

char id[JOB_ID_LEN]; /* Job ID. */

unsigned int state:4; /* Job state: one of JOB_STATE_* states. */

unsigned int gc_retry:4;/* GC attempts counter, for exponential delay. */

uint8_t flags; /* Job flags. */

uint16_t repl; /* Replication factor. */

uint32_t etime; /* Job expire time. */

uint64_t ctime; /* Job creation time in ms+counter. */

uint32_t delay; /* Delay before to queue this job for 1st time. */

uint32_t retry; /* Job re-queue time. */

uint16_t num_nacks; /* Number of NACKs this node observed. */

uint16_t num_deliv; /* Number of deliveries this node observed. */

Immutable, converging, inconsistent

Ehm… some C code. robj *queue; /* Job queue name. */

sds body; /* Body, or NULL if job is just an ACK. */

dict *nodes_delivered; /* Nodes that may have a copy. */

dict *nodes_confirmed; /* Nodes that confirmed copy or ack.

mstime_t qtime; /* Next queue time */

mstime_t awakeme; /* Time at which we need to take actions. */

} job;

github.com/antirez/disque