cassandra internals overview

CASSANDRAINTERNALS OVERVIEW

DATASTAX BOOTCAMP 2015Sam Tunnicliffe

[email protected] / @beobal

OVERVIEWSystem startupMessagingGossipSchema PropagationRequest Coordination

STARTUPorg.apache.cassandra.service.CassandraDaemon

protected void setup()

Load config

Run preflight checks

Load schema

Clean up local temporary state

Recover CommitLog

Schedule background compactions

Initialize storage service

PREFLIGHT CHECKSSane clockJNIJVM & InstrumentationFilesystem permissionsSystem keyspace statusUpgrades (#8049)Incompatible SSTables (#8049)

STARTUPorg.apache.cassandra.service.CassandraDaemon

protected void setup()

Load config

Run pre-flight checks

Load schema


Recover CommitLog



CLEAN UP LOCAL STATETruncate compactions_in_progressScrub data directories

STARTUPorg.apache.cassandra.db.commitlog.CommitLog

public int recover() throws IOException

Load config

Run pre-flight checks

Load schema


Recover CommitLog



INITIALIZE STORAGE SERVICEorg.apache.cassandra.service.StorageService

public synchronized void initServer() throws ConfigurationException

Load ring state (unless don't)

Start gossip & get initial ring info

Set tokens

BOOTSTRAPAbort if other range movements happening

Fetch bootstrap data

Build secondary indexes

INITIALIZE STORAGE SERVICELoad ring state (unless don't)

Start gossip & get initial ring info

Set tokens

Setup auth resources

Ensure gossip stabilized

STARTUPLoad config

Run preflight checks

Load schema


Recover CommitLog



-- it is done --

STARTUP

MESSAGINGSERVICEorg.apache.cassandra.net.MessagingService

Low level one-way messagingpublic void sendOneWay(MessageOut message, InetAddress to)

Async Request/Responsepublic int sendRR(MessageOut message, InetAddress to, IAsyncCallback cb)

MESSAGINGSERVICEorg.apache.cassandra.net.MessagingService

Readspublic int sendRRWithFailure(MessageOut message,

InetAddress to,

IAsyncCallbackWithFailure cb)

Writespublic int sendRR(MessageOut<? extends IMutation> message,

InetAddress to,

AbstractWriteResponseHandler handler,

boolean allowHints)

MESSAGINGSERVICEPre-emptively drops messages when overwhelmed

Dropped if time at execution > send time + timeout

Timeout value dependant on message type

Most client-initated requests can be dropped

(see MessagingService.DROPPABLE_VERBS)

GOSSIPWhat it does do:

Disseminates members' state around the clusterVersioned: generation (per JVM) & version (per value)Heartbeats: incremented every gossip roundApplication state:

StatusTokensRelease & schema versionDC & RackAddressesData sizeHealth

GOSSIPWhat doesn't it do:

Notify about up or down nodesPropagate schemaTransmit data filesDistribute mutations

GOSSIP

https://wiki.apache.org/cassandra/ArchitectureGossip

https://wiki.apache.org/cassandra/ArchitectureGossip

GOSSIPorg.apache.cassandra.gms.Gossiper

private class GossipTask implements Runnable

{

public void run()

{...

Each round (1 second) gossip to:

1 live endpointmaybe 1 unreachable endpointmaybe 1 seed - if neither of the above

SCHEMA MIGRATIONAnother custom protocol

Also uses MessagingService

Target schema objects serialized as Mutations

diff/merge schema representations

SCHEMA PUSHorg.apache.cassandra.service.MigrationManager

private static Future<?> announce(final Collection<Mutation> schema)

SCHEMA PULLorg.apache.cassandra.service.MigrationManager

public void scheduleSchemaPull(InetAddress endpoint, EndpointState state)

Client request arrives at coordinator:

COORDINATION

Transformed into actionable command(s):

IReadCommandIMutation

Coordinator distributes execution around the cluster

Replicas perform commands and respond to coordinator

Gather responses and determine client response

COORDINATIONorg.apache.cassandra.service

StorageProxyAbstractWriteResponseHandlerAbstractReadExecutor

org.apache.cassandra.locatorAbstractReplicationStrategyIEndpointSnitch

https://wiki.apache.org/cassandra/ArchitectureInternals

COORDINATING WRITESorg.apache.cassandra.service.StorageProxy

public static void mutate(Collection<? extends IMutation> mutations,

ConsistencyLevel consistency_level)

Get endpoints using replication strategy

Get pending endpoints from ring metadata

Deliver mutations to both sets of endpoints

Collate responses & determine client response

Maybe store local hints for unreachable replicas

DATA REPLICATIONorg.apache.cassandra.locator.SimpleStrategy

DATA REPLICATIONorg.apache.cassandra.locator.NetworkTopologyStrategy

https://wiki.apache.org/cassandra/ArchitectureInternals









DELIVERING MUTATIONSorg.apache.cassandra.service.StorageProxy

public static void sendToHintedEndpoints(final Mutation mutation,

Iterable<InetAddress> targets,

AbstractWriteResponseHandler responseHandler,

String localDataCenter)

Mutations sent to replicas using MessagingService

ResponseHandler registered as callback

Callback registry triggers an event on expiry

Sent directly within local datacenter

Forwarded via single node in each remote DC

HINTSNodes can be down

Writes may timeout

In which case we may hint

Enabled/disabled globally or enabled per-DC

Writing a hint counts towards ConsistencyLevel.ANY

Deliver hints when a node comes back up & periodically

Too many hints in progress for a replica means we bail early

Determine point of failure by WriteType

LOGGED BATCHESorg.apache.cassandra.service.StorageProxy

public static void mutateAtomically(Collection<Mutation> mutations,


CommitLog for batches

Guarantee eventual success of batched statements

Strives to distribute to across racks in local DC

On success, cleanup log entries asynchronously

Failed batches replayed by the nodes holding the logs

WriteType.BATCH_LOGWriteType.BATCH

COORDINATING READSorg.apache.cassandra.service.StorageProxy

public static List<Row> read(List<ReadCommand> commands,

ConsistencyLevel consistencyLevel,

ClientState state)

Partition based reads

Read Repair & Data vs Digest Requests

Rapid Read Protection & (non)speculating executors

Distribution is more slightly complex than for writes

IDENTIFY TARGET ENDPOINTSorg.apache.cassandra.service.AbstractReadExecutor

public static AbstractReadExecutor getReadExecutor(ReadCommand command,

ConsistencyLevel consistencyLevel)

Use replication strategy to get live endpoints

Snitch sorts by proximity & health of replicas

Consult table metadata for Read Repair Decision

READ REPAIR DECISIONApply filter to sorted list of all live replicas

NONE: closest n replicas required by CLGLOBAL: all live replicasDC_LOCAL: all local replicas

Add closest n remotes needed to satisfy CLDefault Global Chance: 0.0Default Local Chance: 0.1

Give us a list of replicas to send read requests

RAPID READ PROTECTIONNever

Always

Fixed timeout

Table latency percentile

LIGHTS, CAMERA, EXECUTIONFire off each command using read executor

Requests are sent via MessagingService

Closest replica(s) sent full data requests

Others get digest requests

RESOLUTIONResolution can have two outcomes:

RESOLUTIONDigestMismatchException

Trigger a foreground read repairOf all targetted replicas

FOREGROUND READ REPAIRAll data requests, no digests

Includes replicas contacted initially

Effectively ConsistencyLevel.ALL

Specialized resolver: RowDataResolver

Retry any short reads

May also perform background Read Repair

OVERVIEW OVER

cassandra internals overview

Technology