cassandra internals overview by sam tunnicliffe (1)
DESCRIPTION
Cassandra Internals presentaitonTRANSCRIPT
-
CASSANDRAINTERNALS OVERVIEW
DATASTAX BOOTCAMP 2015Sam Tunnicliffe
[email protected] / @beobal
-
OVERVIEWSystem startupMessagingGossipSchema PropagationRequest Coordination
-
STARTUPorg.apache.cassandra.service.CassandraDaemonprotectedvoidsetup()
Load config
Run preflight checks
Load schema
Clean up local temporary state
Recover CommitLog
Schedule background compactions
Initialize storage service
-
PREFLIGHT CHECKSSane clockJNIJVM & InstrumentationFilesystem permissionsSystem keyspace statusUpgrades (#8049)Incompatible SSTables (#8049)
-
STARTUPorg.apache.cassandra.service.CassandraDaemonprotectedvoidsetup()
Load config
Run pre-flight checks
Load schema
Clean up local temporary state
Recover CommitLog
Schedule background compactions
Initialize storage service
-
CLEAN UP LOCAL STATETruncate compactions_in_progressScrub data directories
-
STARTUPorg.apache.cassandra.db.commitlog.CommitLogpublicintrecover()throwsIOException
Load config
Run pre-flight checks
Load schema
Clean up local temporary state
Recover CommitLog
Schedule background compactions
Initialize storage service
-
INITIALIZE STORAGE SERVICEorg.apache.cassandra.service.StorageServicepublicsynchronizedvoidinitServer()throwsConfigurationException
Load ring state (unless don't)
Start gossip & get initial ring info
Set tokens
-
BOOTSTRAPAbort if other range movements happening
Fetch bootstrap data
Build secondary indexes
-
INITIALIZE STORAGE SERVICELoad ring state (unless don't)
Start gossip & get initial ring info
Set tokens
Setup auth resources
Ensure gossip stabilized
-
STARTUPLoad config
Run preflight checks
Load schema
Clean up local temporary state
Recover CommitLog
Schedule background compactions
Initialize storage service
-
-- it is done --
STARTUP
-
MESSAGINGSERVICEorg.apache.cassandra.net.MessagingService
Low level one-way messagingpublicvoidsendOneWay(MessageOutmessage,InetAddressto)
Async Request/ResponsepublicintsendRR(MessageOutmessage,InetAddressto,IAsyncCallbackcb)
-
MESSAGINGSERVICEorg.apache.cassandra.net.MessagingService
ReadspublicintsendRRWithFailure(MessageOutmessage,InetAddressto,IAsyncCallbackWithFailurecb)
WritespublicintsendRR(MessageOut
-
MESSAGINGSERVICEPre-emptively drops messages when overwhelmed
Dropped if time at execution > send time + timeout
Timeout value dependant on message type
Most client-initated requests can be dropped
(see MessagingService.DROPPABLE_VERBS)
-
GOSSIPWhat it does do:
Disseminates members' state around the clusterVersioned: generation (per JVM) & version (per value)Heartbeats: incremented every gossip roundApplication state:
StatusTokensRelease & schema versionDC & RackAddressesData sizeHealth
-
GOSSIPWhat doesn't it do:
Notify about up or down nodesPropagate schemaTransmit data filesDistribute mutations
-
GOSSIP
https://wiki.apache.org/cassandra/ArchitectureGossip
-
GOSSIPorg.apache.cassandra.gms.GossiperprivateclassGossipTaskimplementsRunnable{publicvoidrun(){...
Each round (1 second) gossip to:
1 live endpointmaybe 1 unreachable endpointmaybe 1 seed - if neither of the above
-
SCHEMA MIGRATIONAnother custom protocol
Also uses MessagingService
Target schema objects serialized as Mutations
diff/merge schema representations
-
SCHEMA PUSHorg.apache.cassandra.service.MigrationManagerprivatestaticFutureannounce(finalCollectionschema)
-
SCHEMA PULLorg.apache.cassandra.service.MigrationManagerpublicvoidscheduleSchemaPull(InetAddressendpoint,EndpointStatestate)
-
Client request arrives at coordinator:
COORDINATION
Transformed into actionable command(s):
IReadCommandIMutation
Coordinator distributes execution around the cluster
Replicas perform commands and respond to coordinator
Gather responses and determine client response
-
COORDINATIONorg.apache.cassandra.service
StorageProxyAbstractWriteResponseHandlerAbstractReadExecutor
org.apache.cassandra.locatorAbstractReplicationStrategyIEndpointSnitch
-
https://wiki.apache.org/cassandra/ArchitectureInternals
COORDINATING WRITESorg.apache.cassandra.service.StorageProxypublicstaticvoidmutate(Collection
-
DATA REPLICATIONorg.apache.cassandra.locator.SimpleStrategy
-
DATA REPLICATIONorg.apache.cassandra.locator.NetworkTopologyStrategy
-
https://wiki.apache.org/cassandra/ArchitectureInternals
COORDINATING WRITESorg.apache.cassandra.service.StorageProxypublicstaticvoidmutate(Collection
-
DELIVERING MUTATIONSorg.apache.cassandra.service.StorageProxypublicstaticvoidsendToHintedEndpoints(finalMutationmutation,Iterabletargets,AbstractWriteResponseHandlerresponseHandler,StringlocalDataCenter)
Mutations sent to replicas using MessagingService
ResponseHandler registered as callback
Callback registry triggers an event on expiry
Sent directly within local datacenter
Forwarded via single node in each remote DC
- COORDINATING WRITESorg.apache.cassandra.service.StorageProxypublicstaticvoidmutate(Collection
-
HINTSNodes can be down
Writes may timeout
In which case we may hint
Enabled/disabled globally or enabled per-DC
Writing a hint counts towards ConsistencyLevel.ANY
Deliver hints when a node comes back up & periodically
Too many hints in progress for a replica means we bail early
-
Determine point of failure by WriteType
LOGGED BATCHESorg.apache.cassandra.service.StorageProxypublicstaticvoidmutateAtomically(Collectionmutations,ConsistencyLevelconsistency_level)
CommitLog for batches
Guarantee eventual success of batched statements
Strives to distribute to across racks in local DC
On success, cleanup log entries asynchronously
Failed batches replayed by the nodes holding the logs
WriteType.BATCH_LOGWriteType.BATCH
-
COORDINATING READSorg.apache.cassandra.service.StorageProxypublicstaticListread(Listcommands,ConsistencyLevelconsistencyLevel,ClientStatestate)
Partition based reads
Read Repair & Data vs Digest Requests
Rapid Read Protection & (non)speculating executors
Distribution is more slightly complex than for writes
-
IDENTIFY TARGET ENDPOINTSorg.apache.cassandra.service.AbstractReadExecutorpublicstaticAbstractReadExecutorgetReadExecutor(ReadCommandcommand,ConsistencyLevelconsistencyLevel)
Use replication strategy to get live endpoints
Snitch sorts by proximity & health of replicas
Consult table metadata for Read Repair Decision
-
READ REPAIR DECISIONApply filter to sorted list of all live replicas
NONE: closest n replicas required by CLGLOBAL: all live replicasDC_LOCAL: all local replicas
Add closest n remotes needed to satisfy CLDefault Global Chance: 0.0Default Local Chance: 0.1
Give us a list of replicas to send read requests
-
RAPID READ PROTECTIONNever
Always
Fixed timeout
Table latency percentile
-
LIGHTS, CAMERA, EXECUTIONFire off each command using read executor
Requests are sent via MessagingService
Closest replica(s) sent full data requests
Others get digest requests
-
RESOLUTIONResolution can have two outcomes:
-
RESOLUTIONDigestMismatchException
Trigger a foreground read repairOf all targetted replicas
-
FOREGROUND READ REPAIRAll data requests, no digests
Includes replicas contacted initially
Effectively ConsistencyLevel.ALL
Specialized resolver: RowDataResolver
Retry any short reads
May also perform background Read Repair
-
OVERVIEW OVER