building reliable cloud storage with riak and cloudstack - andy gross, chief architect (basho)
DESCRIPTION
About Basho: Basho makes and distributes Riak CS. Built on Riak, Basho's opensource, scalable datastore used by thousands in production, CS is made for companies that need large file storage that can't go down. About the speaker: Andy Gross, Basho's Chief Architect, will take you on a tour of RiakCS, talk about how and why Basho built it, and the architecture that underpins it. He'll also highlight various uses case featuring Fortune500 companies who rely on Riak CS.TRANSCRIPT
Riak and Riak CSRiak and Riak CSAndy Gross <@argv0>Andy Gross <@argv0>
Chief Architect, Basho TechnologiesChief Architect, Basho Technologies
Silicon Valley Cloud Computing GroupSilicon Valley Cloud Computing Group
April 2, 2013April 2, 2013
BashoBasho120+ employees, offices in SF, MA, 120+ employees, offices in SF, MA, London, JapanLondon, Japan
Founded in 2008, open sourced Riak in Founded in 2008, open sourced Riak in 20092009
Sponsors of the Riak open source database Sponsors of the Riak open source database (Apache 2)(Apache 2)
Sell Enterprise features (multi-DC Sell Enterprise features (multi-DC replication), support, training.replication), support, training.
Riak CS (S3-compat storage) released in Riak CS (S3-compat storage) released in March 2012March 2012
Now Open Source (Apache 2)Now Open Source (Apache 2)
Cloud storage software backed by RiakCloud storage software backed by Riak
S3 APIS3 API
Formerly closed-sourceFormerly closed-source
Per-tenant reportingPer-tenant reporting
Pluggable authenticationPluggable authentication
Detailed statsDetailed stats
DTrace supportDTrace support
Multi-datacenter replication (Enterprise)Multi-datacenter replication (Enterprise)
Preliminary integration with CloudStackPreliminary integration with CloudStack
REDACTEDREDACTEDREDACTEDREDACTED
REDACTEDREDACTED
what is a cloud what is a cloud service?service?
operationally simpleoperationally simple
horizontally scalablehorizontally scalable
globally distributedglobally distributed
highly availablehighly available
no SPOFsno SPOFs
fault tolerantfault tolerant
you can’t outsource you can’t outsource these propertiesthese properties
operationally simpleoperationally simple
horizontally scalablehorizontally scalable
globally distributedglobally distributed
highly availablehighly available
no SPOFsno SPOFs
fault tolerantfault tolerant
““use pacemaker” = use pacemaker” = wrong answerwrong answer
““use mysql best use mysql best practices for practices for redundancy” = wrong redundancy” = wrong answeranswer
““just plug it into a just plug it into a SAN” = wrong SAN” = wrong answeranswer
all cloud services all cloud services need reliable, need reliable, distributed state distributed state storagestorage
storage is the most storage is the most important and important and hardest parthardest part
Riak CS uses RiakRiak CS uses Riak
What is Riak?What is Riak?
Key-Value store (plus extras)Key-Value store (plus extras)
Distributed, horizontally scalableDistributed, horizontally scalable
Eventually consistentEventually consistent
Fault-tolerantFault-tolerant
Highly-availableHighly-available
Inspired by Amazon’s DynamoInspired by Amazon’s Dynamo
Simple operations - get, put, deleteSimple operations - get, put, delete
Value is mostly opaque (some metadata)Value is mostly opaque (some metadata)
ExtrasExtras
MapReduceMapReduce
Secondary IndexesSecondary Indexes
Full-text search (optional)Full-text search (optional)
Key-ValueKey-Value
Distributed & Distributed & Horizontally ScalableHorizontally Scalable
Default configuration is in a clusterDefault configuration is in a cluster
Load and data are spread evenly via consistent Load and data are spread evenly via consistent hashinghashing
Scalable: Add more nodes to get more XScalable: Add more nodes to get more X
Fault-TolerantFault-Tolerant
Symmetry: All nodes participate equallySymmetry: All nodes participate equally
Decentralized: no central control, no SPOFDecentralized: no central control, no SPOF
All data is replicated 3x by defaultAll data is replicated 3x by default
Cluster transparently survives...Cluster transparently survives...
node failurenode failure
network partitionsnetwork partitions
Built on Erlang/OTP (designed for FT)Built on Erlang/OTP (designed for FT)
Highly-AvailableHighly-Available
Any node can serve client requestsAny node can serve client requests
Fallbacks (sloppy quorums) are used when Fallbacks (sloppy quorums) are used when nodes are downnodes are down
Always accepts write requests Always accepts write requests
Accepts read request as long as R/N nodes Accepts read request as long as R/N nodes are alive are alive
Per-request quorumsPer-request quorums
Inspired by Amazon’s Inspired by Amazon’s DynamoDynamo
Masterless, peer-coordinated replicationMasterless, peer-coordinated replication
Consistent hashingConsistent hashing
Eventually consistentEventually consistent
Quorum reads and writesQuorum reads and writes
Anti-entropy: read repair, hinted handoffAnti-entropy: read repair, hinted handoff
RiakNode
RiakNode
RiakNode
RiakNode
RiakNode
Large Object
Riak CS
S3API
ReportingAPI
Riak CS
S3API
ReportingAPI
Riak CS
S3API
ReportingAPI
Riak CS
S3API
ReportingAPI
Riak CS
S3API
ReportingAPI
1. user uploads an object
1 MB
2. Riak CSbreaks object
into 1 MB chunks
1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB
3. Riak CSstreams chunksto Riak nodes
4. Riak replicatesand stores
chunks
PrinciplesPrinciples
Always-writable Always-writable
Incrementally scalableIncrementally scalable
SymmetricalSymmetrical
DecentralizedDecentralized
Focus on SLAs, tail latencyFocus on SLAs, tail latency
TechniquesTechniques
Consistent HashingConsistent Hashing
Vector ClocksVector Clocks
Read RepairRead Repair
Anti-EntropyAnti-Entropy
Hinted HandoffHinted Handoff
Gossip ProtocolGossip Protocol
Consistent HashingConsistent Hashing
Invented by Danny Lewin and others @ Invented by Danny Lewin and others @ MIT/AkamaiMIT/Akamai
Minimizes remapping of keys when number of Minimizes remapping of keys when number of hash slots changeshash slots changes
Originally applied to CDNs, used in Dynamo for Originally applied to CDNs, used in Dynamo for replica placementreplica placement
Enables incremental scalability, even spreadEnables incremental scalability, even spread
Minimizes hot spotsMinimizes hot spots
Vector ClocksVector Clocks
Introduced by Mattern et al, in 1988Introduced by Mattern et al, in 1988
Extends Lamport’s timestamps (1978)Extends Lamport’s timestamps (1978)
Each value in Dynamo tagged with vector clockEach value in Dynamo tagged with vector clock
Allows detection of stale values, logical siblingsAllows detection of stale values, logical siblings
Read RepairRead Repair
Update stale versions opportunistically on Update stale versions opportunistically on reads (instead of writes)reads (instead of writes)
Pushes system toward consistency, after Pushes system toward consistency, after returning value to clientreturning value to client
Reflects focus on a cheap, always-available Reflects focus on a cheap, always-available write pathwrite path
Hinted HandoffHinted Handoff
Any node can accept writes for other nodes if Any node can accept writes for other nodes if they’re downthey’re down
All messages include a destinationAll messages include a destination
Data accepted by node other than destination Data accepted by node other than destination is handed off when node recoversis handed off when node recovers
As long as a single node is alive the cluster can As long as a single node is alive the cluster can accept a writeaccept a write
Anti-EntropyAnti-Entropy
Replicas maintain a Merkle Tree of keys and Replicas maintain a Merkle Tree of keys and their versions/hashestheir versions/hashes
Trees periodically exchanged with peer vnodesTrees periodically exchanged with peer vnodes
Merkle tree enables cheap comparisonMerkle tree enables cheap comparison
Only values with different hashes are Only values with different hashes are exchangedexchanged
Pushes system toward consistencyPushes system toward consistency
Gossip ProtocolGossip Protocol
Decentralized approach to managing global Decentralized approach to managing global statestate
Trades off atomicity of state changes for a Trades off atomicity of state changes for a decentralized approachdecentralized approach
Volume of gossip can overwhelm networks Volume of gossip can overwhelm networks without carewithout care
Hinted Handoff•Node fails
• Requests go to fallback
•Node comes back
• “Handoff” - data returns to recovered node
•Normal operations resume
hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
``̀
X
X
XX
X
X
XX
`̀`
Anatomy of a Request
get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)Get Handler (FSM)
clientRiak
hash(“hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)”)
== 10, 11, 12== 10, 11, 12
get(“blocks/6307C89A-710A-42CD-9FFB-
2A6B39F983EA”)Coordinating node
Cluster
66 77 88 99 1010 1111 1212 1313 1414 1515 1616
The Ring
R=2R=2
v1v1 v2v2
v1v1 v2v2
v2v2
v2v2v2v2
Read Repairget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)
Get Handler (FSM)Get Handler (FSM)
clientRiak
Coordinating nodeCluster
66 77 88 99 1010 1111 1212 1313 1414 1515 1616
R=2R=2 v1v1 v2v2
v2v2
v1v1
v2v2v1v1v1v1 v2v2v2v2
Erlang/OTP RuntimeErlang/OTP Runtime
Riak KVRiak KV
Riak ArchitectureClient APIsClient APIs
Request CoordinationRequest Coordination
Riak CoreRiak Core
getget putput deletdeletee
map-map-reducereduce
HTTPHTTP Protocol BuffersProtocol Buffers
Erlang local clientErlang local client
membershipconsistent hashinghandoff
node-liveness
gossip
buckets
vnodesvnodes
storage backendstorage backend
JS RuntimeJS Runtime
vnode mastervnode master
riak is a solid riak is a solid foundation for foundation for building cloud building cloud servicesservices
Coming Soon:Coming Soon:Riak CS 1.4 (Q2)Riak CS 1.4 (Q2)
Swift APISwift API
Keystone IntegrationKeystone Integration
S3 FeaturesS3 Features
COPY ObjectCOPY Object
Object VersioningObject Versioning
Riak CS 1.5 (Q3)Riak CS 1.5 (Q3)
Server side encryptionServer side encryption
More S3 featuresMore S3 features
Enhanced CloudStack and OpenStack integrationEnhanced CloudStack and OpenStack integration
RiakRiak
Coming Later (2014)Coming Later (2014)
Erasure codingErasure coding
Reduced redundancy storageReduced redundancy storage
Native indexing/searchNative indexing/search
RICON East - May 13-14, RICON East - May 13-14, NYCNYC
A distributed systems conference for A distributed systems conference for developersdevelopers
Speakers from Comcast, State Farm, UC Speakers from Comcast, State Farm, UC Berkeley, Harvard, and many moreBerkeley, Harvard, and many more
Use discount code SVCloud20 for 20% off Use discount code SVCloud20 for 20% off ticketstickets
http://ricon.io/east.htmlhttp://ricon.io/east.html
thanks!/questions?thanks!/questions?download riakcs: download riakcs:
http://docs.basho.com/riakcs/latest/riakcs-downloads/ hack riakcs:hack riakcs:
http://github.com/basho/riak_cs
work at basho:work at basho:http://bashojobs.theresumator.comhttp://bashojobs.theresumator.com
follow basho on twitter:follow basho on twitter: http:/twitter.com/bashohttp:/twitter.com/basho