building reliable cloud storage with riak and cloudstack - andy gross, chief architect (basho)

Post on 08-May-2015

1.851 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

About Basho: Basho makes and distributes Riak CS. Built on Riak, Basho's opensource, scalable datastore used by thousands in production, CS is made for companies that need large file storage that can't go down. About the speaker: Andy Gross, Basho's Chief Architect, will take you on a tour of RiakCS, talk about how and why Basho built it, and the architecture that underpins it. He'll also highlight various uses case featuring Fortune500 companies who rely on Riak CS.

TRANSCRIPT

Riak and Riak CSRiak and Riak CSAndy Gross <@argv0>Andy Gross <@argv0>

Chief Architect, Basho TechnologiesChief Architect, Basho Technologies

Silicon Valley Cloud Computing GroupSilicon Valley Cloud Computing Group

April 2, 2013April 2, 2013

BashoBasho120+ employees, offices in SF, MA, 120+ employees, offices in SF, MA, London, JapanLondon, Japan

Founded in 2008, open sourced Riak in Founded in 2008, open sourced Riak in 20092009

Sponsors of the Riak open source database Sponsors of the Riak open source database (Apache 2)(Apache 2)

Sell Enterprise features (multi-DC Sell Enterprise features (multi-DC replication), support, training.replication), support, training.

Riak CS (S3-compat storage) released in Riak CS (S3-compat storage) released in March 2012March 2012

Now Open Source (Apache 2)Now Open Source (Apache 2)

Cloud storage software backed by RiakCloud storage software backed by Riak

S3 APIS3 API

Formerly closed-sourceFormerly closed-source

Per-tenant reportingPer-tenant reporting

Pluggable authenticationPluggable authentication

Detailed statsDetailed stats

DTrace supportDTrace support

Multi-datacenter replication (Enterprise)Multi-datacenter replication (Enterprise)

Preliminary integration with CloudStackPreliminary integration with CloudStack

REDACTEDREDACTEDREDACTEDREDACTED

REDACTEDREDACTED

what is a cloud what is a cloud service?service?

operationally simpleoperationally simple

horizontally scalablehorizontally scalable

globally distributedglobally distributed

highly availablehighly available

no SPOFsno SPOFs

fault tolerantfault tolerant

you can’t outsource you can’t outsource these propertiesthese properties

operationally simpleoperationally simple

horizontally scalablehorizontally scalable

globally distributedglobally distributed

highly availablehighly available

no SPOFsno SPOFs

fault tolerantfault tolerant

““use pacemaker” = use pacemaker” = wrong answerwrong answer

““use mysql best use mysql best practices for practices for redundancy” = wrong redundancy” = wrong answeranswer

““just plug it into a just plug it into a SAN” = wrong SAN” = wrong answeranswer

all cloud services all cloud services need reliable, need reliable, distributed state distributed state storagestorage

storage is the most storage is the most important and important and hardest parthardest part

Riak CS uses RiakRiak CS uses Riak

What is Riak?What is Riak?

Key-Value store (plus extras)Key-Value store (plus extras)

Distributed, horizontally scalableDistributed, horizontally scalable

Eventually consistentEventually consistent

Fault-tolerantFault-tolerant

Highly-availableHighly-available

Inspired by Amazon’s DynamoInspired by Amazon’s Dynamo

Simple operations - get, put, deleteSimple operations - get, put, delete

Value is mostly opaque (some metadata)Value is mostly opaque (some metadata)

ExtrasExtras

MapReduceMapReduce

Secondary IndexesSecondary Indexes

Full-text search (optional)Full-text search (optional)

Key-ValueKey-Value

Distributed & Distributed & Horizontally ScalableHorizontally Scalable

Default configuration is in a clusterDefault configuration is in a cluster

Load and data are spread evenly via consistent Load and data are spread evenly via consistent hashinghashing

Scalable: Add more nodes to get more XScalable: Add more nodes to get more X

Fault-TolerantFault-Tolerant

Symmetry: All nodes participate equallySymmetry: All nodes participate equally

Decentralized: no central control, no SPOFDecentralized: no central control, no SPOF

All data is replicated 3x by defaultAll data is replicated 3x by default

Cluster transparently survives...Cluster transparently survives...

node failurenode failure

network partitionsnetwork partitions

Built on Erlang/OTP (designed for FT)Built on Erlang/OTP (designed for FT)

Highly-AvailableHighly-Available

Any node can serve client requestsAny node can serve client requests

Fallbacks (sloppy quorums) are used when Fallbacks (sloppy quorums) are used when nodes are downnodes are down

Always accepts write requests Always accepts write requests

Accepts read request as long as R/N nodes Accepts read request as long as R/N nodes are alive are alive

Per-request quorumsPer-request quorums

Inspired by Amazon’s Inspired by Amazon’s DynamoDynamo

Masterless, peer-coordinated replicationMasterless, peer-coordinated replication

Consistent hashingConsistent hashing

Eventually consistentEventually consistent

Quorum reads and writesQuorum reads and writes

Anti-entropy: read repair, hinted handoffAnti-entropy: read repair, hinted handoff

RiakNode

RiakNode

RiakNode

RiakNode

RiakNode

Large Object

Riak CS

S3API

ReportingAPI

Riak CS

S3API

ReportingAPI

Riak CS

S3API

ReportingAPI

Riak CS

S3API

ReportingAPI

Riak CS

S3API

ReportingAPI

1. user uploads an object

1 MB

2. Riak CSbreaks object

into 1 MB chunks

1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB 1 MB

3. Riak CSstreams chunksto Riak nodes

4. Riak replicatesand stores

chunks

PrinciplesPrinciples

Always-writable Always-writable

Incrementally scalableIncrementally scalable

SymmetricalSymmetrical

DecentralizedDecentralized

Focus on SLAs, tail latencyFocus on SLAs, tail latency

TechniquesTechniques

Consistent HashingConsistent Hashing

Vector ClocksVector Clocks

Read RepairRead Repair

Anti-EntropyAnti-Entropy

Hinted HandoffHinted Handoff

Gossip ProtocolGossip Protocol

Consistent HashingConsistent Hashing

Invented by Danny Lewin and others @ Invented by Danny Lewin and others @ MIT/AkamaiMIT/Akamai

Minimizes remapping of keys when number of Minimizes remapping of keys when number of hash slots changeshash slots changes

Originally applied to CDNs, used in Dynamo for Originally applied to CDNs, used in Dynamo for replica placementreplica placement

Enables incremental scalability, even spreadEnables incremental scalability, even spread

Minimizes hot spotsMinimizes hot spots

Vector ClocksVector Clocks

Introduced by Mattern et al, in 1988Introduced by Mattern et al, in 1988

Extends Lamport’s timestamps (1978)Extends Lamport’s timestamps (1978)

Each value in Dynamo tagged with vector clockEach value in Dynamo tagged with vector clock

Allows detection of stale values, logical siblingsAllows detection of stale values, logical siblings

Read RepairRead Repair

Update stale versions opportunistically on Update stale versions opportunistically on reads (instead of writes)reads (instead of writes)

Pushes system toward consistency, after Pushes system toward consistency, after returning value to clientreturning value to client

Reflects focus on a cheap, always-available Reflects focus on a cheap, always-available write pathwrite path

Hinted HandoffHinted Handoff

Any node can accept writes for other nodes if Any node can accept writes for other nodes if they’re downthey’re down

All messages include a destinationAll messages include a destination

Data accepted by node other than destination Data accepted by node other than destination is handed off when node recoversis handed off when node recovers

As long as a single node is alive the cluster can As long as a single node is alive the cluster can accept a writeaccept a write

Anti-EntropyAnti-Entropy

Replicas maintain a Merkle Tree of keys and Replicas maintain a Merkle Tree of keys and their versions/hashestheir versions/hashes

Trees periodically exchanged with peer vnodesTrees periodically exchanged with peer vnodes

Merkle tree enables cheap comparisonMerkle tree enables cheap comparison

Only values with different hashes are Only values with different hashes are exchangedexchanged

Pushes system toward consistencyPushes system toward consistency

Gossip ProtocolGossip Protocol

Decentralized approach to managing global Decentralized approach to managing global statestate

Trades off atomicity of state changes for a Trades off atomicity of state changes for a decentralized approachdecentralized approach

Volume of gossip can overwhelm networks Volume of gossip can overwhelm networks without carewithout care

Hinted Handoff•Node fails

• Requests go to fallback

•Node comes back

• “Handoff” - data returns to recovered node

•Normal operations resume

hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

``̀

X

X

XX

X

X

XX

`̀`

Anatomy of a Request

get(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)Get Handler (FSM)

clientRiak

hash(“hash(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)”)

== 10, 11, 12== 10, 11, 12

get(“blocks/6307C89A-710A-42CD-9FFB-

2A6B39F983EA”)Coordinating node

Cluster

66 77 88 99 1010 1111 1212 1313 1414 1515 1616

The Ring

R=2R=2

v1v1 v2v2

v1v1 v2v2

v2v2

v2v2v2v2

Read Repairget(“blocks/6307C89A-710A-42CD-9FFB-2A6B39F983EA”)

Get Handler (FSM)Get Handler (FSM)

clientRiak

Coordinating nodeCluster

66 77 88 99 1010 1111 1212 1313 1414 1515 1616

R=2R=2 v1v1 v2v2

v2v2

v1v1

v2v2v1v1v1v1 v2v2v2v2

Erlang/OTP RuntimeErlang/OTP Runtime

Riak KVRiak KV

Riak ArchitectureClient APIsClient APIs

Request CoordinationRequest Coordination

Riak CoreRiak Core

getget putput deletdeletee

map-map-reducereduce

HTTPHTTP Protocol BuffersProtocol Buffers

Erlang local clientErlang local client

membershipconsistent hashinghandoff

node-liveness

gossip

buckets

vnodesvnodes

storage backendstorage backend

JS RuntimeJS Runtime

vnode mastervnode master

riak is a solid riak is a solid foundation for foundation for building cloud building cloud servicesservices

Coming Soon:Coming Soon:Riak CS 1.4 (Q2)Riak CS 1.4 (Q2)

Swift APISwift API

Keystone IntegrationKeystone Integration

S3 FeaturesS3 Features

COPY ObjectCOPY Object

Object VersioningObject Versioning

Riak CS 1.5 (Q3)Riak CS 1.5 (Q3)

Server side encryptionServer side encryption

More S3 featuresMore S3 features

Enhanced CloudStack and OpenStack integrationEnhanced CloudStack and OpenStack integration

RiakRiak

Coming Later (2014)Coming Later (2014)

Erasure codingErasure coding

Reduced redundancy storageReduced redundancy storage

Native indexing/searchNative indexing/search

RICON East - May 13-14, RICON East - May 13-14, NYCNYC

A distributed systems conference for A distributed systems conference for developersdevelopers

Speakers from Comcast, State Farm, UC Speakers from Comcast, State Farm, UC Berkeley, Harvard, and many moreBerkeley, Harvard, and many more

Use discount code SVCloud20 for 20% off Use discount code SVCloud20 for 20% off ticketstickets

http://ricon.io/east.htmlhttp://ricon.io/east.html

thanks!/questions?thanks!/questions?download riakcs: download riakcs:

http://docs.basho.com/riakcs/latest/riakcs-downloads/ hack riakcs:hack riakcs:

http://github.com/basho/riak_cs

work at basho:work at basho:http://bashojobs.theresumator.comhttp://bashojobs.theresumator.com

follow basho on twitter:follow basho on twitter: http:/twitter.com/bashohttp:/twitter.com/basho

top related