building distributed systems using helix
DESCRIPTION
Kishore Gopalakrisha (Staff Software Engineer @ LinkedIn) gave this talk at ApacheCon in February 2013.TRANSCRIPT
![Page 1: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/1.jpg)
1
Building distributed systems using Helix
Kishore Gopalakrishna, @kishoreg1980h?p://www.linkedin.com/in/kgopalak
h?p://helix.incubator.apache.org Apache IncubaGon Oct, 2012 @apachehelix
![Page 2: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/2.jpg)
Outline
• Introduc)on • Architecture • How to use Helix • Tools • Helix usage
2
![Page 3: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/3.jpg)
3
Examples of distributed data systems
![Page 4: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/4.jpg)
4
Single Node
MulG node
Fault tolerance
Cluster Expansion
• ParGGoning • Discovery • Co-‐locaGon
• ReplicaGon • Fault detecGon • Recovery
• Thro?le data movement • Re-‐distribuGon
Lifecycle
![Page 5: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/5.jpg)
5
Typical Architecture
Node Node Node Node
App. App. App. App.
Network Cluster manager
![Page 6: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/6.jpg)
Distributed search service
Node 1 Node 3 Node 2
P.3
P.1 P.2
P.4
ParGGon management
• MulGple replicas • Even distribuGon
• Rack aware placement
Fault tolerance
• Fault detecGon • Auto create replicas
• Controlled creaGon of replicas
ElasGcity
• re-‐distribute parGGons
• Minimize movement
• Thro?le data movement
P.5
P.3 P.4
P.6 P.1
P.5 P.6
P.2
INDEX SHARD
REPLICA
![Page 7: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/7.jpg)
Distributed data store
Node 1 Node 3 Node 2
P.4
P.9 P.10 P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
ParGGon management
• MulGple replicas • 1 designated master
• Even distribuGon
Fault tolerance
• Fault detecGon • Promote slave to master
• Even distribuGon
• No SPOF
ElasGcity
• Minimize downGme
• Minimize data movement
• Thro?le data movement
MASTER
SLAVE
![Page 8: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/8.jpg)
Message consumer group
• Similar to Message groups in AcGveMQ – guaranteed ordering of the processing of related messages across a single queue
– load balancing of the processing of messages across mulGple consumers
– high availability / auto-‐failover to other consumers if a JVM goes down
• Applicable to many messaging pub/sub systems like kada, rabbitmq etc
8
![Page 9: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/9.jpg)
Message consumer group
9
ASSIGNMENT SCALING FAULT TOLERANCE
![Page 10: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/10.jpg)
10
ApplicaGon
Zookeeper
ApplicaGon
Framework
Consensus System
• File system • Lock • Ephemeral
• Node • ParGGon • Replica • State • TransiGon
Zookeeper provides low level primiGves. We need high level primiGves.
![Page 11: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/11.jpg)
11
![Page 12: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/12.jpg)
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
12
![Page 13: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/13.jpg)
13
Terminologies Node A single machine
Cluster Set of Nodes
Resource A logical en/ty e.g. database, index, task
ParGGon Subset of the resource.
Replica Copy of a parGGon
State Status of a parGGon replica, e.g Master, Slave
TransiGon AcGon that lets replicas change status e.g Slave -‐> Master
![Page 14: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/14.jpg)
COUNT=2
COUNT=1
minimize(maxnj∈N S(nj) ) t1≤5
Core concept
14
S
M O
t1 t2
t3 t4 minimize(maxnj∈N M(nj) )
State Machine
• States • Offline, Slave, Master
• TransiGon • O-‐>S, S-‐>M,S-‐>M, M-‐>S
Constraints
• States • M=1, S=2
• TransiGons • concurrent(0-‐>S) < 5
ObjecGves
• ParGGon Placement • Failure semanGcs
![Page 15: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/15.jpg)
Helix soluGon
Message consumer group
Offline Online
Distributed search
15
MAX=1
MAX=3 (number of replicas)
Start consumpGon
Stop consumpGon
MAX per node=5
![Page 16: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/16.jpg)
IDEALSTATE
P1 N1:M
N2:S
P2 N2:M
N3:S
P3 N3:M
N1:S
16
ConfiguraGon
• 3 nodes • 3 parGGons • 2 replicas • StateMachine
Constraints
• 1 Master • 1 Slave • Even distribuGon
Replica placement
Replica State
![Page 17: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/17.jpg)
CURRENT STATE
• P1:OFFLINE • P3:OFFLINE N1 • P2:MASTER • P1:MASTER N2 • P3:MASTER • P2:SLAVE N3
17
![Page 18: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/18.jpg)
18
EXTERNAL VIEW
P1 N1:O
N2:M
P2 N2:M
N3:S
P3 N3:M
N1:O
![Page 19: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/19.jpg)
19
Helix Based System Roles
Node 1 Node 3 Node 2
P.4
P.9 P.10 P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
PARTICIPANT
SPECTATORController
Parition routing logic
CURRENT STATE
IDEAL STATE
RESPONSE COMMAND
![Page 20: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/20.jpg)
Logical deployment
20
![Page 21: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/21.jpg)
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
21
![Page 22: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/22.jpg)
Helix based soluGon
1. Define 2. Configure 3. Run
22
![Page 23: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/23.jpg)
Define: State model definiGon
• States – All possible states – Priority
• TransiGons – Legal transiGons – Priority
• Applicable to each parGGon of a resource
• e.g. MasterSlave
23
S
M O
![Page 24: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/24.jpg)
Define: state model
24
Builder = new StateModelDefinition.Builder(“MASTERSLAVE”);! // Add states and their rank to indicate priority. ! builder.addState(MASTER, 1);! builder.addState(SLAVE, 2);! builder.addState(OFFLINE);! ! //Set the initial state when the node starts! builder.initialState(OFFLINE);
//Add transitions between the states.! builder.addTransition(OFFLINE, SLAVE);! builder.addTransition(SLAVE, OFFLINE);! builder.addTransition(SLAVE, MASTER);! builder.addTransition(MASTER, SLAVE);! !
![Page 25: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/25.jpg)
Define: constraints State Transi)on
ParGGon Y Y
Resource -‐ Y
Node Y Y
Cluster -‐ Y
25
S
M O
COUNT=2
COUNT=1 State Transi)on
ParGGon M=1,S=2 -‐
![Page 26: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/26.jpg)
Define:constraints
26
// static constraint! builder.upperBound(MASTER, 1);!!! // dynamic constraint! builder.dynamicUpperBound(SLAVE, "R");!!! ! // Unconstrained ! builder.upperBound(OFFLINE, -1;
![Page 27: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/27.jpg)
Define: parGcipant plug-‐in code
27
![Page 28: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/28.jpg)
Step 2: configure
28
helix-‐admin –zkSvr <zkAddress>
CREATE CLUSTER
-‐-‐addCluster <clusterName>
ADD NODE
-‐-‐addNode <clusterName instanceId(host:port)>
CONFIGURE RESOURCE
-‐-‐addResource <clusterName resourceName par;;ons statemodel>
REBALANCE èSET IDEALSTATE
-‐-‐rebalance <clusterName resourceName replicas>
![Page 29: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/29.jpg)
29
zookeeper view IDEALSTATE
![Page 30: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/30.jpg)
Step 3: Run
30
run-‐helix-‐controller -‐zkSvr localhost:2181 –cluster MyCluster START CONTROLLER
START PARTICIPANT
![Page 31: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/31.jpg)
zookeeper view
31
![Page 32: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/32.jpg)
Znode content
CURRENT STATE EXTERNAL VIEW
32
![Page 33: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/33.jpg)
Spectator Plug-‐in code
33
![Page 34: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/34.jpg)
34
Helix ExecuGon modes
![Page 35: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/35.jpg)
35
IDEALSTATE
P1 N1:M
N2:S
P2 N2:M
N3:S
P3 N3:M
N1:S
ConfiguraGon
• 3 nodes • 3 parGGons • 2 replicas • StateMachine
Constraints
• 1 Master • 1 Slave • Even distribuGon
Replica placement
Replica State
![Page 36: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/36.jpg)
ExecuGon modes
• Who controls what
36
AUTO REBALANCE
AUTO CUSTOM
Replica placement
Helix App App
Replica State
Helix Helix App
![Page 37: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/37.jpg)
Auto rebalance v/s Auto
AUTO REBALANCE AUTO
37
![Page 38: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/38.jpg)
In acGon Auto rebalance
MasterSlave p=3 r=2 N=3 Node1 Node2 Node3
P1:M P2:M P3:M
P2:S P3:S P1:S
Auto MasterSlave p=3 r=2 N=3
38
Node 1 Node 2 Node 3
P1:O P2:M P3:M
P2:O P3:S P1:S
P1:M P2:S
Node 1 Node 2 Node 3
P1:M P2:M P3:M
P2:S P3:S P1:M
Node 1 Node 2 Node 3
P1:M P2:M P3:M
P2:S P3:S P1:S
On failure: Only change states to saGsfy constraint
On failure: Auto create replica and assign state
![Page 39: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/39.jpg)
Custom mode: example
39
![Page 40: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/40.jpg)
40
Custom mode: handling failure � Custom code invoker
� Code that lives on all nodes, but acGve in one place � Invoked when node joins/leaves the cluster � Computes new idealstate � Helix controller fires the transiGon without viola)ng constraints
P1
N1:M
N2:S
P2
N2:M
N3:S
P3
N3:M
N1:S
P1
N1:S
N2:M
P2
N2:M
N3:S
P3
N3:M
N1:S
Transi)ons
1 N1 MàS
2 N2 Sà M
1 & 2 in parallel violate single master constraint
Helix sends 2 aser 1 is finished
![Page 41: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/41.jpg)
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
41
![Page 42: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/42.jpg)
Tools
• Chaos monkey • Data driven tesGng and debugging • Rolling upgrade • On demand task scheduling and intra-‐cluster messaging
• Health monitoring and alerts
42
![Page 43: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/43.jpg)
Data driven tesGng
• Instrument – • Zookeeper, controller, parGcipant logs
• Simulate – Chaos monkey • Analyze – Invariants are
• Respect state transiGon constraints • Respect state count constraints • And so on
• Debugging made easy • Reproduce exact sequence of events
43
![Page 44: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/44.jpg)
Structured Log File -‐ sample timestamp partition instanceName sessionId state
1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
![Page 45: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/45.jpg)
No more than R=2 slaves Time State Number Slaves Instance
42632 OFFLINE 0 10.117.58.247_12918
42796 SLAVE 1 10.117.58.247_12918
43124 OFFLINE 1 10.202.187.155_12918
43131 OFFLINE 1 10.220.225.153_12918
43275 SLAVE 2 10.220.225.153_12918
43323 SLAVE 3 10.202.187.155_12918
85795 MASTER 2 10.220.225.153_12918
![Page 46: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/46.jpg)
How long was it out of whack? Number of Slaves Time Percentage
0 1082319 0.5
1 35578388 16.46
2 179417802 82.99
3 118863 0.05
83% of the Gme, there were 2 slaves to a parGGon 93% of the Gme, there was 1 master to a parGGon
Number of Masters Time Percentage
0 15490456 7.164960359 1 200706916 92.83503964
![Page 47: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/47.jpg)
Invariant 2: State TransiGons FROM TO COUNT
MASTER SLAVE 55
OFFLINE DROPPED 0
OFFLINE SLAVE 298
SLAVE MASTER 155
SLAVE OFFLINE 0
![Page 48: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/48.jpg)
Outline
• IntroducGon • Architecture • How to use Helix • Tools • Helix usage
48
![Page 49: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/49.jpg)
Helix usage at LinkedIn
49
Espresso
![Page 50: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/50.jpg)
In flight
• Apache S4 – ParGGoning, co-‐locaGon – Dynamic cluster expansion
• Archiva – ParGGoned replicated file store – Rsync based replicaGon
• Others in evaluaGon – Bigtop
50
![Page 51: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/51.jpg)
Auto scaling sosware deployment tool
51
• States • Download, Configure, Start • AcGve, Standby
• Constraint for each state • Download < 100 • AcGve 1000 • Standby 100
Download
Configure
Start
Active
Standby
Offline
< 100
1000
100
![Page 52: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/52.jpg)
Summary
• Helix: A Generic framework for building distributed systems
• Modifying/enhancing system behavior is easy – AbstracGon and modularity is key
• Simple programming model: declaraGve state machine
52
![Page 53: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/53.jpg)
Roadmap
• Features • Span mulGple data centers • AutomaGc Load balancing • Distributed health monitoring • YARN Generic ApplicaGon master for real Gme Apps
• Stand alone Helix agent
![Page 54: Building Distributed Systems Using Helix](https://reader033.vdocuments.mx/reader033/viewer/2022060108/55502094b4c905af648b52dc/html5/thumbnails/54.jpg)
54
website h?p://helix.incubator.apache.org
user [email protected]
twi?er @apachehelix, @kishoreg1980