md-sal clustering internals - events.static.linuxfound.org · md-sal clustering internals moiz...

66
www.opendaylight.org MD-SAL Clustering Internals Moiz Raja Open Daylight Summit 2015

Upload: dangthuy

Post on 15-Apr-2018

218 views

Category:

Documents


3 download

TRANSCRIPT

www.opendaylight.org

MD-SAL Clustering Internals

Moiz Raja Open Daylight Summit 2015

www.opendaylight.org

▪ Abhishek  Kumar  ▪ Basheeruddin  Ahmed  ▪ Colin  Dixon  ▪ Harman  Singh  ▪ Kamal  Rameshan  ▪ Robert  Varga  ▪ Tony  Tkacik  

My  Collaborators  

2

Tom Pantelis

▪  Luis  Gomez  ▪ Phillip  Shea  ▪ Radhika  Hirannaiah  ▪ and  many  more…  

www.opendaylight.org

▪ Architecture  ▪ Modules  ▪ Flows  ▪ DiagnosHcs  ▪ QuesHons    

Agenda  

www.opendaylight.org

Architecture  

4

www.opendaylight.org

Subsystems

5

member-­‐1  

member-­‐2  

member-­‐3  

Distributed Data Store

member-­‐1   member-­‐2  

Remote RPC Connector

www.opendaylight.org

High  Level  Architecture

Distributed Data Store Remote RPC Connector

Persistence

Remoting

Clustering

www.opendaylight.org

Actor  Systems

7

Distributed Data Store Remote RPC Connector

Actor Hierarchy

Configuration

Dispatchers

www.opendaylight.org

Data  Synchroniza:on

8

Data store Synchronized Data Tree Raft for Distributed Consensus

Remote RPC Synchronized RPC Registry Gossip for data distribution

www.opendaylight.org

Distributed  Data  Store  Architecture  

9

www.opendaylight.org

Accessing  Remote  Data

10

Client

member-1 member-2

www.opendaylight.org

Loca:on  Transparency

11

Client

member-1 member-2

Dis

tribu

tedD

ataS

tore

www.opendaylight.org

DistributedDataStore

12

DOMStore

DistributedDataStore

www.opendaylight.org

Communica:on

13

Client

member-1 member-2

Dis

tribu

tedD

ataS

tore

Shard

www.opendaylight.org

Data  Distribu:on

14

Client

member-1

Dis

tribu

tedD

ataS

tore

member-2

member-3

topology  

inventory  

www.opendaylight.org

Module  Based  Shards

15

/

/inventory /topology /toaster

default

www.opendaylight.org

HA

16

member-2

member-3

inventory  –  follower  -­‐1  

inventory  –  follower  -­‐  2  

Client

member-1

DistributedDataStore

inventory  –  leader  

www.opendaylight.org

RaB  Distributed  Consensus

17

discovers  n

ode  with

 highe

r  term     Follower

Candidate

Leader

starts  up/    recovers  

Hmes  out,    starts  elecHons  

receives  votes  from  majority    of  nodes  

Hmes  out,    restarts  elecHons  

follower-­‐2  follower-­‐1  

leader  

Election Replication/Consensus

www.opendaylight.org

Journal  replica:on

18

leader

follower-1

follower-2

transaction-1

transaction-2

transaction-3

transaction-4

transaction-1

transaction-2

transaction-3

transaction-4

transaction-1

transaction-2

transaction-3

transaction-4

www.opendaylight.org

Snapshot  Replica:on

19

leader

follower-1

follower-2

www.opendaylight.org

Durability/Recovery

20

Journal

Snapshots

www.opendaylight.org

Remote  RPC  Architecture  

21

www.opendaylight.org

Invoking  a  Remote  RPC

22

Consumer

member-1 member-2

Provider

www.opendaylight.org

Loca:on  Transparency

23

Consumer

member-1 member-2

Provider R

pcP

rovi

derP

roxy

Rem

oteR

pcB

roke

r

www.opendaylight.org

RPC  Registry

24

Provider RPC

Registration Listener

RPC Registry

www.opendaylight.org

RPC  Registry  Replica:on  -­‐  Gossip

25

version=1  

version=2  

modify  

change  version  

Local  bucket  updates    change  version  

m1,v1  

m2,v5  

m3,v7  

All  buckets  and  their  versions  known  to  all  members  

Every  1  second  members  send  all  known  bucket    versions  to  any  one  peer  

m1  

m2   m3  status  

status  

m2   m3  

m1  

update  

local  versions  higher  –  send  update  local  versions  lower  –  send  status  to  sender  

www.opendaylight.org

Modules  

26

www.opendaylight.org

Modules

27

sal-clustering-commons

sal-akka-raft sal-remoterpc-connector

sal-distributed-datastore

sal-clustering-config

sal-akka-raft-example

sal-dummy-distributed-datastore

clustering-test-app

www.opendaylight.org

▪ Some  common  messages  ▪ Actor  base  classes  ▪ The  Protobuf  messages  used  in  Helium  ▪ The  Protobuf  NormalizedNode  serializaHon  code  ▪ The  NormalizedNode  streaming  code  ▪ Other  miscellaneous  uHlity  classes  

sal-­‐clustering-­‐commons

28

www.opendaylight.org

▪  ImplementaHon  of  the  Ra[  Algorithm  on  top  of  akka  ▪ Uses  akka-­‐persistence  for  durability  ▪ Provides  a  base  class  called  Ra:Actor  which  when  can  be  extended  by  anyone  who  wants  to  replicate  state  ▪ See  sal-­‐akka-­‐ra[-­‐example  which  provides  a  simple  implementaHon  of  a  replicated  HashMap  

sal-­‐akka-­‐raB

29

www.opendaylight.org

▪ ConcurrentDOMDataBroker  ▪ DistributedDataStore  ▪  ImplementaHon  of  the  DOMStore  SPI  ▪ Shard  built  on  top  of  Ra[Actor  ▪ Creates  Shards  based  on  Sharding  strategy  ▪ Code  for  a  client  to  interact  with  the  Shard  Leader  

sal-­‐distributed-­‐datastore

30

www.opendaylight.org

▪ RemoteRpcProvider  ▪ Default  RPC  Provider.  Invoked  when  an  RPC  is  not  found  in  the  local  MD-­‐SAL  registry.  ▪ Code  for  BucketStore  which  provides  a  mechanism  to  replicate  state  based  on  Gossip  ▪ Code  for  RpcBroker  which  allows  invoking  a  remote  rpc  

sal-­‐remoterpc-­‐connector

31

www.opendaylight.org

Data  store  flows  

32

www.opendaylight.org

Startup

33

DistributedConfigDataStoreProviderModule

DistributedDataStore

ShardManager

Shard1 Shard Shard3 Shard4

createInstance

ActorContext waitTillReadyLatch

create & waitTillReady

www.opendaylight.org

Recovery

34

Shard1 Shard Shard3 Shard4

ShardManager

read last known state from disk

ready

waitTillReadyLatch

countDown

www.opendaylight.org

▪ Recovery  must  be  complete  ▪ All  Shard  Leaders  must  be  known  ▪ Three  messages  are  monitored  by  ShardManager  

▪  Cluster.MemberStatusUp ▪ Used to figure out the address of a cluster member

▪  LeaderStateChanged ▪ Used to figure out if a Follower has a different Leader

▪  ShardRoleChanged ▪ Use to figured out any changes in a Shard’s Role

▪ WaiHng  is  not  infinite,  by  default  it  lasts  only  90  seconds  but  is  configurable  ▪ Will  block  config  sub-­‐system  

Wai:ng  for  Ready

35

www.opendaylight.org

Crea:ng  a  Transac:on

36

DistributedDataStore newReadWriteTransaction

TransactionProxy

create

www.opendaylight.org

First  Opera:on

37

ActorContext.findPrimary

PrimaryCache.lookup/ShardManager.findPrimary

Found?

LocalTransactionContext RemoteTransactionContext

NoOpTransactionContext

TransactionProxy write(“inventory”, node)

Local?

N

Y N

www.opendaylight.org

Transac:ons

38

Client

DistributedDataStore

inventory  –  leader  

Client

DistributedDataStore

inventory  –  leader  

Local Transaction Remote Transaction

mem

ber-

1 mem

ber-

1 m

embe

r-2

www.opendaylight.org

Local  Transac:on  Op:miza:on

39

LocalTransactionContext Shard - Leader

write

merge

delete

ready

member-1

www.opendaylight.org

Remote  Transac:on  Op:miza:on

40

RemoteTransactionContext Shard Leader

write

merge

delete

ready

write  mod  

merge  mod  

delete  mod  

member-1 member-2

www.opendaylight.org

Transac:on  Rate  Limi:ng

41

rate-limit = 100 Tx/Sec

Tx Cohort

Shard Leader

member-2

20ms

Tx Cohort

50ms

Tx Cohort

15ms

after rate-limit/2 transactions done…. new-rate-limit = 25 Tx/Sec

www.opendaylight.org

Opera:on  Limi:ng

42

RemoteTransactionContext Shard Leader

write

merge

delete

write  mod  

merge  mod  

delete  mod  

member-1 member-2

…  …

block

www.opendaylight.org

Commit  Coordina:on

43

Shard Leader

member-2

Shard CommitCoordinator

Tx1  -­‐  ready  

Tx2  -­‐  ready  

Tx3  -­‐  ready  

Tx1  -­‐  commit  

Tx3  -­‐  commit  

Tx3  -­‐  abort  

Tx2  -­‐  commit  

Tx1  

Tx2  

Tx3  

www.opendaylight.org

Managing  the  in-­‐memory  journal  Replicated  To  All

44

Client leader

follower-1 follower-2

commit transaction

txn

txn txn

www.opendaylight.org

Managing  the  in-­‐memory  journal  Cluster  member  unavailable

45

Client leader

follower-1 follower-2

commit transaction

txn

txn

txn txn txn

txn txn txn

www.opendaylight.org

Data  Change  No:fica:ons

46

Client leader

follower-1 follower-2

commit transaction

txn

txn txn

notify

www.opendaylight.org

RPC  Connector  flows  

47

www.opendaylight.org

Startup

48

RemoteRpcBrokerModule createInstance

RpcManager

RemoteRpcProvider

RpcBroker RpcRegistry RemoteRpcImpl RpcListener

www.opendaylight.org

Default  RPC  Delegate

49

RpcManager SchemaContext

DOMRpcProviderService

read all rpc definitions

registerImplementation(remoteRpcImpl)

www.opendaylight.org

RPC  Registered

50

RpcProviderRegistry addRoutedRpcImpl

RoutedRpcRegistration registerPath

RpcListener

RpcRegistry

www.opendaylight.org

Invoking  a  Remote  RPC

51

RemoteRpcImpl invokeRpc

RpcRegistry

Route found?

RpcBroker

ExecuteRpc

FooService

throw Exception

www.opendaylight.org

Invoking  a  Remote  RPC

52

RemoteRpcImpl

Consumer

Provider

member-1 member-2

RpcBroker

RpcRegistry

invokeRpc

invokeRpc findRoute

ExecuteRpc

www.opendaylight.org

Data  store  DiagnosCcs  

53

www.opendaylight.org

Transac:on  Tracing

54

Created  txn  member-­‐2-­‐txn-­‐9400  of  type  READ_WRITE  on  chain  member-­‐2-­‐txn-­‐chain-­‐13  

Client

Server

Tx  member-­‐2-­‐txn-­‐9400  read  /(urn:opendaylight:inventory?...  

member-­‐3-­‐shard-­‐inventory-­‐operaHonal:  CreaHng  transacHon  :  shard-­‐member-­‐2-­‐txn-­‐9400  

Tx  member-­‐2-­‐txn-­‐9400  Readying  1  transacHons  for  commit  

Tx  member-­‐2-­‐txn-­‐9400  commit  

member-­‐3-­‐shard-­‐inventory-­‐operaHonal:  Readying  transacHon  member-­‐2-­‐txn-­‐9400  

member-­‐3-­‐shard-­‐inventory-­‐operaHonal:  Commigng  transacHon  member-­‐2-­‐txn-­‐9400  

Tx  member-­‐2-­‐txn-­‐9400:  commit  succeeded  

Cluster  Member  IniHator  

Counter  

TransacHon  Type  

Module  

Data  store  type  

www.opendaylight.org

Replica:on  Tracing

55

Leader

Sending  AppendEntries  to  follower  member-­‐2-­‐shard-­‐topology-­‐operaHonal:  AppendEntries  [term=2,  leaderId=member-­‐1-­‐shard-­‐topology-­‐operaHonal,  prevLogIndex=520,  prevLogTerm=2,  entries=[Entry{index=521,  term=2}],  leaderCommit=520,  replicatedToAllIndex=-­‐1]  

Follower handleAppendEntries:  AppendEntries  [term=2,  leaderId=member-­‐2-­‐shard-­‐topology-­‐operaHonal,    prevLogIndex=520,  prevLogTerm=2,  entries=[Entry{index=521,  term=2}],  leaderCommit=520,    replicatedToAllIndex=-­‐1]  

handleAppendEntries  returning  :  AppendEntriesReply  [term=2,  success=true,  logLastIndex=521,    logLastTerm=2,  followerId=member-­‐1-­‐shard-­‐topology-­‐operaHonal]  

handleAppendEntriesReply  from  member-­‐2-­‐shard-­‐topology-­‐operaHonal:  applying  to  log  –    commitIndex:  521,  lastAppliedIndex:  520  

handleAppendEntriesReply  -­‐  FollowerLogInformaHon  for  member-­‐2-­‐shard-­‐topology-­‐operaHonal  updated:    matchIndex:  521,  nextIndex:  522  

www.opendaylight.org

Shard  MBean

56

org.opendaylight.controller:type=DistributedOperaHonalDataStore,Category=Shards,name=member-­‐1-­‐shard-­‐inventory-­‐operaHonal  

OperaHonal  

Config  

member-­‐1  

member-­‐2  

member-­‐3  

default  

inventory  

topology  

operaHonal  

config  

Attributes AbortTransacHonsCount   CommitIndex   CommiledTransacHon

sCount  CurrentTerm   FailedTransacHonsCount  

FollowerInfo   FollowerIniHalSync  Status  

InMemoryJournalData  Size  

InMemoryJournalLogSize  

LastApplied  

LastCommiledTransacHonTime  

LastIndex   LastTerm   Leader   Ra[State  

ReadOnlyTransacHon  Count  

ReadWriteTransacHonCount  

WriteOnlyTransacHon  Count  

VotedFor   and  more….  

www.opendaylight.org

ShardManager  MBean

57

org.opendaylight.controller:type=DistributedOperaHonalDataStore,Category=ShardManager,name=shard-­‐manager-­‐operaHonal  

OperaHonal  

Config  

operaHonal  

config  

Attributes

•  LocalShards  •  SyncStatus  

www.opendaylight.org

Data  store  GeneralRun:meInfo  MBean

58

org.opendaylight.controller:type=DistributedConfigDatastore,name=GeneralRunHmeInfo  

OperaHonal  

Config  

Attributes

•  TransacHonCreaHonRateLimit  

www.opendaylight.org

Transac:on  Commit  Rate  MBean

59

org.opendaylight.controller.cluster.datastore:name=distributed-­‐data-­‐store.config.commit.rate  

Attributes

•  50thPercentile •  75thPercenHle  •  90thPercenHle  •  and  so  on…  

operaHonal  

config  

•  Count •  Min  •  Max  •  StdDev  

www.opendaylight.org

Data  store  GeneralRun:meInfo  MBean

60

org.opendaylight.controller:type=DistributedConfigDatastore,name=GeneralRunHmeInfo  

OperaHonal  

Config  

Attributes

•  TransacHonCreaHonRateLimit  

www.opendaylight.org

Message  Sta:s:cs  MBean

61

org.opendaylight.controller.actor.metric:name=/user/shardmanager-­‐config.msg-­‐rate.ActorIniHalized  

Attributes

•  50thPercentile •  75thPercenHle  •  90thPercenHle  •  and  so  on…  

operaHonal  

config  

•  Count •  Min  •  Max  •  StdDev  

Message  Name  

www.opendaylight.org

Remote  RPC  Broker  DiagnosCcs  

62

www.opendaylight.org

RemoteRpcBroker  MBean

63

org.opendaylight.controller:type=RemoteRpcBroker,name=RemoteRpcRegistry  

Attributes

•  BucketVersions •  GlobalRpc  •  LocalRegisteredRoutedRpc  

Operations

•  findRpcByName  •  findRpcByRoute  

www.opendaylight.org

Message  Sta:s:cs  MBean

64

org.opendaylight.controller.actor.metric:name=/user/rpc/registry.msg-­‐rate.AddOrUpdateRoutes  

Attributes

•  50thPercentile •  75thPercenHle  •  90thPercenHle  •  and  so  on…  

•  Count •  Min  •  Max  •  StdDev  

Message  Name  

www.opendaylight.org 65

www.opendaylight.org

▪ Deploy  a  cluster  ▪ Run  clustering  integraHon  tests  ▪ Write  an  applicaHon  that  works  in  the  cluster  ▪ Write  bugs  to  report  features  which  you  find  missing  ▪ Try  running  dsBenchMark  on  a  cluster  ▪ Test  out  replicaHon  using  the  dummy  data  store  ▪ Check  out  the  code  ▪ Send  email  to  controller-­‐[email protected]  with  quesHons  

Suggested  Next  Steps…  

66