moving forward under the weight of all that state

Upgrading under the weight of all that

stateQuinton Anderson

Context

Canonical Model

Source

Source

Source

Source

Raw Data

Business Data

Access Layer

Access Layer

Access Layer

Access Layer

Load Balance

r

//TODO

Function

Cntrl-V

Scaling

Downstream systems• Specialised

management systems

• Reporting Systems• Product management

Channel & product systems

Master Data Management

Hadoop• Leverage all data & reduce

integration costs• Comprehensive dataset –

internal & external, realtime & batch, structured & unstructured

• Advanced analytics / machine learning

Group Data Warehouse• Understand our business• Accurate, conformed, and

reconciled data• Access layer to support BI &

reporting

BI/Reporting• User facing tools• Regulatory reporting• Dishoarding• Self service BI for the

masses

Customer record &insights

All data

Price, conversation,credit dec. etc.

Financial Data

Subset ofdata

Useraccess

‘ Reconciled

data

Information for people

Core Financial Systems and functions• P&L• Recon• General Ledger• Etc…

Closed loop,automated ‘decisions’

Decisioning• Personalise/optimise decisions,

maximise customer value• E.g. price, credit decision, next

conversation, experience

Core information repositories

Analytics applications

Other systems

Channels

Had

oop

Rules

Serving and decisioning

AnalyticRecords

Systems Of Record

Core Banking Payments

Event Processor

Raw Data

Derived Data

Feature Store

Event Store

Scoring

Machine Learning

www

Event Streams

Customer Information data loaded

Data analysed

& processed

Insights & events captured

Integration API/Service Discovery

> 4000 Daily Batch Jobs

> 6 PB of State and growing

Hbase, Cassandra,

HDFS, Influx,

Elastic Search, Kafka,Etcd,

ZookeeperOpenStack Swift

Oracle,MySQL,

Postgres

Hundreds of services

MR1,MR2,

Spark,Akka

Dev,Test,

Staging,Prod 1,Prod 2,Etc…

== Complexity

Imperative:

Culture

Architecture

Immutable

Someone else’s computer

State Locality

Workload non-locality

Flexible over optimal

Practically, it is a closed system

State management is my problem

All abstractions are leaky

Repo(s) CI/CD Apps

Docker CalicoMesos Yarn

Spark, MR, Impala, etcMarathon, Chronos, Cassandra, etc

CI/CD

CI/CD

Repo(s)

Repo(s)

Open Stac

k

Nova

Nova/Ironic

OSKVM

OSFirmware + Hardware + Tags

Strategies

Outsource the problem, and tool away the resulting issues

Delete it, tool away the resulting issues

Be stateless, tool away the resulting issues

Implement some patterns, incrementally optimise. Tool away the resulting issues

Excess Capacity

Patterns

Consumer

Router

DB

Old Old

Web

App

DB

Web

App

L4

HAProxy

Old Old Old Old

L4

HAProxy

Old Old Old Old New

L4

HAProxy

Old Old Old New New

L4

HAProxy

Old Old New New New

L4

HAProxy

Old New New New New

L4

HAProxy

New New New New New

== Incrementally accept risk

In place upgrade

Stateful

CAP, PACELC

Data models

Atomicity

Access patterns

Implementation approaches = ??

Upgrade Duration O(N)

for node in nodes: if info[node]['instance']:

if Status(node).run().wait() == AVAILBLE_FOR_MAINTENANCE:

MaintenanceMode(node).run().wait()Upgrade(node).run().wait()Health = HealthTests(node).run.wait()UpdateStatus(node, health).run.wait()

all_good = Truehost = self.cdh.get_host(self.host_map[self.node_name])if host.healthSummary != 'GOOD': all_good = False

# Look up the host by its rolesfor c in self.cdh.get_all_clusters(): for s in c.get_all_services(): for r in s.get_all_roles(): h = r.hostRef if h.hostId == self.host_map[self.node_name]:

if r.healthSummary != 'GOOD': all_good = False

return all_good

O(log N)

nodeComputation = for {_ <- Status(node)_ <-

MaintenanceMode(_,node)_ <- Upgrade(node)nodeResult <- HealthTests(node)

} yield nodeResult

upgrade = for {node <- groupcomp <- nodeComputation(node)

} yield comp.exec

groups.map(upgrade)

Repo(s) CI/CD Apps

Docker CalicoMesos Yarn

Spark, MR, Impala, etcMarathon, Chronos, Cassandra, etc

CI/CD

CI/CD

Repo(s)

Repo(s)

Open Stac

k

Nova

Nova/Ironic

OSKVM

OSFirmware + Hardware + Tags

Workflow

Jenkins

Environment

Branch PR

Merge

Dev

Deploy

Master

Deploy

TestChange Plan

clusters: green-cluster: dns: nameservers: - x.x.x.x data_domain: *.*.* etcd: token: green-cluster masters: able: provision_id: 1 lan: - mac: 0c:c4:7a:c1:2e:92 ip: 1.1.11.151/24 vlan: 11 gateway: 1.1.1.1 ironic_id: a7af76ad-6583-4209-ba5f-cf1477b6405e flavor: ramish-baremetal-flavor2 image: *mesos-master-green theta: provision_id: 2 lan: - mac: 0c:c4:7a:a9:04:0c ip: 1.1.11.53/24 vlan: 11 gateway: 1.1.1.1 ironic_id: 8ff1fd1c-4893-11e6-a447-2f366077ca0e flavor: ramish-baremetal-flavor2 image: *mesos-master-green tobias: provision_id: 3 lan: - mac: 0c:c4:7a:a8:f6:ac ip: 1.11.11.52/24 vlan: 11 gateway: 1.1.1.1 ironic_id: c89fdd08-232c-40fe-b965-49fc3e4dcba7 flavor: ramish-baremetal-flavor2 image: *mesos-master-green

Recommendations

Instrument as much of deployment and provisioning as you can

Optimise incrementally, learn the right hard lessons

Allow for manual intervention, but attack it aggressively

Encourage your people to intervene

Prevent Pets

Spend more time on testing

moving forward under the weight of all that state

Software