moving forward under the weight of all that state
TRANSCRIPT
Upgrading under the weight of all that
stateQuinton Anderson
Context
Canonical Model
Source
Source
Source
Source
Raw Data
Business Data
Access Layer
Access Layer
Access Layer
Access Layer
Load Balance
r
//TODO
Function
Cntrl-V
Scaling
Downstream systems• Specialised
management systems
• Reporting Systems• Product management
Channel & product systems
Master Data Management
Hadoop• Leverage all data & reduce
integration costs• Comprehensive dataset –
internal & external, realtime & batch, structured & unstructured
• Advanced analytics / machine learning
Group Data Warehouse• Understand our business• Accurate, conformed, and
reconciled data• Access layer to support BI &
reporting
BI/Reporting• User facing tools• Regulatory reporting• Dishoarding• Self service BI for the
masses
Customer record &insights
All data
Price, conversation,credit dec. etc.
Financial Data
Subset ofdata
Useraccess
‘ Reconciled
data
Information for people
Core Financial Systems and functions• P&L• Recon• General Ledger• Etc…
Closed loop,automated ‘decisions’
Decisioning• Personalise/optimise decisions,
maximise customer value• E.g. price, credit decision, next
conversation, experience
Core information repositories
Analytics applications
Other systems
Channels
Had
oop
Rules
Serving and decisioning
AnalyticRecords
Systems Of Record
Core Banking Payments
Event Processor
Raw Data
Derived Data
Feature Store
Event Store
Scoring
Machine Learning
www
Event Streams
Customer Information data loaded
Data analysed
& processed
Insights & events captured
Integration API/Service Discovery
> 4000 Daily Batch Jobs
> 6 PB of State and growing
Hbase, Cassandra,
HDFS, Influx,
Elastic Search, Kafka,Etcd,
ZookeeperOpenStack Swift
Oracle,MySQL,
Postgres
Hundreds of services
MR1,MR2,
Spark,Akka
Dev,Test,
Staging,Prod 1,Prod 2,Etc…
== Complexity
Imperative:
Culture
Architecture
Immutable
Someone else’s computer
State Locality
Workload non-locality
Flexible over optimal
Practically, it is a closed system
State management is my problem
All abstractions are leaky
Repo(s) CI/CD Apps
Docker CalicoMesos Yarn
Spark, MR, Impala, etcMarathon, Chronos, Cassandra, etc
CI/CD
CI/CD
Repo(s)
Repo(s)
Open Stac
k
Nova
Nova/Ironic
OSKVM
OSFirmware + Hardware + Tags
Strategies
Outsource the problem, and tool away the resulting issues
Delete it, tool away the resulting issues
Be stateless, tool away the resulting issues
Implement some patterns, incrementally optimise. Tool away the resulting issues
Excess Capacity
Patterns
Consumer
Router
DB
Old Old
Web
App
DB
Web
App
Consumer
Router
DB
Old Old
Web
App
DB
Web
App
L4
HAProxy
Old Old Old Old
L4
HAProxy
Old Old Old Old New
L4
HAProxy
Old Old Old Old New
L4
HAProxy
Old Old Old Old New
L4
HAProxy
Old Old Old New New
L4
HAProxy
Old Old New New New
L4
HAProxy
Old New New New New
L4
HAProxy
New New New New New
== Incrementally accept risk
In place upgrade
Stateful
CAP, PACELC
Data models
Atomicity
Access patterns
Implementation approaches = ??
Upgrade Duration O(N)
for node in nodes: if info[node]['instance']:
if Status(node).run().wait() == AVAILBLE_FOR_MAINTENANCE:
MaintenanceMode(node).run().wait()Upgrade(node).run().wait()Health = HealthTests(node).run.wait()UpdateStatus(node, health).run.wait()
all_good = Truehost = self.cdh.get_host(self.host_map[self.node_name])if host.healthSummary != 'GOOD': all_good = False
# Look up the host by its rolesfor c in self.cdh.get_all_clusters(): for s in c.get_all_services(): for r in s.get_all_roles(): h = r.hostRef if h.hostId == self.host_map[self.node_name]:
if r.healthSummary != 'GOOD': all_good = False
return all_good
O(log N)
nodeComputation = for {_ <- Status(node)_ <-
MaintenanceMode(_,node)_ <- Upgrade(node)nodeResult <- HealthTests(node)
} yield nodeResult
upgrade = for {node <- groupcomp <- nodeComputation(node)
} yield comp.exec
groups.map(upgrade)
Repo(s) CI/CD Apps
Docker CalicoMesos Yarn
Spark, MR, Impala, etcMarathon, Chronos, Cassandra, etc
CI/CD
CI/CD
Repo(s)
Repo(s)
Open Stac
k
Nova
Nova/Ironic
OSKVM
OSFirmware + Hardware + Tags
Workflow
Jenkins
Environment
Branch PR
Merge
Dev
Deploy
Master
Deploy
TestChange Plan
clusters: green-cluster: dns: nameservers: - x.x.x.x data_domain: *.*.* etcd: token: green-cluster masters: able: provision_id: 1 lan: - mac: 0c:c4:7a:c1:2e:92 ip: 1.1.11.151/24 vlan: 11 gateway: 1.1.1.1 ironic_id: a7af76ad-6583-4209-ba5f-cf1477b6405e flavor: ramish-baremetal-flavor2 image: *mesos-master-green theta: provision_id: 2 lan: - mac: 0c:c4:7a:a9:04:0c ip: 1.1.11.53/24 vlan: 11 gateway: 1.1.1.1 ironic_id: 8ff1fd1c-4893-11e6-a447-2f366077ca0e flavor: ramish-baremetal-flavor2 image: *mesos-master-green tobias: provision_id: 3 lan: - mac: 0c:c4:7a:a8:f6:ac ip: 1.11.11.52/24 vlan: 11 gateway: 1.1.1.1 ironic_id: c89fdd08-232c-40fe-b965-49fc3e4dcba7 flavor: ramish-baremetal-flavor2 image: *mesos-master-green
Recommendations
Instrument as much of deployment and provisioning as you can
Optimise incrementally, learn the right hard lessons
Allow for manual intervention, but attack it aggressively
Encourage your people to intervene
Prevent Pets
Spend more time on testing