it tech agile config
TRANSCRIPT
-
8/2/2019 IT Tech Agile Config
1/33
CERN IT Department
CH-1211 Genve 23Switzerlandwww.cern.ch/it
The Agile Infrastructure Project
Part 1: Configuration Management
Tim Bell
Gavin McCance
-
8/2/2019 IT Tech Agile Config
2/33
Configuration and Operations Tools
https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure
https://agileinf.cern.ch/jira/
IT Technical Forum 27 Jan 2012 2
https://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://agileinf.cern.ch/jira/https://agileinf.cern.ch/jira/https://agileinf.cern.ch/jira/https://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://twiki.cern.ch/twiki/bin/view/AgileInfrastructure -
8/2/2019 IT Tech Agile Config
3/33
Project scope
The project is reviewing the entire CERNcomputer-centre management toolset
What happens from the bare metal up
Asset management, inventory
Sysadmin tools and maintenance workflows
Service management and configuration tools
Dynamic configuration for virtual hosts
Operations monitoring
Workflow automation and continuous deployment
IT Technical Forum 27 Jan 2012 3
-
8/2/2019 IT Tech Agile Config
4/33
Configuration and Operations Tools
IT Technical Forum 27 Jan 2012 4
-
8/2/2019 IT Tech Agile Config
5/33
Why?
Current production system built around theQuattor toolset is successfully managing 10k
servers
(CERN) Quattor + many CERN components
Why are we changing the toolset?
IT Technical Forum 27 Jan 2012 5
-
8/2/2019 IT Tech Agile Config
6/33
What are the issues
Uncompressible technical debt
The cost to develop and maintain our own solution is
not reducing and clearly exceeds our resources
Small community (less funding) and general support
problem. At CERN, weve fallen into the sticky hands
support model
We need better automation and integration
between the sub-components
Lack of automated workflow: everything is a ticket emailScript : your added value in the process is often your
CERN password
The 15-min CDB commit walk context switch cost
IT Technical Forum 27 Jan 2012 6
-
8/2/2019 IT Tech Agile Config
7/33
What are the issues
Transferrable skills and training
Learning curve for our tools is steep and remains high
Its easier to hire people who have skills in a widely-
used tool than your internal tools
Depending on where you look
IT Technical Forum 27 Jan 2012 7
-
8/2/2019 IT Tech Agile Config
8/33
Jobs adverts indeed.com
IT Technical Forum 27 Jan 2012 8
Index of millionsof worldwide job
posts across
thousands of
job sites
These are the
sort of posts our
departing staffwill be applying
for.
Puppet
Quattor
-
8/2/2019 IT Tech Agile Config
9/33
Integration is hard
IPv6, virtualisation, Windows Server all need a
solution
We could leverage lots of open source tools
But piecemeal integration of these requires high
investment due to our complex system
Years of organic growth have made the system way toohairy
Its often easier to reinvent rather than integrate
Lack of dynamic-ness in the infrastructure
We hack the config system for dynamic VMs
Its critical to look at the system as a whole
IT Technical Forum 27 Jan 2012 9
-
8/2/2019 IT Tech Agile Config
10/33
Where to look?
Large ops community out there taking the tool
chain approach whose scaling needs match ours:
O(100k) servers, many apps
Become standard and join this community
IT Technical Forum 27 Jan 2012 10
-
8/2/2019 IT Tech Agile Config
11/33
Use Puppet for the core
The tool space has exploded in the last few years
In configuration management and ops
Large, shared tool forges, and lots of experience
Puppet and Chefare the clear leaders for the core tool
other tools in our scope try to integrate with those
Many large-scale enterprises use Puppet
Its declarative approach fits better with what were used to
Large installations: friendly, wide-base community and
commercial support and training You can buy books on it
IT Technical Forum 27 Jan 2012 11
-
8/2/2019 IT Tech Agile Config
12/33
Scaling challenges: nodes
Currently we have O(10k) physical nodes
IaaS approach:
Moving to virtual machines
More (smaller, load-balanced) service nodes
VMs for raw compute (batch or pilot jobs)
Homogeneous: compute + storage on the same node
Add another computer centre, 24/48 SMT cores per node,
you get 100k 300k virtual nodes to be managed 99.6%(1) node update success-rate means 1200 manual
interventions to fix it
(1) in a recent intervention on lxbatch
IT Technical Forum 27 Jan 2012 12
-
8/2/2019 IT Tech Agile Config
13/33
Scaling challenges: people
IT Technical Forum 27 Jan 2012 13
Many, diverse applications (clusters) managed by different
teams
..and 700+ other unmanaged Linux nodes in VMs that could
benefit from a simple configuration system
-
8/2/2019 IT Tech Agile Config
14/33
IT Technical Forum 27 Jan 2012
Agile Infrastructure 1st Try
First started investigating tools in September using part-
time resources from CF, DB, DSS, GT, OIS and PES
Trying iterative agile-sprint style (Scrum): short sprints,
feedback, sprint review, visible
Take first, best-guess at architecture and tool selection, iterate
Mixed success with this agile style What works: Good visibility and reviews.
Daily scrum meeting useful.
Weekly review meeting open to management.
What doesnt: The time boxing part of of Scrum
sprints is hard with part-time resources
The project planning now foresees more dedication of staff
14
-
8/2/2019 IT Tech Agile Config
15/33
Agile Infrastructure 1st Try
Were currently running:
OpenStack as cloud software for virtual machines, image
management, bulk storage Future IT forum presentation
Puppet for the configuration management core
with Foreman as a dashboard
IT Technical Forum 27 Jan 2012 15
-
8/2/2019 IT Tech Agile Config
16/33
Foreman dashboard
IT Technical Forum 27 Jan 2012 16
-
8/2/2019 IT Tech Agile Config
17/33
Agile Infrastructure 1st Try
Were currently running:
OpenStack as cloud software for virtual machines, image
management, bulk storage Future IT forum presentation
Puppet for the configuration management core
with Foreman as a dashboard
None of the tools are perfect out-of-the-box
..but wed rather submit patches to a good open source tool than re-
implement it
Weve experienced very good community support: RFCs and patchesare quickly accepted
Very active community: often problems are fixed and missing features
implemented before you even report them
IT Technical Forum 27 Jan 2012 17
-
8/2/2019 IT Tech Agile Config
18/33
Agile Infrastructure 1st Try
Were currently running:
yum for software distribution (replacing spma)
git for template management: why git?
Almost all the Puppet (and Chef) usage schemes out
there assume you use git to handle the templates
Many of the tools we can benefit from also assume git We should not be different from the rest of the community
IT Technical Forum 27 Jan 2012 18
-
8/2/2019 IT Tech Agile Config
19/33
Puppet
Client/server architecture
puppetmaster: horizontally scalable Rails application
X509 cert authenticated nodes: integrate with CERN CA
IT Technical Forum 27 Jan 2012 19
-
8/2/2019 IT Tech Agile Config
20/33
Puppet
Puppet runs on the client, applying
the configuration changes It detects the current state and only
runs if theres something to do
It runs every few minutes
new configuration will be ~immediately applied (fail-fast).
This is a change from CDB where latent changes can be stacked up
Normal mode is client-side compile (assume success) No more CDB commit waits
Change from CDB: the compilation fails later
Good monitoring is a pre-req: puppet sends reports back to
the puppetmaster The Foreman tool can collect these for you
IT Technical Forum 27 Jan 2012 20
-
8/2/2019 IT Tech Agile Config
21/33
Puppet language
Puppet uses its own Ruby-like language for the templates
to assert the desired state of the nodes With Ruby fall-back for hard stuff (weve only needed this
once)
Being declarative rather than procedural, there are quirks
Takes a bit of practice to get it
There are books, online docs, online cook-books, and a large
community to help
It dispenses with the need for ncm components
All the work is done by puppet on the node itself you just
provide the template part to assert what you want done
Less software -> easier to move to new OS versions
IT Technical Forum 27 Jan 2012 21
-
8/2/2019 IT Tech Agile Config
22/33
Externals
Puppet uses an external DB for much of the configuration
that we currently store in textual CDB templates
Node function + hardware Moving a host between clusters is a DB update
Your configuration can use variables the node detects itself
e.g. reconfigure daemons based on where a newly live-migrated VMhas found itself
Query the compiled configuration of other hosts e.g. Open my firewall to the lxadm nodes
IT Technical Forum 27 Jan 2012 22
-
8/2/2019 IT Tech Agile Config
23/33
Moving towards PaaS
Parametrisable recipes Just fill in the blanks
The aim is to make it easy to use pre-canned recipes
without even touching a Puppet template
e.g. stick a standard CERN SSO-enabled apache / mod_wsgi
/ Django server on my box
with these parameters
Moving us in the PaaS direction
Ultimately, it would be better if you never even needed to log
into this node (J2EE public service, IT web hosting service, MySQL service)
IT Technical Forum 27 Jan 2012 23
-
8/2/2019 IT Tech Agile Config
24/33
Standard workflow
IT Technical Forum 27 Jan 2012 24
check out
from CDB
update
templates
CDB
commit
run and
check on
test node
notify with
nc-client
n minutes
Iterate
CDB onlxadm
check out
from git
update
templates
git commit
and push
run and
check on
test node
notify with
mcollective
1 minute
Iterate
Puppet on
lxadm
check out
from git on
the test node
update
templates
run
puppet-apply
check on
test node
notify with
mcollective
Iterate
Puppet-apply
on test node
check on
foreman
check on
node(s)
check on
foreman
git commit
and push
-
8/2/2019 IT Tech Agile Config
25/33
Modernising our processes
Our software processes for the computer centre are fairly
limited fire-and-forget broadcasts to project-elfms
and rather manual
The manual test/ -> preprod/ -> prod/ template dance
Our toolset RPMs are built on laptop and uploaded to swrep
by hand
Add standard CI (e.g. Jenkins, Bamboo, Cruise)
and automated build (Koji) as the only route to
get new packages into the CC
.. then automate the testing
e.g. suitably tagged RPMs are automaticallydeployed to /test
nodes.
IT Technical Forum 27 Jan 2012 25
-
8/2/2019 IT Tech Agile Config
26/33
Modernising our processes
Were working out which of the many puppet / git models
suits us code review, sign-off and automated notification for changes
that will affect multiple clusters
How to automate the test/preprod/prod advancement
Pre-req is flexible monitoring and alarming
you need to trust that an automation failure will be signaled
to you
Script-generated emails are banned Need good monitoring to hang these notifications on
Integrate components rather than use emailScript
Script-generated tickets (where your value in the process is
your password), are banned
IT Technical Forum 27 Jan 2012 26
-
8/2/2019 IT Tech Agile Config
27/33
Current tool snapshot (liable to change)
IT Technical Forum 27 Jan 2012 27
Jenkins
Koji, Mock
Puppet
Foreman
AIMS/PXE
Foreman
Yum repo
Pulp
Puppet stored
config DB
mcollective, yum
JIRA
Lemon
git, SVN
Openstack Nova
Hardware
database
-
8/2/2019 IT Tech Agile Config
28/33
Preliminary timelines
Year What Actions
2011 Agree overall principles
2012 Prepare formal project planEstablish IaaS in CERN CCProduction Agile InfrastructureMonitoring Implementation as per WGMigrate lxcloudEarly adopters to Agile Infrastructure
2013 LSD 1New Data Centre
Extend IaaS to remote CCBusiness ContinuitySupport Experiment App re-workMigrate CVIGeneral migration to Agile with SLC6 andWindows 8
2014 LSD 1 (toNovember)
Phase out Quattor/CDB/
IT Technical Forum 27 Jan 2012 28
Aggressive schedule if we are to make it for new data centre
-
8/2/2019 IT Tech Agile Config
29/33
Initial steps
Decide on tools now and integrate them together
to make a production setup (Q1)
We can still change.. But were starting to commit
Looking for early adopters (from Q1)
In particular to understand the people-scaling / ACL
issues: which of the git/puppet models is best? e.g. PES/OIS services: batch/VMs, JIRA, Drupal
https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/
EarlyAdopters2012
Help with integration / coding
Help with ideas
Help with building the task list
IT Technical Forum 27 Jan 2012 29
https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012 -
8/2/2019 IT Tech Agile Config
30/33
Summary
IT has started a new project to move our infrastructure to a
new toolset based around industry standard open sourcecomponents
Puppet for the core configuration tool
Better integration between components
Use of more modern software processes to aid deployment
Better monitoring
Engage with the community rather than re-implement
Overall project scope is wider (future IT forums)
Cloud and virtualisation, improved monitoring
Please get involved early and give feedback
https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure
IT Technical Forum 27 Jan 2012 30
https://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://twiki.cern.ch/twiki/bin/view/AgileInfrastructure -
8/2/2019 IT Tech Agile Config
31/33
Backup slides
IT Technical Forum 27 Jan 2012 31
-
8/2/2019 IT Tech Agile Config
32/33
Code ownership model
The sticky hands support model (you touched it last!)
Were working out an FE-based model where
Code is owned by the related service Functional-Element
Ownership confers the responsibility to maintain a decentstandard config for the computer centre, and the
responsibility to roll out new versions of that code/config
Patches from interested people can be offered, and if you
take them, you support them
not the guy that gave you the patch
IT Technical Forum 27 Jan 2012 32
-
8/2/2019 IT Tech Agile Config
33/33
mcollective and messaging
mcollective is a notification framework
Mix of CERNs not.d / wassh
It broadcast instructions to run pre-canned tasks to nodes
selected by a filter collects the results from the nodes
then renders that result for the CLI
e.g. restart all my webservers, do a puppet run now
It requires a messaging framework that all nodes subscribe
to (to receive the notification)
Typically: AcvtiveMQ or RabbitMQ
Both Openstack and our (future) monitoring system need aCC wide messaging system as well
IT Technical Forum 27 Jan 2012 33