it tech agile config

Upload: nadia-metoui

Post on 06-Apr-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 IT Tech Agile Config

    1/33

    CERN IT Department

    CH-1211 Genve 23Switzerlandwww.cern.ch/it

    The Agile Infrastructure Project

    Part 1: Configuration Management

    Tim Bell

    Gavin McCance

  • 8/2/2019 IT Tech Agile Config

    2/33

    Configuration and Operations Tools

    https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure

    https://agileinf.cern.ch/jira/

    IT Technical Forum 27 Jan 2012 2

    https://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://agileinf.cern.ch/jira/https://agileinf.cern.ch/jira/https://agileinf.cern.ch/jira/https://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://twiki.cern.ch/twiki/bin/view/AgileInfrastructure
  • 8/2/2019 IT Tech Agile Config

    3/33

    Project scope

    The project is reviewing the entire CERNcomputer-centre management toolset

    What happens from the bare metal up

    Asset management, inventory

    Sysadmin tools and maintenance workflows

    Service management and configuration tools

    Dynamic configuration for virtual hosts

    Operations monitoring

    Workflow automation and continuous deployment

    IT Technical Forum 27 Jan 2012 3

  • 8/2/2019 IT Tech Agile Config

    4/33

    Configuration and Operations Tools

    IT Technical Forum 27 Jan 2012 4

  • 8/2/2019 IT Tech Agile Config

    5/33

    Why?

    Current production system built around theQuattor toolset is successfully managing 10k

    servers

    (CERN) Quattor + many CERN components

    Why are we changing the toolset?

    IT Technical Forum 27 Jan 2012 5

  • 8/2/2019 IT Tech Agile Config

    6/33

    What are the issues

    Uncompressible technical debt

    The cost to develop and maintain our own solution is

    not reducing and clearly exceeds our resources

    Small community (less funding) and general support

    problem. At CERN, weve fallen into the sticky hands

    support model

    We need better automation and integration

    between the sub-components

    Lack of automated workflow: everything is a ticket emailScript : your added value in the process is often your

    CERN password

    The 15-min CDB commit walk context switch cost

    IT Technical Forum 27 Jan 2012 6

  • 8/2/2019 IT Tech Agile Config

    7/33

    What are the issues

    Transferrable skills and training

    Learning curve for our tools is steep and remains high

    Its easier to hire people who have skills in a widely-

    used tool than your internal tools

    Depending on where you look

    IT Technical Forum 27 Jan 2012 7

  • 8/2/2019 IT Tech Agile Config

    8/33

    Jobs adverts indeed.com

    IT Technical Forum 27 Jan 2012 8

    Index of millionsof worldwide job

    posts across

    thousands of

    job sites

    These are the

    sort of posts our

    departing staffwill be applying

    for.

    Puppet

    Quattor

  • 8/2/2019 IT Tech Agile Config

    9/33

    Integration is hard

    IPv6, virtualisation, Windows Server all need a

    solution

    We could leverage lots of open source tools

    But piecemeal integration of these requires high

    investment due to our complex system

    Years of organic growth have made the system way toohairy

    Its often easier to reinvent rather than integrate

    Lack of dynamic-ness in the infrastructure

    We hack the config system for dynamic VMs

    Its critical to look at the system as a whole

    IT Technical Forum 27 Jan 2012 9

  • 8/2/2019 IT Tech Agile Config

    10/33

    Where to look?

    Large ops community out there taking the tool

    chain approach whose scaling needs match ours:

    O(100k) servers, many apps

    Become standard and join this community

    IT Technical Forum 27 Jan 2012 10

  • 8/2/2019 IT Tech Agile Config

    11/33

    Use Puppet for the core

    The tool space has exploded in the last few years

    In configuration management and ops

    Large, shared tool forges, and lots of experience

    Puppet and Chefare the clear leaders for the core tool

    other tools in our scope try to integrate with those

    Many large-scale enterprises use Puppet

    Its declarative approach fits better with what were used to

    Large installations: friendly, wide-base community and

    commercial support and training You can buy books on it

    IT Technical Forum 27 Jan 2012 11

  • 8/2/2019 IT Tech Agile Config

    12/33

    Scaling challenges: nodes

    Currently we have O(10k) physical nodes

    IaaS approach:

    Moving to virtual machines

    More (smaller, load-balanced) service nodes

    VMs for raw compute (batch or pilot jobs)

    Homogeneous: compute + storage on the same node

    Add another computer centre, 24/48 SMT cores per node,

    you get 100k 300k virtual nodes to be managed 99.6%(1) node update success-rate means 1200 manual

    interventions to fix it

    (1) in a recent intervention on lxbatch

    IT Technical Forum 27 Jan 2012 12

  • 8/2/2019 IT Tech Agile Config

    13/33

    Scaling challenges: people

    IT Technical Forum 27 Jan 2012 13

    Many, diverse applications (clusters) managed by different

    teams

    ..and 700+ other unmanaged Linux nodes in VMs that could

    benefit from a simple configuration system

  • 8/2/2019 IT Tech Agile Config

    14/33

    IT Technical Forum 27 Jan 2012

    Agile Infrastructure 1st Try

    First started investigating tools in September using part-

    time resources from CF, DB, DSS, GT, OIS and PES

    Trying iterative agile-sprint style (Scrum): short sprints,

    feedback, sprint review, visible

    Take first, best-guess at architecture and tool selection, iterate

    Mixed success with this agile style What works: Good visibility and reviews.

    Daily scrum meeting useful.

    Weekly review meeting open to management.

    What doesnt: The time boxing part of of Scrum

    sprints is hard with part-time resources

    The project planning now foresees more dedication of staff

    14

  • 8/2/2019 IT Tech Agile Config

    15/33

    Agile Infrastructure 1st Try

    Were currently running:

    OpenStack as cloud software for virtual machines, image

    management, bulk storage Future IT forum presentation

    Puppet for the configuration management core

    with Foreman as a dashboard

    IT Technical Forum 27 Jan 2012 15

  • 8/2/2019 IT Tech Agile Config

    16/33

    Foreman dashboard

    IT Technical Forum 27 Jan 2012 16

  • 8/2/2019 IT Tech Agile Config

    17/33

    Agile Infrastructure 1st Try

    Were currently running:

    OpenStack as cloud software for virtual machines, image

    management, bulk storage Future IT forum presentation

    Puppet for the configuration management core

    with Foreman as a dashboard

    None of the tools are perfect out-of-the-box

    ..but wed rather submit patches to a good open source tool than re-

    implement it

    Weve experienced very good community support: RFCs and patchesare quickly accepted

    Very active community: often problems are fixed and missing features

    implemented before you even report them

    IT Technical Forum 27 Jan 2012 17

  • 8/2/2019 IT Tech Agile Config

    18/33

    Agile Infrastructure 1st Try

    Were currently running:

    yum for software distribution (replacing spma)

    git for template management: why git?

    Almost all the Puppet (and Chef) usage schemes out

    there assume you use git to handle the templates

    Many of the tools we can benefit from also assume git We should not be different from the rest of the community

    IT Technical Forum 27 Jan 2012 18

  • 8/2/2019 IT Tech Agile Config

    19/33

    Puppet

    Client/server architecture

    puppetmaster: horizontally scalable Rails application

    X509 cert authenticated nodes: integrate with CERN CA

    IT Technical Forum 27 Jan 2012 19

  • 8/2/2019 IT Tech Agile Config

    20/33

    Puppet

    Puppet runs on the client, applying

    the configuration changes It detects the current state and only

    runs if theres something to do

    It runs every few minutes

    new configuration will be ~immediately applied (fail-fast).

    This is a change from CDB where latent changes can be stacked up

    Normal mode is client-side compile (assume success) No more CDB commit waits

    Change from CDB: the compilation fails later

    Good monitoring is a pre-req: puppet sends reports back to

    the puppetmaster The Foreman tool can collect these for you

    IT Technical Forum 27 Jan 2012 20

  • 8/2/2019 IT Tech Agile Config

    21/33

    Puppet language

    Puppet uses its own Ruby-like language for the templates

    to assert the desired state of the nodes With Ruby fall-back for hard stuff (weve only needed this

    once)

    Being declarative rather than procedural, there are quirks

    Takes a bit of practice to get it

    There are books, online docs, online cook-books, and a large

    community to help

    It dispenses with the need for ncm components

    All the work is done by puppet on the node itself you just

    provide the template part to assert what you want done

    Less software -> easier to move to new OS versions

    IT Technical Forum 27 Jan 2012 21

  • 8/2/2019 IT Tech Agile Config

    22/33

    Externals

    Puppet uses an external DB for much of the configuration

    that we currently store in textual CDB templates

    Node function + hardware Moving a host between clusters is a DB update

    Your configuration can use variables the node detects itself

    e.g. reconfigure daemons based on where a newly live-migrated VMhas found itself

    Query the compiled configuration of other hosts e.g. Open my firewall to the lxadm nodes

    IT Technical Forum 27 Jan 2012 22

  • 8/2/2019 IT Tech Agile Config

    23/33

    Moving towards PaaS

    Parametrisable recipes Just fill in the blanks

    The aim is to make it easy to use pre-canned recipes

    without even touching a Puppet template

    e.g. stick a standard CERN SSO-enabled apache / mod_wsgi

    / Django server on my box

    with these parameters

    Moving us in the PaaS direction

    Ultimately, it would be better if you never even needed to log

    into this node (J2EE public service, IT web hosting service, MySQL service)

    IT Technical Forum 27 Jan 2012 23

  • 8/2/2019 IT Tech Agile Config

    24/33

    Standard workflow

    IT Technical Forum 27 Jan 2012 24

    check out

    from CDB

    update

    templates

    CDB

    commit

    run and

    check on

    test node

    notify with

    nc-client

    n minutes

    Iterate

    CDB onlxadm

    check out

    from git

    update

    templates

    git commit

    and push

    run and

    check on

    test node

    notify with

    mcollective

    1 minute

    Iterate

    Puppet on

    lxadm

    check out

    from git on

    the test node

    update

    templates

    run

    puppet-apply

    check on

    test node

    notify with

    mcollective

    Iterate

    Puppet-apply

    on test node

    check on

    foreman

    check on

    node(s)

    check on

    foreman

    git commit

    and push

  • 8/2/2019 IT Tech Agile Config

    25/33

    Modernising our processes

    Our software processes for the computer centre are fairly

    limited fire-and-forget broadcasts to project-elfms

    and rather manual

    The manual test/ -> preprod/ -> prod/ template dance

    Our toolset RPMs are built on laptop and uploaded to swrep

    by hand

    Add standard CI (e.g. Jenkins, Bamboo, Cruise)

    and automated build (Koji) as the only route to

    get new packages into the CC

    .. then automate the testing

    e.g. suitably tagged RPMs are automaticallydeployed to /test

    nodes.

    IT Technical Forum 27 Jan 2012 25

  • 8/2/2019 IT Tech Agile Config

    26/33

    Modernising our processes

    Were working out which of the many puppet / git models

    suits us code review, sign-off and automated notification for changes

    that will affect multiple clusters

    How to automate the test/preprod/prod advancement

    Pre-req is flexible monitoring and alarming

    you need to trust that an automation failure will be signaled

    to you

    Script-generated emails are banned Need good monitoring to hang these notifications on

    Integrate components rather than use emailScript

    Script-generated tickets (where your value in the process is

    your password), are banned

    IT Technical Forum 27 Jan 2012 26

  • 8/2/2019 IT Tech Agile Config

    27/33

    Current tool snapshot (liable to change)

    IT Technical Forum 27 Jan 2012 27

    Jenkins

    Koji, Mock

    Puppet

    Foreman

    AIMS/PXE

    Foreman

    Yum repo

    Pulp

    Puppet stored

    config DB

    mcollective, yum

    JIRA

    Lemon

    git, SVN

    Openstack Nova

    Hardware

    database

  • 8/2/2019 IT Tech Agile Config

    28/33

    Preliminary timelines

    Year What Actions

    2011 Agree overall principles

    2012 Prepare formal project planEstablish IaaS in CERN CCProduction Agile InfrastructureMonitoring Implementation as per WGMigrate lxcloudEarly adopters to Agile Infrastructure

    2013 LSD 1New Data Centre

    Extend IaaS to remote CCBusiness ContinuitySupport Experiment App re-workMigrate CVIGeneral migration to Agile with SLC6 andWindows 8

    2014 LSD 1 (toNovember)

    Phase out Quattor/CDB/

    IT Technical Forum 27 Jan 2012 28

    Aggressive schedule if we are to make it for new data centre

  • 8/2/2019 IT Tech Agile Config

    29/33

    Initial steps

    Decide on tools now and integrate them together

    to make a production setup (Q1)

    We can still change.. But were starting to commit

    Looking for early adopters (from Q1)

    In particular to understand the people-scaling / ACL

    issues: which of the git/puppet models is best? e.g. PES/OIS services: batch/VMs, JIRA, Drupal

    https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/

    EarlyAdopters2012

    Help with integration / coding

    Help with ideas

    Help with building the task list

    IT Technical Forum 27 Jan 2012 29

    https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure/EarlyAdopters2012
  • 8/2/2019 IT Tech Agile Config

    30/33

    Summary

    IT has started a new project to move our infrastructure to a

    new toolset based around industry standard open sourcecomponents

    Puppet for the core configuration tool

    Better integration between components

    Use of more modern software processes to aid deployment

    Better monitoring

    Engage with the community rather than re-implement

    Overall project scope is wider (future IT forums)

    Cloud and virtualisation, improved monitoring

    Please get involved early and give feedback

    https://twiki.cern.ch/twiki/bin/view/AgileInfrastructure

    IT Technical Forum 27 Jan 2012 30

    https://twiki.cern.ch/twiki/bin/view/AgileInfrastructurehttps://twiki.cern.ch/twiki/bin/view/AgileInfrastructure
  • 8/2/2019 IT Tech Agile Config

    31/33

    Backup slides

    IT Technical Forum 27 Jan 2012 31

  • 8/2/2019 IT Tech Agile Config

    32/33

    Code ownership model

    The sticky hands support model (you touched it last!)

    Were working out an FE-based model where

    Code is owned by the related service Functional-Element

    Ownership confers the responsibility to maintain a decentstandard config for the computer centre, and the

    responsibility to roll out new versions of that code/config

    Patches from interested people can be offered, and if you

    take them, you support them

    not the guy that gave you the patch

    IT Technical Forum 27 Jan 2012 32

  • 8/2/2019 IT Tech Agile Config

    33/33

    mcollective and messaging

    mcollective is a notification framework

    Mix of CERNs not.d / wassh

    It broadcast instructions to run pre-canned tasks to nodes

    selected by a filter collects the results from the nodes

    then renders that result for the CLI

    e.g. restart all my webservers, do a puppet run now

    It requires a messaging framework that all nodes subscribe

    to (to receive the notification)

    Typically: AcvtiveMQ or RabbitMQ

    Both Openstack and our (future) monitoring system need aCC wide messaging system as well

    IT Technical Forum 27 Jan 2012 33