puppetconf 2015: practical ci/cd with puppet code and configuration

Welcome everyone, let’s get started Today we’re going to talk about some practical tips and techniques that we’ve found for doing CI/CD with Puppet

Both principal engineers at TWC on the OpenStack team We both have varied backgrounds ranging from software development to

operations to IT and engineering

●  Our OpenStack team started with four people about two years ago ●  The team is a mix of backgrounds, traditional IT folks and traditional

developers. This makes for an interesting mix of skillsets ●  At the beginning we only had a small amount of automation experience. We

had 1 person who knew Puppet. ●  We got OpenStack into production and then quickly realized that our lack of

process was hurting delivery speed ○  Started working on a plan for CI/CD last summer

●  So to give you an idea of how far we’ve come, this is the requisite slide of stats.

●  Don’t need to read this to you, our CI/CD efforts started pretty small, but have grown over the last year.

●  Background on our Puppet deployment: ○  Puppet open source, 3.7 release ○  Puppet master per environment per data center (infra, dev, staging,

prod) ■  We don’t really use puppet directory environments: Our only

environment is production and the directory for that has two symlinks in it

○  We don’t use PuppetDB for a lot, mostly for monitoring configuration and a few other minor things

○  Using hiera for managing differences in across dev, staging and prod environments

○  Using hiera-eyaml backend for encrypted secrets such as passwords, SSL certs, etc

○  We use r10k to manage puppet module installation and versioning

This is a very high level diagram that shows how changes start in our dev environments and eventually end up deployed into production.

We will cover some of this in more detail as we talk. Each of these yellow boxes here represent an OpenStack region, and you can

see we have multiple regions in Staging, Production, and some dev environments

This is what a pair of OpenStack regions look like. Regions are usually used to represent different data centers

We’re not going to spend a lot of time talking about OpenStack in this talk, but we just wanted to show you that we’re deploying applications that aren’t just a simple web servers with a of database server behind them

If you do want to know more about puppet and openstack, there’s a talk tomorrow at 1:30

Our OpenStack deployments are fairly complex: Several multi-master MySQL database clusters, in some cases spanning

data centers Multiple RabbitMQ clusters. Many services running on each node type (Control plane servers have

over 20 OpenStack services running on them) The small “Compute” and “Storage” boxes, there are a lot more than 3 of each of

those There are a lot of other node types for monitoring, hardware load balancing, etc

that aren’t shown here There are also a lot of cross node dependencies here that we can’t really

represent and keep this readable. So next Matt is going to talk about how we deal with this complexity in our

development environments

4 min - Matt ●  How do you develop & test your changes before committing them?

○  sshing into the puppet master is not a great solution and doesn’t scale well

○  our solution is virtual development environments

●  What are virtual development environments? These are actually openstack

VMs running the various openstack node types and a puppet master. ●  Because we like playing on the hard setting, we run our virtual development

environments on top of our production cloud. ○  One benefit of that is we’re not relying solely on monitoring to tell us if

something is wrong, we’re all using it all the time.

A bit more about these environments... We build these nodes with the exact same tool chain and process we build real

production hardware, it helps exercise the tools. ●  Every team member has their own environments

○  Team member can pick and choose which node types they need and how many

○  Can have multiple environments, which can be connected with a router to simulate multiple-DCs

○  Share environments between team members to help diagnose problems

●  So once someone has developed new puppet code in an environment, what’s the next step?

Team member commits the change locally, and then they need to submit the change for review so other team members to look at and for automated testing to churn away on

Gerrit may be new to some of you. Gerrit is a code review tool that we use and it was originally developed for

use by the Android project and is now used by a number of open source projects including OpenStack and Wikipedia.

For us, all changes go through Gerrit and almost all repositories in Gerrit have automated testing.

People on joining our team either already know how to use Gerrit, because they’ve contributed to the OpenStack project, or they’ll need to learn how to use it so they can contribute upstream

We see a lot of benefits from doing code review Code quality goes up any time you have someone else even skim over a

change that you’re proposing This is also a good opportunity for knowledge sharing and mentoring One thing we didn’t really anticipate is that it provides a better sense of

shared ownership of your configuration and code If you make a change and 3 other people review it, and then it has

a problem, it’s really hard to point fingers One nice feature of Gerrit is that it’s very easy integrate pre-merge testing

with Jenkins This means that you can prevent merging changes to your master

branch that don’t work We’ll talk about that more later

So once a change in our internal repos is approved and merged, what’s the next step?

8:30 min - Clayton ●  So no matter how you install your puppet modules, if you want to have a

stable, reproducible environment. ●  To achieve that you’ll need to do some sort of version pinning for those

modules. ●  How do you pin module versions without it being painful?

○  We’d like to talk about an approach we think is pretty straightforward and works well with r10k or anything else that can read a Puppetfile

This should give you an idea of what our git repo layout for Puppet looks like We have a single top-level repo we call puppet-config that controls exactly

what gets deployed and how it’s configured. Master branch contains all the hiera data for configuration and Puppetfiles

for specifying puppet modules we use For those that aren’t familiar Puppetfile is a Ruby based file format

that different tools like r10k and puppet librarian use for specifying where to retrieve modules from, and what tag or commit to use when installing them

You can see here that the Puppetfile refers to specific git repos that each contain a single puppet module.

During deploys r10k reads the Puppetfile, clones new repos if needed, and checks out new revisions in repos that are already cloned.

Those of you that aren’t asleep yet or working an outage may have noticed we also have this Puppetfile.yaml thing, the wacky thing we came up with.

To refresh your memory, this is what a normal Puppetfile looks like In our case, everything is installed from git, and everything is locked to a tag or

commit. As mentioned before, we don’t put branch names in here Also, tag or commit is more efficient, since r10k can check if those are up

to date without network access Highlighted using http://markup.su/highlighter/ (LAZY theme)

This is what a Puppetfile.yaml entry actually looks like: This is an approach we use for all of our internal modules You’ll see here that we have a “cirrus” module and there are two highlighted

sections, with the Git URL and the commit hash. These are the only fields that are actually used when doing a deploy Other fields are all informational.

For multiple modules, we just have more entries in this file in addition to cirrus Before we were using Gerrit, we used to have commit a change in a puppet

module, tag it, then go back and change the reference in the Puppetfile That was really tedious, and when we were moving to Gerrit, we realized that

having to put up two code reviews for every change was going to be miserable.

When we moved to using Gerrit, we put this new file in place. Now when a change is merged in Gerrit, a Jenkins job is triggered to

update this Puppetfile.yaml file with the new commit hash. This has made pinning internal modules to a specific commit completely

painless A side effect of this is that for every puppet module change, there is a

corresponding commit in the top-level “puppet-config” repo The Jenkins job that updates this file also puts a lot of the metadata shown

here in the commit message This allows us to understand what has changed across all modules by just

looking at the commit history of puppet-config repo

If you look at the top of our Puppetfile, you get an idea of what’s going on here: The comment says “This loads the YAML file that Jenkins maintains of the

latest commits approved through Gerrit” Since the Puppetfile is just Ruby, we can embed code here to extend the

functionality This code checks to see if a Puppetfile.yaml file exists, and if so, it loads

the data in that file, and imports all the entries into the Puppetfile This approach requires no special support from any tool that reads a Puppetfile Since the file is just YAML, it’s to parse, update, etc

Highlighted using http://markup.su/highlighter/ (LAZY theme)

●  Another TWC employee, Phil Zimmerman, gave a great talk last Puppetconf about how they handle a similar problem. He’s since open sourced a tool called Reaktor that covers some of the same space as our approach

○  Focused heavily on creation of new dynamic environments by modifying the Puppetfile and running r10k automatically for you

○  Provides an opinionated workflow that will probably work well for many people

○  Our approach was inspired by Phil’s work. We already had Ruby code in our Puppetfile, so we felt like having something easily parsable was important.

●  This summer Camptocamp released their ‘puppetfile-updater’ tool. ○  Provides Rake tasks that allow you to automate updating your

Puppetfile.

14:30 minutes - Clayton How do you have confidence that your code won’t cause deployment issues before merging? We’ve found that automated pre-merge testing is one of the best ways to cut down on deployment issues.

●  If you’re doing code review using pull requests or something similar, you’re generally going to waiting for a human to review the code. In this situation we think it makes a lot of sense to put effort into providing the reviewer with as much information as possible.

●  Every time a new change is put up for review, our pre-merge test jobs trigger ●  Most of our tests are pretty simple things like puppet-lint, syntax checks, unit

tests, etc. ●  One thing we do that we think is a little unusual is automated puppet catalog

compiles ●  We feel like this is a really great technique and want to tell you more about

how we use it.

First some background: ○  When you run puppet, the Puppet master builds a catalog of all

managed resources on the target node ○  The inputs to this process are your puppet code, your hiera config and

the facts from the node ■  If we have all three of these, we can reproduce this catalog

compile in any environment. ○  We already have the code and the hiera config in git ○  To get the facts We have an automated job that collect Puppet fact

information from all nodes several times a day and checks that into git ○  This gives us the ingredients to generate catalogs for any node, any

time, from anywhere. ●  When a change is proposed, we build catalogs for one of every node type in

every environment (dev, staging, prod). ○  Not just a single catalog, it generates a view of what the catalog would

be before, and after the proposed change. We then take these two and we diff them using Zack Smith’s catalog diff puppet module.

○  Jenkins then takes the output and posts a comment on the code review that summarizes what nodes would have changes, and a link to more details.

So I know, and you know from the beginning of the talk, that we’ve got quite a few services that use MySQL, and we configure them using the puppet-mysql module

What happens if we accidently delete the puppet-mysql line from our Puppetfile? The catalog compile job fails, and we have Gerrit configured so that a

reviewer can’t just ignore this and merge the change without seeing it. That is great, because we all make stupid mistakes, maybe not always this

stupid, but we caught this mistake before it was merged. So if this mistake was less obvious, how would we know what the issue

was? Let’s look at the jenkins link given with the FAILURE message

The Puppet catalog compile output looks something like this for one of the failed nodes

It’s going to give you the same error you’d get if you were trying to run puppet agent to do a deploy

Since the first message is that it cannot find mysql::server, we get a really good idea where to start looking

Another example that is a little more interesting About six months ago we stopped using a piece of software that required us to

use MongoDB (not a coincidence) We forgot to remove the MongoDB puppet module. I was pretty sure that we weren’t using it anymore, but we want to be sure Put up a code review to remove it and what you see here is the result

Two important points here: Our catalog compile was successful, we know this won’t cause any failed

Puppet runs because something is still using Mongo Our diff shows us that 0 nodes were changed, we know there should be no

change in functionality. We know we can merge this change and everything will be a little cleaner than

when we started.

We originally used a “OpenStack Cloud in a Box” setup that came tons of extra

hiera config and puppet modules that we never used and didn’t really understand.

This is been a key tool for being able to clean up that giant mess. We’ve used this to remove modules, and clean-up thousands of lines of

hiera config and puppet code that wasn’t needed or was poorly organized.

This has given the people on our team with the most Puppet experience a really powerful tool, but it’s also allowed people with little or no Puppet experience the ability to propose a change and then see what will happen. It allows anyone on the team to put together a what-if scenario and see what the result is.

This tool allows a new person on the team to propose, for example, a change for production-only and then see if it actually does what they think it might.

We try to keep the puppet modules we use up to date, but with over 90 puppet

modules in use, we don’t have the time to dive in and read every code change for every upgrade.

This is another area where catalog diffs are incredibly useful You need to know what *actually* changed. Change log is great, but may

not exist, may not be complete, etc

This is the output from Jenkins after we updated the Puppetfile to start using the latest released version of the puppet-ntp module

The catalog compile is successful, which is a good sign You’ll see that 65 nodes are showing as changed. We don’t do catalog

compiles for every node, just one of every node type in each data center and each environment. So for example, one monitoring server in staging west, one in staging east, and then one each in prod east and west. That adds up to 65 nodes right now.

At the bottom you see a link to details about the changes, let’s see what that looks like:

We’re going to do a small demo, you can type in the tinyurl link at the bottom to follow along, or just if you want to try out the tool

For viewing catalog diffs we’re using camptocamp’s catalog diff viewer This is relatively new, released a few months ago. First thing you can see is that 65 nodes have changes, no failed, etc No new or removed resources Two copies of config file diff may be related to concat module We use the same NTP servers in all environments, but if we used different ones,

that would be reflected here also. On the catalog-diff-viewer project page, there is also a demo where you can

upload your own catalog diff files and try it out for yourself. https://dl.dropboxusercontent.com/u/23807/puppetconf-2015/example1/index.html http://tinyurl.com/puppetconf-2015-cicd1

●  This test can verify that the combination of Puppet code and config will run at all. Finding out your puppet code throws a fatal error when you’re trying to deploy to production is no fun.

●  This provides more detail than simple syntax checks, and it can test things that rspec alone cannot. It also is essentially free in terms of effort once you set it up. ○  No extra code to write for each change ○  New puppet users don’t need to immediately invest into learning rspec

which lowers the barrier to entry to new developers. ●  Faster than integration testing

○  This test takes about 3-4 minutes to run catalogs for 65 nodes. Integration tests can take that long to even get started.

●  Doesn’t: ○  If you have changes in the Ruby code for custom functions, custom

types, etc, this won’t help you evaluate that ○  It also won’t help you determine what services might get restarted, or if

packages will be installed or upgraded, etc ■  Those things depend on the existing environment. Puppet only

knows the desired state. ■  We’re planning on enhancing our integration testing to address

this and we’ll be talking about integration testing in a few more slides

So you may be thinking “This sounds awesome, I can stop writing rspec/beaker/etc tests!”

Not really, but these are complementary Rspec is great for modules that have complex interactions, for example, because

of multiple operating system support, or ones that provide a lot of flexibility to the end user.

To be honest, we don’t write a lot of rspec tests for our internal modules, but a lot of that has to do with having a very homogenous environment

We do write rspec tests for nearly everything we contribute to other projects, assuming they already have tests.

We’re both core reviewers on the OpenStack puppet modules, rspec tests are required for all changes and for good reason

These modules support multiple operating systems and are very mix and match in their configuration

However, we don’t see much value in writing rspec tests that just repeat what you’ve already stated in your puppet manifests.

Lastly, we don’t use beaker, but we use other tools that provide us with similar functionality. This is mostly because beaker was pretty new and raw when we started doing integration testing. If we were starting over, we’d probably look closely at beaker.

28:30 Matt

How can you test your deployments and know what the effect will be on our nodes?

So what is our integration test story?

We’re doing automated builds of our core node types puppetmaster, load-balancer, identity, control, compute We’re mainly testing “Can you still build these node types?”

Keeps us from breaking node rebuilds. Since Puppet building these nodes it is exercising features like our

identity service, service provisioning, etc, it’s still a really useful check.

You don’t want to find an issue with a node rebuild when prod hardware dies.

Originally it took about an hour to build this environment, we’ve gotten that down to 35 minutes with some optimizations we’ll talk about on the next slide

These integration environments are transient Gets build anew for each test the same way we’d build a new production

environment Nodepool is an OpenStack specific tool for pre-provisioning Jenkins slaves

and destroying them automatically after a job completes There are alternatives to nodepool like Beaker, Cloud Formation, etc.

The most important benefit has been the ability to catch breaking changes post-commit, which isn’t ideal, but it’s better than nothing.

We can go back and see what the last commit that worked was rather than trying to guess.

One of the reasons we didn’t do integration testing earlier was because it would take 1.5 to 2 hours to build an environment serially due to cross-node dependencies when we did it with Vagrant

We have 5 nodes to build It takes us about 20 minutes to build a puppet master Another another 30 minutes to build our most complex node (which also depends

on 3 other nodes already being built. Given this dependency chain and then time involved, how do we build this whole

environment in 35 minutes? Parallelism!

How does parallelization apply here? How can we get puppet running on all nodes once the master is up without requiring multiple puppet?

Simplest cross-node dependency is that the puppet master must be up before any clients talk to it. Our script for bootstrapping client nodes just calls curl in a loop against the puppet master right before trying to run puppet agent. This allows us to spawn all the VMs at once and get the base packages installed.

But the key piece was inspired by something most of you running PuppetDB have probably seen.

PuppetDB has a custom provider that allows it to retry a HTTP request every x seconds for up to y length of time before giving up.

We took this and created our own TCP and HTTP validation functions based on that code

This allows us to use validator resources to block puppet runs until the dependency is met. Essentially we can start our puppet runs on all nodes at once (once the puppet master is up) and they will block until they’re dependencies are met. Not fail.

Another puppet-openstack contributor Yanis Guenane had the same idea and his implementation is now part of the in-progress puppet-healthcheck module.

http_conn_validator is not merged yet We’ve switched over to a forked version of his code and have it in

production now. you can find that at the twc-openstack GitHub Org The health check is useful in other places for us.

We have services that we manage with puppet and custom providers that

●  We have an app-server that exposes a REST API and we know takes 45

seconds to start up because it’s written in Java ●  We have a custom provider that will provision services in the application, but

clearly it needs the REST api to be responding to do that. ●  Few issues:

○  First, On initial server provisioning it will probably two puppet runs to provision the app service because the app won’t be up when the custom provider tries to provision it

○  Second, If the app server has to be restarted, the puppet run will probably fail even if the service is already provisioned, because the application may not be responding yet when the provider tries check the provisioned state.

○  Lastly, this is a cause of intermittent failures, sometimes the app won’t get restarted and sometimes when the app is restarted it will be 45 seconds between when it is restarted and when puppet tries to use the app_service_provisioner. In those cases you won’t see an error.

●  To solve this we put the http_conn_validator resource between the two. This will always ensure that the given URL is responding before trying to provision services against it.

We feel like this area has a huge potential. Some of the things we have either in

progress or planned right now are: Add the ability to build a staging or prod like environment with a specific git

tag, and then run a deploy to a new tag This will allow us to catch things like service restarts, package

upgrades that are dependent on the existing environment. Our plan is to do some log processing after deploying to a newly

built environment to summarize changes. Integrate post deploy checks - We have a test suite we run post deploy

now, but it’s not automatic yet To be honest, we thought we’d have most of these things done before this talk,

and some of these are nearly done.

36:30 min - Matt ●  Okay lets talk a bit about Deployments and CD

Quick

I’d like to talk briefly about how we do deployment in general before we get to continuous delivery, since they’re closely linked

We plan for a normal, should-be-boring deploy to staging and prod at least once a week.

These are deployed from a specific tag First deployed to Staging and after validation & testing they’re deployed to

production So how does this deployment process actually work?

●  Ansible is used to orchestrate configuration management changes via Puppet ○  This includes code checkouts and running r10k ○  This handles all node ordering and pre and post health checks on

nodes that need it, I’ll cover some examples of this in the following slides.

●  Jenkins is used to drive Ansible scripts ○  Provide access control, auditing & allows multiple people to view

deploys while they’re going on and afterwards

●  We are currently doing automated deployment into our shared dev environment if the integration test passes.

●  Why is this useful? It used to be that deployment to our shared development environment was entirely manual.

○  People would deploy when they needed to test something there, but otherwise it wasn’t guaranteed that a change would get deployed there before it even went to staging!

●  Our automation captures a change list so we know the full list of changes being deployed.

●  Anytime a deployment is started or completed its announced in our team chat with a clickable link to watch the deploy and see what’s changing.

●  Like everyone else, we want our deploys to be more boring. ○  Right now there is too much ceremony around our deploys - Manual

review of changesets ○  We want want to smaller patch sets

■  Less risk ■  Less issues with people trying to make a deadline and deploy

half baked changes. ●  Solution requires a few things:

○  more automation, including validation ○  more tooling

●  We’d like to extend the CD model into staging and prod rather than just dev. ○  Before we do that we need to have better reporting on what will be

changing in the deploy.

●  This one is pretty obvious, but if you have multiple tests that take a minute or two to run, run them in parallel, instead of in the same CI job.

○  Another option is to break your jobs into multiple parts, or change them to use multiple cores when possible

○  We use Jenkins Matrix jobs to break our catalog compiles out into multiple slaves

○  We use xargs parallel mode to parallelize our catalog compiles across multiple cores

○  We contributed code to the catalog-diff module to use the ‘parallel’ gem when available to use multiple cores

●  Use disposable (one-time use) slaves for CI work ○  This allows you to use eatmydata to disable all synchronous writes ○  With ext4 you can also do things like noatime, nobarrier, and disable

journaling ○  Combined, these made our integration tests about twice as fast.

●  Build a cache of your puppet modules: ○  If you have CI jobs that need to check out all of your puppet modules,

that can be really slow. For us that takes almost 4 minutes by itself. ○  We have a jenkins job that runs once a day that runs r10k to check out

all the modules we use. We create an archive from that and then have CI runs wget that file and extract it *before* running r10k, so r10k only has to update new and changed modules.

○  With 90 modules, this drops our r10k deploy time about 5 seconds

We just wanted to thank the people here for the projects they’ve worked on. We’ve worked with each of these people and sent them pull requests they’ve all been great to work with.

Today we’ve shared our entire CI/CD process for our OpenStack deployment. This includes getting changes in puppet and openstack code and puppet config through our entire pipeline. Starting at development in our virtualized dev environment to submitting the code to Gerrit and undergoing code review and pre-merge testing. And ending with integration testing and deployments through our shared dev, staging, and prod environments. Hopefully at least some of this process would be useful to you when you’re planning your deployment work. Thanks for your time and if there are any questions ask now or come find us after the talk!

puppetconf 2015: practical ci/cd with puppet code and configuration

Software