what do the "cool kids" know about devops?

© 2014 IBM Corporation

Session: 2427What the Cool Kids are Doing with DevOpsBill HoltshouserSenior Strategist, Mobile, DevOps, CloudIBM Rational

Please note…IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.

Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.

The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.

Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.

2

Introduction

• This session is based on an examination of a series of “born on the web” companies to see what common patterns and other learnings can be derived from their DevOps journeys, with the goal of extracting guidance for IBM’s clients

• We used only publicly available information such as published conference presentations, company blogs, videos, news stories and white papers

• Important: Everything here is strictly our opinion; none of the companies mentioned reviewed or endorsed these opinions in any way!

3

Key Takeaways

• “Born on the Web” startups like Etsy, Netflix and others have been leaders in applying a DevOps approach to SW development and delivery – but they are essentially built from the ground up to do so

• These companies display numerous common DevOps-related traits in the areas of Culture, Organization, Practices, Automation and Measurements

• Although your enterprise won’t be able to replicate all aspects of these “cool kid” companies and how they have applied DevOps (nor should you even try), there are some important learnings from them that can inform your own DevOps approach

4

5

Does this story sound familiar?

One way to address the issue…

6

Believe it or not, Dev and Ops weren’t always separate “Back in the dawn of the computer age, there was no distinction between dev and ops. If you developed, you operated. You mounted the tapes, you flipped the switches on the front panel, you rebooted when things crashed, and possible even replaced the burned out vacuum tubes. And you got to wear a geeky white lab coat…”

“Dev and ops started to separate in the ‘60s, when programmers dumped boxes of punch cards into readers and “computer operators” scurried around mounting tapes in response to IBM JCL. The operators also pulled printouts from line printers and put them in labeled cubbyholes, where you got your output filed under your last name.” – John Alspaw, Etsy

7

So…just who are these “Cool Kids” anyway?

8

Sidebar: Continuous Delivery is more than just “fast Continuous Integration”

Continuous Delivery• Websites, SaaS offerings• Multiple pushes to

production per day• Highly decoupled,

independent feature sets• Single image/single

stream • New practices and

patterns

Continuous Integration• Traditional applications,

appliances, mobile apps, Web APIs

• Delivery to production every few days to weeks

• Coordinated releases, multiple version streams

• Established Agile practices

Continuous Engineering• Complex embedded

systems• Complex product

release and update cycles

• Management of variants and versions

• Engineering practices

9

Five essential elements of “Cool Kids” DevOps success

Organization

Practices

Culture Automation

Measure-ment

10

• Trust leads to an acceptance of “reasonable” risk– Organization, tools, automation, instrumentation can all reduce risk

• Risk = PROBABILITY of Error x COST of Error– Not all risks are created equal; zero risk is unattainable– Cost depends on Time to Fix

• Learning from mistakes > blame– …but there is still Karma: repeated mistakes may lead to loss of privilege

Cool Kids and Culture - key learningsCulture

At Etsy, employees have a high degree of creative freedom and, when things go wrong, accountability without blame. “We actually trust people,” CTO Chad Dickerson says. He calls the approach a “radical decentralization of authority.” – Inc. Magazine, 12/13

11

• ALL exhibit a high degree of delegation– …which leads to velocity

• In order to delegate, the Cool Kids trust… but verify– E.g. via instrumentation, measurement

Re-defining the attitude towards “failure”

12

• NetFlix allows failure to happen continuously, and want their SW to be able to deal with it; in fact they take steps to encourage errors (Simian Army)

• In reality they look at “failure” as simply another STEP in the SW development process

http://techblog.netflix.com/2011/07/netflix-simian-army.html

• Adopt an “Ops First” design mentality– Don’t build what you can’t manage

• Recognize the importance of build– They don’t just give the build system to the “worst programmer”

or newest hire, but establish a focused role

Cool Kids and Culture – more learningsCulture

13

Bottom line: a culture of trust is required

14

Rapid delivery requires low

risk

Small feature sets

Independent services

Progressive exposure

Rapid feedback

Reliable rollback

High delegation

& trust

Risk = Probability of error x Cost of

error

Culture

Adrian Cockcroft of Netflix on Culture

“Culture is very hard to create or modify but easy to destroy. This is because everyone has to buy into it for it to be effective, and then every manager has to hire only people who are compatible with the culture, and also get rid of people who turn out not to fit in, even if they are doing good work.

So the short answer is: start a new company from scratch with the culture you want, and pay a lot of attention to who you hire. I don't think it is possible to do a culture shift if there are more than a roomful of people involved.

Even with a roadmap and a guide, you probably won't be able to follow this path if you are in a large established company. Your existing culture won't let you.”

http://perfcap.blogspot.com/2012/03/ops-devops-and-noops-at-netflix.html

15

Organization follows Culture

Traditional Culture DevOps Culture

My priority is to deliver code…

fast.

My priority is to keep the site up

and running.

We’re all on the same team! Want

some pizza?

16

Organ-ization

• Conway’s Law (you build what you are) applies– …also applies to how you’re organized

• Feature teams, not platform teams– Small teams: “two pizza” rule

• Organize for an “end-to-end” responsibility for delivery– Positive approach to fixing mistakes – learning, not “blame and shame”

• Many common patterns are seen in QA…– Shared responsibility across a team, everybody does QA, or co-located QA– Small Quality Engineering CoE team provides common tools/practices– But NOT a separate/antagonostic QA org (“clean up your own mess”)

• Small DevOps “toolsmith” teams– A.K.A. Systems Release Engineering– Provide common tools & processes for automation, logging, monitoring…– There to help, NOT to do it for you

• Finally - no “throwing it over the wall”…

Organization follows Culture

17

Organ-ization

…basically, you need to be getting away from this

WORKED FINE IN TEST

OPS PROBLEM NOW18

Practices that “make perfect” for the Cool Kids

Practices

• “Light” planning and specs– Etsy high level planning done in 60 day chunks and two week

periods; specs kept very light – no more than what is required

• Cut the cord with traditional release process– Developers coordinate and drive the release of their own

code without need for a centralized release cycle– Netflix goes farther than most: “NoOps”

• Speed, speed, speed– Its all about rapid deployment; some deploy updates to their site 25x per

day• Progressive rollout of new features, “dark” releases

– Concept of “config flags”, new features there but not yet enabled, then launched with simple switch in the code

• They talk about it…a LOT– Lots of internal and external forums / blogs among the Cool Kids– Example: Etsy “Code as Craft” site www.codeasdraft.com

19

http://www.codeascraft.com/

• Most of these companies manage a single production image that they completely control– The don’t have to worry about shipping releases to

customers who might or might not install those releases

• …therefore there are no branches in their version control – everything is checked into the trunk

Practices: a single image simplifies things

Practices

20

• Testing everything on every check-in is good…but it isn’t the endgame– LinkedIn has only a few thousand unit tests

• Testing in a non-production environment can reach a point of diminishing returns– Ever-growing lists of unit tests, often testing very obscure

scenarios, often overlapping and redundant– Limited by your ability to predict real world scenarios

• LinkedIn practice: get to production environment as soon as practical– Progressive rollout minimizes the risk when deploying to

production…

Practices: “Continuous Delivery Heresy”(Yes, you can do too much testing)

Practices

21

• Progressive rollout of new features, “dark” releases:– Deploy to one server with all features disabled to ensure no

performance or resource regressions (also known as “canarying”)– Turn on features for a small population, and measure (“smoke test”)– Turn it on for up to 1% of users, and measure– Progressively roll out to all servers, continuing to measure– Config Flags (also known as feature flags or gatekeepers [LinkedIn])

control which users see which features

• In order to successfully do Progressive Rollout, you’ll need two more of our five essential elements:– Automation, both to progressively roll out and to roll back if a

problem is discovered– Measurement (tied to Instrumentation), in order to be able to rapidly

measure the impact

Practices: Progressive Rollout

Practices

22

Progressive Rollout console at Facebook

Practices

23

• These companies tend to avoid “release-defining features” that can hold up the entire release

• Cool Kids pattern: release features when they are ready - the release train waits for nobody– Also known as date-based releases - the date of release is

fixed, but the features in that release are flexible

• For this to work, you must respect forward and backwards compatibility of API (service) interfaces

Practices: Fire When Ready!

Practices

24

• In general, the Cool Kids automate as much as possible – Etsy has invested a lot in automated unit / functional

testing, dev tooling and monitoring, use of dashboards– Netflix has a heavy degree of automation across the board

• Automate even the infrastructure, but keep it simple– LinkedIn, Flickr and Netflix generally build up their

infrastructure from just a single OS image

– From here, configure individual servers using automated scripts driven by tool of choice (e.g. IBM UrbanCode)

– Also commonly seen was use of “Phoenix” servers (vs. “Snowflakes”), which can be re-built at any time then “burned to the ground” if needed

• … but only automate what can be measured

Cool Kids and Automation Auto-mation

25

Think you don’t need to keep an eye on automation?

http://windowsitpro.com/windows-7/aggressive-configmgr-based-windows-7-deployment-takes-down-emory-university

“During TechEd 2014, the Emory University IT department prepared and deployed Windows 7 upgrades to the campuses computers. If you've worked with ConfigMgr at all, you know that there are checks-and-balances that can be employed to ensure that only specifically targeted systems will receive an OS upgrade. In Emory University's case, the check-and-balance method failed and instead of delivering the upgrade to applicable computers, delivered Windows 7 to ALL computers including laptops, desktops, and even servers.

I'll stop for a second to let you take that in.

Yes, even servers.

By the time it was realized what exactly had happened, the Windows 7 sequence had repartitioned, reformatted, and installed Windows 7. Emory IT powered off the ConfigMgr server, hoping to stop the deployment before it was too late, but – it was too late. Even the ConfigMgr server had been repartitioned and reformatted…”

– Windows IT Pro, May 19, 2014

Finally: Instrument and Measure

27

• LinkedIn: “Measurement is better than prediction”• Provide a common framework to make it easy for developers to

choose what to log simply by tagging or registering it– “Push” from services works better than “pull” or polling

– In many cases, developers need do no more than push key/value pairs to a logging system

– LinkedIn collects 500K+ metrics per minute at an average of 400 metrics per service

• Instrument user behaviors to improve the user experience– Esurance: “we mined the data to figure out what people were doing

most often, make those tasks the most prominent and make them addressable in as few clicks as possible”

• Metrics dashboards also display deployment activity– So if there’s a problem, you can easily tie the start time of the issue to

the preceding pushes

Measure-ment

• LinkedIn developed and then open sourced tools for monitoring and graphing data being pushed to its logs…

Monitoring at LinkedIn

inGraph, inFormed

Measure-ment

28

So…what are the Cool Kids DevOps takeaways?

29

Culture

• Cultural change takes time – take reasonable steps– Team-building, cross-training, improved communication– Maybe include your Ops team in requirements / feature

reviews and planning (e.g. via IBM RRC, RTC)

• Don’t turn your organization upside down– Experiment on a few smaller, low-risk projects – Maybe create DevOps "center of excellence" – Tear down walls between teams

Organi-zation

• “Continuous Integration” is a good starting point– Push all builds to the last stage before release– Eat your own dog food (get employees involved to test)– Try progressive rollout or dark release of features

Practices

So…what are the Cool Kids DevOps takeaways?

30

Auto-mation

• Start by automating a few areas that you can easily see and track the results from – E.g. Test / build pipeline, possibly using UrbanCode Deploy

• First, assess your current process and consider the changes you want to make – then consider how to measure them – Instrument and measure anything you intend to automate

Measure-ment

• But above all, be honest– Assess your own DevOps maturity and aspirations – where are

you and where do you want to be?

31

IBM can help: DevOps Adoption Framework delivers measurable outcomesEnable lean adoption of DevOps capabilities

Adoption Model Self-assessmentsAdoption pathsAdoption services

SolutionsPractices

Tooling Services

Steer Product-based

Agile

Automated

Collaborative

Optimizing

MorePredictable

MoreTransparent

MoreContinuous

Process-based

Process-heavy

Manual

Silo-ed

Develop/Test

Deploy

Operate

Inefficient LeanerLeaner and

Smarter

ContinuousCustomer

Feedback & Optimization

Collaborative Development

Continuous Release and Deployment

ContinuousMonitoring

ContinuousBusiness Planning

ContinuousTesting

Operate Develop/ Test

Deploy

Steer

DevOps Continuous Feedback

CommunityStories

EnablementFeedback

Where andHow to Get

Lean

Expertise and Technologies

Knowledge sharing

32

Where to start: DevOps Adoption Roadmap Assess desired outcome and supporting practices to drive strategy and rollout

What am I trying to achieve?

Think through business-level drivers for improvement Define measurable goals for your organizational investment Look across silos and include key Dev and Ops stakeholders

Where am I currently?

What do you measure and currently achieve What don’t you measure, but should to improve What practices are difficult, incubating, well-scaled How do your team members agree with these findings

What are my priorities?

Start where you are today and where your improvement goals Consider changes to People, Practices and Technology Prioritize change using goals, complexities and dependencies

Step

1St

ep 2

Step

3

Current PracticeAssessment

Objective & Prioritized Capabilities

Business Goal Determination

What new practices should help me grow?

Step

4

Understand your appetite for cross-functional change Target improvements with the biggest bang for the buck Roadmap and agree on an actionable plan Use measurable milestones that include early wins Strategy/Roadmap

33

Connect with me on Twitter at @BillHoltshouser or LinkedIn at www.linkedin.com/pub/bill-holtshouser/4/815/66a/

http://www.linkedin.com/pub/bill-holtshouser/4/815/66a/

Acknowledgements and Disclaimers

© Copyright IBM Corporation 2012. All rights reserved.– U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract

with IBM Corp.

IBM, the IBM logo, ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml

Other company, product, or service names may be trademarks or service marks of others.

Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.

The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are

provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.

All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.

34

http://www.ibm.com/legal/copytrade.shtml

what do the "cool kids" know about devops?

Software