what do the "cool kids" know about devops?
Post on 17-Oct-2014
412 views
DESCRIPTION
Facebook, Netflix, Flickr, Etsy, LinkedIn, eSurance, Instagram and Salesforce.com; you know their names. As a consumer, you’ve probably used services provided by many of them. These are some of the “born on the web” companies of the last couple of decades that have helped pioneer new, web-based business models - and in the process become dominant players in their markets, or created new markets altogether. Call them the “Cool Kids”. What you may not know, however, is that these companies are also strong adopters of a DevOps approach when it comes to software development and delivery. In this presentation we take a look at these companies to discern patterns related to how they have applied DevOps in the areas of Culture, Organization, Practices, Automation and Measurements. Even if your company bears no resemblance at all to the Cool Kids, you can take away some important learnings from them as you look to apply DevOps to your own software initiatives. This presentation is a result of a joint project executed by IBM strategists Bill Holtshouser and Carl Zetie, both of the Rational division in IBM Software Group, during the first half of 2014.TRANSCRIPT
© 2014 IBM Corporation
Session: 2427What the Cool Kids are Doing with DevOpsBill HoltshouserSenior Strategist, Mobile, DevOps, CloudIBM Rational
Please note…IBM’s statements regarding its plans, directions, and intent are subject to change or withdrawal without notice at IBM’s sole discretion.
Information regarding potential future products is intended to outline our general product direction and it should not be relied on in making a purchasing decision.
The information mentioned regarding potential future products is not a commitment, promise, or legal obligation to deliver any material, code or functionality. Information about potential future products may not be incorporated into any contract. The development, release, and timing of any future features or functionality described for our products remains at our sole discretion.
Performance is based on measurements and projections using standard IBM benchmarks in a controlled environment. The actual throughput or performance that any user will experience will vary depending upon many factors, including considerations such as the amount of multiprogramming in the user’s job stream, the I/O configuration, the storage configuration, and the workload processed. Therefore, no assurance can be given that an individual user will achieve results similar to those stated here.
2
Introduction
• This session is based on an examination of a series of “born on the web” companies to see what common patterns and other learnings can be derived from their DevOps journeys, with the goal of extracting guidance for IBM’s clients
• We used only publicly available information such as published conference presentations, company blogs, videos, news stories and white papers
• Important: Everything here is strictly our opinion; none of the companies mentioned reviewed or endorsed these opinions in any way!
3
Key Takeaways
• “Born on the Web” startups like Etsy, Netflix and others have been leaders in applying a DevOps approach to SW development and delivery – but they are essentially built from the ground up to do so
• These companies display numerous common DevOps-related traits in the areas of Culture, Organization, Practices, Automation and Measurements
• Although your enterprise won’t be able to replicate all aspects of these “cool kid” companies and how they have applied DevOps (nor should you even try), there are some important learnings from them that can inform your own DevOps approach
4
5
Does this story sound familiar?
One way to address the issue…
6
Believe it or not, Dev and Ops weren’t always separate “Back in the dawn of the computer age, there was no distinction between dev and ops. If you developed, you operated. You mounted the tapes, you flipped the switches on the front panel, you rebooted when things crashed, and possible even replaced the burned out vacuum tubes. And you got to wear a geeky white lab coat…”
“Dev and ops started to separate in the ‘60s, when programmers dumped boxes of punch cards into readers and “computer operators” scurried around mounting tapes in response to IBM JCL. The operators also pulled printouts from line printers and put them in labeled cubbyholes, where you got your output filed under your last name.” – John Alspaw, Etsy
7
So…just who are these “Cool Kids” anyway?
8
Sidebar: Continuous Delivery is more than just “fast Continuous Integration”
Continuous Delivery• Websites, SaaS offerings• Multiple pushes to
production per day• Highly decoupled,
independent feature sets• Single image/single
stream • New practices and
patterns
Continuous Integration• Traditional applications,
appliances, mobile apps, Web APIs
• Delivery to production every few days to weeks
• Coordinated releases, multiple version streams
• Established Agile practices
Continuous Engineering• Complex embedded
systems• Complex product
release and update cycles
• Management of variants and versions
• Engineering practices
9
Five essential elements of “Cool Kids” DevOps success
Organization
Practices
Culture Automation
Measure-ment
10
• Trust leads to an acceptance of “reasonable” risk– Organization, tools, automation, instrumentation can all reduce risk
• Risk = PROBABILITY of Error x COST of Error– Not all risks are created equal; zero risk is unattainable– Cost depends on Time to Fix
• Learning from mistakes > blame– …but there is still Karma: repeated mistakes may lead to loss of privilege
Cool Kids and Culture - key learningsCulture
At Etsy, employees have a high degree of creative freedom and, when things go wrong, accountability without blame. “We actually trust people,” CTO Chad Dickerson says. He calls the approach a “radical decentralization of authority.” – Inc. Magazine, 12/13
11
• ALL exhibit a high degree of delegation– …which leads to velocity
• In order to delegate, the Cool Kids trust… but verify– E.g. via instrumentation, measurement
Re-defining the attitude towards “failure”
12
• NetFlix allows failure to happen continuously, and want their SW to be able to deal with it; in fact they take steps to encourage errors (Simian Army)
• In reality they look at “failure” as simply another STEP in the SW development process
http://techblog.netflix.com/2011/07/netflix-simian-army.html
• Adopt an “Ops First” design mentality– Don’t build what you can’t manage
• Recognize the importance of build– They don’t just give the build system to the “worst programmer”
or newest hire, but establish a focused role
Cool Kids and Culture – more learningsCulture
13
Bottom line: a culture of trust is required
14
Rapid delivery requires low
risk
Small feature sets
Independent services
Progressive exposure
Rapid feedback
Reliable rollback
High delegation
& trust
Risk = Probability of error x Cost of
error
Culture
Adrian Cockcroft of Netflix on Culture
“Culture is very hard to create or modify but easy to destroy. This is because everyone has to buy into it for it to be effective, and then every manager has to hire only people who are compatible with the culture, and also get rid of people who turn out not to fit in, even if they are doing good work.
So the short answer is: start a new company from scratch with the culture you want, and pay a lot of attention to who you hire. I don't think it is possible to do a culture shift if there are more than a roomful of people involved.
Even with a roadmap and a guide, you probably won't be able to follow this path if you are in a large established company. Your existing culture won't let you.”
http://perfcap.blogspot.com/2012/03/ops-devops-and-noops-at-netflix.html
15
Organization follows Culture
Traditional Culture DevOps Culture
My priority is to deliver code…
fast.
My priority is to keep the site up
and running.
We’re all on the same team! Want
some pizza?
16
Organ-ization
• Conway’s Law (you build what you are) applies– …also applies to how you’re organized
• Feature teams, not platform teams– Small teams: “two pizza” rule
• Organize for an “end-to-end” responsibility for delivery– Positive approach to fixing mistakes – learning, not “blame and shame”
• Many common patterns are seen in QA…– Shared responsibility across a team, everybody does QA, or co-located QA– Small Quality Engineering CoE team provides common tools/practices– But NOT a separate/antagonostic QA org (“clean up your own mess”)
• Small DevOps “toolsmith” teams– A.K.A. Systems Release Engineering– Provide common tools & processes for automation, logging, monitoring…– There to help, NOT to do it for you
• Finally - no “throwing it over the wall”…
Organization follows Culture
17
Organ-ization
…basically, you need to be getting away from this
WORKED FINE IN TEST
OPS PROBLEM NOW18
Practices that “make perfect” for the Cool Kids
Practices
• “Light” planning and specs– Etsy high level planning done in 60 day chunks and two week
periods; specs kept very light – no more than what is required
• Cut the cord with traditional release process– Developers coordinate and drive the release of their own
code without need for a centralized release cycle– Netflix goes farther than most: “NoOps”
• Speed, speed, speed– Its all about rapid deployment; some deploy updates to their site 25x per
day• Progressive rollout of new features, “dark” releases
– Concept of “config flags”, new features there but not yet enabled, then launched with simple switch in the code
• They talk about it…a LOT– Lots of internal and external forums / blogs among the Cool Kids– Example: Etsy “Code as Craft” site www.codeasdraft.com
19
• Most of these companies manage a single production image that they completely control– The don’t have to worry about shipping releases to
customers who might or might not install those releases
• …therefore there are no branches in their version control – everything is checked into the trunk
Practices: a single image simplifies things
Practices
20
• Testing everything on every check-in is good…but it isn’t the endgame– LinkedIn has only a few thousand unit tests
• Testing in a non-production environment can reach a point of diminishing returns– Ever-growing lists of unit tests, often testing very obscure
scenarios, often overlapping and redundant– Limited by your ability to predict real world scenarios
• LinkedIn practice: get to production environment as soon as practical– Progressive rollout minimizes the risk when deploying to
production…
Practices: “Continuous Delivery Heresy”(Yes, you can do too much testing)
Practices
21
• Progressive rollout of new features, “dark” releases:– Deploy to one server with all features disabled to ensure no
performance or resource regressions (also known as “canarying”)– Turn on features for a small population, and measure (“smoke test”)– Turn it on for up to 1% of users, and measure– Progressively roll out to all servers, continuing to measure– Config Flags (also known as feature flags or gatekeepers [LinkedIn])
control which users see which features
• In order to successfully do Progressive Rollout, you’ll need two more of our five essential elements:– Automation, both to progressively roll out and to roll back if a
problem is discovered– Measurement (tied to Instrumentation), in order to be able to rapidly
measure the impact
Practices: Progressive Rollout
Practices
22
Progressive Rollout console at Facebook
Practices
23
• These companies tend to avoid “release-defining features” that can hold up the entire release
• Cool Kids pattern: release features when they are ready - the release train waits for nobody– Also known as date-based releases - the date of release is
fixed, but the features in that release are flexible
• For this to work, you must respect forward and backwards compatibility of API (service) interfaces
Practices: Fire When Ready!
Practices
24
• In general, the Cool Kids automate as much as possible – Etsy has invested a lot in automated unit / functional
testing, dev tooling and monitoring, use of dashboards– Netflix has a heavy degree of automation across the board
• Automate even the infrastructure, but keep it simple– LinkedIn, Flickr and Netflix generally build up their
infrastructure from just a single OS image
– From here, configure individual servers using automated scripts driven by tool of choice (e.g. IBM UrbanCode)
– Also commonly seen was use of “Phoenix” servers (vs. “Snowflakes”), which can be re-built at any time then “burned to the ground” if needed
• … but only automate what can be measured
Cool Kids and Automation Auto-mation
25
Think you don’t need to keep an eye on automation?
http://windowsitpro.com/windows-7/aggressive-configmgr-based-windows-7-deployment-takes-down-emory-university
“During TechEd 2014, the Emory University IT department prepared and deployed Windows 7 upgrades to the campuses computers. If you've worked with ConfigMgr at all, you know that there are checks-and-balances that can be employed to ensure that only specifically targeted systems will receive an OS upgrade. In Emory University's case, the check-and-balance method failed and instead of delivering the upgrade to applicable computers, delivered Windows 7 to ALL computers including laptops, desktops, and even servers.
I'll stop for a second to let you take that in.
Yes, even servers.
By the time it was realized what exactly had happened, the Windows 7 sequence had repartitioned, reformatted, and installed Windows 7. Emory IT powered off the ConfigMgr server, hoping to stop the deployment before it was too late, but – it was too late. Even the ConfigMgr server had been repartitioned and reformatted…”
– Windows IT Pro, May 19, 2014
Finally: Instrument and Measure
27
• LinkedIn: “Measurement is better than prediction”• Provide a common framework to make it easy for developers to
choose what to log simply by tagging or registering it– “Push” from services works better than “pull” or polling
– In many cases, developers need do no more than push key/value pairs to a logging system
– LinkedIn collects 500K+ metrics per minute at an average of 400 metrics per service
• Instrument user behaviors to improve the user experience– Esurance: “we mined the data to figure out what people were doing
most often, make those tasks the most prominent and make them addressable in as few clicks as possible”
• Metrics dashboards also display deployment activity– So if there’s a problem, you can easily tie the start time of the issue to
the preceding pushes
Measure-ment
• LinkedIn developed and then open sourced tools for monitoring and graphing data being pushed to its logs…
Monitoring at LinkedIn
inGraph, inFormed
Measure-ment
28
So…what are the Cool Kids DevOps takeaways?
29
Culture
• Cultural change takes time – take reasonable steps– Team-building, cross-training, improved communication– Maybe include your Ops team in requirements / feature
reviews and planning (e.g. via IBM RRC, RTC)
• Don’t turn your organization upside down– Experiment on a few smaller, low-risk projects – Maybe create DevOps "center of excellence" – Tear down walls between teams
Organi-zation
• “Continuous Integration” is a good starting point– Push all builds to the last stage before release– Eat your own dog food (get employees involved to test)– Try progressive rollout or dark release of features
Practices
So…what are the Cool Kids DevOps takeaways?
30
Auto-mation
• Start by automating a few areas that you can easily see and track the results from – E.g. Test / build pipeline, possibly using UrbanCode Deploy
• First, assess your current process and consider the changes you want to make – then consider how to measure them – Instrument and measure anything you intend to automate
Measure-ment
• But above all, be honest– Assess your own DevOps maturity and aspirations – where are
you and where do you want to be?
31
IBM can help: DevOps Adoption Framework delivers measurable outcomesEnable lean adoption of DevOps capabilities
Adoption Model Self-assessmentsAdoption pathsAdoption services
SolutionsPractices
Tooling Services
Steer Product-based
Agile
Automated
Collaborative
Optimizing
MorePredictable
MoreTransparent
MoreContinuous
Process-based
Process-heavy
Manual
Silo-ed
Develop/Test
Deploy
Operate
Inefficient LeanerLeaner and
Smarter
ContinuousCustomer
Feedback & Optimization
Collaborative Development
Continuous Release and Deployment
ContinuousMonitoring
ContinuousBusiness Planning
ContinuousTesting
Operate Develop/ Test
Deploy
Steer
DevOps Continuous Feedback
CommunityStories
EnablementFeedback
Where andHow to Get
Lean
Expertise and Technologies
Knowledge sharing
32
Where to start: DevOps Adoption Roadmap Assess desired outcome and supporting practices to drive strategy and rollout
What am I trying to achieve?
Think through business-level drivers for improvement Define measurable goals for your organizational investment Look across silos and include key Dev and Ops stakeholders
Where am I currently?
What do you measure and currently achieve What don’t you measure, but should to improve What practices are difficult, incubating, well-scaled How do your team members agree with these findings
What are my priorities?
Start where you are today and where your improvement goals Consider changes to People, Practices and Technology Prioritize change using goals, complexities and dependencies
Step
1St
ep 2
Step
3
Current PracticeAssessment
Objective & Prioritized Capabilities
Business Goal Determination
What new practices should help me grow?
Step
4
Understand your appetite for cross-functional change Target improvements with the biggest bang for the buck Roadmap and agree on an actionable plan Use measurable milestones that include early wins Strategy/Roadmap
33
Connect with me on Twitter at @BillHoltshouser or LinkedIn at www.linkedin.com/pub/bill-holtshouser/4/815/66a/
Acknowledgements and Disclaimers
© Copyright IBM Corporation 2012. All rights reserved.– U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract
with IBM Corp.
IBM, the IBM logo, ibm.com are trademarks or registered trademarks of International Business Machines Corporation in the United States, other countries, or both. If these and other IBM trademarked terms are marked on their first occurrence in this information with a trademark symbol (® or ™), these symbols indicate U.S. registered or common law trademarks owned by IBM at the time this information was published. Such trademarks may also be registered or common law trademarks in other countries. A current list of IBM trademarks is available on the Web at “Copyright and trademark information” at www.ibm.com/legal/copytrade.shtml
Other company, product, or service names may be trademarks or service marks of others.
Availability. References in this presentation to IBM products, programs, or services do not imply that they will be available in all countries in which IBM operates.
The workshops, sessions and materials have been prepared by IBM or the session speakers and reflect their own views. They are
provided for informational purposes only, and are neither intended to, nor shall have the effect of being, legal or other guidance or advice to any participant. While efforts were made to verify the completeness and accuracy of the information contained in this presentation, it is provided AS-IS without warranty of any kind, express or implied. IBM shall not be responsible for any damages arising out of the use of, or otherwise related to, this presentation or any other materials. Nothing contained in this presentation is intended to, nor shall have the effect of, creating any warranties or representations from IBM or its suppliers or licensors, or altering the terms and conditions of the applicable license agreement governing the use of IBM software.
All customer examples described are presented as illustrations of how those customers have used IBM products and the results they may have achieved. Actual environmental costs and performance characteristics may vary by customer. Nothing contained in these materials is intended to, nor shall have the effect of, stating or implying that any activities undertaken by you will result in any specific sales, revenue growth or other results.
34