the service score gamifying operational …...gamifying operational excellence basically, if we can...

of 159 /159
Gamifying Operational Excellence The Service Score Card

Author: others

Post on 06-Jul-2020

72 views

Category:

Documents


0 download

Embed Size (px)

TRANSCRIPT

  • Gamifying Operational Excellence

    TheService Score Card

  • 1 The Problem

    3 A Solution tour

    4 The results

    5 Take aways & lessons Learnt & Questions

    2 A Solution idea

    Agenda

  • “If it's not broken, I’ll fix it.”

    From Australia, on loan as

    Staff SRE @ linkedIn

    jobs, companies, recruiter

    & Finder of encoding bugs

    about meDanny ☃ Lawrence

  • “If it's not broken, I’ll fix it.”

    From Australia, on loan as

    Staff SRE @ linkedIn

    jobs, companies, recruiter

    & Finder of encoding bugs

    about meDanny ☃ Lawrence

  • “If it's not broken, I’ll fix it.”

    From Australia, on loan as

    Staff SRE @ linkedIn

    jobs, companies, recruiter

    & Finder of encoding bugs

    about meDanny ☃ Lawrence

  • “If it's not broken, I’ll fix it.”

    From Australia, on loan as

    Staff SRE @ linkedIn

    jobs, companies, recruiter

    & Finder of encoding bugs

    about meDanny ☃ Lawrence

  • “If it's not broken, I’ll fix it.”

    From Australia, on loan as

    Staff SRE @ linkedIn

    jobs, companies, recruiter

    & Finder of encoding bugs

    about meDanny ☃ Lawrence

  • Good news SRECON.

    You passed the ☃ test.

    about meDanny ☃ Lawrence

  • Some terms(before we really get started)

  • Operational Excellenceeffective and efficient delivery of information, technology, and services required by end users

    that add measurable value.

    10

    Gamifying Operational Excellence

  • Operational ExcellenceDoing everything required to make sure

    all of your services are as fast and as reliable as possible.

    11

    Gamifying Operational Excellence

  • Gamificationapplication of game-design elements and game

    principles in non-game contexts.

    12

    Gamifying Operational Excellence

  • Some background(LinkedIn SRE crash course)

  • Mostly JavaMultitudes of services

    Doing lots of thingsService-oriented architectureEverything talks to everything

    My direct team looks after 80+ servicesWe have 200+ SREs

    14

    LinkedIn SRE Crash Course

  • The Problem(What started this whole thing)

  • Problem 1:The GOOD

    & The BAD

    16

    Gamifying Operational Excellence

  • BAD serviceswake me up

    17

    Gamifying Operational Excellence

  • GOOD serviceslet me sleep

    18

    Gamifying Operational Excellence

  • What makes a GOOD service at LinkedIn is a moving target.

    19

    Gamifying Operational Excellence

  • Technologies and dependencies change

    over time.

    20

    Gamifying Operational Excellence

  • Upgrading dependencies & libraries Java / Jetty / Play / Tomcat

    Correct usage of TLSSwitching databases / caches

    Migrate from SVN to GITReduce application startup time

    Setup error budgetingTrue up the number of metrics

    21

    Some examples

  • A GOOD service can turn into a BAD service.

    If you are not checking it

    22

    Gamifying Operational Excellence

  • UnfortunatelyBAD services

    do not magically turn into

    GOOD services23

    Gamifying Operational Excellence

  • Problem 2:Knowing what is BAD

    24

    Gamifying Operational Excellence

  • Problem 3:Knowing why it’s BAD

    25

    Gamifying Operational Excellence

  • Problem 4:Tribal knowledge

    about how to get to GOOD

    26

    Gamifying Operational Excellence

  • The only thing we appear to hate more than not having documentation,

    ...Is writing documentation.

    27

    Gamifying Operational Excellence

  • The Problemsummary

  • BAD services wake me upTime will cause GOOD to turn BAD

    Hard to know what is BADHard to know why is BAD

    Not sure how to fix the BAD

    29

    Gamifying Operational Excellence

  • The Service ScoreCard(A solution)

  • In order determine the healthof the services we support,

    we define a list of production requirements.

    31

    Gamifying Operational Excellence

  • Apply a weight to each requirement

    32

    Gamifying Operational Excellence

  • Codify each requirement into a check.

    33

    Gamifying Operational Excellence

  • Execute these checksfor each service

    34

    Service Scorecard

  • Tally up the results for service.

    35

    Gamifying Operational Excellence

  • Grade the service from“F” to “A+”

    36

    Gamifying Operational Excellence

  • Add all the services into a highscore system

    37

    Gamifying Operational Excellence

  • Then

    38

    Gamifying Operational Excellence

  • Publish those scores to the company

    39

    Gamifying Operational Excellence

  • This is great, but how do I improve the score?

    How can I add X check into the system.

    40

    Gamifying Operational Excellence

  • What makes a check?

  • checks are one type of plugin.

    fetch plugins gather datacheck plugins check the data.

    42

    Gamifying Operational Excellence

  • We use the fetch plugin to gather remote data from:

    SVN, GIT, Configuration DBs,host databases, monitoring systems, build systems, deployment systems.

    43

    Gamifying Operational Excellence

  • Basically,if we can fetch it,

    then we do so.

    44

    Gamifying Operational Excellence

  • We build a giant context object.

    45

    Gamifying Operational Excellence

  • The check plugin will look at our context object.

    46

    Gamifying Operational Excellence

  • All plugins are small python scripts,where small is 10~30 LOC

    47

    Gamifying Operational Excellence

  • Simply return 2 or 3 things.

    state*: True, False, None or 0.0 - 1.0message*: short stringdata: python dict of interesting things.

    48

    Gamifying Operational Excellence

  • Example fetch plugin

  • @ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

    o = r.get(“http://owners/” + service_name)

    return True, “gathered data”, o.json()

    50

    http://owners/

  • @ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

    o = r.get(“http://owners/” + service_name)

    return True, “gathered data”, o.json()

    51

    http://owners/

  • @ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

    o = r.get(“http://owners/” + service_name)

    return True, “gathered data”, o.json()

    52

    http://owners/

  • @ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

    o = r.get(“http://owners/” + service_name)

    return True, “gathered data”, o.json()

    53

    http://owners/

  • @ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”

    o = r.get(“http://owners/” + service_name)

    return True, “gathered owner data”, o.json()

    54

    http://owners/

  • Example check plugin

  • @ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

    if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

    56

  • @ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

    if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

    57

  • @ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

    if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

    58

  • @ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

    if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

    59

  • @ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

    if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

    60

  • @ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

    if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

    61

  • @ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”

    if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”

    62

  • Putting it all together

  • Problems

    Understanding what is BADKnowing why it is BAD

    Not sure how to fix the BAD

    64

    Gamifying Operational Excellence

  • Problems

    Understanding what is BAD

    65

    Gamifying Operational Excellence

  • 66

    Service Scorecard

  • 67

    Service Scorecard

  • 68

    Service Scorecard

  • 69

    Service Scorecard

  • 70

    Service Scorecard

  • 71

    Service Scorecard

  • 72

    Service Scorecard

  • 73

    Service Scorecard

  • 74

    Service Scorecard

  • 75

    Service Scorecard

  • 76

    Service Scorecard

  • 77

    Service Scorecard

  • 78

    Service Scorecard

  • 79

    Service Scorecard

  • Problems

    Understanding what is BADKnowing why it is BAD

    80

    Gamifying Operational Excellence

  • 81

    Service Scorecard

  • 82

    Service Scorecard

  • 83

  • 84

  • 85

  • 86

  • 87

  • 88

  • 89

  • 90

  • 91

  • 92

  • 93

  • Problems

    Understanding what is BADKnowing why it is BAD

    Not sure how to fix the BAD

    94

    Gamifying Operational Excellence

  • 95

  • 96

  • 97

  • 98

  • 99

  • What is the check?Why is it important?

    How long it will take to fix?How will it be fixed?

    100

    Gamifying Operational Excellence

  • 101

  • 102

    AngularJSimage: CC BY 4.0 https://angular.io/presskit.html (2017)

    https://creativecommons.org/licenses/by/4.0/https://angular.io/presskit.html

  • 103

    {{service_name}}becomes

    jobs-server

  • 104

  • 105

    {{context.ownership.eng_owner}}becomesjobs-team

  • Using our fetched data in the wiki

  • 107

    {{service_name}}

  • 108

    {html}

    {html}

    https://cdn/angularjs.js%E2%80%9D/

  • 109

    var query = $location.search();

    var service_name = query[‘service_name’];

    var url = ‘http://ssc/api/’ + service_name;

    $http.get().success(

    function(ctx) {

    $scope.ctx = ctx;

    }

    );

  • 110

    var query = $location.search();

    var service_name = query[‘service_name’];

    var url = ‘http://ssc/api/’ + service_name;

    $http.get().success(

    function(ctx) {

    $scope.ctx = ctx;

    }

    );

  • 111

    var query = $location.search();

    var service_name = query[‘service_name’];

    var url = ‘http://ssc/api/’ + service_name;

    $http.get().success(

    function(ctx) {

    $scope.ctx = ctx;

    }

    );

  • 112

    var query = $location.search();

    var service_name = query[‘service_name’];

    var url = ‘http://ssc/api/’ + service_name;

    $http.get().success(

    function(ctx) {

    $scope.ctx = ctx;

    }

    );

  • 113

    var query = $location.search();

    var service_name = query[‘service_name’];

    var url = ‘http://ssc/api/’ + service_name;

    $http.get().success(

    function(ctx) {

    $scope.ctx = data;

    }

    );

  • 114

    var query = $location.search();

    var service_name = query[‘service_name’];

    var url = ‘http://ssc/api/’ + service_name;

    $http.get().success(

    function(ctx) {

    $scope.ctx = ctx;

    }

    );

  • 115

    {{ctx.ownership.owner_eng}}

  • 116

    {{ctx.ownership.owner_eng}}

    {{ctx.number_of_hosts}}

    {{ctx.product.lib.jetty.version}}

    {{ctx.hosts.hostnames}}

    {{ctx.is_deployed_in_prod}}

    {{ctx.commits.last_commit}}

  • Problems

    Understanding what is BADKnowing why it is BAD

    Not sure how to fix the BAD

    117

    Gamifying Operational Excellence

  • Now

    Reports show what is BADChecks validate why it is BADWiki shows how to fix the BAD

    118

    Gamifying Operational Excellence

  • No more of these emails

    “If you use a lib-core, then upgrade it, we found a bug”

    119

    Gamifying Operational Excellence

  • How many of my 80 services use this lib?How do I check?

    How do I upgrade?

    120

    Gamifying Operational Excellence

  • 121

  • 122

  • 123

  • Where does this tool fit?

  • 125

    Gamifying Operational Excellence

    pre-commit Build Deployment Monitoring

  • 126

    Gamifying Operational Excellence

    pre-commit Build Deployment Monitoring

    Service Scorecard

  • 127

    Gamifying Operational Excellence

    pre-commit Build Deployment Monitoring

    Service Scorecard

    API

  • 128

    Gamifying Operational Excellence

    Service Scorecard

    API

    hack-days Reporting Deployment Monitoring

  • Results &

    Outcomes

  • What we do with the scores?

    130

    Gamifying Operational Excellence

  • Priority #1:Getting the grades better

    131

    Gamifying Operational Excellence

  • 132

    When we started Now

    Average grade for my team 40% 80%

    Average score across SRE 35% 60%

    Checks in 24 hours 15,560 89,859

    Number of checks per service 15 31

    Center the source to page, and align to bottom of page number. Do not increase in size, and keep on one line.

    Gamifying Operational Excellence

  • We can now explore news ways to use the scores

    133

    Gamifying Operational Excellence

  • Carrot&

    Stick

    134

    Gamifying Operational Excellence

  • Carrot / GOOD

    Stick / BAD

    135

    Gamifying Operational Excellence

  • No SRE supportfor

    F Gradeservices.

    136

    Gamifying Operational Excellence

  • F Grade services generally cause the

    most problems.

    137

    Gamifying Operational Excellence

  • No deploy moratorium for

    A+ services

    138

    Gamifying Operational Excellence

  • A+ services generally cause the

    least problems.

    139

    Gamifying Operational Excellence

  • A servicesare allowed to deploy 24/7

    140

    Gamifying Operational Excellence

  • Premium SRE support for A+ services

    141

    Gamifying Operational Excellence

  • Priority build queuesfor

    GOODServices.

    142

    Gamifying Operational Excellence

  • Tiger teams to raise the scores on

    F Grade services

    143

    Gamifying Operational Excellence

  • Hack Days

    144

    Gamifying Operational Excellence

  • FREE BEER

    145

    Gamifying Operational Excellence

  • Basically any problem can be solve with

    FREE BEER

    146

    Gamifying Operational Excellence

  • OR T-Shirts

    147

    Gamifying Operational Excellence

  • /

    148

  • Influence where we allocate open headcount

    149

    Gamifying Operational Excellence

  • Simple way to get things done

    150

    Gamifying Operational Excellence

  • Take aways&

    Lessons Learnt

  • Everyone cares about Reliability.

    152

    Gamifying Operational Excellence

  • Everyone cares about Reliability,

    Everyone is a Site Reliability Engineer.

    153

    Gamifying Operational Excellence

  • Everyone cares about Reliability,

    You just need to empower them.

    154

    Gamifying Operational Excellence

  • Hack Days are important,

    This POC was built in an afternoon.

    155

    Gamifying Operational Excellence

  • Getting the data was easy,

    Finding interesting ways to use it is hard.

    156

    Gamifying Operational Excellence

  • Make it as easy as possible to do the right thing.

    157

    Gamifying Operational Excellence

  • Cheers !

  • Q & A