

Gamifying Operational Excellence
TheService Score Card

1 The Problem
3 A Solution tour
4 The results
5 Take aways & lessons Learnt & Questions
2 A Solution idea
Agenda

“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence

“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence

“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence

“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence

“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence

Good news SRECON.
You passed the ☃ test.
about meDanny ☃ Lawrence

Some terms(before we really get started)

Operational Excellenceeffective and efficient delivery of information, technology, and services required by end users
that add measurable value.
10
Gamifying Operational Excellence

Operational ExcellenceDoing everything required to make sure
all of your services are as fast and as reliable as possible.
11
Gamifying Operational Excellence

Gamificationapplication of game-design elements and game
principles in non-game contexts.
12
Gamifying Operational Excellence

Some background(LinkedIn SRE crash course)

Mostly JavaMultitudes of services
Doing lots of thingsService-oriented architectureEverything talks to everything
My direct team looks after 80+ servicesWe have 200+ SREs
14
LinkedIn SRE Crash Course

The Problem(What started this whole thing)

Problem 1:The GOOD
& The BAD
16
Gamifying Operational Excellence

BAD serviceswake me up
17
Gamifying Operational Excellence

GOOD serviceslet me sleep
18
Gamifying Operational Excellence

What makes a GOOD service at LinkedIn is a moving target.
19
Gamifying Operational Excellence

Technologies and dependencies change
over time.
20
Gamifying Operational Excellence

Upgrading dependencies & libraries Java / Jetty / Play / Tomcat
Correct usage of TLSSwitching databases / caches
Migrate from SVN to GITReduce application startup time
Setup error budgetingTrue up the number of metrics
21
Some examples

A GOOD service can turn into a BAD service.
If you are not checking it
22
Gamifying Operational Excellence

UnfortunatelyBAD services
do not magically turn into
GOOD services23
Gamifying Operational Excellence

Problem 2:Knowing what is BAD
24
Gamifying Operational Excellence

Problem 3:Knowing why it’s BAD
25
Gamifying Operational Excellence

Problem 4:Tribal knowledge
about how to get to GOOD
26
Gamifying Operational Excellence

The only thing we appear to hate more than not having documentation,
...Is writing documentation.
27
Gamifying Operational Excellence

The Problemsummary

BAD services wake me upTime will cause GOOD to turn BAD
Hard to know what is BADHard to know why is BAD
Not sure how to fix the BAD
29
Gamifying Operational Excellence

The Service ScoreCard(A solution)

In order determine the healthof the services we support,
we define a list of production requirements.
31
Gamifying Operational Excellence

Apply a weight to each requirement
32
Gamifying Operational Excellence

Codify each requirement into a check.
33
Gamifying Operational Excellence

Execute these checksfor each service
34
Service Scorecard

Tally up the results for service.
35
Gamifying Operational Excellence

Grade the service from“F” to “A+”
36
Gamifying Operational Excellence

Add all the services into a highscore system
37
Gamifying Operational Excellence

Then
38
Gamifying Operational Excellence

Publish those scores to the company
39
Gamifying Operational Excellence

This is great, but how do I improve the score?
How can I add X check into the system.
40
Gamifying Operational Excellence

What makes a check?

checks are one type of plugin.
fetch plugins gather datacheck plugins check the data.
42
Gamifying Operational Excellence

We use the fetch plugin to gather remote data from:
SVN, GIT, Configuration DBs,host databases, monitoring systems, build systems, deployment systems.
43
Gamifying Operational Excellence

Basically,if we can fetch it,
then we do so.
44
Gamifying Operational Excellence

We build a giant context object.
45
Gamifying Operational Excellence

The check plugin will look at our context object.
46
Gamifying Operational Excellence

All plugins are small python scripts,where small is 10~30 LOC
47
Gamifying Operational Excellence

Simply return 2 or 3 things.
state*: True, False, None or 0.0 - 1.0message*: short stringdata: python dict of interesting things.
48
Gamifying Operational Excellence

Example fetch plugin

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
50

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
51

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
52

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
53

@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered owner data”, o.json()
54

Example check plugin

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
56

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
57

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
58

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
59

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
60

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
61

@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
62

Putting it all together

Problems
Understanding what is BADKnowing why it is BAD
Not sure how to fix the BAD
64
Gamifying Operational Excellence

Problems
Understanding what is BAD
65
Gamifying Operational Excellence

66
Service Scorecard

67
Service Scorecard

68
Service Scorecard

69
Service Scorecard

70
Service Scorecard

71
Service Scorecard

72
Service Scorecard

73
Service Scorecard

74
Service Scorecard

75
Service Scorecard

76
Service Scorecard

77
Service Scorecard

78
Service Scorecard

79
Service Scorecard

Problems
Understanding what is BADKnowing why it is BAD
80
Gamifying Operational Excellence

81
Service Scorecard

82
Service Scorecard

83

84

85

86

87

88

89

90

91

92

93

Problems
Understanding what is BADKnowing why it is BAD
Not sure how to fix the BAD
94
Gamifying Operational Excellence

95

96

97

98

99

What is the check?Why is it important?
How long it will take to fix?How will it be fixed?
100
Gamifying Operational Excellence

101

102
AngularJSimage: CC BY 4.0 https://angular.io/presskit.html (2017)

103
{{service_name}}becomes
jobs-server

104

105
{{context.ownership.eng_owner}}becomesjobs-team

Using our fetched data in the wiki

107
{{service_name}}

109
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);

110
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);

111
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);

112
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);

113
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = data;
}
);

114
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);

115
{{ctx.ownership.owner_eng}}

116
{{ctx.ownership.owner_eng}}
{{ctx.number_of_hosts}}
{{ctx.product.lib.jetty.version}}
{{ctx.hosts.hostnames}}
{{ctx.is_deployed_in_prod}}
{{ctx.commits.last_commit}}

Problems
Understanding what is BADKnowing why it is BAD
Not sure how to fix the BAD
117
Gamifying Operational Excellence

Now
Reports show what is BADChecks validate why it is BADWiki shows how to fix the BAD
118
Gamifying Operational Excellence

No more of these emails
“If you use a lib-core, then upgrade it, we found a bug”
119
Gamifying Operational Excellence

How many of my 80 services use this lib?How do I check?
How do I upgrade?
120
Gamifying Operational Excellence

121

122

123

Where does this tool fit?

125
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring

126
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
Service Scorecard

127
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
Service Scorecard
API

128
Gamifying Operational Excellence
Service Scorecard
API
hack-days Reporting Deployment Monitoring

Results &
Outcomes

What we do with the scores?
130
Gamifying Operational Excellence

Priority #1:Getting the grades better
131
Gamifying Operational Excellence

132
When we started Now
Average grade for my team 40% 80%
Average score across SRE 35% 60%
Checks in 24 hours 15,560 89,859
Number of checks per service 15 31
Center the source to page, and align to bottom of page number. Do not increase in size, and keep on one line.
Gamifying Operational Excellence

We can now explore news ways to use the scores
133
Gamifying Operational Excellence

Carrot&
Stick
134
Gamifying Operational Excellence

Carrot / GOOD
Stick / BAD
135
Gamifying Operational Excellence

No SRE supportfor
F Gradeservices.
136
Gamifying Operational Excellence

F Grade services generally cause the
most problems.
137
Gamifying Operational Excellence

No deploy moratorium for
A+ services
138
Gamifying Operational Excellence

A+ services generally cause the
least problems.
139
Gamifying Operational Excellence

A servicesare allowed to deploy 24/7
140
Gamifying Operational Excellence

Premium SRE support for A+ services
141
Gamifying Operational Excellence

Priority build queuesfor
GOODServices.
142
Gamifying Operational Excellence

Tiger teams to raise the scores on
F Grade services
143
Gamifying Operational Excellence

Hack Days
144
Gamifying Operational Excellence

FREE BEER
145
Gamifying Operational Excellence

Basically any problem can be solve with
FREE BEER
146
Gamifying Operational Excellence

OR T-Shirts
147
Gamifying Operational Excellence

/
148

Influence where we allocate open headcount
149
Gamifying Operational Excellence

Simple way to get things done
150
Gamifying Operational Excellence

Take aways&
Lessons Learnt

Everyone cares about Reliability.
152
Gamifying Operational Excellence

Everyone cares about Reliability,
Everyone is a Site Reliability Engineer.
153
Gamifying Operational Excellence

Everyone cares about Reliability,
You just need to empower them.
154
Gamifying Operational Excellence

Hack Days are important,
This POC was built in an afternoon.
155
Gamifying Operational Excellence

Getting the data was easy,
Finding interesting ways to use it is hard.
156
Gamifying Operational Excellence

Make it as easy as possible to do the right thing.
157
Gamifying Operational Excellence

Cheers !

Q & A