the service score gamifying operational …...gamifying operational excellence basically, if we can...
Embed Size (px)
TRANSCRIPT
-
Gamifying Operational Excellence
TheService Score Card
-
1 The Problem
3 A Solution tour
4 The results
5 Take aways & lessons Learnt & Questions
2 A Solution idea
Agenda
-
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
-
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
-
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
-
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
-
“If it's not broken, I’ll fix it.”
From Australia, on loan as
Staff SRE @ linkedIn
jobs, companies, recruiter
& Finder of encoding bugs
about meDanny ☃ Lawrence
-
Good news SRECON.
You passed the ☃ test.
about meDanny ☃ Lawrence
-
Some terms(before we really get started)
-
Operational Excellenceeffective and efficient delivery of information, technology, and services required by end users
that add measurable value.
10
Gamifying Operational Excellence
-
Operational ExcellenceDoing everything required to make sure
all of your services are as fast and as reliable as possible.
11
Gamifying Operational Excellence
-
Gamificationapplication of game-design elements and game
principles in non-game contexts.
12
Gamifying Operational Excellence
-
Some background(LinkedIn SRE crash course)
-
Mostly JavaMultitudes of services
Doing lots of thingsService-oriented architectureEverything talks to everything
My direct team looks after 80+ servicesWe have 200+ SREs
14
LinkedIn SRE Crash Course
-
The Problem(What started this whole thing)
-
Problem 1:The GOOD
& The BAD
16
Gamifying Operational Excellence
-
BAD serviceswake me up
17
Gamifying Operational Excellence
-
GOOD serviceslet me sleep
18
Gamifying Operational Excellence
-
What makes a GOOD service at LinkedIn is a moving target.
19
Gamifying Operational Excellence
-
Technologies and dependencies change
over time.
20
Gamifying Operational Excellence
-
Upgrading dependencies & libraries Java / Jetty / Play / Tomcat
Correct usage of TLSSwitching databases / caches
Migrate from SVN to GITReduce application startup time
Setup error budgetingTrue up the number of metrics
21
Some examples
-
A GOOD service can turn into a BAD service.
If you are not checking it
22
Gamifying Operational Excellence
-
UnfortunatelyBAD services
do not magically turn into
GOOD services23
Gamifying Operational Excellence
-
Problem 2:Knowing what is BAD
24
Gamifying Operational Excellence
-
Problem 3:Knowing why it’s BAD
25
Gamifying Operational Excellence
-
Problem 4:Tribal knowledge
about how to get to GOOD
26
Gamifying Operational Excellence
-
The only thing we appear to hate more than not having documentation,
...Is writing documentation.
27
Gamifying Operational Excellence
-
The Problemsummary
-
BAD services wake me upTime will cause GOOD to turn BAD
Hard to know what is BADHard to know why is BAD
Not sure how to fix the BAD
29
Gamifying Operational Excellence
-
The Service ScoreCard(A solution)
-
In order determine the healthof the services we support,
we define a list of production requirements.
31
Gamifying Operational Excellence
-
Apply a weight to each requirement
32
Gamifying Operational Excellence
-
Codify each requirement into a check.
33
Gamifying Operational Excellence
-
Execute these checksfor each service
34
Service Scorecard
-
Tally up the results for service.
35
Gamifying Operational Excellence
-
Grade the service from“F” to “A+”
36
Gamifying Operational Excellence
-
Add all the services into a highscore system
37
Gamifying Operational Excellence
-
Then
38
Gamifying Operational Excellence
-
Publish those scores to the company
39
Gamifying Operational Excellence
-
This is great, but how do I improve the score?
How can I add X check into the system.
40
Gamifying Operational Excellence
-
What makes a check?
-
checks are one type of plugin.
fetch plugins gather datacheck plugins check the data.
42
Gamifying Operational Excellence
-
We use the fetch plugin to gather remote data from:
SVN, GIT, Configuration DBs,host databases, monitoring systems, build systems, deployment systems.
43
Gamifying Operational Excellence
-
Basically,if we can fetch it,
then we do so.
44
Gamifying Operational Excellence
-
We build a giant context object.
45
Gamifying Operational Excellence
-
The check plugin will look at our context object.
46
Gamifying Operational Excellence
-
All plugins are small python scripts,where small is 10~30 LOC
47
Gamifying Operational Excellence
-
Simply return 2 or 3 things.
state*: True, False, None or 0.0 - 1.0message*: short stringdata: python dict of interesting things.
48
Gamifying Operational Excellence
-
Example fetch plugin
-
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
50
http://owners/
-
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
51
http://owners/
-
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
52
http://owners/
-
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered data”, o.json()
53
http://owners/
-
@ssc.tags(“ownership”)def fetch_ownership(service_name): “Fetch all the ownership data of a service”
o = r.get(“http://owners/” + service_name)
return True, “gathered owner data”, o.json()
54
http://owners/
-
Example check plugin
-
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
56
-
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
57
-
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
58
-
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
59
-
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
60
-
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
61
-
@ssc.weight(5)@ssc.tags(‘ownership’)@ssc.wiki(‘http://wiki/ssc_eng_owner’)def check_eng_team(ctx): “ensure ENG ownership of a service”
if ctx.ownership.eng_team: return True, ctx.ownership.eng_team return False, “missing eng_team”
62
-
Putting it all together
-
Problems
Understanding what is BADKnowing why it is BAD
Not sure how to fix the BAD
64
Gamifying Operational Excellence
-
Problems
Understanding what is BAD
65
Gamifying Operational Excellence
-
66
Service Scorecard
-
67
Service Scorecard
-
68
Service Scorecard
-
69
Service Scorecard
-
70
Service Scorecard
-
71
Service Scorecard
-
72
Service Scorecard
-
73
Service Scorecard
-
74
Service Scorecard
-
75
Service Scorecard
-
76
Service Scorecard
-
77
Service Scorecard
-
78
Service Scorecard
-
79
Service Scorecard
-
Problems
Understanding what is BADKnowing why it is BAD
80
Gamifying Operational Excellence
-
81
Service Scorecard
-
82
Service Scorecard
-
83
-
84
-
85
-
86
-
87
-
88
-
89
-
90
-
91
-
92
-
93
-
Problems
Understanding what is BADKnowing why it is BAD
Not sure how to fix the BAD
94
Gamifying Operational Excellence
-
95
-
96
-
97
-
98
-
99
-
What is the check?Why is it important?
How long it will take to fix?How will it be fixed?
100
Gamifying Operational Excellence
-
101
-
102
AngularJSimage: CC BY 4.0 https://angular.io/presskit.html (2017)
https://creativecommons.org/licenses/by/4.0/https://angular.io/presskit.html
-
103
{{service_name}}becomes
jobs-server
-
104
-
105
{{context.ownership.eng_owner}}becomesjobs-team
-
Using our fetched data in the wiki
-
107
{{service_name}}
-
108
{html}
{html}
https://cdn/angularjs.js%E2%80%9D/
-
109
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
-
110
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
-
111
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
-
112
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
-
113
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = data;
}
);
-
114
var query = $location.search();
var service_name = query[‘service_name’];
var url = ‘http://ssc/api/’ + service_name;
$http.get().success(
function(ctx) {
$scope.ctx = ctx;
}
);
-
115
{{ctx.ownership.owner_eng}}
-
116
{{ctx.ownership.owner_eng}}
{{ctx.number_of_hosts}}
{{ctx.product.lib.jetty.version}}
{{ctx.hosts.hostnames}}
{{ctx.is_deployed_in_prod}}
{{ctx.commits.last_commit}}
-
Problems
Understanding what is BADKnowing why it is BAD
Not sure how to fix the BAD
117
Gamifying Operational Excellence
-
Now
Reports show what is BADChecks validate why it is BADWiki shows how to fix the BAD
118
Gamifying Operational Excellence
-
No more of these emails
“If you use a lib-core, then upgrade it, we found a bug”
119
Gamifying Operational Excellence
-
How many of my 80 services use this lib?How do I check?
How do I upgrade?
120
Gamifying Operational Excellence
-
121
-
122
-
123
-
Where does this tool fit?
-
125
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
-
126
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
Service Scorecard
-
127
Gamifying Operational Excellence
pre-commit Build Deployment Monitoring
Service Scorecard
API
-
128
Gamifying Operational Excellence
Service Scorecard
API
hack-days Reporting Deployment Monitoring
-
Results &
Outcomes
-
What we do with the scores?
130
Gamifying Operational Excellence
-
Priority #1:Getting the grades better
131
Gamifying Operational Excellence
-
132
When we started Now
Average grade for my team 40% 80%
Average score across SRE 35% 60%
Checks in 24 hours 15,560 89,859
Number of checks per service 15 31
Center the source to page, and align to bottom of page number. Do not increase in size, and keep on one line.
Gamifying Operational Excellence
-
We can now explore news ways to use the scores
133
Gamifying Operational Excellence
-
Carrot&
Stick
134
Gamifying Operational Excellence
-
Carrot / GOOD
Stick / BAD
135
Gamifying Operational Excellence
-
No SRE supportfor
F Gradeservices.
136
Gamifying Operational Excellence
-
F Grade services generally cause the
most problems.
137
Gamifying Operational Excellence
-
No deploy moratorium for
A+ services
138
Gamifying Operational Excellence
-
A+ services generally cause the
least problems.
139
Gamifying Operational Excellence
-
A servicesare allowed to deploy 24/7
140
Gamifying Operational Excellence
-
Premium SRE support for A+ services
141
Gamifying Operational Excellence
-
Priority build queuesfor
GOODServices.
142
Gamifying Operational Excellence
-
Tiger teams to raise the scores on
F Grade services
143
Gamifying Operational Excellence
-
Hack Days
144
Gamifying Operational Excellence
-
FREE BEER
145
Gamifying Operational Excellence
-
Basically any problem can be solve with
FREE BEER
146
Gamifying Operational Excellence
-
OR T-Shirts
147
Gamifying Operational Excellence
-
/
148
-
Influence where we allocate open headcount
149
Gamifying Operational Excellence
-
Simple way to get things done
150
Gamifying Operational Excellence
-
Take aways&
Lessons Learnt
-
Everyone cares about Reliability.
152
Gamifying Operational Excellence
-
Everyone cares about Reliability,
Everyone is a Site Reliability Engineer.
153
Gamifying Operational Excellence
-
Everyone cares about Reliability,
You just need to empower them.
154
Gamifying Operational Excellence
-
Hack Days are important,
This POC was built in an afternoon.
155
Gamifying Operational Excellence
-
Getting the data was easy,
Finding interesting ways to use it is hard.
156
Gamifying Operational Excellence
-
Make it as easy as possible to do the right thing.
157
Gamifying Operational Excellence
-
Cheers !
-
Q & A