i'm no hero: full stack reliability at linkedin

17
I’m No Hero Full Stack Reliability At LinkedIn

Upload: todd-palino

Post on 23-Jan-2017

294 views

Category:

Engineering


2 download

TRANSCRIPT

Page 1: I'm No Hero: Full Stack Reliability at LinkedIn

I’m No HeroFull StackReliability

At LinkedIn

Page 2: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.

Todd PalinoStaff Site Reliability EngineerLinkedIn, Data Infrastructure Streaming

Page 3: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 3

What is Site Reliability Engineering?

Page 4: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 4

Types of SRE

Embedded

Central (or Production SRE)

Tools and Infrastructure

Page 5: I'm No Hero: Full Stack Reliability at LinkedIn
Page 6: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 6

We Can’t Do It Alone

The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore

We manage over 6000 application instances– 100 Kafka clusters, with 1800 brokers– Over 1 trillion messages a day

The environment is never static from one day to the next

Page 7: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 7

Maslow’s Hierarchy

Page 8: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 8

Todd’s Hierarchy of Reliability

Page 9: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 9

Infrastructure as a Service

SREs do not deploy hardware and OS

Production Operations– Datacenter Technicians– Systems Operations– Network Operations

Provide all basic OS and network services

There is still tweaking for individual applications

Page 10: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 10

Common Repositories

All source code and configurations are committed to one place

Subversion and Git centrally managed

Consistent management– Precommit checks– ACLs and Review boards

Connects directly to the build systems

Page 11: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 11

Containerization

Most of our stack is Java– Python is well-supported– Always a few one-offs

Java applications have Tomcat and Jetty containers– Hooks for monitoring– Client libraries are managed by the team that owns the application

Provides a consistent control surface for applications

Page 12: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 12

Build and Deployment

When code is committed, it is automatically built– Successes become deployment artifacts– Failures are tracked via Jira

Build systems are centrally managed

Common tools– Dependency management and introspection– Version management– Error budgeting– Deployment

Page 13: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 13

Monitoring

Monitoring, graphing, and alerting as a service

Completely self-service– Applications annotate metrics and they are automatically collected– Monitoring dashboards can be created by anyone

Automatic metrics and dashboards for common features– HTTP servers, system and OS metrics– Client libraries (such as Kafka)

Additional metrics can be published outside the container

Page 14: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 14

Site Up

Page 15: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 15

Site Up

With the stack supporting it, applications sit on top– SREs architect and run the application– SRE and developers respond to failures

The NOC monitors high-level metrics– Overall site health and growth metrics– They also coordinate incident response

Incident response is blameless– Fix the problem, don’t fix the blame

Page 16: I'm No Hero: Full Stack Reliability at LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 16

Review and Revise

All components are constantly improving– Incidents expose issues in the infrastructure– Feedback from usage of the tools

Steering committees discuss large-scale changes– Production Operations, SRE, and Development all have their own– Comprised of individual contributors, not managers

Open collaboration– Common repositories means everyone can help

Page 17: I'm No Hero: Full Stack Reliability at LinkedIn