i'm no hero: full stack reliability at linkedin

I’m No HeroFull StackReliability

At LinkedIn

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved.

Todd PalinoStaff Site Reliability EngineerLinkedIn, Data Infrastructure Streaming

SITE RELIABILITY ENGINEERING©2016 LinkedIn Corporation. All Rights Reserved. 3

What is Site Reliability Engineering?


Types of SRE

Embedded

Central (or Production SRE)

Tools and Infrastructure


We Can’t Do It Alone

The Kafka SRE team is 3 people in the US, and 1.5 SREs in Bangalore

We manage over 6000 application instances– 100 Kafka clusters, with 1800 brokers– Over 1 trillion messages a day

The environment is never static from one day to the next


Maslow’s Hierarchy


Todd’s Hierarchy of Reliability


Infrastructure as a Service

SREs do not deploy hardware and OS

Production Operations– Datacenter Technicians– Systems Operations– Network Operations

Provide all basic OS and network services

There is still tweaking for individual applications


Common Repositories

All source code and configurations are committed to one place

Subversion and Git centrally managed

Consistent management– Precommit checks– ACLs and Review boards

Connects directly to the build systems


Containerization

Most of our stack is Java– Python is well-supported– Always a few one-offs

Java applications have Tomcat and Jetty containers– Hooks for monitoring– Client libraries are managed by the team that owns the application

Provides a consistent control surface for applications


Build and Deployment

When code is committed, it is automatically built– Successes become deployment artifacts– Failures are tracked via Jira

Build systems are centrally managed

Common tools– Dependency management and introspection– Version management– Error budgeting– Deployment


Monitoring

Monitoring, graphing, and alerting as a service

Completely self-service– Applications annotate metrics and they are automatically collected– Monitoring dashboards can be created by anyone

Automatic metrics and dashboards for common features– HTTP servers, system and OS metrics– Client libraries (such as Kafka)

Additional metrics can be published outside the container


Site Up


Site Up

With the stack supporting it, applications sit on top– SREs architect and run the application– SRE and developers respond to failures

The NOC monitors high-level metrics– Overall site health and growth metrics– They also coordinate incident response

Incident response is blameless– Fix the problem, don’t fix the blame


Review and Revise

All components are constantly improving– Incidents expose issues in the infrastructure– Feedback from usage of the tools

Steering committees discuss large-scale changes– Production Operations, SRE, and Development all have their own– Comprised of individual contributors, not managers

Open collaboration– Common repositories means everyone can help