wtf is a microservice - rafael schloming, datawire

WTF is a microservice?Rafael Schloming

Co-Founder & Chief Architect

datawire.io

History

Datawire

● Founded in 2014● Focused on microservices

Me

● Lots of distributed systems experience● Starting from zero with microservices

2

datawire.io

What is a microservice?

Wikipedia: “...no industry consensus…”

● “...implementation approach for SoA…”● “...processes that communicate with each other to fulfill a goal…”● “...Naturally enforces a modular structure...”

Everything else:

● Volumes of essays… good, bad, and ugly...

3

datawire.io

Three aspects of Microservices

Technology

Process

People

4

datawire.io

From Three Sources

5

Experts Bootstrapping

Migrating

datawire.io

Starting Point

Technical:

● An application composed of a network of small services● Building your application from microservices forces you to create clear

boundaries, better abstractions, ...

Process:

● ???

People:

● ???

6

datawire.io

The Expert Source

Read just about every firsthand story out there

Went to conferences

Talked to everyone we could

Started the practitioner summit

And armed with a little bit of knowledge, we started filling in our picture…

7

datawire.io

People Picture

Developer Happiness/Tooling/Platform Team

● Builds the infrastructure

Service teams

● Builds the features

8

datawire.io

Technical Picture

Control Plane

● Service Discovery● Logging + Metrics● Configuration● Smart Endpoints

Traffic Layer

● HTTP● RPC● Messaging

9

Reference Architecture

datawire.io

First Picture

Technical:

● A network of small services● Connected via a control plane and traffic layer

Process:

● ???

People:

● Platform team and service teams

10

datawire.io

The Bootstrap Perspective

Five engineers building an out of the box control plane...

Ingest interesting application level events:● start, stop, heartbeat events● log messages

Store them in an appropriate piece of infrastructure:● Service registry● Log store

Transform and Present:● Realtime view of: routing table, service health● Historic view of: request traces, ...

11

datawire.io

Ubiquitous Data Processing Pipeline

12

Ingest Source of Truth Transform Present

Template for many data driven businesses…

datawire.io

V1: Started with Discovery

Requirements:● highly available● low throughput● low latency● low operational complexity● able to survive a complete restart● capable of handling spikes

Initial Choices:● vert.x + hazelcast ● websockets● smart clients● auth0 + python shim

Total Services: 213

datawire.io

V2: Added Tracing (PoC)

Requirements:● high throughput● highish latency ok● cannot impact application

Initial choices:● vert.x, hazelcast (only retained transient buffer of last 1000 log messages)● websockets● smart circular buffer minimized impact on application

Total Services: 3

14

datawire.io

V3: Added Persistence for Tracing

Requirements:● keep extended history● provide full text search● filtering, sorting, etc

Initial Choices:● elasticsearch for storage/search● query service

Total Services: 4

15

datawire.io

First hint of pain...

Rerouting data pathways:● touched multiple services● coupled changes

Poor local dev experience:● manually fire up and wire the whole fabric

Slow deployment pipeline:● bunched up changes

All this resulted in a big scary cutover

16

datawire.io

V4: Adding Persistence for Discovery

Requirements:● track errors associated with particular service nodes● store routing strategies

Initial Choices:● postgres (RDS) for persistence

Yet another big cutover… enough is enough!

Let’s fix our tooling once and for all...

17

datawire.io

Deployment Requirements

Stuff we had tried:

● Deliver everything as a docker image○ Still too much wiring to bootstrap the system

● Use kubernetes for everything○ Nice dev experience with minikube, but we use amazon services

Need to meet both dev & operational requirements…

● Fast dev cycle● Good visibility● Fast rollback● Ability to leverage commodity services

18

datawire.io

Deployment Redesign

● Complete system definition in git○ Contains all the information necessary to bootstrap the system from scratch in all of its operating

environments…

● System definition is well factored with respect to its environments…○ Abstract definition: “my service needs postgres and redis”○ Development: service -> docker image, postgres -> docker image, redis -> docker image

■ Use minikube to run the whole system○ Test: <same as dev for now>○ Production: service -> docker image, postgres -> RDS, redis -> elasticache

■ Kubernetes cluster for stateless services

● Tooling caters to the needs of each environment○ Development: fast feedback cycle○ Test: repeatable environments○ Production: quick and safe updates/rollbacks

● Tooling helps maintain environment parity

19

datawire.io

DevOps?

DevOps is presented as a solution to an organizational problem, but we all sat in the same room…

We were thinking about operational factors from day one:

● throughput, latency, availability, …● building a service, not a server

This forced us to follow an incremental process:

● tooling for this process was inadequate● when we thought about the process it helped us figure out the tooling

20

datawire.io

Process: Architecture vs Development (SoA vs SoD)

Systems (their shape in particular) are traditionally architected

Architecture

● lots of up front thinking● slow feedback cycle

Development

● frequent small changes● quick feedback cycle● measure the impact at every step

Microservices are about enabling a developmental methodology for systems

21

datawire.io

Methodology for Developing Systems

Principles● small frequent changes● rapid feedback and good visibility

Applied to codebases:● Tooling for rapid feedback: compilers, incremental builds, test suites● Tooling for good visibility: printf, logging, debuggers, profilers

Applied to systems:● Key characteristics go beyond just logic and correctness● Performance within specified tolerance of the running system is a critical feature

Tests don’t cut it anymore...

22

datawire.io

Update the Dev Cycle

Tests assess impact on correctness...

Build -> Test -> Deploy

We need a way to assess impact on the system…

Build -> Test -> Assess Impact -> Deploy

How do you measure system level impact?

● Measure impact against defined Service Level Objectives (SLOs):○ throughput, latency, and availability (error rate)

23

datawire.io

Back to the Experts...

● Canary Testing● Circuit Breakers● Dark Launching● Tracing● Metrics● Deployment

All ways to enable the dev cycle for running systems:

● make small frequent changes● measure the impact on the running system● provide good visibility

24

datawire.io

Second Picture

Technical:

● A network of small services● Scaffolding to safely enable small frequent changes

Process:

● Service oriented Development● Small frequent changes with good visibility and feedback

People:

● Platform team and service teams

25

datawire.io

The Migration Perspective

Variety of stages...

● Monolith: django, rails, ...● Monolith++: mothership + several little ducklings● SoA-ish: small flock of services (maybe 5-10)● Inbetweeners…

Some moving really slowly...

● Months to create just one microservice…

Some moving much faster…

● What’s the difference?

26

datawire.io

Migration is about people

Starting point: team vs tech

● Picking a tech stack for the entire eng org to adopt is slow○ lots of organizational friction

● Replatforming/refactoring an entire existing monolith is slow○ lots of organizational and orchestrational friction

● Creating a relatively autonomous team to tackle a particular problem in the form of a service

Growing pains: stability vs progress

● some orgs hit a sticking point, some didn’t

27

datawire.io

The People Picture: Dividing up the Work

The work has two aspects:● build the features (dev)● keeping the system running (ops)

You can’t usefully divide up the work along these lines:● new features are the biggest source of instability (bugs)● separate roles creates misaligned incentives ⇒ (devops)● yet a big part of the work is keeping things running

Microservices is about how to go about dividing up work:● break the big app into smaller ones● divide operational responsibility in a way that aligns incentives

28

datawire.io

Third Picture

Technical:

● A network of small services● Scaffolding to quickly and safely enable small frequent changes

Process:

● Service oriented Development● Small frequent changes with good visibility and feedback

People:

● Dividing up the work● Service teams deliver features to users● Platform team supports service teams

29

datawire.io

The Hard Way

30

1. Start with Tech2. Reverse Engineer The Process + People3. Make lots of mistakes along the way4. Learn from them

datawire.io

The Easy Way

31

1. Understand the principles of People and Process

2. Use this as a framework toa. pick tech that fitsb. learn from other people's mistakes

datawire.io

Microservices Cheat Sheet (What, Why & How)People Process Technology

Microservices are a way to divide the work of building a cloud application

Microservices are built from a process of frequent small changes with rapid feedback and good visibility

Microservices are an application that is made up of a network of small services

This work falls into two categories:● Keep the system running (ops)● Build new features (dev).

Dividing work along these categories creates conflicting incentives between progress and stability. New features from dev eventually become the biggest source of instability for ops.

Unifying these roles (devops) allows you to minimize the tradeoff between progress and stability, but you now need to divide up the work by dividing up the app. This results in a network of services.

This is the application of the traditional dev cycle to systems rather than codebases, and for it to work, key system properties must become a first class features for developers.

This requires dev tooling to support quickly and safely assessing system impact.

This requires fast deployment tooling and good visibility into key system level properties:

● Throughput● Latency● Availability (error-rate)

Depending on your system, this may require tooling for:

● Fancy request routing (for canary testing, dark launching)

Give your dev teams operational responsibility!

Define service level objectives & agreements for each service: SLOs: throughput, latency, availability SLAs: what happens when these aren’t met

Commoditize common operational overhead.

Extend the dev cycle to include a stage to assess the impact on key system properties (SLOs)

Build -> Test -> Deploy ⇒Build -> Test -> Assess Impact -> Deploy

Start with a fast deployment pipeline that incorporates basic system level metrics and monitoring for each service.

32

datawire.io 33

Questions?

datawire.io

Microservices Cheat Sheet (What, Why & How)People

Microservices are a way to divide the work of building a cloud application

Two aspects of work: keep it running (ops), build new features (dev)

Dividing by aspect creates conflicting incentives between progress and stability.

Unifying roles (devops) to minimize tradeoff... divide work by dividing the app

Give your dev teams operational responsibility!

Define service level objectives & agreements for each service: SLOs: throughput, latency, availability SLAs: what happens when these aren’t met

Commoditize common operational overhead.

34

datawire.io

Microservices Cheat Sheet (What, Why & How)

Process

Microservices are built from a process of frequent small changes with rapid feedback and good visibility

This is the application of the traditional dev cycle to systems rather than codebases, and for it to work, key system properties must become a first class features for developers.

Extend the dev cycle to include a stage to assess the impact on key system properties (SLOs)

Build -> Test -> Deploy ⇒Build -> Test -> Assess Impact -> Deploy

35

datawire.io

Microservices Cheat Sheet (What, Why & How)Technology

Microservices are an application that is made up of a network of small services

This requires dev tooling to support quickly and safely assessing system impact.

This requires fast deployment tooling and good visibility into key system level properties:● Throughput● Latency● Availability (error-rate)

Depending on your system, this may require tooling for:● Fancy request routing (for canary testing, dark launching)

Start with a fast deployment pipeline that incorporates basic system level metrics and monitoring for each service.

36

datawire.io

DevOps: you can’t split the work (along these lines)

37

Dev

Ops

User User

DevOps

datawire.io

Features are the largest source of bugs

38

Dev

DevDev

Dev

Ops

Ops

User

User

datawire.io

Microservices: Divide the work by dividing the app

39

Dev

UserUser

Infra

DevDev

DevOps

datawire.io

Dividing up Work

40

Dev

DevDevDev

DevDev

Dev

Infra

User

User

User

User

Ops

datawire.io 41