cs 350 lecture 5-3 resilient design -...

48
CS 350 Lecture 5-3 Resilient Design Fall 2019 SoC, KAIST Doo-Hwan Bae [email protected]

Upload: others

Post on 22-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

CS 350 Lecture 5-3Resilient Design

Fall 2019

SoC, KAIST

Doo-Hwan Bae

[email protected]

Page 2: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Resilence Design Patterns

Resources

• https://docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency

• http://microservices.io/patterns/monolithic.html

• https://conferences.oreilly.com/software-architecture/sa-eu-2017/public/schedule/detail/61746

• https://www.thoughtworks.com/de/insights/blog/scaling-microservices-event-stream

CS350, SoC, KAIST 2

Page 3: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Contents

• What & Why?

• Resilient Patterns

“We will prepare for the armies of illogical users who do crazy, unpredictable things.” (by Michael Nygard)

CS350, SoC, KAIST 3

Page 4: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

What?

• Resilience:• Ability of a system to handle unexpected situations

• Best case: without the user noticing it

• Worst case: with a graceful degradation of service

• Part of design activity

CS350, SoC, KAIST 4

Page 5: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Why? (1/2)

• Distributed systems are everywhere

• Fallacies of distributed systems (wrong perception/assumption)• Network is reliable, secure, homogeneous• Zero latency• Infinite bandwidth• No change on topology• One administrator• …

• Failures in distributed systems are not the exception• Normal, and even worse is ‘not predictable’• What do we do with such systems?

• Option 1: Develop a fail-free system• Many internet-service companies give up this option!

• Option 2: Embrace failures and increase availability of the system

CS350, SoC, KAIST 5

Page 6: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Why? (2/2)

• It is getting worse and worse with recent IT evolution• Too complex to manage with traditional approaches

• Some of such system examples are• Cloud-based system

• Microservices

• Zero downtime (100% availability)

• Mobile

• IoT, CPS

• Social Web

• System of Systems

CS350, SoC, KAIST 6

Page 7: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Resilience Approach

• Availability = MTTF / (MTTF + MTTR)- MTTF: Mean Time To Failure- MTTR: Mean Time to Repair

• How can we increase the availability of a (distributed) system?• Increase MTTF: minimize errors/failures, reliable h/w, ..• Reduce MTTR: How?

• Failure types: Crash failure, Omission failure, Timing failure, Response failure, …

CS350, SoC, KAIST 7

Page 8: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Whole Picture for Resilient Design

CS350, SoC, KAIST 8

Page 9: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Whole Picture for Resilient Design Techniques (by Uwe Friedrichsen, Resilient SW Design In a Nutshell)

CS350, SoC, KAIST 9

Page 10: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Isolation

CS350, SoC, KAIST 10

Page 11: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Isolation

• System must not fail as a whole

• Split system in parts and isolate parts against each other

• Avoid cascading failures

• Foundations of resilient software design• Separation of concerns

• High cohesion, low coupling

• Isolation patterns• Bulkhead Design• Monolithic vs. Microarchitecture

CS350, SoC, KAIST 11

Page 12: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Bulkhead Pattern

CS350, SoC, KAIST 12

Page 13: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Bulkhead Pattern (1/2)• Isolate elements of an application into pools so that if one fails, the others will continue to function.

CS350, SoC, KAIST 13

Page 14: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Bulkheads Pattern (2/2)

• Core isolation pattern

• Diverse implementation choices available, such as microservice,

• Shaping good bulkheads is extremely hard• Software design issue

• Needs understanding of SE principles, domain knowledge, and system behavior, future technology evolution, etc…

CS350, SoC, KAIST 14

Page 15: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Monolithic Architecture (1/2) (http://microservices.io/patterns/monolithic.html)

CS350, SoC, KAIST 15

Page 16: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Monolithic Architecture (2/2)

• Benefits• Simple to develop – most of current tools support

• Simple to deploy – deploy WAR file

• Simple to scale – by running multiple copies

• Drawbacks• Difficult to understand

• Difficult to continuous deployments

• Requires a long-term commitment to a technology

CS350, SoC, KAIST 16

Page 17: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Microservice Architecture (1/3)

• Partition a system

into small manageable

pieces, loosely coupled

CS350, SoC, KAIST 17

Page 18: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Microservice Architecture (2/3)

• Benefits• Enables continuous delivery and deployment of large, complex applications• Organize the development effort with multiple, autonomous teams• Easier for a developer to understand• Application starts faster, more productive

• Drawbacks• Additional complexity of developing a distributed system• Difficult to test• Deployment complexity• Increased memory consumption

• M (number of different services) times more JVM

• Difficult to coordinate between teams, multiple services.

CS350, SoC, KAIST 18

Page 19: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Microservice Architecture(3/3)

• When to use the microservice architecture• Startup?• Large-scale service provision?

• How to decompose the application into services• In short, it is an ‘art’! (design is art!)• Some strategies

• Decompose by business capability• Single Responsibility Principle (SRP), • Use case, • Functional cohesion

• How to maintain data consistency• In order to ensure loose coupling, each service has its own database. Then,

how to guarantee data inconsistency?• Check ‘Saga pattern’, ‘Event sourcing’

CS350, SoC, KAIST 19

Page 20: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Communication Paradigm

CS350, SoC, KAIST 20

Page 21: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Communication Paradigm

• Heavily influence resilient patterns to be used

• Request-Response vs. Event-Driven• Request-Response

• Event-Driven

CS350, SoC, KAIST 21

Page 22: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Request-Response vs. Event Driven (1/2)

CS350, SoC, KAIST 22

Page 23: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Request-Response vs. Event Driven (2/2) Orchestration vs. Choreography• Which one looks better?

• Why?

CS350, SoC, KAIST 23

Page 24: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Online Shop Example (1/3)

CS350, SoC, KAIST 24

Page 25: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Online Shop Example (2/3)

CS350, SoC, KAIST 25

Page 26: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Online Shop Example (3/3)

CS350, SoC, KAIST 26

Page 27: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Day 20 Wrap-Up

• What/Why Resilient Design Patterns?• Have to deal with issues on distributed application development

• Such issues used to be system developers’ concern in the past.• However, nowadays software engineers need to deal with them from

software design phase to implementation/maintenance phases.

CS350, SoC, KAIST 27

Page 28: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Detect

CS350, SoC, KAIST 28

Page 29: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Detect: Circuit Breaker (1/2)

• Most often cited resilient pattern

• Takes downstream unit offline if calls fail multiple times.

• Circuit breaker detects failures and prevents the application from trying to perform the action that is doomed to fail (until it's safe to retry).

• Handle faults that might take a variable amount of time to fix when connecting to a remote service or resource.

CS350, SoC, KAIST 29

Page 30: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Detect: Circuit Breaker(2/2)

CS350, SoC, KAIST 30

Page 31: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Recover

CS350, SoC, KAIST 31

Page 32: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Recover: Retry

• Basic recovery pattern

• Recover from omission or other transient errors

• Limit retries to minimize extra load on an already loaded resource

• Limit reties to avoid recurring errors

CS350, SoC, KAIST 32

Page 33: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Recover: Rollback & Roll Forward

Rollback

• Roll back state and/or execution path to a define safe state

• Recover from internal errors caused by external failures

• Use checkpoints and safe points to provide safe rollback points

• Limit retries to avoid recurring errors

Roll Forward

• Advance execution past the point of error

• Often used as escalation if retry or rollback do not succeed

• Not applicable if skipped activity is essential

CS350, SoC, KAIST 33

Page 34: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Recover: Reset & Failover

Reset

• Often used as radical escalation of all other measures failed

• Restart service

• Reset data to a guaranteed consistent state

Failover

• Used as escalation if other measures failed

• Requires redundancy

CS350, SoC, KAIST 34

Page 35: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Mitigate

CS350, SoC, KAIST 35

Page 36: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Mitigate: Fallback

• Execute an alternative action if the original action fails

• Baiss for most mitigation patterns

• Silently ignore the error and continue processing

• Return a predefined default value of an error occurs

CS350, SoC, KAIST 36

Page 37: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Mitigate: Queues for Resources

• Protect resource from temporary overload situations

• Avoid losing requests by queuing them in front of resource

• However, unlimited queues can create excessive latency

CS350, SoC, KAIST 37

Page 38: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Mitigate: Share Load

• Use if additional resources for load sharing are available

• Share load among resources to keep throughput good

• Can be implemented statically or dynamically

• Minimize amount of synchronization needed between resources

CS350, SoC, KAIST 38

Page 39: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Prevent

CS350, SoC, KAIST 39

Page 40: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Prevent: Error Injection

• Inject errors at runtime and observe how the system reacts• Chaos engineering at Netflix

• Make sure to inject errors of all types

(Routine maintenance)

• Keep preventable errors from occurring

• Check system periodically and fix detected faults and errors

CS350, SoC, KAIST 40

Page 41: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Complement

CS350, SoC, KAIST 41

Page 42: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Complement: Redundancy

• Core resilient concept

• Basis for many recovery and mitigation patterns

• Often different variants implemented in a system• N-version program

CS350, SoC, KAIST 42

Page 43: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Complement: Escalation

• Failed units may not have enough time or information to handle errors

• Escalation peer with more time and information needed

• Separate error handling flow from processing flow

• Often multi-level hierarchies

CS350, SoC, KAIST 43

Page 44: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Treat: Hot deployment

• Hot-deployable services are those which can be added to or removed from the running server. It is the ability to change ON-THE-FLY what’s currently deployed without redeploying it.

• Hot deployment is VERY hot for development. The time savings realized when your developers can simply run their build and have the new code auto-deploy instead of build, shutdown, startup is massive.

• Pros: business never stops

• Cons: may require large resources

CS350, SoC, KAIST 44

Page 45: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Whole Picture for Resilient Design Techniques (by Uwe Friedrichsen, Resilient SW Design In a Nutshell)

CS350, SoC, KAIST 45

Page 46: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Using Resilience Patterns

• Patterns are options, not obligations

• Do not pick too many patterns

• Each pattern increase complexity which is the enemy of robustness

• Each pattern costs money

• Look for complementary patterns

CS350, SoC, KAIST 46

Page 47: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Netflix’s Resilient Design Patterns Used

• Choose right ones

for you.

CS350, SoC, KAIST 47

Page 48: CS 350 Lecture 5-3 Resilient Design - KAISTse.kaist.ac.kr/wp-content/uploads/2019/11/CS350-Lecture... · 2019-11-24 · •Distributed systems are every corner of our society •Attempts

Wrap-Up

• Distributed systems are every corner of our society

• Attempts rather to have a fail-free system, better to have a resilient system.

• Resilient SW design patterns (or approaches) need to be mastered for distributed software development (design)

• Try to use of existing ones,

• Even better, “create your own patterns!”

CS350, SoC, KAIST 48