retro: targeted resource management in multi …...code change increases database usage, shifts...

Targeted Resource Management in Multi-tenant Distributed Systems

Jonathan MaceBrown University

Peter BodikMSR Redmond

Rodrigo FonsecaBrown University

Madanlal MusuvathiMSR Redmond

Resource Management in Multi-Tenant Systems

2

Failure in availability zone cascades to shared control plane, causes thread pool starvation for all zones• April 2011 – Amazon EBS Failure


2


Aggressive background task responds to increased hardware capacity with deluge of warnings and logging• Aug. 2012 – Azure Storage Outage


2

Code change increases database usage, shifts bottleneck to unmanaged application-level lock• Nov. 2014 – Visual Studio Online outage




2





2

Shared storage layer bottlenecks circumvent resource management layer• 2014 – Communication with Cloudera





2

Degraded performance, Violated SLOs, system outages






3







3



OSHypervisor

Containers / VMs

4

Containers / VMs

Shared Systems:Storage, Database,

Queueing, etc.

4

Monitors resource usage of each tenant in near real-time

Actively schedules tenants and activities

5



High-level, centralized policies:Encapsulates resource management logic

5



High-level, centralized policies:Encapsulates resource management logic

Abstractions – not specific to resource type, system

Achieve different goals: guarantee average latencies, fair share a resource, etc.

5

Hadoop Distributed File System (HDFS)

6


HDFS NameNode

Filesystem metadata6

HDFS DataNodeHDFS DataNode

HDFS DataNode


Replicated block storage

HDFS NameNode

Filesystem metadata6


HDFS DataNode



HDFS NameNode

Filesystem metadata

Rename

6


HDFS DataNode



HDFS NameNode

Filesystem metadata

ReadRename

6

HDFS NameNode HDFS DataNodeHDFS DataNode

HDFS DataNode

7


HDFS DataNode

7

time

500

0


HDFS DataNode

8

500

0


HDFS DataNode

8

500

012

0


HDFS DataNode

9

500

012

0


HDFS DataNode

9

500

012

0500

0


HDFS DataNode

10

500

012

0500

0


HDFS DataNode

10

500

012

0500

0500

0

Local storage

HDFS DataNode

11

Local storage

HBase RegionServer

HDFS DataNode

11

Local storage

HBase RegionServer

HDFS DataNode

MapReduceShuffler

Yarn ContainerYarn ContainerHadoop YARN

Container

MapReduce Tasks

11

Local storage

HBase RegionServer

HDFS DataNode

Hadoop YARNNodeManager

Yarn ContainerYarn ContainerHadoop YARN Container

MapReduce

Local storage

HBase RegionServer

HDFS DataNode

Hadoop YARNNodeManager

Yarn ContainerYarn ContainerHadoop YARN Container

MapReduce

Local storage

HBase RegionServer

HDFS DataNode

MapReduceShuffler

Yarn ContainerYarn ContainerHadoop YARN

Container

MapReduce Tasks

11

Goals

12

Co-ordinated control across processes, machines, and services

Goals

12


Handle system and application level resources

Goals

12



Principals: tenants, background tasks

Goals

12




Real-time and reactive

Goals

12




Real-time and reactive

Efficient: Only control what is needed

Goals

12

Architecture

13

Tenant Requests

14

Workflows

Purpose: identify requests from different users, background activities

eg, all requests from a tenant over time

eg, data balancing in HDFS

15

Workflows




Unit of resource measurement, attribution, and enforcement

15

Workflows




Unit of resource measurement, attribution, and enforcement

Tracks a request across varying levels of granularityOrthogonal to threads, processes, network flows, etc.

15

Workflows

16

Workflows Resources

Purpose: cope with diversity of resources

16

Workflows Resources


What we need:

1. Identify overloaded resources

16

Workflows Resources


What we need:


2. Identify culprit workflows

16

Workflows Resources


What we need:


Slowdown Ratio of how slow the resource is now compared to its baseline performance with no contention.


16

Workflows Resources


What we need:




Load Fraction of current utilization that we can attribute to each workflow

16

Workflows Resources

17


What we need:





Workflows Resources

Slowdown (queue time + execute time) / execute timeeg. 100ms queue, 10ms execute

=> slowdown 11

Load time spent executingeg. 10ms execute

=> load 10

17


What we need:





Workflows Resources

17

Workflows Resources Control Points

Goal: enforce resource management decisions

18



Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

18



Decoupled from resources

Rate-limits workflows, agnostic to underlying implementation e.g.,

token bucket

priority queue

19


20

Pervasive Measurement

1. Pervasive MeasurementAggregated locally then reported centrally once per second


20

Re

tro

Co

ntr

oll

er

AP

I



2. Centralized ControllerGlobal, abstracted view of the system


20

Re

tro

Co

ntr

oll

er

AP

I




Policies run in continuous control loop

Po

licy

Po

licy

Po

licy


20

Re

tro

Co

ntr

oll

er

AP

I




Policies run in continuous control loop

3. Distributed EnforcementCo-ordinates enforcement using distributed token bucket

Po

licy

Po

licy

Po

licy


Distributed Enforcement

20

Re

tro

Co

ntr

oll

er

AP

I

Distributed Enforcement



Po

licy

Po

licy

Po

licy

“Control Plane” for resource management

Global, abstracted view of the systemEasier to write

Reusable

21

22

Example: LatencySLOPolicy

22

Example: LatencySLOH High Priority Workflows

Policy

“200ms average request latency”

22


Policy

L Low Priority Workflows


(use spare capacity)

22


Policy



(use spare capacity)monitor latencies

22


Policy




attribute interference

22


Policy




throttle interfering workflows

23

1 foreach candidate in H2 miss[candidate] = latency(candidate) / guarantee[candidate]3 W = candidate in H with max miss[candidate]

4 foreach rsrc in resources() // calculate importance of each resource for hipri5 importance[rsrc] = latency(W, rsrc) * log(slowdown(rsrc))

6 foreach lopri in L // calculate low priority workflow interference7 interference[lopri] = Σrsrc importance[rsrc] * load(lopri, rsrc) / load(rsrc)

8 foreach lopri in L // normalize interference9 interference[lopri] /= Σk interference[k]

10 foreach lopri in L11 if miss[W] > 1 // throttle12 scalefactor = 1 – α * (miss[W] – 1) * interference[lopri]13 else // release14 scalefactor = 1 + β

15 foreach cpoint in controlpoints() // apply new rates16 set_rate(cpoint, lopri, scalefactor * get_rate(cpoint, lopri)


Policy


24








Policy


Select the high priority workflow Wwith worst performance

24








Policy



Weight low priority workflows by their interference with W

24








Policy




Throttle low priority workflows proportionally to their weight

25








Policy





26








Policy





27








Policy





Other types of policy…

28


Bottleneck FairnessDetect most overloaded resource

Fair-share resource between tenants using it

Policy

28




Dominant Resource FairnessEstimate demands and capacities from measurements

Policy

Policy

28




Dominant Resource FairnessEstimate demands and capacities from measurements

Policy

Policy

ConciseAny resources can be bottleneck (policy doesn’t care)Not system specific

28

Evaluation

29

Instrumentation

30

Instrumentation

Retro implementation in JavaInstrumentation Library

Central controller implementation

30

Instrumentation

Retro implementation in JavaInstrumentation LibraryCentral controller implementation

To enable RetroPropagate Workflow ID within application (like X-Trace, Dapper)Instrument resources with wrapper classes

30

Instrumentation



Overheads

31

Instrumentation



OverheadsResource instrumentation automatic using AspectJ

31

Instrumentation




Overall 50-200 lines per system to modify RPCs

31

Instrumentation




Overall 50-200 lines per system to modify RPCs

Retro overhead: max 1-2% on throughput, latency

31

Experiments

32

ExperimentsYARN MapReduce

ZooKeeper

Systems

HDFSHBase

32


ZooKeeper

Systems

Workflows MapReduce Jobs (HiBench)HBase (YCSB)HDFS clientsBackground Data ReplicationBackground Heartbeats

HDFSHBase

32


ZooKeeper

Systems


ResourcesCPU, Disk, Network (All systems)Locks, Queues (HDFS, HBase)

HDFSHBase

32


ZooKeeper

Systems



PoliciesLatencySLO

Bottleneck Fairness

Dominant Resource Fairness

HDFSHBase

32

Experiments

Policies for a mixture of systems, workflows, and resources

Results on clusters up to 200 nodes

YARN MapReduce

ZooKeeper

Systems



PoliciesLatencySLO

Bottleneck Fairness


See paper for full experiment results

HDFSHBase

32

Experiments

Policies for a mixture of systems, workflows, and resources

Results on clusters up to 200 nodes

YARN MapReduce

ZooKeeper

Systems



PoliciesLatencySLO

Bottleneck Fairness


See paper for full experiment results

This talk LatencySLO policy results

HDFSHBase

32

HDFS read 8k

HBase read 1 cached row

HBase read 1 row

33

HDFS read 8k


HBase read 1 row

HBaseCached Table Scan

HDFS mkdir

HBaseTable Scan

33

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


HBase read 1 row


HDFS mkdir

HBaseTable Scan

33

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


HBase read 1 row


HDFS mkdir

HBaseTable Scan

SLOtarget

33

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


HBase read 1 row


HDFS mkdir

HBaseTable Scan

SLOtarget

34

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


Disk

HBase read 1 row


HDFS mkdir

HBaseTable Scan

SLOtarget

34

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


Disk

CPU

HBase read 1 row


HDFS mkdir

HBaseTable Scan

SLOtarget

34

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


Disk

CPU

HDFS NN Lock

HBase read 1 row


HDFS mkdir

HBaseTable Scan

SLOtarget

34

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


Disk

CPU

HDFS NN Lock

HDFS NN Queue

HBase read 1 row


HDFS mkdir

HBaseTable Scan

SLOtarget

34

0

5

10

15

0 5 10 15 20 25 30

Slo

wd

ow

n

Time [minutes]

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


Disk

CPU

HDFS NN Lock

HDFS NN Queue

HBase Queue

HBase read 1 row


HDFS mkdir

HBaseTable Scan

SLOtarget

34

1

10

100

1000

0 5 10 15 20 25 300.1

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

+ SLO Policy Enabled

SLO target

HBase read 1 row


HDFS mkdir

HBaseTable Scan

35

SLOtarget

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


HBase read 1 row


HDFS mkdir

HBaseTable Scan

36

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


0.2

1 HBaseTable Scan

HBase read 1 row


HDFS mkdir

HBaseTable Scan

36

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


0.2

1 HBaseTable Scan

0.5

1 HDFS mkdir

HBase read 1 row


HDFS mkdir

HBaseTable Scan

36

0.1

1

10

0 5 10 15 20 25 30

SLO

-No

rma

lize

d L

ate

ncy

Time [minutes]

HDFS read 8k


0.5

1 HBaseCached Table Scan

0.2

1 HBaseTable Scan

0.5

1 HDFS mkdir

HBase read 1 row


HDFS mkdir

HBaseTable Scan

36

Conclusion

37

Resource management for shareddistributed systems

Conclusion

37


Centralized resource management

Conclusion

37



Conclusion

Comprehensive: resources, processes, tenants, background tasks

37



Conclusion


Abstractions for writing concise, general-purpose policies:

WorkflowsResources (slowdown, load)Control points

37



Conclusion

http://cs.brown.edu/~jcmace


Abstractions for writing concise, general-purpose policies:

WorkflowsResources (slowdown, load)Control points

http://cs.brown.edu/~jcmace

retro: targeted resource management in multi …...code change increases database usage, shifts...

Documents