“good enough” service model and description randy katz, jeff mogul, giovanni pacifici, john...

“Good Enough”Service Model and Description

Randy Katz, Jeff Mogul,Giovanni Pacifici, John Sontag,

Ion Stoica, Mark Verber

Common Tasks for Practitioners

• Capacity planning and deployment

• Detect failure and repair (monitor)

• Anticipate problems and prevent (trend)

These tasks are done while:

• system components change

• overall service architect changes

Common (not best) Practices• A “production service” is composed of a series of poorly understood

components which provide one or more network services.– Components often don’t scale well– There is little insight into performance characteristics of these

components.• Architectural descriptions are incomplete. There is only a vague

understanding of how components interact with other components– Typically only out of date visio diagram of relationships between service

components running in-house generated software– Often lack understanding of what other dependencies such as name

resolution.– The more time that passes, the less people understand the service

• Insufficient investment into scaling / availability– simple load balancing using a network device without a full plan– Rendezvous and state not thought out: see options on netscaler

• Staff often lack systems background and understanding

Common Problems

• Little insight in how the service scales– As a result the service typical is driven off the cliff

repeatedly.– hardware is added based on what machine seemed

to fail first which may or may not fix problems

• Little insight is what components depend on others– Often times changing one component has adverse

effect on other parts of the service.– Diagnosis is slow and hard– People are afraid to touch pieces

What’s Good Enough

• A guide which would get within +/-30% to the number and ratio of servers to handle a given load

• A definitive and accurate description of how components interact which would aid monitoring and debugging

Greybox Model of a Service

The following would be need for a model:

• Description of components• Instrumentation of basic machine resources• Instrumentation for component interfaces• Description of how components interact in the

overall service• Traffic capture and replay facility• Performance thumbnails (graphs) describing

behavior under various loads

General Component Description

CPU

Protocol

Protocol

ProtocolReq

Memory

Disk

Network

Rec

Req

Rec

Req

Rec

Service Description Diagram

CPUMemoryDiskNetwork

HTTP

HTTP

SQLNET


HTTPHTTP


SQLDatabase


HTTP

App1 Cache

Instrumentation for Interfaces

• Each interface will minimally capture:– Simple logging of request and responses with

timestamps is sufficient

• Recommended Additions– Global transaction ID which is passed through

each interface which enables path based analysis

– Integrated capture / replay ala RADlab’s liblog

Traffic Capture and Replay Facility

• Ability to capture real world traffic (traces)

• Ability to replay real world traffic at specified rates

• Alternatively, a synthetic load generator

Basic Scaling Methodology

1. Apply increasing load to individual components with the components it depends on being sufficient responsive that internal resources are the bottleneck.

2. If internal resources aren’t consumed before the machine hits a bottleneck, investigate external components

3. Find where the component goes non-linear. Back that value off by 15% and call that component being “rated” for that workload.

4. Based on the number of requests components need to issue to fulfill a request, it should be quite easy to figure out how many of what sort of components will be required to service a specified number of requests.

Performance Thumbnails (1)

• Components as a Transfer Function

• FIXME: Insert graphs of resource stavation, deadlock, livelock

Performance Thumbnails (2)

• Create “Expected” Overlay Graph– X axis is number of requests– Y axis is:

• work performed• Resources used• Number of requests each downstream component

• Generate in the production service graphs– X axis is time– Y axis is <see above>

• Purpose– Gives operations team a good feel for what a

component should look like

Service Description Markup Details

• Specify Components Relationships Once– This should only need to be changed when the relationship

between components change– But Permit Fault Isolated Service Units

• Building Blocks– component(*) = wildcard, any component of this type– component(x) = variable substitution, in same service unit– component(specific) = only the specified service unit

• Examplecache(*) appserverA(*) httpappserverA(x) coredb(x) sqlnetappserverA(pod1) blobserver(pod1) httpdbmonitor(ops) coredb(*) http

Generate Configuration andEnforce Service Model!

• Models Typically Don’t Stay Accurate– Model provides little benefit to the component

developers, so updates often lags just like most “documentation”

– People who need the models typically have to infer the model through path based analysis or network flow analysis

• Infered models typical miss things• It’s already broken

• Prevents drift!– On machines via kernel like ipfilter– In network– Changes won’t work unless the model is updated

Advanced Scaling

• As individual components are stressed see how internal resources are consumed.– You might want to change hardware/software

platforms if one resources is consumed before all others.

• If you increase the number of machines and you don’t get a corresponding performance increase then something other than internal resources is the bottleneck

• Testing with all the components might reveal emergent behaivor

Suggested Benefits

• “Hints” which guide machine deployment– Ratio of machines for various functions– Number given an anticipated load

• Provides insight to operators (mental model)• Filters for alert management• May permit auto-tuned monitors• Rough insight into bottleneck

– Add machine failing to improve performance guides to more serious problem

• Lowers the risk of breaking service with releasing an updated component

• Can be used for first pass at IDS and protection

Experiment Manager

• A system which would run each of the components through a load test to get initial scaling numbers

• Use the initial scaling number to prioritize which combination of components should be first tested together.

• Map the space generated by each of the component transfer functions to find an optimal configuration

• Run for as long as time permits• Component ratio of input / output really interesting

Other Opportunities for Research

• Measurement Methodology– Look at Margo’s work on micro benchmarks– Peak to average work-load

• Logging / capture• Crisp description of inter-dependence• Map of cliffs• Exploration of making critical resources first

class items (locks, etc)• More natural good/bad for SLT• How to capture state oriented bindings

“good enough” service model and description randy katz, jeff mogul, giovanni pacifici, john...

Documents