“good enough” service model and description randy katz, jeff mogul, giovanni pacifici, john...
TRANSCRIPT
![Page 1: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/1.jpg)
“Good Enough”Service Model and Description
Randy Katz, Jeff Mogul,Giovanni Pacifici, John Sontag,
Ion Stoica, Mark Verber
![Page 2: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/2.jpg)
Common Tasks for Practitioners
• Capacity planning and deployment
• Detect failure and repair (monitor)
• Anticipate problems and prevent (trend)
These tasks are done while:
• system components change
• overall service architect changes
![Page 3: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/3.jpg)
Common (not best) Practices• A “production service” is composed of a series of poorly understood
components which provide one or more network services.– Components often don’t scale well– There is little insight into performance characteristics of these
components.• Architectural descriptions are incomplete. There is only a vague
understanding of how components interact with other components– Typically only out of date visio diagram of relationships between service
components running in-house generated software– Often lack understanding of what other dependencies such as name
resolution.– The more time that passes, the less people understand the service
• Insufficient investment into scaling / availability– simple load balancing using a network device without a full plan– Rendezvous and state not thought out: see options on netscaler
• Staff often lack systems background and understanding
![Page 4: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/4.jpg)
Common Problems
• Little insight in how the service scales– As a result the service typical is driven off the cliff
repeatedly.– hardware is added based on what machine seemed
to fail first which may or may not fix problems
• Little insight is what components depend on others– Often times changing one component has adverse
effect on other parts of the service.– Diagnosis is slow and hard– People are afraid to touch pieces
![Page 5: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/5.jpg)
What’s Good Enough
• A guide which would get within +/-30% to the number and ratio of servers to handle a given load
• A definitive and accurate description of how components interact which would aid monitoring and debugging
![Page 6: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/6.jpg)
Greybox Model of a Service
The following would be need for a model:
• Description of components• Instrumentation of basic machine resources• Instrumentation for component interfaces• Description of how components interact in the
overall service• Traffic capture and replay facility• Performance thumbnails (graphs) describing
behavior under various loads
![Page 7: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/7.jpg)
General Component Description
CPU
Protocol
Protocol
ProtocolReq
Memory
Disk
Network
Rec
Req
Rec
Req
Rec
![Page 8: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/8.jpg)
Service Description Diagram
CPUMemoryDiskNetwork
HTTP
HTTP
SQLNET
CPUMemoryDiskNetwork
HTTPHTTP
CPUMemoryDiskNetwork
SQLDatabase
CPUMemoryDiskNetwork
HTTP
App1 Cache
![Page 9: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/9.jpg)
Instrumentation for Interfaces
• Each interface will minimally capture:– Simple logging of request and responses with
timestamps is sufficient
• Recommended Additions– Global transaction ID which is passed through
each interface which enables path based analysis
– Integrated capture / replay ala RADlab’s liblog
![Page 10: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/10.jpg)
Traffic Capture and Replay Facility
• Ability to capture real world traffic (traces)
• Ability to replay real world traffic at specified rates
• Alternatively, a synthetic load generator
![Page 11: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/11.jpg)
Basic Scaling Methodology
1. Apply increasing load to individual components with the components it depends on being sufficient responsive that internal resources are the bottleneck.
2. If internal resources aren’t consumed before the machine hits a bottleneck, investigate external components
3. Find where the component goes non-linear. Back that value off by 15% and call that component being “rated” for that workload.
4. Based on the number of requests components need to issue to fulfill a request, it should be quite easy to figure out how many of what sort of components will be required to service a specified number of requests.
![Page 12: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/12.jpg)
Performance Thumbnails (1)
• Components as a Transfer Function
• FIXME: Insert graphs of resource stavation, deadlock, livelock
![Page 13: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/13.jpg)
Performance Thumbnails (2)
• Create “Expected” Overlay Graph– X axis is number of requests– Y axis is:
• work performed• Resources used• Number of requests each downstream component
• Generate in the production service graphs– X axis is time– Y axis is <see above>
• Purpose– Gives operations team a good feel for what a
component should look like
![Page 14: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/14.jpg)
Service Description Markup Details
• Specify Components Relationships Once– This should only need to be changed when the relationship
between components change– But Permit Fault Isolated Service Units
• Building Blocks– component(*) = wildcard, any component of this type– component(x) = variable substitution, in same service unit– component(specific) = only the specified service unit
• Examplecache(*) appserverA(*) httpappserverA(x) coredb(x) sqlnetappserverA(pod1) blobserver(pod1) httpdbmonitor(ops) coredb(*) http
![Page 15: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/15.jpg)
Generate Configuration andEnforce Service Model!
• Models Typically Don’t Stay Accurate– Model provides little benefit to the component
developers, so updates often lags just like most “documentation”
– People who need the models typically have to infer the model through path based analysis or network flow analysis
• Infered models typical miss things• It’s already broken
• Prevents drift!– On machines via kernel like ipfilter– In network– Changes won’t work unless the model is updated
![Page 16: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/16.jpg)
Advanced Scaling
• As individual components are stressed see how internal resources are consumed.– You might want to change hardware/software
platforms if one resources is consumed before all others.
• If you increase the number of machines and you don’t get a corresponding performance increase then something other than internal resources is the bottleneck
• Testing with all the components might reveal emergent behaivor
![Page 17: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/17.jpg)
Suggested Benefits
• “Hints” which guide machine deployment– Ratio of machines for various functions– Number given an anticipated load
• Provides insight to operators (mental model)• Filters for alert management• May permit auto-tuned monitors• Rough insight into bottleneck
– Add machine failing to improve performance guides to more serious problem
• Lowers the risk of breaking service with releasing an updated component
• Can be used for first pass at IDS and protection
![Page 18: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/18.jpg)
Experiment Manager
• A system which would run each of the components through a load test to get initial scaling numbers
• Use the initial scaling number to prioritize which combination of components should be first tested together.
• Map the space generated by each of the component transfer functions to find an optimal configuration
• Run for as long as time permits• Component ratio of input / output really interesting
![Page 19: “Good Enough” Service Model and Description Randy Katz, Jeff Mogul, Giovanni Pacifici, John Sontag, Ion Stoica, Mark Verber](https://reader036.vdocuments.mx/reader036/viewer/2022082517/56649e0d5503460f94af6bbf/html5/thumbnails/19.jpg)
Other Opportunities for Research
• Measurement Methodology– Look at Margo’s work on micro benchmarks– Peak to average work-load
• Logging / capture• Crisp description of inter-dependence• Map of cliffs• Exploration of making critical resources first
class items (locks, etc)• More natural good/bad for SLT• How to capture state oriented bindings