presenter: chi-hung lu 1. problems distributed applications are hard to validate distribution of...

51
Distributed Debugging Presenter: Chi-Hung Lu 1

Upload: virginia-atkinson

Post on 23-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

  • Slide 1
  • Presenter: Chi-Hung Lu 1
  • Slide 2
  • Problems Distributed applications are hard to validate Distribution of application state across many distinct execution environments Protocols involve complex interactions among a collection of networked machines Need to handle failures ranging from network problems to crashing nodes Intricate sequences of events can trigger complex errors as a result of mishandled corner cases 2
  • Slide 3
  • Approaches Logging-based Debugging X-Trace Bi-directional Distributed BackTracker (BDB) Pip Deterministic Replay WiDS Friday Jockey Model Checking MaceMC 3
  • Slide 4
  • R. Fonseca et al, NSDI 07 4
  • Slide 5
  • Problem Description It is difficult to diagnose the source of the problem for an internet application Current network diagnostic tools only focus on one particular protocol Does not share information on the application between the user, service, and the network operators 5
  • Slide 6
  • Examples traceroute Could locate IP connectivity problem Could not reveal proxy or DNS failures HTTP monitoring suite Could locate application problem Could not diagnose routing problems 6
  • Slide 7
  • Examples 7 User DNS Server Proxy Web Server
  • Slide 8
  • Examples 8 User DNS Server Proxy Web Server
  • Slide 9
  • Examples 9 User DNS Server Proxy Web Server
  • Slide 10
  • Examples 10 User DNS Server Proxy Web Server
  • Slide 11
  • X-Trace An integrated tracing framework Record the network path that were taken Invoke X-Trace when initiating an application task Insert X-Trace metadata with a task identifier in the request Propagate the metadata down to lower layers through protocol interfaces 11
  • Slide 12
  • Task Tree X-Trace tags all network operations resulting from a particular task with the same task identifier Task tree is the set of network operations connected with an initial task Task tree could be reconstruct after collecting trace data with reports 12
  • Slide 13
  • An example of the task tree A simple HTTP request through a proxy 13
  • Slide 14
  • X-Trace Components Data X-Trace metadata Network path Task tree Report Reconstruct task tree 14
  • Slide 15
  • Propagation of X-Trace Metadata The propagation of X-Trace metadata through the task tree 15
  • Slide 16
  • Propagation of X-Trace Metadata The propagation of X-Trace metadata through the task tree 16
  • Slide 17
  • The X Trace metadata FieldUsage FlagsBits that specify which of the three optional components are present TaskIDAn unique integer ID TreeInfoParentID, OpID, EdgeType DestinationSpecify the address that X-Trace report should be sent to OptionsAccommodate future extensions mechanism 17
  • Slide 18
  • Operation of X-Trace Metadata 18
  • Slide 19
  • Operation of X-Trace Metadata 19
  • Slide 20
  • X-Trace Report Architecture 20
  • Slide 21
  • X-Trace Report Architecture 21
  • Slide 22
  • X-Trace Report Architecture 22
  • Slide 23
  • Usage Scenario (1) Web request and recursive DNS queries 23
  • Slide 24
  • Usage Scenario (2) A request fault annotated with user input 24
  • Slide 25
  • Usage Scenario (3) A client and a server communicate over I3 overlay network 25
  • Slide 26
  • Usage Scenario (3) Internet Indirect Infrastructure (I3) 26
  • Slide 27
  • Usage Scenario (3) Internet Indirect Infrastructure (I3) 27
  • Slide 28
  • Usage Scenario (3) Internet Indirect Infrastructure (I3) 28
  • Slide 29
  • Usage Scenario (3) Tree for normal operation 29
  • Slide 30
  • Usage Scenario (3) The receiver host fails 30
  • Slide 31
  • Usage Scenario (3) Middlebox process crash 31
  • Slide 32
  • Usage Scenario (3) The middlebox host fails 32
  • Slide 33
  • Discussion Report loss Non-tree request structures Partial deployment Managing report traffic Security Considerations 33
  • Slide 34
  • X. Liu et al, NSDI 07 34
  • Slide 35
  • Problem Description Log mining is both labor-intensive and fragile Latent bugs often are distributed across multiple nodes Logs reflect incomplete information of an execution Non-determinism of distributed application 35
  • Slide 36
  • Goals Efficiently verify application properties Provide fairly complete information about an execution Reproduce the buggy runs deterministically and faithfully 36
  • Slide 37
  • Approach Log the actual execution of a distributed system Apply predicate checking in a centralized simulator over a run driven by testing scripts or replayed by logs Output violation report along with message traces An execution is interpreted as a sequence of events, which are dispatched to corresponding handling routines 37
  • Slide 38
  • Components A versatile script language Allow a developer to refine system properties into straightforward assertions A checker Inspect for violations 38
  • Slide 39
  • Architecture Components of WiDS Checker 39
  • Slide 40
  • Architecture Reproduce real runs Log all non-deterministic events using Lamports logical clock Check user-defined predicates A versatile scription language to specify system states being observed and the predicates for invariants and correctness Screen out false alarms with auxiliary information For liveness properties Trace root causes using a visualization tool 40
  • Slide 41
  • Programming with WiDS WiDS APIs are mostly member function of the WiDSObject class WiDS runtime maintains an event queue to buffer pending events and dispatches them to corresponding handling routines 41
  • Slide 42
  • Enabling Replay Logging Log all WiDS nondeterminism Redirect OS calls and log the results Embed a Lamport Clock in each out-going message Checkpoint Support partial replay Save the WiDS process context Replay Start from the beginning or a checkpoint Replay events in serialized Lamport order 42
  • Slide 43
  • Checker Observe memory state Define states and evaluate predicates Refresh database for each event Maintain history Re-evaluate modified predicates Auxiliary information for violations Liveness properties only guarantee to be true eventually 43
  • Slide 44
  • 44
  • Slide 45
  • 45
  • Slide 46
  • 46
  • Slide 47
  • Visualization Tools Message flow graph 47
  • Slide 48
  • Evaluation Benchmark and result summary 48
  • Slide 49
  • Performance Running time for evaluating predicates 49
  • Slide 50
  • Logging Overhead Percentage of logging time 50
  • Slide 51
  • Discussion System is debugged by those who developed it Bugs are hunted by those who are intimately familiar with the system 51