exploring history with hawk - susecon · 2020. 7. 2. · corosync messaging / infrastructure...
TRANSCRIPT
Exploring History with HawkAn Introduction to Cluster Forensics
Kristoffer GrönlundHigh Availability Software Developer
2
This tutorial
• High Availability in 5 minutes
• Introduction to HAWK‒ What's new in HAWK 2
• History Explorer‒ Cluster Forensics
‒ Example Usage
• Summary
3
About me
• Kristoffer Grönlund‒ Developer
‒ crmsh
‒ hawk
‒ resource-agents
‒ Maintainer
‒ fence-agents
‒ haproxy
High Availability
5
High Availability
6
What is a cluster?
• Cluster → 1 - 32* Nodes
• Node → Single machine in cluster‒ Hardware or virtualized
‒ Remote nodes
• Site → Physical location‒ Local
‒ Metro
‒ Geographical
* Scale beyond 32 nodes with remote nodes
7
Resources
• Agent Classes‒ Open Cluster Framework (OCF) Agents
‒ resource-agents
‒ systemd services
‒ Fencing agents
‒ Init scripts
• Examples:‒ Web Server, File Server
‒ Databases
‒ Filesystems, IP Addresses
‒ VMs, resources in VMs...
8
Constraints
• Order‒ Start resource A before resource B
• Location‒ Resource A prefers node
• Colocation‒ Resource A with resource B
• Score‒ Mandatory vs. Preference
‒ Numeric value or +/- infinity
‒ Resource stickiness
9
Overview
Corosync
Messaging / Infrastructure
Resource Allocation
Resource Agents
ResourceResourceResource
Resource
Local Resource Manager Local Resource
Manager
Cluster Resource Manager
Policy Engine Cluster Information Base (CIB)
CIB Replica Cluster Resource
Manager
Corosync
Designated Coordinator (DC)
CO
RO
SYN
CPA
CEM
AK
ERR
ESO
UR
CES
10
Fencing
• Dealing with Schrödinger's cat
• Goal: Preventing corruption
• Storage based: SBD‒ Recommended if possible
‒ No special hardware required
• Hardware based: IPMI, iLO, …‒ Many supported devices
11
12
Tools
• crmsh‒ Command line interface
• HAWK‒ Web interface
13
Learn more
• www.suse.com/documentation/sle-ha-12/
• Two node cluster in two commands
node1 # ha-cluster-init
node2 # ha-cluster-join -c node1
Introducing HAWK
15
HAWK - Overview
• “High Availability Web Konsole”
• Monitoring
• Configuration / Administration
• Dashboard
16
HAWK - Technical details
• Installed by ha-cluster-bootstrap
• Runs on the cluster nodes
• Ruby on Rails
• https://<node>:7630/
17
HAWK - Security
• Default user is hacluster
‒ Remember to change the password
• HTTPS for secure access
• Replace SSL certificate with your own‒ /etc/hawk/hawk.key
‒ /etc/hawk/hawk.pem
HAWK 0.7
19
Status
20
Dashboard
HAWK 2
22
A New Look
• Complete visual overhaul‒ More intuitive
‒ Similar to other SUSE tools
• Improved features‒ History Explorer
‒ More powerful wizards
‒ Integrated help
• Supports new cluster features
23
Upgrading to HAWK 2
zypper install hawk2
24
Login
25
Status
26
Dashboard
27
Graph
28
Simulator
29
Simulator, node event
30
Simulator, results
31
Creating resources
32
Command log
Wizards
34
Wizards
• Apply a complete cluster configuration
• Helps configuring constraints and groups
• Install and configure required software
35
Wizards
36
Wizard, configuration
37
Wizard, verify changes
38
Wizard, advanced options
39
Wizard, optional steps
40
Wizard, verify changes (1)
41
Wizard, verify changes (2)
42
Command line wizards
crm script
list
show virtual-ip
verify virtual-ip id=admin-ip ip=10.13.37.42
run virtual-ip id=...
History Explorer
44
Cluster Forensics
• Something went wrong‒ How can we figure it out?
‒ Pitfalls
• Understanding the cluster logs‒ Use the history explorer
‒ Get a cluster report
45
Root Cause Analysis
• Start at the evidence
• Trace backwards
• Know the application
• Assume you know nothing
46
Jumping To Conclusions
• Always stay on the evidence
• When the evidence runs out, we are guessing
• Guessing is OK!‒ But know when you are guessing
47
The Evidence
• Failed Cluster Action‒ Software bugs, crashes
‒ Configuration error
• Failed Node‒ Hardware failure
‒ Communication error
48
Collecting data
crm report -f '2015-10-10 12:00' -t '2015-10-10 14:00' strange_event
49
Understanding the logs
2015-10-11T19:40:11.717167+02:00 sle12sp1a crmd[1590]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]2015-10-11T19:40:19.777412+02:00 sle12sp1a apache(srv2)[20777]: INFO: Successfully retrieved http header at http://localhost:80002015-10-11T19:40:24.524292+02:00 sle12sp1a crmd[1590]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]2015-10-11T19:40:24.528651+02:00 sle12sp1a pengine[1589]: notice: Restart admin_addr#011(Started sle12sp1b)2015-10-11T19:40:24.528851+02:00 sle12sp1a pengine[1589]: notice: Calculated Transition 156: /var/lib/pacemaker/pengine/pe-input-55.bz22015-10-11T19:40:24.530055+02:00 sle12sp1a crmd[1590]: notice: Processing graph 156 (ref=pe_calc-dc-1444585224-290) derived from /var/lib/pacemaker/pengine/pe-input-55.bz22015-10-11T19:40:24.530701+02:00 sle12sp1a crmd[1590]: notice: Initiating action 16: stop admin_addr_stop_0 on sle12sp1b2015-10-11T19:40:24.740118+02:00 sle12sp1a crmd[1590]: notice: Initiating action 6: start admin_addr_start_0 on sle12sp1b2015-10-11T19:40:24.801183+02:00 sle12sp1a crmd[1590]: notice: Initiating action 1: monitor admin_addr_monitor_10000 on sle12sp1b2015-10-11T19:40:24.836022+02:00 sle12sp1a crmd[1590]: notice: Transition 156 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-55.bz2): Complete
50
Internal components
• Cluster Information Base (CIB)
• Cluster Resource Management daemon (crmd)
• Local Resource Management daemon (lrmd)
• Policy Engine (pengine)
• Fencing daemon (stonithd)
51
Policy Engine
• Designated Controller (DC)‒ Elected automatically
‒ Calculates ideal cluster state
‒ Decides on actions to achieve state
52
Transition
• Sequence of actions to reach new state
• Records state before and after transition
• Saved to /var/lib/pacemaker/pengine/
• Numbered with sequence number‒ Number sequence may reset to 0 if DC is re-elected
53
Cluster Actions
• <resource>_<action>_<nn>
• Actions‒ start
‒ stop
‒ promote
‒ demote
‒ monitor
‒ migrate_to
‒ migrate_from
54
Cluster Actions
• Error Codes
0: Success
1: Generic Error
2: Argument Error
3: Unimplemented Action
4: Insufficient Permissions
5: Required Component Is Missing
6: Configuration Error
7: Resource Was Not Running
8: Running As Primary
9: Failed As Primary
55
Cluster Action Failure
• Unexpected result when performing action
• Triggers transition
• May also trigger fencing (stop failure)
56
Node Failure
• Quorum = Majority vote‒ Improves availability
‒ Avoids fence loops
‒ Downside: Need more nodes
• Smaller partitions are fenced
57
Node Failure
• Crash / reboot
• Network issues
• Leads to chaos without fencing‒ Cluster no longer knows if node is running resources
• Uncommunicative nodes are fenced‒ Enforces a known state
58
History Explorer
• Command line:‒ crm history
• Collect logs from cluster nodes
• Analyse transitions
• Present summary of events
• View configuration
• Transition graph
• Transition diff
• Extract logs during a particular transition
59
History Explorer
60
History Explorer
61
History Explorer
62
History Explorer
63
History Explorer
64
Example configuration
demo-node1
demo-node2
srv1
srv2
200
200
g-proxy
proxy proxy-vipping
50
65
Example Description
• Two web servers‒ Port 8000
• HAProxy‒ Port 80
‒ Load balancer (round robin)
• Failed action: kill -9 proxy detected by monitor
66
Failed Action
67
History Explorer
68
History Explorer
69
History Explorer
70
History Explorer
71
History Explorer
72
History Explorer
73
History Explorer
74
History Explorer
75
History Explorer
76
History Explorer
77
Pitfalls
78
Too many logs
• History explorer can get slow‒ Run HAWK in offline mode to avoid burdening cluster
• Find the relevant transitions
• Narrow the scope
• Command line:‒ timeframe <from> <to>
79
End of the tracks
• Analysing action failure‒ Example: monitor fails for unknown reasons
‒ Probes
‒ Before starting a resource, Pacemaker checks if it is running
‒ Success Is Failure
• Know your application‒ Start at action failure, read application logs backwards
‒ At this point, the cluster can't help you
80
General Confusion
• Which node wrote this log?‒ Was it even running the resource in question?
• Get back to the evidence‒ If in doubt, start over
• Cancelled Transitions‒ Sometimes, the history explorer gets confused
‒ Fencing can cancel a transition
‒ By default, Pacemaker fences offline nodes at startup
81
Possible Problems
• Network Latency‒ Does your network fulfill the requirements?
• Disk is full
• Misconfiguration‒ Use csync2 or configuration management tool
• Fencing device failure‒ Is fencing enabled?
‒ Does the fencing device work?
‒ Use SBD
82
Resource tracing
• crm resource trace <resource>
• /var/lib/heartbeat/trace_ra/<agent>/
• Note: Trace is written on node where resource runs
• Complete trace of every action‒ Can be a lot of data: remember to untrace!
83
Summary
• Try The New Hawk
• Use The History Explorer
• Follow The Evidence‒ Action Failure Leads To Actions
‒ Node Failure Leads To Fencing
‒ Without Fencing, Anything Can Happen
84
Open Source
https://github.com/ClusterLabs/hawk
https://github.com/ClusterLabs/crmsh
Thank you.
85
Questions?
www.suse.com
86
Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.
General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.