exploring history with hawk - susecon · 2020. 7. 2. · corosync messaging / infrastructure...

87
Exploring History with Hawk An Introduction to Cluster Forensics Kristoffer Grönlund High Availability Software Developer [email protected]

Upload: others

Post on 21-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

Exploring History with HawkAn Introduction to Cluster Forensics

Kristoffer GrönlundHigh Availability Software Developer

[email protected]

Page 2: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

2

This tutorial

• High Availability in 5 minutes

• Introduction to HAWK‒ What's new in HAWK 2

• History Explorer‒ Cluster Forensics

‒ Example Usage

• Summary

Page 3: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

3

About me

• Kristoffer Grönlund‒ Developer

‒ crmsh

‒ hawk

‒ resource-agents

‒ Maintainer

‒ fence-agents

‒ haproxy

Page 4: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

High Availability

Page 5: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

5

High Availability

Page 6: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

6

What is a cluster?

• Cluster → 1 - 32* Nodes

• Node → Single machine in cluster‒ Hardware or virtualized

‒ Remote nodes

• Site → Physical location‒ Local

‒ Metro

‒ Geographical

* Scale beyond 32 nodes with remote nodes

Page 7: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

7

Resources

• Agent Classes‒ Open Cluster Framework (OCF) Agents

‒ resource-agents

‒ systemd services

‒ Fencing agents

‒ Init scripts

• Examples:‒ Web Server, File Server

‒ Databases

‒ Filesystems, IP Addresses

‒ VMs, resources in VMs...

Page 8: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

8

Constraints

• Order‒ Start resource A before resource B

• Location‒ Resource A prefers node

• Colocation‒ Resource A with resource B

• Score‒ Mandatory vs. Preference

‒ Numeric value or +/- infinity

‒ Resource stickiness

Page 9: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

9

Overview

Corosync

Messaging / Infrastructure

Resource Allocation

Resource Agents

ResourceResourceResource

Resource

Local Resource Manager Local Resource

Manager

Cluster Resource Manager

Policy Engine Cluster Information Base (CIB)

CIB Replica Cluster Resource

Manager

Corosync

Designated Coordinator (DC)

CO

RO

SYN

CPA

CEM

AK

ERR

ESO

UR

CES

Page 10: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

10

Fencing

• Dealing with Schrödinger's cat

• Goal: Preventing corruption

• Storage based: SBD‒ Recommended if possible

‒ No special hardware required

• Hardware based: IPMI, iLO, …‒ Many supported devices

Page 11: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

11

Page 12: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

12

Tools

• crmsh‒ Command line interface

• HAWK‒ Web interface

Page 13: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

13

Learn more

• www.suse.com/documentation/sle-ha-12/

• Two node cluster in two commands

node1 # ha-cluster-init

node2 # ha-cluster-join -c node1

Page 14: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

Introducing HAWK

Page 15: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

15

HAWK - Overview

• “High Availability Web Konsole”

• Monitoring

• Configuration / Administration

• Dashboard

Page 16: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

16

HAWK - Technical details

• Installed by ha-cluster-bootstrap

• Runs on the cluster nodes

• Ruby on Rails

• https://<node>:7630/

Page 17: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

17

HAWK - Security

• Default user is hacluster

‒ Remember to change the password

• HTTPS for secure access

• Replace SSL certificate with your own‒ /etc/hawk/hawk.key

‒ /etc/hawk/hawk.pem

Page 18: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

HAWK 0.7

Page 19: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

19

Status

Page 20: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

20

Dashboard

Page 21: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

HAWK 2

Page 22: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

22

A New Look

• Complete visual overhaul‒ More intuitive

‒ Similar to other SUSE tools

• Improved features‒ History Explorer

‒ More powerful wizards

‒ Integrated help

• Supports new cluster features

Page 23: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

23

Upgrading to HAWK 2

zypper install hawk2

Page 24: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

24

Login

Page 25: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

25

Status

Page 26: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

26

Dashboard

Page 27: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

27

Graph

Page 28: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

28

Simulator

Page 29: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

29

Simulator, node event

Page 30: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

30

Simulator, results

Page 31: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

31

Creating resources

Page 32: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

32

Command log

Page 33: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

Wizards

Page 34: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

34

Wizards

• Apply a complete cluster configuration

• Helps configuring constraints and groups

• Install and configure required software

Page 35: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

35

Wizards

Page 36: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

36

Wizard, configuration

Page 37: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

37

Wizard, verify changes

Page 38: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

38

Wizard, advanced options

Page 39: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

39

Wizard, optional steps

Page 40: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

40

Wizard, verify changes (1)

Page 41: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

41

Wizard, verify changes (2)

Page 42: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

42

Command line wizards

crm script

list

show virtual-ip

verify virtual-ip id=admin-ip ip=10.13.37.42

run virtual-ip id=...

Page 43: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

History Explorer

Page 44: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

44

Cluster Forensics

• Something went wrong‒ How can we figure it out?

‒ Pitfalls

• Understanding the cluster logs‒ Use the history explorer

‒ Get a cluster report

Page 45: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

45

Root Cause Analysis

• Start at the evidence

• Trace backwards

• Know the application

• Assume you know nothing

Page 46: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

46

Jumping To Conclusions

• Always stay on the evidence

• When the evidence runs out, we are guessing

• Guessing is OK!‒ But know when you are guessing

Page 47: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

47

The Evidence

• Failed Cluster Action‒ Software bugs, crashes

‒ Configuration error

• Failed Node‒ Hardware failure

‒ Communication error

Page 48: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

48

Collecting data

crm report -f '2015-10-10 12:00' -t '2015-10-10 14:00' strange_event

Page 49: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

49

Understanding the logs

2015-10-11T19:40:11.717167+02:00 sle12sp1a crmd[1590]: notice: State transition S_TRANSITION_ENGINE -> S_IDLE [ input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd ]2015-10-11T19:40:19.777412+02:00 sle12sp1a apache(srv2)[20777]: INFO: Successfully retrieved http header at http://localhost:80002015-10-11T19:40:24.524292+02:00 sle12sp1a crmd[1590]: notice: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL origin=abort_transition_graph ]2015-10-11T19:40:24.528651+02:00 sle12sp1a pengine[1589]: notice: Restart admin_addr#011(Started sle12sp1b)2015-10-11T19:40:24.528851+02:00 sle12sp1a pengine[1589]: notice: Calculated Transition 156: /var/lib/pacemaker/pengine/pe-input-55.bz22015-10-11T19:40:24.530055+02:00 sle12sp1a crmd[1590]: notice: Processing graph 156 (ref=pe_calc-dc-1444585224-290) derived from /var/lib/pacemaker/pengine/pe-input-55.bz22015-10-11T19:40:24.530701+02:00 sle12sp1a crmd[1590]: notice: Initiating action 16: stop admin_addr_stop_0 on sle12sp1b2015-10-11T19:40:24.740118+02:00 sle12sp1a crmd[1590]: notice: Initiating action 6: start admin_addr_start_0 on sle12sp1b2015-10-11T19:40:24.801183+02:00 sle12sp1a crmd[1590]: notice: Initiating action 1: monitor admin_addr_monitor_10000 on sle12sp1b2015-10-11T19:40:24.836022+02:00 sle12sp1a crmd[1590]: notice: Transition 156 (Complete=4, Pending=0, Fired=0, Skipped=0, Incomplete=0, Source=/var/lib/pacemaker/pengine/pe-input-55.bz2): Complete

Page 50: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

50

Internal components

• Cluster Information Base (CIB)

• Cluster Resource Management daemon (crmd)

• Local Resource Management daemon (lrmd)

• Policy Engine (pengine)

• Fencing daemon (stonithd)

Page 51: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

51

Policy Engine

• Designated Controller (DC)‒ Elected automatically

‒ Calculates ideal cluster state

‒ Decides on actions to achieve state

Page 52: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

52

Transition

• Sequence of actions to reach new state

• Records state before and after transition

• Saved to /var/lib/pacemaker/pengine/

• Numbered with sequence number‒ Number sequence may reset to 0 if DC is re-elected

Page 53: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

53

Cluster Actions

• <resource>_<action>_<nn>

• Actions‒ start

‒ stop

‒ promote

‒ demote

‒ monitor

‒ migrate_to

‒ migrate_from

Page 54: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

54

Cluster Actions

• Error Codes

0: Success

1: Generic Error

2: Argument Error

3: Unimplemented Action

4: Insufficient Permissions

5: Required Component Is Missing

6: Configuration Error

7: Resource Was Not Running

8: Running As Primary

9: Failed As Primary

Page 55: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

55

Cluster Action Failure

• Unexpected result when performing action

• Triggers transition

• May also trigger fencing (stop failure)

Page 56: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

56

Node Failure

• Quorum = Majority vote‒ Improves availability

‒ Avoids fence loops

‒ Downside: Need more nodes

• Smaller partitions are fenced

Page 57: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

57

Node Failure

• Crash / reboot

• Network issues

• Leads to chaos without fencing‒ Cluster no longer knows if node is running resources

• Uncommunicative nodes are fenced‒ Enforces a known state

Page 58: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

58

History Explorer

• Command line:‒ crm history

• Collect logs from cluster nodes

• Analyse transitions

• Present summary of events

• View configuration

• Transition graph

• Transition diff

• Extract logs during a particular transition

Page 59: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

59

History Explorer

Page 60: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

60

History Explorer

Page 61: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

61

History Explorer

Page 62: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

62

History Explorer

Page 63: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

63

History Explorer

Page 64: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

64

Example configuration

demo-node1

demo-node2

srv1

srv2

200

200

g-proxy

proxy proxy-vipping

50

Page 65: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

65

Example Description

• Two web servers‒ Port 8000

• HAProxy‒ Port 80

‒ Load balancer (round robin)

• Failed action: kill -9 proxy detected by monitor

Page 66: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

66

Failed Action

Page 67: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

67

History Explorer

Page 68: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

68

History Explorer

Page 69: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

69

History Explorer

Page 70: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

70

History Explorer

Page 71: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

71

History Explorer

Page 72: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

72

History Explorer

Page 73: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

73

History Explorer

Page 74: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

74

History Explorer

Page 75: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

75

History Explorer

Page 76: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

76

History Explorer

Page 77: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

77

Pitfalls

Page 78: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

78

Too many logs

• History explorer can get slow‒ Run HAWK in offline mode to avoid burdening cluster

• Find the relevant transitions

• Narrow the scope

• Command line:‒ timeframe <from> <to>

Page 79: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

79

End of the tracks

• Analysing action failure‒ Example: monitor fails for unknown reasons

‒ Probes

‒ Before starting a resource, Pacemaker checks if it is running

‒ Success Is Failure

• Know your application‒ Start at action failure, read application logs backwards

‒ At this point, the cluster can't help you

Page 80: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

80

General Confusion

• Which node wrote this log?‒ Was it even running the resource in question?

• Get back to the evidence‒ If in doubt, start over

• Cancelled Transitions‒ Sometimes, the history explorer gets confused

‒ Fencing can cancel a transition

‒ By default, Pacemaker fences offline nodes at startup

Page 81: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

81

Possible Problems

• Network Latency‒ Does your network fulfill the requirements?

• Disk is full

• Misconfiguration‒ Use csync2 or configuration management tool

• Fencing device failure‒ Is fencing enabled?

‒ Does the fencing device work?

‒ Use SBD

Page 82: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

82

Resource tracing

• crm resource trace <resource>

• /var/lib/heartbeat/trace_ra/<agent>/

• Note: Trace is written on node where resource runs

• Complete trace of every action‒ Can be a lot of data: remember to untrace!

Page 83: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

83

Summary

• Try The New Hawk

• Use The History Explorer

• Follow The Evidence‒ Action Failure Leads To Actions

‒ Node Failure Leads To Fencing

‒ Without Fencing, Anything Can Happen

Page 84: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

84

Open Source

https://github.com/ClusterLabs/hawk

https://github.com/ClusterLabs/crmsh

Page 85: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

Thank you.

85

Questions?

www.suse.com

Page 86: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

86

Page 87: Exploring History with Hawk - SUSECON · 2020. 7. 2. · Corosync Messaging / Infrastructure Resource Allocation Resource Agents Resource Resource Resource Resource Local Resource

Unpublished Work of SUSE LLC. All Rights Reserved.This work is an unpublished work and contains confidential, proprietary and trade secret information of SUSE LLC. Access to this work is restricted to SUSE employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of SUSE. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability.

General DisclaimerThis document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. SUSE makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for SUSE products remains at the sole discretion of SUSE. Further, SUSE reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All SUSE marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.