hawkeye a monitoring and management tool for distributed systems
DESCRIPTION
HawkEye A Monitoring and Management Tool for Distributed Systems. Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison http://www.cs.wisc.edu/condor [email protected]. What does Condor have?. …lots of core technology for building a distributed system. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/1.jpg)
1www.cs.wisc.edu/condor
HawkEyeA Monitoring and
Management Tool for
Distributed Systems Todd Tannenbaum
Department of Computer SciencesUniversity of Wisconsin-Madisonhttp://www.cs.wisc.edu/condor
![Page 2: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/2.jpg)
2www.cs.wisc.edu/condor
What does Condor have?› …lots of core technology for building a
distributed system
![Page 3: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/3.jpg)
3www.cs.wisc.edu/condor
What does Condor have?› …lots of core technology for building a
distributed system› …lots of core technology for monitoring
the status of a machine
![Page 4: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/4.jpg)
4www.cs.wisc.edu/condor
What does Condor have?› …lots of core technology for building a
distributed system› …lots of core technology for monitoring
the status of a machine› …lots of core technology for managing
a work load of tasks
![Page 5: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/5.jpg)
5www.cs.wisc.edu/condor
What does Condor have?› …lots of core technology for building a
distributed system› …lots of core technology for monitoring
the status of a machine› …lots of core technology for managing
a work load of tasks› …lots of really, truly, skilled and
experienced developers and researchers at building distributed systems. Some of the best. Standout state employees. Honest. Email for Wisconsin Gov Scott McCallum:
![Page 6: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/6.jpg)
6www.cs.wisc.edu/condor
One day an avid Condor user asked:
![Page 7: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/7.jpg)
7www.cs.wisc.edu/condor
One day an avid Condor user asked:
Say, could Condor Technology be
used for distributed
system administration??
![Page 8: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/8.jpg)
8www.cs.wisc.edu/condor
Time to think…› Gathered up our experiences with
our own management tasks, looked at the mature Condor technology available to us, and HawkEye effort was born.
› Completely separate from Condor from end user prospective. Can install HawkEye, or Condor, or both
![Page 9: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/9.jpg)
9www.cs.wisc.edu/condor
First Component: MONITORING
› Sysadmins first need information about what is happening on the machines they are responsible for. Both Current and Past Information must be consolidated and
easily accessible Information must be dynamic
![Page 10: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/10.jpg)
10www.cs.wisc.edu/condor
Condor ClassAds› Technology for an entity to
describe itself
› Simple attribute value pairs [
load_average = 1.3free_Swap_space_mb = 140number_of_processes = 92keyboard_idle_secs = 6ram = 128total_swap = 512total_memory = ram + total_swapbusy = load_average > 1.0
]
![Page 11: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/11.jpg)
11www.cs.wisc.edu/condor
Condor ClassAds, cont.› No fixed schema› Attributes can contain values or
expressions› Serialize Ads in XML› Open source libraries on C++ and Java
to: Manipulate Ads and Ad attributes Store Ads Query collections of Ads
› Bindings for Perl and others on the way…
![Page 12: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/12.jpg)
12www.cs.wisc.edu/condor
HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Manager ClassAd
UpdatesVia SecureUDP
![Page 13: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/13.jpg)
13www.cs.wisc.edu/condor
HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Manager HawkEye Monitoring Agent
HawkEye Monitoring Agent
HawkEye Monitoring Agent
![Page 14: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/14.jpg)
14www.cs.wisc.edu/condor
HawkEye Monitoring Agent
/proc, kstat…
Hawkeye_Startup_Agent
Hawkeye_Monitor
HawkEye Monitoring Agent
HawkEye Manager ClassAd
UpdatesVia SecureUDP
![Page 15: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/15.jpg)
15www.cs.wisc.edu/condor
Monitor Agent, cont.
› Updates are sent periodically Information does not get stale
› Updates also serve as a heartbeat monitor Know when a machine is down
› Out of the box, the update ClassAd has many attributes about the machine of interest for system administration Current Prototype = 184 attributes
![Page 16: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/16.jpg)
16www.cs.wisc.edu/condor
What if I want to monitor
something you didn’t think
about?
![Page 17: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/17.jpg)
17www.cs.wisc.edu/condor
Custom Attributes
/proc, kstat…
Hawkeye_Startup_Agent
Hawkeye_Monitor
HawkEye Monitoring Agent
HawkEye Manager
Data from hawkeye_update_attribute
command line tool
Create your ownHawkEye plugins,or share plugins with others
![Page 18: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/18.jpg)
18www.cs.wisc.edu/condor
Role of HawkEye Manager
› Store all incoming ClassAds in a indexed resident data structure Fast response to client tool queries about
current state “Show me all machines with a load average >
10”
› Periodically store ClassAd attributes into a Round Robin Database Store information over time “Show me a graph with the load average for
this machine over the past week”
› Speak to clients via CEDAR, HTTP
HawkEye Manager
![Page 19: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/19.jpg)
Several different clients
› Command-line, GUI, Web-based
![Page 20: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/20.jpg)
20www.cs.wisc.edu/condor
But sysadmins also sometimes have to do
work…
› Task: copy a new library onto the local disk of each machine. Just a script to copy via rcp/scp to
every machine… or is it?
![Page 21: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/21.jpg)
21www.cs.wisc.edu/condor
Running tasks on behalf of the sysadmin
› Submit your sysadmin tasks to HawkEye Tasks are stored in a persistent queue by
the Manager Tasks can leave the queue upon completion,
or repeat after specified intervals Tasks can have complex interdependencies
via DAGMan Records are kept on which task ran where
› Sounds like Condor, eh? Yes, but simpler…
![Page 22: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/22.jpg)
22www.cs.wisc.edu/condor
Run Tasks in response to monitoring information› ClassAd “Requirements” Attribute
› Example: Send email if a machine is low on disk space or low on swap space Submit an email task with an attribute:
Requirements = free_disk < 5 || free_swap < 5
› Example w/ task interdependency: If load average is high and OS=Linux and console is Idle, submit a task which runs “top”, if top sees Netscape, submit a task to kill Netscape
![Page 23: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/23.jpg)
23www.cs.wisc.edu/condor
HawkEye Design Goals› Monitoring
Reliable presence Get Data off the node in an extensible, consistent
manner
› Run Tasks In response to probe information Repeat or once-only semantics Audit Log
› Independent and self-contained› Cross-Platform
![Page 24: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/24.jpg)
24www.cs.wisc.edu/condor
Current Status
› Just Beginning this project
› Initial release early summer
› Prototypes already running – Stop in and see initial HawkEye Work
Rm 3385 on Weds 9am – 12pm
![Page 25: HawkEye A Monitoring and Management Tool for Distributed Systems](https://reader035.vdocuments.mx/reader035/viewer/2022062723/56813bfb550346895da54bf1/html5/thumbnails/25.jpg)
25www.cs.wisc.edu/condor
Thank you!
I was an overworked
sysadmin. Now I have more free time thanks to
HawkEye!