![Page 1: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/1.jpg)
Monitoring Temperature and Fan Speed Using Ganglia and Winbond
Chips
Caitie McCaffrey, Yemi Adesanya
August 2006
![Page 2: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/2.jpg)
“The SLAC Computing Services Group is dedicated to providing leadership and support in computing and communications to the laboratory as a whole, and to physics research, in particular”
Major Concerns• Power consumption• Cooling• Monitoring
![Page 3: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/3.jpg)
• I/O Rate• CPU usage• Memory Usage• Temperature• Fan Speed• Load
Monitoring Software-low overhead
-scalable
-low impact on individual machines
What Is My Computer Doing???
![Page 4: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/4.jpg)
“Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters
and Grids”
• Scalable, overhead increases by number of clusters not nodes• Works on multiple operating systems• Round Robin Database• Measures metrics like CPU usage, load, I/O rate, and memory usage
GMOND, GMETAD, GMETRIC
![Page 5: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/5.jpg)
B C
A
1
3
2
4Cluster OneAll machines know state of entire cluster
Cluster TwoMachines 1 and 3 know state of entire cluster
Updates RRD, polls clusters periodically
Ganglia Architecturehttp://www.slac.stanford.edu/comp/unix/ganglia/index.html
![Page 6: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/6.jpg)
![Page 7: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/7.jpg)
GMETRICAllows users to monitor metrics to expand on the core
monitored by the daemon gmond
• Name• Value• Type• Units
gmetric conf=/var/ganglia/gmond.conf –nCPUTemp1 –v75 –tuint8 –uCelsius
Good because allows us to be more machine specific, can monitor temperature and fan speed
![Page 8: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/8.jpg)
A little bit on hardwareNoma - batch machines• Tyan Thunder LE-T motherboard• Winbond w83782d (lm_sensor compatible)• 2 pentium III processors
Why is temperature important?•Chip specifications give temperature range
•Behavior is unpredictable outside temperature range
•Clues to weird machine behavior
•Pentiums have a max temp of 77°-82° C
Tyan Thunder LE-T
![Page 9: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/9.jpg)
What’s a Noma?• Horse from Noma County Japan• Smallest native Japanese pony 10.1 -10.3 hands• Super rare 27 pure blood nomas left (1988)
Some more machines
COBDON
TORI
MORABORLOV
NOMA
![Page 10: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/10.jpg)
• caitiem@noma0449 $ sensors• w83782d-i2c-0-29• Adapter: SMBus PIIX4 adapter at 0580• Algorithm: Non-I2C SMBus adapter• VCore 1: +1.48 V (min = +4.08 V, max = +4.08 V)• VCore 2: +1.26 V (min = +4.08 V, max = +4.08 V)• +3.3V: +3.37 V (min = +2.97 V, max = +3.63 V)• +5V: +4.97 V (min = +4.50 V, max = +5.48 V)• +12V: +12.08 V (min = +10.79 V, max = +13.11 V)• -12V: -1.03 V (min = -13.21 V, max = -10.90 V)• -5V: +2.84 V (min = -5.51 V, max = -4.51 V)• V5SB: +5.12 V (min = +4.50 V, max = +5.48 V)• VBat: +3.34 V (min = +2.70 V, max = +3.29 V)• fan1: 8231 RPM (min = 3000 RPM, div = 2)• fan2: 8333 RPM (min = 3000 RPM, div = 2)• fan3: 0 RPM (min = 3000 RPM, div = 2)• temp1: +77°C (limit = +60°C) sensor = thermistor• ALARM• temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor• ALARM• temp3: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor• ALARM• vid: +1.450 V• alarms: Chassis intrusion detection ALARM• beep_enable:• Sound alarm disabled
![Page 11: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/11.jpg)
Perl
Fills gap between low level languages like C and C++ and high level languages like shell.
-mostly fast-basically unlimited-good for working with text-portable
Regular Expressions/^temp([0-9]):\s+\+([0-9]+\.*[0-9]*)/
matchestemp1: +77°C (limit = +60°C) sensor = thermistor
temp2: +65.0°C (limit = +60°C, hysteresis = +50°C) sensor = thermistor
![Page 12: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/12.jpg)
Sample Time - Decreasing• Time interval = 12.15 minutes• Fri Aug 11 03:04:05 PDT 2006
• FanSpeed1 8035• FanSpeed2 7941• Temp 1: 77• Change: 0• Temp 2: 64.0• Change: 0• Temp 3: 64.0• Change: 1• Time interval = 9.8415 minutes• Fri Aug 11 03:16:15 PDT 2006
Parameters•Trigger = 0.5 degrees
•Decrement = 0.9
•MaxTime = 15 minutes
•MinTime = 1 minute
New time = old time * Decrement ^(Change / Trigger)*if new time < min time then newTime = minTime
New time = 12.15 * .9 ^ (1 / .05) = 9.8415
Want Sample time to decrease faster when
temperatures are changing faster
![Page 13: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/13.jpg)
Sample Time – Increasing • Time interval = 12.15 minutes• Fri Aug 11 08:25:18 PDT 2006
• Found FanSpeed1 8035• Found FanSpeed2 7941• Temp 1: 77• Change: 0• Temp 2: 64.0• Change: 0• Temp 3: 64.0• Change: 0• Time interval = 13.5 minutes• Fri Aug 11 08:37:28 PDT 2006
Parameters•Trigger = 0.5 degrees
•Decrement = 0.9
•MaxTime = 15 minutes
•MinTime = 1 minute
NewTime = OldTime / DecrementNewTime = 12.15 / 0.9 = 13.5
Want Sample Time to Increase Temperature is
changing slowly or not at all
*If we increase by large amounts we could miss valuable data
![Page 14: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/14.jpg)
noma0450
noma0449
![Page 15: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/15.jpg)
Up and running on two Nomas currently • Noma0449• Noma0450
Will be installed on all Nomas
Can be used on any Ganglia monitored machine with a compatible Winbond chip
Much thanks to the DOE, SCCS systems group and especially Yemi Adesanya, John Goebel, & Karl Amrhein for all their help throughout the summer.
![Page 16: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/16.jpg)
Smartmontools for SCSI devices• Command smartctl –l error /dev/sda
Error counter log:
Errors Corrected Total Total Correction Gigabytes Total delay: [rereads/ errors algorithm processed uncorrected minor | major rewrites] corrected invocations [10^9 bytes] errorsread: 234237 0 0 234237 234237 605.516 0write: 0 0 0 0 0 1457.589 0
Non-medium error count: 0
http://smartmontools.sourceforge.net/smartmontools_scsi.html
![Page 17: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/17.jpg)
Corrected Errors• Minor/ Fast
• Correction algorithm works successfully• No delay to reading later sectors• These are ok
• Major / Slow•Correction algorithm works successfully
•Delay in reading later sectors
•Not so good
• Uncorrected Errors•Correction algorithm fails
•Very Bad
![Page 18: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/18.jpg)
Other Information• Total [rereads/rewrites] – errors corrected by applying retries
• Total errors corrected – number of all correctable errors
• Correction Algorithm Invocation – number of times algorithm is used
• Gigabytes Processed – number of bytes successfully and unsuccessfully read or written
![Page 19: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/19.jpg)
This indicates there might be a problem
This should be a flag as well
This is ok, its correcting the errors and not losing any time doing so
![Page 20: Monitoring Temperature and Fan Speed Using Ganglia and Winbond Chips](https://reader036.vdocuments.mx/reader036/viewer/2022062305/56815d66550346895dcb7118/html5/thumbnails/20.jpg)
Monitors• Read Uncorrected Errors• Read Delayed Errors• Read No Delay Errors• Write Uncorrected Errors• Write Delayed Errors• Write No Delay Errors• Total Uncorrected Errors• Total Delayed Errors
Collects Data Once a Day
errorsWatch
-Noma
-Don
-Tori
-Cob
-Morab
-Orlov