evolution of the monitoring in the lhcb online system · icinga2 icinga2 early development stages...

15
Evolution of the Monitoring in the LHCb Online System Christophe Haen E.Bonaccorsi and N.Neufeld LHCb Online team (CERN), GENEVA, CH 10th October 2013

Upload: others

Post on 02-Jun-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Evolution of the Monitoring in the LHCb OnlineSystem

Christophe HaenE.Bonaccorsi and N.Neufeld

LHCb Online team (CERN), GENEVA, CH

10th October 2013

Page 2: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Plan

1 The LHCb Online System

2 Feedback of the current infrastructure

3 AlternativesNagios4ShinkenIcinga2

4 Benchmark

5 Conclusion

Page 3: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

The LHCb Online System

LHCb

One of the four largeexperiments of the LHC.

Relies on large andheterogeneous ITinfrastructure.

Thousands of servers,different hardwareconfigurations, greatvariety of tasks

Futur monitoring for LHCb 1 C. Haen

Page 4: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

The LHCb Online System

Distributed monitoringinfrastructure

Single Icinga 1.8.4instance

ido2db with local MySQL(SSD disks)

mod gearman 1.4.2

NRPE and NSClient++

Nand for mail aggregation

Futur monitoring for LHCb 2 C. Haen

Page 5: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Positive aspects

Pros

Performance: 40 000 checks in a 5 mn window without latency

Ease of scalability with mod gearman

Group and template functionalities of the configuration:factorization

New web interface

Mail aggregation is good and necessary

Futur monitoring for LHCb 3 C. Haen

Page 6: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Negative aspects

Cons

Icinga instance is a single point of failure

Dependency system unsatisfactory

Performance with big environment failures

Very static: no easy access to live information, noconfiguration change while running

(Configuration parsing and loading time in the database)

Futur monitoring for LHCb 4 C. Haen

Page 7: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Nagios4

Nagios4

Major release, currently in beta version

Performance improvements

Better algorithms

Give up fork system for worker processes (mod gearman like)

Claim: -87% iops, -42% CPU, -64% memory

Configuration logic slightly changed (Beware!)

Futur monitoring for LHCb 5 C. Haen

Page 8: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Shinken

Shinken (part 1)

Pioneer of the nextgeneration tools

All in Python

Extends Nagios’philosophy

Innovative technicaldesign

Futur monitoring for LHCb 6 C. Haen

Page 9: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Shinken

Shinken (part 2)

Dynamic (“calculated”checks, cluster support,virtualization, etc)

Extends Nagios’configuration (servicesapplied to templates,composition of templates,macro “foreach” etc)

Automatic configurationgeneration

Business oriented

Futur monitoring for LHCb 7 C. Haen

Page 10: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Icinga2

Icinga2

Early development stages

Separate branch fromIcinga 1.x

C++ with lots of Boost

Distributed core

Totally differentconfiguration

Remote agent

Dynamic

Business oriented

Futur monitoring for LHCb 8 C. Haen

Page 11: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Test bench

Candidates

Tunned Icinga + 15 remote gearman workers

Shinken (slightly tuned)

Out of the box Icinga2

Out of the box nagios4

Procedure

60 000 services on 2 000 hosts

No historical data

t=0: everything OK

t=1000s: 90% services fail

t=2000s: everything recovers

Futur monitoring for LHCb 9 C. Haen

Page 12: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Latency

Latency

Icinga gearman:knock on effect

Shinken: increasewhen big failures

nagios4: flateverywhere

Icinga2: bump atthe beginning

Futur monitoring for LHCb 10 C. Haen

Page 13: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Service checks

Service checks

Icinga gearman:slow increase at1000s

Shinken: stepincrease

nagios4: steepincrease after 1000s

Icinga2: very fast atstartup

Futur monitoring for LHCb 11 C. Haen

Page 14: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Reaction time

Reaction time

Icinga gearman:relatively slow

Shinken: stepfunction, slow

nagios4: fast

Icinga2: very fast

Futur monitoring for LHCb 12 C. Haen

Page 15: Evolution of the Monitoring in the LHCb Online System · Icinga2 Icinga2 Early development stages Separate branch from Icinga 1.x C++ with lots of Boost Distributed core Totally di

Conclusion

Conclusion

Icinga with Gearman was a good move

Still has some weaknesses

Nagios4 is not an option

Icinga2 extremely promising performance wise

Shinken seems slower, but very dynamic and many features

Further tests to be done when stable version for both

Futur monitoring for LHCb 13 C. Haen