[op5-users] Me trying to test Merlin and Ninja
Andreas Ericsson
ae at op5.se
Tue Aug 4 13:49:03 CEST 2009
Mathieu Gagné wrote:
> Hi,
>
Hi Mathieu. Thanks for your thorough testing.
> This message is kind of a follow-up on my first message on Nagios-devel:
> "My impression about Merlin and Ninja".
>
> I tried several kind of setups:
>
> 1) NOC + Poller (on same server)
> - Everything on the same server: NOC + Poller
> - All hosts/services definitions on the same server
>
> 2) NOC + Poller
> - Identical hosts/services definitions on all servers. (NOC + Poller)
> - With a hostgroup telling which hosts the poller is responsible of.
>
> 3) NOC + 2 x Pollers (independant)
> - NOC has all hosts/services definitions
> - Pollers have only hosts/services definitions they are responsible of.
> - Hosts are organized in 2 hostgroups, one for each Poller. Hostgroups
> are configured on the NOC and both Pollers.
>
> 4) NOC + 2 x Pollers (independant) and one Poller having a Peer
> - NOC has all hosts/services definitions
> - Pollers have only hosts/services definitions they are responsible of.
> - Hosts are organized in 2 hostgroups, one for each Poller/Peer.
> Hostgroups are configured on the NOC and both Pollers/Peer.
>
>
> My hope is to be able to handle +40k hosts and around 100k services
> (maybe more) with ~40 pollers. (depending on the overall performance)
>
This sounds very interesting, and it should be perfectly doable.
> I'm not the best QA in town, however here is what I found while trying
> to use/test Merlin. Please tell me if I'm doing anything wrong.
>
>
> 1) Merlin crashes with this error when I try to stop it on the NOC:
> *** glibc detected *** /usr/local/merlin/merlind: free(): invalid
> pointer: 0x08307b68 ***
>
> Any idea about how I can debug and pin-point the source of the problem?
>
If you get a coredump (which you should), you can run the following
command and send the output to me directly, or to this list:
gdb -ex bt -ex quit /path/to/merlind /path/to/corefile
That will, hopefully, tell me on which line in which sourcefile the
invalid free is issued from which makes troubleshooting exceptionally
simple.
Does this also happen on the pollers, or is it only on the noc server?
>
> 2) Local commands are not transported between NOC, Pollers and Peers.
>
> If I disable notification on a service on a Poller, the database gets
> updated and I can see notifications are disabled in Ninja. However it
> does not get updated on the NOC nor the Peer. Also, the database gets
> updated by the NOC or the Peer later on and notifications are now back
> to "enabled" in Ninja.
>
> It's the same for notifications, active checks and probably all other
> local commands.
>
This is because all events received from the network are processed by
the database handler but not all events have corresponding handlers in
the eventbroker module. This means that when a host or servicestatus
update occurs on the NOC (which happens whenever a check is received
from either the NOC or the poller), the status gets transmitted to the
poller which then inserts the NOC's point of view into the database.
If you wait a bit more you should see the status go back to disabled
since that's what it is in Nagios' core on the poller. It's basically
a race between the poller and the NOC; Whichever issues a status update
event last will win.
>
> 3) Active checks are scheduled on every servers (NOC, Poller and Peer)
>
> Active checks for hosts and services are scheduled on every servers:
> NOC, Poller and Peer.
>
> This results in a constant "conflict" between all Nagios instances.
> Isn't the Poller supposed to be the only one scheduling checks or am I
> missing something? Did I misconfigured everything?
>
> If it's a normal behavior, what's the purpose of Merlin then? If the NOC
> still schedules every hosts/services the poller is supposed to take care
> of and propagate check results to everyone, what's the purpose of all this?
>
Checks should be disabled for the hosts in question on the NOC after the
poller handling those checks have connected. Peers should schedule the
same checks, but once one peer has executed one check the other peer will
reschedule it to the normal check interval + 15 seconds, which ensures
that the second peer will only take over the check if the first peer has
a check latency of 15 seconds or more.
I'll take a look at the NOC check disabling issue though.
>
> 4) Question: Which server is responsible for sending notifications?
>
> Will I get 3 notices because the NOC, Poller and Peer decided they
> should send the notification?
>
The original idea is that the NOC is responsible for sending notifications.
This is to save up on notification-specific hardware that usually costs
a bit of money.
The intention now is to have pollers and noc's both be capable of sending
notifications. I'm not entirely clear on how the semantics should be or
how one should configure this. It's likely that SMS-notifications need to
be sent from a server in the right country, for example, to avoid costly
international rates, while email notifications could probably be sent
from pretty much any server. I'm guessing it would make sense to add a new
pragma ("handles_notifications = yes/no") to the node-sections in the
merlin configuration file.
>
> 5) Which server has authority over the database?
>
> When I start the NOC, all hosts/services definitions are exported to the
> database. Everything is fine in Ninja.
>
> However, if I start/restart a Poller, only hosts/services the poller is
> responsible of get exported to the database and everything else is
> dropped. Am I missing something?
>
Each poller is supposed to only have its share of the configuration in the
database. This is to enable users with widespread networks and tiered
responsibility to be able to use Merlin and Nagios to delegate responsibility
to the admin in question. Imagine you have a world-wide network with admins
responsible for each branch office, and a team of admins responsible for a
quadrant of the world (think asia, europe, usa, world). "world" is the admin
team responsible for the core network of the world, with sub-teams in asia
etc. The asian team should only see the servers in asia and therefore they
only see the stuff their poller is monitoring.
>
> 6) STDOUT, STDERR and STDIN should be closed when demonizing.
>
> Otherwise I'm getting logs on my console while I'm actually trying to
> work on it...
>
True. I'll fix this.
>
> 7) Split-brain situations
>
> In the event the NOC and the Poller can't talk to each others but both
> can talk to the database, this would cause a split-brain situation.
>
NOC and poller should never connect to the same database. That will cause
conflicts such as the ones you've seen up above.
>
> 8) Latency
>
> There's a ~3 seconds latency before the NOC or Poller get updated with
> each others status.
>
> This is just an affirmation, no real problem here.
>
A small delay is sort of expected, really. 3 seconds is ok imo.
>
> 9) Sources are a mess! :-O
>
> Sources should be better organized. Just take a look at Nagios or
> NDOutils, everything is organized in folders: includes, src, etc.
>
There's no sane reason to separate sources and header files unless
you're writing a library, in which case you want to separate public
header files into a separate folder. In those cases it almost always
makes sense to let the .c files include the private headers and let
the private headers include the public ones.
The {src,include}/ insanity comes from autoconf, which merlin doesn't
use (and never will use as a primary means of configuring the build).
There will be a Documentation folder though, which will hold all the
relevant documentation beyond the README.
> There's also a lack of comments in the sources. If you want peoples to
> contribute and help with Merlin, they should be able to understand what
> the code is doing.
>
> A great example is the source code of ARC n ZFS:
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3647
>
> Although it is an extreme example, it shows how good sources can/should
> be documented.
>
That is indeed an extreme example, and much of it would have been way
better off as a separate technical document. I will add appropriate
comments to Merlin, but I will not do things like
/* flush the buffer */
flush(fd);
which is just totally super-redundant commentary that annoys people who
actually read the code. A comment that explains what a small piece of
code does without also explaining why that is necessary comes under the
heading of "useless cruft".
>
> 10) NOC port in merlin.conf
>
> Merlin should bailout if no port is specified in merlin.conf on a Poller
> for noc{}, otherwise it seems Merlin will try to connect to TCP port 0
> on the NOC and miserably fail every seconds. Make it mandatory or use a
> default one.
>
> It seems it's not an issue for poller{} or peer{}. Can someone confirm?
>
The default should be 15551, as specified in the documentation. I suppose
this is a bug, so I'll investigate it.
>
> 11) Ninja shows active checks disabled on host when only a service has
> active checks disabled
>
> Try disabling active checks on a service. You will see Ninja reporting
> active checks being disabled on both host and service when only active
> checks on the service are disabled.
>
>
> 12) Files organization in Merlin
>
> Binary should be installed in /usr/local/merlin/bin
> Configuration file should be installed in /usr/local/merlin/etc
> Etc...
>
That's primarily a matter of style.
>
> 13) Configuration file's name and syntax
>
> To be consistent with Nagios, NRPE and NDOutils, configuration file
> should be named merlin.cfg.
>
Ignoring this. You can name your nagios and nrpe configuration files
whatever you like and it will still work the same. The unix tradition
is to use ".conf" for configuration files. Windows and it's original
retarded 8+3 naming scheme had to come up with something else, so the
.cfg suffix stems from there.
> Syntax in merlin.cfg seems to be very permissive. A lot of syntax are
> supported:
> => key = value
> => key = value;
> => key value
> => key value;
>
> Which one is recommended? README and example.conf use both "key = value"
> and "key = value;" in their examples which could be confusing.
>
You can use semi-colon as a key+value delimiter, so you can put multiple
statements on a single line if you wish, like so:
key1 = value1; key2 = value2;
>
> 14) Hard-coded value in daemon.c
>
> This value is hard-coded in value which isn't great:
>
> static char *import_program = "php
> /home/exon/git/monitor/merlin/import.php";
>
It can (and obviously should) be configured. The hardcoded value is a
remnant from my testing.
>
> 15) Status duration are not transported between NOC, Poller and Peer.
>
> Although status are shared between them, the status duration isn't which
> could be confusing if peoples rely on the CGI interface to get their
> information.
>
>
> 16) Inconsistency in merlin.conf
>
> You specify "address" in the main section but need to specify "port" in
> daemon{}. Only the daemon listen on "address" so putting it in daemon{}
> would be the right thing to do. No?
>
Actually, you can specify "port" in the global section to set a new default
port (I think, I need to read the sources again to verify this). You're
right that only the daemon needs the "address" part though. I'll amend this
but retain the old behaviour with a warning that it's deprecated and will
no longer work as of a year from now. Thanks for noticing.
>
> 17) Folder name in archive
>
> Then I extract the archive at
> "http://www.op5.org/op5media/op5.org/downloads/merlin-0.6.2-beta2.p1.tar.gz"
> I get a folder named merlin/.
>
> Would it be possible to name it merlin-0.6.2-beta2.p1/ so I can download
> multiple versions without getting name conflicts?
>
Yes. I have no idea who created those packages, but I always add the version
to the directory when I create packages.
Thanks for your thorough testing. I'll take a look at the deficiencies you
reported and see what can be done about them. Some I'll leave unregarded
though, as it's just a matter of personal style rather than any functional
issue (sort of like "you should indent like this instead"). Hope you're ok
with that.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
More information about the op5-users
mailing list