[op5-users] Me trying to test Merlin and Ninja
Mathieu Gagné
mgagne at iweb.com
Wed Jul 29 23:13:38 CEST 2009
Hi,
This message is kind of a follow-up on my first message on Nagios-devel:
"My impression about Merlin and Ninja".
I tried several kind of setups:
1) NOC + Poller (on same server)
- Everything on the same server: NOC + Poller
- All hosts/services definitions on the same server
2) NOC + Poller
- Identical hosts/services definitions on all servers. (NOC + Poller)
- With a hostgroup telling which hosts the poller is responsible of.
3) NOC + 2 x Pollers (independant)
- NOC has all hosts/services definitions
- Pollers have only hosts/services definitions they are responsible of.
- Hosts are organized in 2 hostgroups, one for each Poller. Hostgroups
are configured on the NOC and both Pollers.
4) NOC + 2 x Pollers (independant) and one Poller having a Peer
- NOC has all hosts/services definitions
- Pollers have only hosts/services definitions they are responsible of.
- Hosts are organized in 2 hostgroups, one for each Poller/Peer.
Hostgroups are configured on the NOC and both Pollers/Peer.
My hope is to be able to handle +40k hosts and around 100k services
(maybe more) with ~40 pollers. (depending on the overall performance)
I'm not the best QA in town, however here is what I found while trying
to use/test Merlin. Please tell me if I'm doing anything wrong.
1) Merlin crashes with this error when I try to stop it on the NOC:
*** glibc detected *** /usr/local/merlin/merlind: free(): invalid
pointer: 0x08307b68 ***
Any idea about how I can debug and pin-point the source of the problem?
2) Local commands are not transported between NOC, Pollers and Peers.
If I disable notification on a service on a Poller, the database gets
updated and I can see notifications are disabled in Ninja. However it
does not get updated on the NOC nor the Peer. Also, the database gets
updated by the NOC or the Peer later on and notifications are now back
to "enabled" in Ninja.
It's the same for notifications, active checks and probably all other
local commands.
3) Active checks are scheduled on every servers (NOC, Poller and Peer)
Active checks for hosts and services are scheduled on every servers:
NOC, Poller and Peer.
This results in a constant "conflict" between all Nagios instances.
Isn't the Poller supposed to be the only one scheduling checks or am I
missing something? Did I misconfigured everything?
If it's a normal behavior, what's the purpose of Merlin then? If the NOC
still schedules every hosts/services the poller is supposed to take care
of and propagate check results to everyone, what's the purpose of all this?
4) Question: Which server is responsible for sending notifications?
Will I get 3 notices because the NOC, Poller and Peer decided they
should send the notification?
5) Which server has authority over the database?
When I start the NOC, all hosts/services definitions are exported to the
database. Everything is fine in Ninja.
However, if I start/restart a Poller, only hosts/services the poller is
responsible of get exported to the database and everything else is
dropped. Am I missing something?
6) STDOUT, STDERR and STDIN should be closed when demonizing.
Otherwise I'm getting logs on my console while I'm actually trying to
work on it...
7) Split-brain situations
In the event the NOC and the Poller can't talk to each others but both
can talk to the database, this would cause a split-brain situation.
8) Latency
There's a ~3 seconds latency before the NOC or Poller get updated with
each others status.
This is just an affirmation, no real problem here.
9) Sources are a mess! :-O
Sources should be better organized. Just take a look at Nagios or
NDOutils, everything is organized in folders: includes, src, etc.
There's also a lack of comments in the sources. If you want peoples to
contribute and help with Merlin, they should be able to understand what
the code is doing.
A great example is the source code of ARC n ZFS:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3647
Although it is an extreme example, it shows how good sources can/should
be documented.
10) NOC port in merlin.conf
Merlin should bailout if no port is specified in merlin.conf on a Poller
for noc{}, otherwise it seems Merlin will try to connect to TCP port 0
on the NOC and miserably fail every seconds. Make it mandatory or use a
default one.
It seems it's not an issue for poller{} or peer{}. Can someone confirm?
11) Ninja shows active checks disabled on host when only a service has
active checks disabled
Try disabling active checks on a service. You will see Ninja reporting
active checks being disabled on both host and service when only active
checks on the service are disabled.
12) Files organization in Merlin
Binary should be installed in /usr/local/merlin/bin
Configuration file should be installed in /usr/local/merlin/etc
Etc...
13) Configuration file's name and syntax
To be consistent with Nagios, NRPE and NDOutils, configuration file
should be named merlin.cfg.
Syntax in merlin.cfg seems to be very permissive. A lot of syntax are
supported:
=> key = value
=> key = value;
=> key value
=> key value;
Which one is recommended? README and example.conf use both "key = value"
and "key = value;" in their examples which could be confusing.
14) Hard-coded value in daemon.c
This value is hard-coded in value which isn't great:
static char *import_program = "php
/home/exon/git/monitor/merlin/import.php";
15) Status duration are not transported between NOC, Poller and Peer.
Although status are shared between them, the status duration isn't which
could be confusing if peoples rely on the CGI interface to get their
information.
16) Inconsistency in merlin.conf
You specify "address" in the main section but need to specify "port" in
daemon{}. Only the daemon listen on "address" so putting it in daemon{}
would be the right thing to do. No?
"ipc_socket" is used by the daemon and the module right? Keeping it in
the "main" section would be ok I tink.
17) Folder name in archive
Then I extract the archive at
"http://www.op5.org/op5media/op5.org/downloads/merlin-0.6.2-beta2.p1.tar.gz"
I get a folder named merlin/.
Would it be possible to name it merlin-0.6.2-beta2.p1/ so I can download
multiple versions without getting name conflicts?
18) Typo in Ninja
Typo here:
./ninja/application/views/themes/default/icons/16x16/nofity-disabled.png
Views should be fixed and the file renamed.
That is all for now. Do not hesitate to ask questions if I'm not clear. :)
--
Mathieu
More information about the op5-users
mailing list