[op5-users] Me trying to test Merlin and Ninja

Mathieu Gagné mgagne at iweb.com
Wed Jul 29 23:13:38 CEST 2009


Hi,

This message is kind of a follow-up on my first message on Nagios-devel: 
"My impression about Merlin and Ninja".

I tried several kind of setups:

1) NOC + Poller (on same server)
- Everything on the same server: NOC + Poller
- All hosts/services definitions on the same server

2) NOC + Poller
- Identical hosts/services definitions on all servers. (NOC + Poller)
- With a hostgroup telling which hosts the poller is responsible of.

3) NOC + 2 x Pollers (independant)
- NOC has all hosts/services definitions
- Pollers have only hosts/services definitions they are responsible of.
- Hosts are organized in 2 hostgroups, one for each Poller. Hostgroups 
are configured on the NOC and both Pollers.

4) NOC + 2 x Pollers (independant) and one Poller having a Peer
- NOC has all hosts/services definitions
- Pollers have only hosts/services definitions they are responsible of.
- Hosts are organized in 2 hostgroups, one for each Poller/Peer. 
Hostgroups are configured on the NOC and both Pollers/Peer.


My hope is to be able to handle +40k hosts and around 100k services 
(maybe more) with ~40 pollers. (depending on the overall performance)

I'm not the best QA in town, however here is what I found while trying 
to use/test Merlin. Please tell me if I'm doing anything wrong.


1) Merlin crashes with this error when I try to stop it on the NOC:
*** glibc detected *** /usr/local/merlin/merlind: free(): invalid 
pointer: 0x08307b68 ***

Any idea about how I can debug and pin-point the source of the problem?


2) Local commands are not transported between NOC, Pollers and Peers.

If I disable notification on a service on a Poller, the database gets 
updated and I can see notifications are disabled in Ninja. However it 
does not get updated on the NOC nor the Peer. Also, the database gets 
updated by the NOC or the Peer later on and notifications are now back 
to "enabled" in Ninja.

It's the same for notifications, active checks and probably all other 
local commands.


3) Active checks are scheduled on every servers (NOC, Poller and Peer)

Active checks for hosts and services are scheduled on every servers: 
NOC, Poller and Peer.

This results in a constant "conflict" between all Nagios instances. 
Isn't the Poller supposed to be the only one scheduling checks or am I 
missing something? Did I misconfigured everything?

If it's a normal behavior, what's the purpose of Merlin then? If the NOC 
still schedules every hosts/services the poller is supposed to take care 
of and propagate check results to everyone, what's the purpose of all this?


4) Question: Which server is responsible for sending notifications?

Will I get 3 notices because the NOC, Poller and Peer decided they 
should send the notification?


5) Which server has authority over the database?

When I start the NOC, all hosts/services definitions are exported to the 
database. Everything is fine in Ninja.

However, if I start/restart a Poller, only hosts/services the poller is 
responsible of get exported to the database and everything else is 
dropped. Am I missing something?


6) STDOUT, STDERR and STDIN should be closed when demonizing.

Otherwise I'm getting logs on my console while I'm actually trying to 
work on it...


7) Split-brain situations

In the event the NOC and the Poller can't talk to each others but both 
can talk to the database, this would cause a split-brain situation.


8) Latency

There's a ~3 seconds latency before the NOC or Poller get updated with 
each others status.

This is just an affirmation, no real problem here.


9) Sources are a mess! :-O

Sources should be better organized. Just take a look at Nagios or 
NDOutils, everything is organized in folders: includes, src, etc.

There's also a lack of comments in the sources. If you want peoples to 
contribute and help with Merlin, they should be able to understand what 
the code is doing.

A great example is the source code of ARC n ZFS:
http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3647

Although it is an extreme example, it shows how good sources can/should 
be documented.


10) NOC port in merlin.conf

Merlin should bailout if no port is specified in merlin.conf on a Poller 
for noc{}, otherwise it seems Merlin will try to connect to TCP port 0 
on the NOC and miserably fail every seconds. Make it mandatory or use a 
default one.

It seems it's not an issue for poller{} or peer{}. Can someone confirm?


11) Ninja shows active checks disabled on host when only a service has 
active checks disabled

Try disabling active checks on a service. You will see Ninja reporting 
active checks being disabled on both host and service when only active 
checks on the service are disabled.


12) Files organization in Merlin

Binary should be installed in /usr/local/merlin/bin
Configuration file should be installed in /usr/local/merlin/etc
Etc...


13) Configuration file's name and syntax

To be consistent with Nagios, NRPE and NDOutils, configuration file 
should be named merlin.cfg.

Syntax in merlin.cfg seems to be very permissive. A lot of syntax are 
supported:
=> key = value
=> key = value;
=> key   value
=> key   value;

Which one is recommended? README and example.conf use both "key = value" 
and "key = value;" in their examples which could be confusing.


14) Hard-coded value in daemon.c

This value is hard-coded in value which isn't great:

static char *import_program = "php 
/home/exon/git/monitor/merlin/import.php";


15) Status duration are not transported between NOC, Poller and Peer.

Although status are shared between them, the status duration isn't which 
could be confusing if peoples rely on the CGI interface to get their 
information.


16) Inconsistency in merlin.conf

You specify "address" in the main section but need to specify "port" in 
daemon{}. Only the daemon listen on "address" so putting it in daemon{} 
would be the right thing to do. No?

"ipc_socket" is used by the daemon and the module right? Keeping it in 
the "main" section would be ok I tink.


17) Folder name in archive

Then I extract the archive at 
"http://www.op5.org/op5media/op5.org/downloads/merlin-0.6.2-beta2.p1.tar.gz" 
I get a folder named merlin/.

Would it be possible to name it merlin-0.6.2-beta2.p1/ so I can download 
multiple versions without getting name conflicts?


18) Typo in Ninja

Typo here:
./ninja/application/views/themes/default/icons/16x16/nofity-disabled.png

Views should be fixed and the file renamed.


That is all for now. Do not hesitate to ask questions if I'm not clear. :)

--
Mathieu


More information about the op5-users mailing list