[op5-users] Me trying to test Merlin and Ninja

Johannes Dagemark jd at op5.com
Thu Jul 30 12:13:30 CEST 2009


Hi Mathieu

Thanks for your extensive testing and feedback, it's really appreciated.

Andreas is on vacation and back on Tuesday next week, but I'm going to 
answer on the parts I can answer on.

The distributed stuff is very limited in Merlin right now, we have been 
focusing on the status info -> db parts since it's needed for Ninja to work.

Non the less, all the issues you posted is a great compilation of merlin 
problems, I have posted them to our mantis bugtracker which basically 
mean that sooner or later we will take action on them. Also I now really 
have to focus on making our mantis installation available for the public 
to make it easier to follow the progress :)

Mathieu Gagné wrote:
> Hi,
>
> This message is kind of a follow-up on my first message on Nagios-devel: 
> "My impression about Merlin and Ninja".
>
> I tried several kind of setups:
>
> 1) NOC + Poller (on same server)
> - Everything on the same server: NOC + Poller
> - All hosts/services definitions on the same server
>
> 2) NOC + Poller
> - Identical hosts/services definitions on all servers. (NOC + Poller)
> - With a hostgroup telling which hosts the poller is responsible of.
>
> 3) NOC + 2 x Pollers (independant)
> - NOC has all hosts/services definitions
> - Pollers have only hosts/services definitions they are responsible of.
> - Hosts are organized in 2 hostgroups, one for each Poller. Hostgroups 
> are configured on the NOC and both Pollers.
>
> 4) NOC + 2 x Pollers (independant) and one Poller having a Peer
> - NOC has all hosts/services definitions
> - Pollers have only hosts/services definitions they are responsible of.
> - Hosts are organized in 2 hostgroups, one for each Poller/Peer. 
> Hostgroups are configured on the NOC and both Pollers/Peer.
>
>
> My hope is to be able to handle +40k hosts and around 100k services 
> (maybe more) with ~40 pollers. (depending on the overall performance)
>
> I'm not the best QA in town, however here is what I found while trying 
> to use/test Merlin. Please tell me if I'm doing anything wrong.
>
>
> 1) Merlin crashes with this error when I try to stop it on the NOC:
> *** glibc detected *** /usr/local/merlin/merlind: free(): invalid 
> pointer: 0x08307b68 ***
>
> Any idea about how I can debug and pin-point the source of the problem?
>
>   
I have to wait for Andreas for this one.
> 2) Local commands are not transported between NOC, Pollers and Peers.
>
> If I disable notification on a service on a Poller, the database gets 
> updated and I can see notifications are disabled in Ninja. However it 
> does not get updated on the NOC nor the Peer. Also, the database gets 
> updated by the NOC or the Peer later on and notifications are now back 
> to "enabled" in Ninja.
>
> It's the same for notifications, active checks and probably all other 
> local commands.
>
>   
Reported in Mantis and will be fixed
> 3) Active checks are scheduled on every servers (NOC, Poller and Peer)
>
> Active checks for hosts and services are scheduled on every servers: 
> NOC, Poller and Peer.
>
> This results in a constant "conflict" between all Nagios instances. 
> Isn't the Poller supposed to be the only one scheduling checks or am I 
> missing something? Did I misconfigured everything?
>
> If it's a normal behavior, what's the purpose of Merlin then? If the NOC 
> still schedules every hosts/services the poller is supposed to take care 
> of and propagate check results to everyone, what's the purpose of all this?
>   
It does not sound normal... I have reported it and will check with Andreas.
> 4) Question: Which server is responsible for sending notifications?
>
> Will I get 3 notices because the NOC, Poller and Peer decided they 
> should send the notification?
>
>   
I would say that it depends on how you would like it to be. In some 
cases it makes sense to only have the NOC to send out notifications and 
in other cases you would like the pollers to send notifications as well. 
How to do this is not really clear right now, but I would like if the 
user could simply enable and disable notifications per poller from the 
gui as usual. We need to make sure that Merlin can handle this though.
> 5) Which server has authority over the database?
>
> When I start the NOC, all hosts/services definitions are exported to the 
> database. Everything is fine in Ninja.
>
> However, if I start/restart a Poller, only hosts/services the poller is 
> responsible of get exported to the database and everything else is 
> dropped. Am I missing something?
>   
The starting point is that the NOC should be responsible for updating 
the database. Also something that we need to figure out.
>
> 6) STDOUT, STDERR and STDIN should be closed when demonizing.
>
> Otherwise I'm getting logs on my console while I'm actually trying to 
> work on it...
>
>   
agreed and reported.
> 7) Split-brain situations
>
> In the event the NOC and the Poller can't talk to each others but both 
> can talk to the database, this would cause a split-brain situation.
>
>   
yepp, I reported this as well.
> 8) Latency
>
> There's a ~3 seconds latency before the NOC or Poller get updated with 
> each others status.
>
> This is just an affirmation, no real problem here.
>
>   
I have no comment here and no idea of whats causing it..
> 9) Sources are a mess! :-O
>
> Sources should be better organized. Just take a look at Nagios or 
> NDOutils, everything is organized in folders: includes, src, etc.
>
> There's also a lack of comments in the sources. If you want peoples to 
> contribute and help with Merlin, they should be able to understand what 
> the code is doing.
>
> A great example is the source code of ARC n ZFS:
> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3647
>
> Although it is an extreme example, it shows how good sources can/should 
> be documented.
>
>   
agree, I started out as a sysadmin and when downloading code I simply 
hate bad structure, it gives an instant feeling of bad quality. Andreas 
has been working pretty hard on implementing all new functionality in 
Merlin so I kind of have indulgence on this though. We will fix this but 
a patch is always welcome :)
> 10) NOC port in merlin.conf
>
> Merlin should bailout if no port is specified in merlin.conf on a Poller 
> for noc{}, otherwise it seems Merlin will try to connect to TCP port 0 
> on the NOC and miserably fail every seconds. Make it mandatory or use a 
> default one.
>
> It seems it's not an issue for poller{} or peer{}. Can someone confirm?
>
>   
reported, will check and fix it if it's an issue.
> 11) Ninja shows active checks disabled on host when only a service has 
> active checks disabled
>
> Try disabling active checks on a service. You will see Ninja reporting 
> active checks being disabled on both host and service when only active 
> checks on the service are disabled.
>
>   
I'm actually not able to do any external commands right now :( so I have 
not been able to verify it. I did report it though and we will fix this.
> 12) Files organization in Merlin
>
> Binary should be installed in /usr/local/merlin/bin
> Configuration file should be installed in /usr/local/merlin/etc
> Etc...
>
>   
yepp, or berhaps $nagiosdir/addons/merlin/[bin,etc]

> 13) Configuration file's name and syntax
>
> To be consistent with Nagios, NRPE and NDOutils, configuration file 
> should be named merlin.cfg.
>
> Syntax in merlin.cfg seems to be very permissive. A lot of syntax are 
> supported:
> => key = value
> => key = value;
> => key   value
> => key   value;
>
> Which one is recommended? README and example.conf use both "key = value" 
> and "key = value;" in their examples which could be confusing.
>
>   
Agree and reported.
> 14) Hard-coded value in daemon.c
>
> This value is hard-coded in value which isn't great:
>
> static char *import_program = "php 
> /home/exon/git/monitor/merlin/import.php";
>
>   
Not going to argue about that :)
> 15) Status duration are not transported between NOC, Poller and Peer.
>
> Although status are shared between them, the status duration isn't which 
> could be confusing if peoples rely on the CGI interface to get their 
> information.
>
>   
reported
> 16) Inconsistency in merlin.conf
>
> You specify "address" in the main section but need to specify "port" in 
> daemon{}. Only the daemon listen on "address" so putting it in daemon{} 
> would be the right thing to do. No?
>
> "ipc_socket" is used by the daemon and the module right? Keeping it in 
> the "main" section would be ok I tink.
>
>   
reported
> 17) Folder name in archive
>
> Then I extract the archive at 
> "http://www.op5.org/op5media/op5.org/downloads/merlin-0.6.2-beta2.p1.tar.gz" 
> I get a folder named merlin/.
>
> Would it be possible to name it merlin-0.6.2-beta2.p1/ so I can download 
> multiple versions without getting name conflicts?
>
>   
my bad, fixed now
> 18) Typo in Ninja
>
> Typo here:
> ./ninja/application/views/themes/default/icons/16x16/nofity-disabled.png
>
> Views should be fixed and the file renamed.
>
>   
This was fixed already in 
http://git.op5.org/git/?p=nagios/ninja.git;a=commitdiff;h=3b264d50c422381689dc247fd0751f003564b591

> That is all for now. Do not hesitate to ask questions if I'm not clear. :)
>
>   
Once again, thanks! :)

A lot of the issues you found are related to Merlin not being really 
ready when it comes to distributed stuff. We have discussed a lot about 
Merlin, what it should do, how it should behave, etc and now when people 
actually try to use it outside of our office it gets a lot easier to see 
what's important, what we missed and what's not so important.

Most likely you will get a mail from Andreas next week giving some more 
answers and probably asking some questions.

Until then,

Take care
/Johannes
> --
> Mathieu
> _______________________________________________
> op5-users mailing list
> op5-users at lists.op5.com
> http://lists.op5.com/mailman/listinfo/op5-users
>   


-- 
Johannes Dagemark
CTO / VP Engineering
________________________________________

op5 AB
Första Långgatan 19
SE-413 27 Gothenburg
cell: +46 733-70 90 24
fax:  +46 31-774 04 32
Email: jd at op5.com
http://www.op5.com/



More information about the op5-users mailing list