[op5-users] Me trying to test Merlin and Ninja

Mathieu Gagné mgagne at iweb.com
Tue Aug 4 20:55:21 CEST 2009


Hi,


On 04/08/09 7:49 AM, Andreas Ericsson wrote:
>> My hope is to be able to handle +40k hosts and around 100k services
>> (maybe more) with ~40 pollers. (depending on the overall performance)
>>
>
> This sounds very interesting, and it should be perfectly doable.

You should had read 10 pollers. We are planning on segmenting our 
datacenter in groups of 4k servers.

By setting up peers, will checks be load-balanced between them? I mean, 
should I be able to setup 2 peers and monitor twice as much hosts/services?


>> I'm not the best QA in town, however here is what I found while trying
>> to use/test Merlin. Please tell me if I'm doing anything wrong.
>>
>>
>> 1) Merlin crashes with this error when I try to stop it on the NOC:
>> *** glibc detected *** /usr/local/merlin/merlind: free(): invalid
>> pointer: 0x08307b68 ***
>>
>> Any idea about how I can debug and pin-point the source of the problem?
>>
>
> If you get a coredump (which you should), you can run the following
> command and send the output to me directly, or to this list:
>
> gdb -ex bt -ex quit /path/to/merlind /path/to/corefile
>
> That will, hopefully, tell me on which line in which sourcefile the
> invalid free is issued from which makes troubleshooting exceptionally
> simple.
>
> Does this also happen on the pollers, or is it only on the noc server?

On the NOC server. No coredump though. Where would they be located?


>> 2) Local commands are not transported between NOC, Pollers and Peers.
>>
>> If I disable notification on a service on a Poller, the database gets
>> updated and I can see notifications are disabled in Ninja. However it
>> does not get updated on the NOC nor the Peer. Also, the database gets
>> updated by the NOC or the Peer later on and notifications are now back
>> to "enabled" in Ninja.
>>
>> It's the same for notifications, active checks and probably all other
>> local commands.
>>
>
> This is because all events received from the network are processed by
> the database handler but not all events have corresponding handlers in
> the eventbroker module. This means that when a host or servicestatus
> update occurs on the NOC (which happens whenever a check is received
> from either the NOC or the poller), the status gets transmitted to the
> poller which then inserts the NOC's point of view into the database.
> If you wait a bit more you should see the status go back to disabled
> since that's what it is in Nagios' core on the poller. It's basically
> a race between the poller and the NOC; Whichever issues a status update
> event last will win.

Should I remove the database{} part on the poller? :-/


>> 3) Active checks are scheduled on every servers (NOC, Poller and Peer)
>>
>> Active checks for hosts and services are scheduled on every servers:
>> NOC, Poller and Peer.
>>
>> This results in a constant "conflict" between all Nagios instances.
>> Isn't the Poller supposed to be the only one scheduling checks or am I
>> missing something? Did I misconfigured everything?
>>
>> If it's a normal behavior, what's the purpose of Merlin then? If the NOC
>> still schedules every hosts/services the poller is supposed to take care
>> of and propagate check results to everyone, what's the purpose of all this?
>>
>
> Checks should be disabled for the hosts in question on the NOC after the
> poller handling those checks have connected. Peers should schedule the
> same checks, but once one peer has executed one check the other peer will
> reschedule it to the normal check interval + 15 seconds, which ensures
> that the second peer will only take over the check if the first peer has
> a check latency of 15 seconds or more.

Ok. Will the peer force a reschedule on the check to "check interval + 
15 seconds" if he receives a status update from the "main" peer ?

How can we make sure checks are load-balanced fairly? (if they should be)


> I'll take a look at the NOC check disabling issue though.

Ok. Thanks.


>> 4) Question: Which server is responsible for sending notifications?
>>
>> Will I get 3 notices because the NOC, Poller and Peer decided they
>> should send the notification?
>>
>
> The original idea is that the NOC is responsible for sending notifications.
> This is to save up on notification-specific hardware that usually costs
> a bit of money.
>
> The intention now is to have pollers and noc's both be capable of sending
> notifications. I'm not entirely clear on how the semantics should be or
> how one should configure this. It's likely that SMS-notifications need to
> be sent from a server in the right country, for example, to avoid costly
> international rates, while email notifications could probably be sent
> from pretty much any server. I'm guessing it would make sense to add a new
> pragma ("handles_notifications = yes/no") to the node-sections in the
> merlin configuration file.

Ok.


>> 5) Which server has authority over the database?
>>
>> When I start the NOC, all hosts/services definitions are exported to the
>> database. Everything is fine in Ninja.
>>
>> However, if I start/restart a Poller, only hosts/services the poller is
>> responsible of get exported to the database and everything else is
>> dropped. Am I missing something?
>>
>
> Each poller is supposed to only have its share of the configuration in the
> database. This is to enable users with widespread networks and tiered
> responsibility to be able to use Merlin and Nagios to delegate responsibility
> to the admin in question. Imagine you have a world-wide network with admins
> responsible for each branch office, and a team of admins responsible for a
> quadrant of the world (think asia, europe, usa, world). "world" is the admin
> team responsible for the core network of the world, with sub-teams in asia
> etc. The asian team should only see the servers in asia and therefore they
> only see the stuff their poller is monitoring.

There's a TRUNCATE TABLE in the import script. I don't know how I'm 
supposed to not get my database wiped at each restart with it. :D


>> 6) STDOUT, STDERR and STDIN should be closed when demonizing.
>>
>> Otherwise I'm getting logs on my console while I'm actually trying to
>> work on it...
>>
>
> True. I'll fix this.

Thanks.


>> 7) Split-brain situations
>>
>> In the event the NOC and the Poller can't talk to each others but both
>> can talk to the database, this would cause a split-brain situation.
>>
>
> NOC and poller should never connect to the same database. That will cause
> conflicts such as the ones you've seen up above.

So you are telling me each NOC/Poller should all have independent 
databases? :-O

Knowing this fact answered by question in (2) and (5). It should be 
documented somewhere. People coming from NDOutils will think you can 
install everything in the same database.

And my understanding is that Ninja shouldn't be installed on an 
independent server but on the NOC and pollers. Right?


>> 8) Latency
>>
>> There's a ~3 seconds latency before the NOC or Poller get updated with
>> each others status.
>>
>> This is just an affirmation, no real problem here.
>>
>
> A small delay is sort of expected, really. 3 seconds is ok imo.

Ok.


>> 9) Sources are a mess! :-O
>>
>> Sources should be better organized. Just take a look at Nagios or
>> NDOutils, everything is organized in folders: includes, src, etc.
>>
>
> There's no sane reason to separate sources and header files unless
> you're writing a library, in which case you want to separate public
> header files into a separate folder. In those cases it almost always
> makes sense to let the .c files include the private headers and let
> the private headers include the public ones.
>
> The {src,include}/ insanity comes from autoconf, which merlin doesn't
> use (and never will use as a primary means of configuring the build).
>
> There will be a Documentation folder though, which will hold all the
> relevant documentation beyond the README.

Ok.


>> There's also a lack of comments in the sources. If you want peoples to
>> contribute and help with Merlin, they should be able to understand what
>> the code is doing.
>>
>> A great example is the source code of ARC n ZFS:
>> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3647
>>
>> Although it is an extreme example, it shows how good sources can/should
>> be documented.
>>
>
> That is indeed an extreme example, and much of it would have been way
> better off as a separate technical document. I will add appropriate
> comments to Merlin, but I will not do things like
>
>     /* flush the buffer */
>     flush(fd);
>
> which is just totally super-redundant commentary that annoys people who
> actually read the code. A comment that explains what a small piece of
> code does without also explaining why that is necessary comes under the
> heading of "useless cruft".


I'm not asking for that kind of basic comments. I was just trying to 
understand what the code was doing from a sysadmin perspective since 
external documentation was lacking.

Fun fact: http://www.ohloh.net/p/merlin_mod/factoids/1764700


>> 10) NOC port in merlin.conf
>>
>> Merlin should bailout if no port is specified in merlin.conf on a Poller
>> for noc{}, otherwise it seems Merlin will try to connect to TCP port 0
>> on the NOC and miserably fail every seconds. Make it mandatory or use a
>> default one.
>>
>> It seems it's not an issue for poller{} or peer{}. Can someone confirm?
>>
>
> The default should be 15551, as specified in the documentation. I suppose
> this is a bug, so I'll investigate it.

Ok. Thanks.


>> 11) Ninja shows active checks disabled on host when only a service has
>> active checks disabled
>>
>> Try disabling active checks on a service. You will see Ninja reporting
>> active checks being disabled on both host and service when only active
>> checks on the service are disabled.

Is it a bug or not?


>> 12) Files organization in Merlin
>>
>> Binary should be installed in /usr/local/merlin/bin
>> Configuration file should be installed in /usr/local/merlin/etc
>> Etc...
>>
>
> That's primarily a matter of style.

I disagree with your thought. What if Nagios had one big folder with 
everything in it? The daemon, plugins and configuration files. You might 
call it style, I call it organization.


>> 13) Configuration file's name and syntax
>>
>> To be consistent with Nagios, NRPE and NDOutils, configuration file
>> should be named merlin.cfg.
>>
>
> Ignoring this. You can name your nagios and nrpe configuration files
> whatever you like and it will still work the same. The unix tradition
> is to use ".conf" for configuration files. Windows and it's original
> retarded 8+3 naming scheme had to come up with something else, so the
> .cfg suffix stems from there.

It just for the sack of consistency which is lacking. You can obviously 
ignore my comment.


>> Syntax in merlin.cfg seems to be very permissive. A lot of syntax are
>> supported:
>> =>  key = value
>> =>  key = value;
>> =>  key   value
>> =>  key   value;
>>
>> Which one is recommended? README and example.conf use both "key = value"
>> and "key = value;" in their examples which could be confusing.
>>
>
> You can use semi-colon as a key+value delimiter, so you can put multiple
> statements on a single line if you wish, like so:
>    key1 = value1; key2 = value2;

Ok.


>
>>
>> 14) Hard-coded value in daemon.c
>>
>> This value is hard-coded in value which isn't great:
>>
>> static char *import_program = "php
>> /home/exon/git/monitor/merlin/import.php";
>>
>
> It can (and obviously should) be configured. The hardcoded value is a
> remnant from my testing.

Ok.


>> 15) Status duration are not transported between NOC, Poller and Peer.
>>
>> Although status are shared between them, the status duration isn't which
>> could be confusing if peoples rely on the CGI interface to get their
>> information.

Is it a bug or not?


>> 16) Inconsistency in merlin.conf
>>
>> You specify "address" in the main section but need to specify "port" in
>> daemon{}. Only the daemon listen on "address" so putting it in daemon{}
>> would be the right thing to do. No?
>>
>
> Actually, you can specify "port" in the global section to set a new default
> port (I think, I need to read the sources again to verify this). You're
> right that only the daemon needs the "address" part though. I'll amend this
> but retain the old behaviour with a warning that it's deprecated and will
> no longer work as of a year from now. Thanks for noticing.

Ok. Thanks.


>> 17) Folder name in archive
>>
>> Then I extract the archive at
>> "http://www.op5.org/op5media/op5.org/downloads/merlin-0.6.2-beta2.p1.tar.gz"
>> I get a folder named merlin/.
>>
>> Would it be possible to name it merlin-0.6.2-beta2.p1/ so I can download
>> multiple versions without getting name conflicts?
>>
>
> Yes. I have no idea who created those packages, but I always add the version
> to the directory when I create packages.

Thanks.


> Thanks for your thorough testing. I'll take a look at the deficiencies you
> reported and see what can be done about them. Some I'll leave unregarded
> though, as it's just a matter of personal style rather than any functional
> issue (sort of like "you should indent like this instead"). Hope you're ok
> with that.

Code is poetry. It's why I have a lot of issues with the way sources are 
documented and organized. :)


We are basically looking for a solution to deploy Nagios in a very large 
scale environment. Being able to scale and retrieve host/service status 
easily are the main keys.

We are doing very well with NDOutils at this time. Multiple independent 
Nagios instances logging to NDOutils. We patched Nagios and NDOutils 
with those patches for better performance and reduce data exported to 
NDOutils as well:
- 
http://svn.opsview.org/opsview/tags/nagios-patch-day/opsview-base/patches/nagios_stop_logging_retained_states_to_ndo.patch
- 
http://svn.opsview.org/opsview/tags/nagios-patch-day/opsview-base/patches/ndoutils_stop_logging_retained_states_to_ndo.patch


For now, we will probably wait and see what Merlin becomes in the follow 
months and reevaluate it when it will be out of beta/testing.

--
Mathieu


More information about the op5-users mailing list