[op5-users] Me trying to test Merlin and Ninja

Andreas Ericsson ae at op5.se
Wed Aug 5 11:42:07 CEST 2009


Mathieu Gagné wrote:
> Hi,
> 
> 
> On 04/08/09 7:49 AM, Andreas Ericsson wrote:
>>> My hope is to be able to handle +40k hosts and around 100k services
>>> (maybe more) with ~40 pollers. (depending on the overall performance)
>>>
>> This sounds very interesting, and it should be perfectly doable.
> 
> You should had read 10 pollers. We are planning on segmenting our 
> datacenter in groups of 4k servers.
> 
> By setting up peers, will checks be load-balanced between them? I mean, 
> should I be able to setup 2 peers and monitor twice as much hosts/services?
> 

Not exactly twice as many, but perhaps 90-95% more. Merlin has a slight
overhead which is inevitable, so it can never be twice as many.

> 
>>> I'm not the best QA in town, however here is what I found while trying
>>> to use/test Merlin. Please tell me if I'm doing anything wrong.
>>>
>>>
>>> 1) Merlin crashes with this error when I try to stop it on the NOC:
>>> *** glibc detected *** /usr/local/merlin/merlind: free(): invalid
>>> pointer: 0x08307b68 ***
>>>
>>> Any idea about how I can debug and pin-point the source of the problem?
>>>
>> If you get a coredump (which you should), you can run the following
>> command and send the output to me directly, or to this list:
>>
>> gdb -ex bt -ex quit /path/to/merlind /path/to/corefile
>>
>> That will, hopefully, tell me on which line in which sourcefile the
>> invalid free is issued from which makes troubleshooting exceptionally
>> simple.
>>
>> Does this also happen on the pollers, or is it only on the noc server?
> 
> On the NOC server. No coredump though. Where would they be located?
> 

Wherever you started merlind from. Make sure you have coredumps enabled
on your system before merlind's started though (ulimit -c unlimited).

> 
>>> 2) Local commands are not transported between NOC, Pollers and Peers.
>>>
>>> If I disable notification on a service on a Poller, the database gets
>>> updated and I can see notifications are disabled in Ninja. However it
>>> does not get updated on the NOC nor the Peer. Also, the database gets
>>> updated by the NOC or the Peer later on and notifications are now back
>>> to "enabled" in Ninja.
>>>
>>> It's the same for notifications, active checks and probably all other
>>> local commands.
>>>
>> This is because all events received from the network are processed by
>> the database handler but not all events have corresponding handlers in
>> the eventbroker module. This means that when a host or servicestatus
>> update occurs on the NOC (which happens whenever a check is received
>> from either the NOC or the poller), the status gets transmitted to the
>> poller which then inserts the NOC's point of view into the database.
>> If you wait a bit more you should see the status go back to disabled
>> since that's what it is in Nagios' core on the poller. It's basically
>> a race between the poller and the NOC; Whichever issues a status update
>> event last will win.
> 
> Should I remove the database{} part on the poller? :-/
> 

Pollers and NOC's need to write to different databases. If you want your
pollers to write to databases, you need to create separate databases for
them.

> 
>>> 3) Active checks are scheduled on every servers (NOC, Poller and Peer)
>>>
>>> Active checks for hosts and services are scheduled on every servers:
>>> NOC, Poller and Peer.
>>>
>>> This results in a constant "conflict" between all Nagios instances.
>>> Isn't the Poller supposed to be the only one scheduling checks or am I
>>> missing something? Did I misconfigured everything?
>>>
>>> If it's a normal behavior, what's the purpose of Merlin then? If the NOC
>>> still schedules every hosts/services the poller is supposed to take care
>>> of and propagate check results to everyone, what's the purpose of all this?
>>>
>> Checks should be disabled for the hosts in question on the NOC after the
>> poller handling those checks have connected. Peers should schedule the
>> same checks, but once one peer has executed one check the other peer will
>> reschedule it to the normal check interval + 15 seconds, which ensures
>> that the second peer will only take over the check if the first peer has
>> a check latency of 15 seconds or more.
> 
> Ok. Will the peer force a reschedule on the check to "check interval + 
> 15 seconds" if he receives a status update from the "main" peer ?
> 
> How can we make sure checks are load-balanced fairly? (if they should be)
> 

I'm not exactly sure. They probably won't be, but does it matter so long as
all checks are executed properly?

> 
>>> 5) Which server has authority over the database?
>>>
>>> When I start the NOC, all hosts/services definitions are exported to the
>>> database. Everything is fine in Ninja.
>>>
>>> However, if I start/restart a Poller, only hosts/services the poller is
>>> responsible of get exported to the database and everything else is
>>> dropped. Am I missing something?
>>>
>> Each poller is supposed to only have its share of the configuration in the
>> database. This is to enable users with widespread networks and tiered
>> responsibility to be able to use Merlin and Nagios to delegate responsibility
>> to the admin in question. Imagine you have a world-wide network with admins
>> responsible for each branch office, and a team of admins responsible for a
>> quadrant of the world (think asia, europe, usa, world). "world" is the admin
>> team responsible for the core network of the world, with sub-teams in asia
>> etc. The asian team should only see the servers in asia and therefore they
>> only see the stuff their poller is monitoring.
> 
> There's a TRUNCATE TABLE in the import script. I don't know how I'm 
> supposed to not get my database wiped at each restart with it. :D
> 

Again, each merlin instance needs to write to a separate database instance.
A lot of your problems seem to stem from the fact that you've set everything
up to write to the same database.

> 
>>> 7) Split-brain situations
>>>
>>> In the event the NOC and the Poller can't talk to each others but both
>>> can talk to the database, this would cause a split-brain situation.
>>>
>> NOC and poller should never connect to the same database. That will cause
>> conflicts such as the ones you've seen up above.
> 
> So you are telling me each NOC/Poller should all have independent 
> databases? :-O
> 

Yes. :-)

> Knowing this fact answered by question in (2) and (5). It should be 
> documented somewhere. People coming from NDOutils will think you can 
> install everything in the same database.
> 

You're right. I'll amend the README.

> And my understanding is that Ninja shouldn't be installed on an 
> independent server but on the NOC and pollers. Right?
> 

Ninja can be installed on an independent server just fine. The only
thing that will get tricky with that approach is command submission,
which I feel we should be able to solve without too much hassle.
Just make sure ninja reads from the database you want it to display
status and so on from and you should be golden.

> 
>>> There's also a lack of comments in the sources. If you want peoples to
>>> contribute and help with Merlin, they should be able to understand what
>>> the code is doing.
>>>
>>> A great example is the source code of ARC n ZFS:
>>> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3647
>>>
>>> Although it is an extreme example, it shows how good sources can/should
>>> be documented.
>>>
>> That is indeed an extreme example, and much of it would have been way
>> better off as a separate technical document. I will add appropriate
>> comments to Merlin, but I will not do things like
>>
>>     /* flush the buffer */
>>     flush(fd);
>>
>> which is just totally super-redundant commentary that annoys people who
>> actually read the code. A comment that explains what a small piece of
>> code does without also explaining why that is necessary comes under the
>> heading of "useless cruft".
> 
> 
> I'm not asking for that kind of basic comments. I was just trying to 
> understand what the code was doing from a sysadmin perspective since 
> external documentation was lacking.
> 
> Fun fact: http://www.ohloh.net/p/merlin_mod/factoids/1764700
> 

http://www.ohloh.net/p/git/factoids/1796803

The 20% average comes in quite large parts from overly chatty projects
which insert the stupid type of comments I mentioned earlier. I'd happily
take patches to enhance the commentary along the lines of explaining
functions etc, although I usually prefer to refactor large functions into
very small ones that explain themselves with their code at a single glance.

> 
>>> 11) Ninja shows active checks disabled on host when only a service has
>>> active checks disabled
>>>
>>> Try disabling active checks on a service. You will see Ninja reporting
>>> active checks being disabled on both host and service when only active
>>> checks on the service are disabled.
> 
> Is it a bug or not?
> 

This most likely stems from the fact that you're using the same database
for all nocs and pollers. Things will get weird when you do that.

> 
>>> 12) Files organization in Merlin
>>>
>>> Binary should be installed in /usr/local/merlin/bin
>>> Configuration file should be installed in /usr/local/merlin/etc
>>> Etc...
>>>
>> That's primarily a matter of style.
> 
> I disagree with your thought. What if Nagios had one big folder with 
> everything in it? The daemon, plugins and configuration files. You might 
> call it style, I call it organization.
> 

Now you're talking about different pieces of code that do vastly different
things. That's like saying "what if the KDE and X.org had all their source
files in one directory?", which is just plain stupid. Nagios should stay
split into core, plugins and gui. If I did that split for merlin, there
would be module, daemon and common subdirectories, with two files in
daemon, one file in module and the rest in "common". Such a split would
just make it harder for me to work on the sources and would make creating
a slick Makefile harder.

> 
>>> 15) Status duration are not transported between NOC, Poller and Peer.
>>>
>>> Although status are shared between them, the status duration isn't which
>>> could be confusing if peoples rely on the CGI interface to get their
>>> information.
> 
> Is it a bug or not?
> 

This could be caused by the one-for-all database approach you've used.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.


More information about the op5-users mailing list