[op5-users] Merlin: A Few Questions

Andreas Ericsson ae at op5.se
Mon Jan 18 10:44:46 CET 2010


On 01/14/2010 04:58 AM, Eric Schoeller wrote:
> Good Evening,
> 

Good morning :)

> 
> I am new to the list, and have just setup my first 2 node NOC/Poller/Peer
> merlin installation. I have been digging through the mailing list
> archives and the wiki to try and gain a better understanding of
> exactly how Merlin works ... but I still have a few questions.
> Please excuse me if they're silly!
> 

I shall answer them to the best of my abilities.

>      1. For a pair of NOCs peering with each other, should both be
>         using the same mysql database? I can't seem to find any
>         explanation on why/why not.
> 

No, they should use separate databases, or they'll double the load on
the database server. The idea is that peers are redundant, and the
database is necessary to view the information when you're using Ninja.
If you aren't using Ninja, you can safely configure Merlin to not use
a database at all, by simply omitting the "database {}" block in the
merlin config file.

>      2. Does configuration sharing work yet between NOC peers?
> 

No, it does not. It's fairly straightforward to hack something up
when all configuration is to be shared though, since all the files
are to be synced.

>      3. How are notifications handled between a pair of NOCs?
>         Specifically, if a notification is sent out from one
>         NOC, is merlin responsible for telling the other not
>         to notify?
> 

Good question. I'll have to dig into the code to answer that.

>      4. Is there a diagram depicting how merlin integrates with
>         the rest of the Nagios core?
> 

Unless there is some general diagram of how NEB-modules work in
general, no. A basic explanation goes like this though;

Something happens in Nagios (a check is scheduled, a notification
is sent, etc, etc). If a module has plugged itself in to receive
information about events of that kind, the specific part of the
module that plugs into that particular event is run.

The merlin daemon constantly checks for incoming events from other
merlin daemons and forwards them to the module when received. When
the module receives an event from the daemon, it alters the internals
of Nagios to match whatever it received, with some few exceptions.
See below for one of those.

>      5. From my understanding, two NOC peers are essentially just
>         two instances of Nagios running independently of each other,
>         scheduling their own checks etc. Merlin simply passes check
>         results between the two? How does this interface with the
>         Nagios scheduler? If checkA was run on peerB, peerB would
>         submit that result to peerA ... but what if peerA is in the
>         process of running checkA as well, or it's scheduled to run
>         within a few nanoseconds? Can the scheduler on peerA remove
>         that check from the queue since it already received a result?
>         Perhaps the answer to #4 will shed light on #5!
> 

When a check-result is received by the merlin module, it re-schedules
its own check of the host or service that was just received to a time
15 seconds further into the future than its peer did. This means that
if one instance of Nagios lags behind by more than 15 seconds, the
other instance will simply take over running those checks. When the
first instance then receives the completed check-result from its peer,
it will re-schedule the check so that loadbalancing works again.

There might be a small amount of checks that are run by both instances
before the scheduling queues are arranged nicely.

>      6. Is there a chance that an event_handler would ever get executed
>         twice (on peerA and peerB) when it should have only been executed
>         once (on peerA) ?
> 

Eventhandlers aren't disabled by merlin but are only run by the host that
executes the check. Since some checks can be run by both instances at the
same time, it's possible that an eventhandler is run twice even though it
should only be executed once.

>      7. Has anyone integrated DNX with Merlin successfully? I tried to
>         load dnxServer.so along with running merlin, and got some complaints:
> 
>      [1263441167] 3: Non-control packet of type 0 with zero size length
> (this should never happen)
>      [1263441167] 7: Read 64 bytes from 172.20.0.16. protocol: 0, type:
> 0, len: 0
>      [1263441167] 6: ipc socket isn't ready to accept data: Success
>      [1263441167] 3: Unknown callback type. Weird, to say the least...
>      [1263441167] 6: Data available from peer 'a-test ' (172.20.0.16)
>      [1263441167] 3: recv(7, (buf + total), 1070757272, MSG_DONTWAIT |
> MSG_NOSIGNAL) returned -1 (Bad address)
>      [1263441167] 4: Bogus read in proto_read_event(). got -1, expected
> 1070793228
>      [1263441167] 3: read() from peer node a-test  failed: Bad address
> 

This looks like you're trying to send DNX data into the merlin daemon.
That doesn't work at all, since the protocols are completely different.
DNX uses an XML-based protocol. Merlin uses a binary one.

>      8. What are the popular methods for synchronizing configurations
>         between nagios hosts (if configuration sharing doesn't work)
>         Ideas that come to mind: DRBD, csync2, rsync etc.
> 

We've had a fair amount of success with scp, rsync and git. scp and
rsync have the advantage of being fairly simple to work with. git has
the advantage of adding revision control to your Nagios configuration
and makes merging possible in case the config is updated on both hosts
at the same time.

>      9. I believe these three log-lines from merlin aren't good. Can
>         someone explain briefly what they mean, and under what
>         circumstances one would encounter them?
> 
>      ipc socket isn't ready to accept data: Success

This just means that the socket isn't ready to be read from. The kernel
might be filling it right at that moment, or it may not be connected.

>      ipc is not connected

It means the module hasn't connected to the (unix domain) ipc socket,
or that the connection has been reset, or that connecting to the ipc
socket failed for some reason.

>      Nulling OOB ptr 0. type: 0; offset: 0x30302e303030333b; len: 336;
> overshot with 3472326097204425195 bytes
> 


These would be found when Merlin thinks it has received a packet of
a type that it really hasn't. Either the sending end has failed to
pack up the strings in the packet and set the proper offsets, or
someone is sending junk into the merlin daemon.

If you're using one 32-bit machine and one 64-bit machine, you might
get issues of this kind. I haven't actually tested running a merlin
network with that setup. It's easily fixable if that's the problem,
though it will take time I'm afraid I do not have, since Ninja
requires my full attention at the moment, unless customers start
screaming for merlin loadbalancing. The good thing about being a
hired programmer is that you get paid. The bad thing is that you
don't get to do what you want ;)

> 
> Sorry for the barrage of questions. I am clearly interested in this
> software<g>  I appreciate your patience and any responses or
> suggestions you may have.
> 

Questions are good. If noone asked any questions we would have
no idea what area of a product we should work on.

Thanks for the interest, and feel free to ask again if there's
anything else you're wondering about.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.


More information about the op5-users mailing list