[op5-users] merlin noc-poller setup
Andreas Ericsson
ae at op5.se
Fri Dec 4 10:38:34 CET 2009
On 12/03/2009 05:52 PM, Daniel Tuecks wrote:
> Hello List!
>
Hi Daniel. Thanks for trying out Merlin, and thank you very much for the
very detailed writeup. I'll hack and slash a bit in your mail and respond
to the bits I can.
> I am trying to setup nagios/merlin in a noc/poller scenario. Therefore I
> have installed two Nagios 3.2.0 servers (nagios-a and nagios-b) with
> sample-configs (make install-config). For testing purposes I created two
> hosts (server-a and server-b) and put each of those hosts in separate
> hostgroups (server-group-a and server-group-b).
>
> Now I want
> - nagios-b to check server-a& server-b.
> - Nagios server 'nagios-a' should only check server-a.
> - Nagios server 'nagios-a' should only recieve passive checkresults for
> server-b from nagios-b:
>
So basically server-a will be monitored by two Nagios instances. I'm
with you so far.
>
> ------NOC---------- send pasv results ------POLLER-------
> | nagios-a |<<--------------------- | nagios-b |
> ------------------- server-b -------------------
> | |
> |-- check& display server-a |-- check& display server-a
> |-- display server-b |-- check& display server-b
>
> (see below for nagios / merlin config files)
>
> With merlin-0.6.6.tar.gz I can't get this done. I see a very strange
> behaviour in
> the neb.log:
>
> ## 'server-b' is a member of 'server-group-b', so can't add to poller
> for 'server-group-b'
> yes, it is a member of server-group-b.
>
> ## 'server-group-b' is a selection without hosts. Are you sure you want this?
> no, it is not a selection without hosts (as stated one line before)
>
> Checks for hostgroup 'server-group-b' were not disabled on nagios-a. Both
> nagios-servers were still checking both hostgroups (and therefore both servers)
>
Right. This is a known bug, and it lives on because I haven't had time to
fully investigate it. The patches living in 'next' take care of it, but
I can't fully merge them yet since we're (sadly) not focusing on making
distributed checks work well yet. There's a quite simple patch one can
apply to get things to a working state. I'll make sure to get that done
today, so try 'next' on monday and it should work better.
>
> With merlin checked out from git (and switched to 'next') things were also a
> little bit strange.
>
> The neb.log on nagios-a looked like this:
>
> ## [1259848914] 6: Reaping ipc events
> ## [1259848914] 6: Received control packet code 3 for selection 'server-group-b'
> ## [1259848914] 6: Disabling active checks for hosts in hostgroup
> 'server-group-b'
> ## [1259848914] 6: Disabling active checks for services of hosts in
> hostgroup 'server-group-b'
> ## [1259848914] 6: Received control packet code 2 for selection 'server-group-b'
> ## [1259848914] 6: Enabling active checks for hosts in hostgroup
> 'server-group-b'
> ## [1259848914] 6: Enabling active checks for services of hosts in
> hostgroup 'server-group-b'
> ## [1259848914] 6: Received control packet code 3 for selection 'server-group-b'
> ## [1259848914] 6: Disabling active checks for hosts in hostgroup
> 'server-group-b'
>
This is clearly weird.
> Now 'server-group-b' was disabled on nagios-server 'a'. This looked
> promising but
> the nagios daemon died after a few seconds.
>
Yes, that happens for some reason when a host or service state event is
received by the module. I'll have to disable that and go back to making
it only send check-results instead, which worked just fine.
>
> Furthermore I'd say that the alternating reception of "code 3" and "code 2"
> packets looks fishy.
>
Indeed.
> The neb.log on nagios-server 'b':
>
> ## [1259848862] 6: Merlin Module Loaded
> ## [1259848862] 6: setting connect and disconnect handlers
> ## [1259848862] 6: Coredumps in /root
> ## [1259848862] 6: Merlin module v0.6.6p18-g26db1b7e41ae initialized
> successfully
> ## [1259848902] 6: Object configuration parsed.
> ## [1259848902] 6: Creating hash tables
> ## [1259848902] 6: Attempting ipc connect
> ## [1259848902] 6: Shoutcasting active status through IPC socket
> ## [1259848902] 6: Running on_connect hook for module
> ## [1259848902] 6: ipc successfully connected
> ## [1259848902] 6: Reaping ipc events
> ## [1259848902] 4: Handled 0 'ipc' events in 0.043 seconds in: 0, out: 0
> ## [1259848902] 6: Scheduling next ipc reaping at 1259848907
> ## [1259848902] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848902] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848907] 6: Reaping ipc events
> ## [1259848907] 4: Handled 0 'ipc' events in 4.335 seconds in: 0, out: 0
> ## [1259848907] 6: Scheduling next ipc reaping at 1259848912
> ## [1259848908] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848912] 7: Processing callback NEBCALLBACK_HOST_STATUS_DATA
> ## [1259848912] 4: key 'server-b' doesn't match any possible selection
> ## [1259848912] 7: Processing callback NEBCALLBACK_HOST_STATUS_DATA
> ## [1259848912] 4: key 'server-b' doesn't match any possible selection
> ## [1259848912] 6: Reaping ipc events
> ## [1259848912] 4: Handled 0 'ipc' events in 9.404 seconds in: 0, out: 0
> ## [1259848912] 6: Scheduling next ipc reaping at 1259848917
> ## [1259848912] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848914] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848917] 6: Reaping ipc events
> ## [1259848917] 4: Handled 0 'ipc' events in 14.519 seconds in: 0, out: 0
> ## [1259848917] 6: Scheduling next ipc reaping at 1259848922
> ## [1259848920] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848922] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848922] 6: Reaping ipc events
> ## [1259848922] 4: Handled 0 'ipc' events in 19.319 seconds in: 0, out: 0
> ## [1259848922] 6: Scheduling next ipc reaping at 1259848927
> ## [1259848926] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848927] 6: Reaping ipc events
> ## [1259848927] 4: Handled 0 'ipc' events in 24.400 seconds in: 0, out: 0
> ## [1259848927] 6: Scheduling next ipc reaping at 1259848932
> ## [1259848932] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848932] 7: Processing callback NEBCALLBACK_SERVICE_STATUS_DATA
> ## [1259848932] 4: key 'server-b' doesn't match any possible selection
> ## [1259848932] 7: Processing callback NEBCALLBACK_SERVICE_STATUS_DATA
> ## [1259848932] 4: key 'server-b' doesn't match any possible selection
> ## [1259848932] 6: Reaping ipc events
> ## [1259848932] 4: Handled 0 'ipc' events in 29.657 seconds in: 0, out: 0
> ## [1259848932] 6: Scheduling next ipc reaping at 1259848937
> ## [1259848932] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848937] 6: Reaping ipc events
> ## [1259848937] 4: Handled 0 'ipc' events in 34.452 seconds in: 0, out: 0
> ## [1259848937] 6: Scheduling next ipc reaping at 1259848942
> ## [1259848938] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848942] 6: Reaping ipc events
> ## [1259848942] 4: Handled 0 'ipc' events in 39.496 seconds in: 0, out: 0
> ## [1259848942] 6: Scheduling next ipc reaping at 1259848947
> ## [1259848942] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848944] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848947] 6: Reaping ipc events
> ## [1259848947] 4: Handled 0 'ipc' events in 44.334 seconds in: 0, out: 0
> ## [1259848947] 6: Scheduling next ipc reaping at 1259848952
> ## [1259848950] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848952] 7: Processing callback NEBCALLBACK_SERVICE_STATUS_DATA
> ## [1259848952] 4: key 'localhost' doesn't match any possible selection
> ## [1259848952] 7: Processing callback NEBCALLBACK_SERVICE_STATUS_DATA
> ## [1259848952] 4: key 'localhost' doesn't match any possible selection
> ## [1259848952] 6: Reaping ipc events
> ## [1259848952] 4: Handled 0 'ipc' events in 49.431 seconds in: 0, out: 0
> ## [1259848952] 6: Scheduling next ipc reaping at 1259848957
> ## [1259848952] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848956] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848957] 6: Reaping ipc events
> ## [1259848957] 4: Handled 0 'ipc' events in 54.492 seconds in: 0, out: 0
> ## [1259848957] 6: Scheduling next ipc reaping at 1259848962
> ## [1259848962] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848962] 6: Reaping ipc events
> ## [1259848962] 4: Handled 0 'ipc' events in 59.349 seconds in: 0, out: 0
> ## [1259848962] 6: Scheduling next ipc reaping at 1259848967
> ## [1259848962] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848967] 6: Reaping ipc events
> ## [1259848967] 4: Handled 0 'ipc' events in 64.410 seconds in: 0, out: 0
> ## [1259848967] 6: Scheduling next ipc reaping at 1259848972
> ## [1259848968] 7: Processing callback NEBCALLBACK_PROGRAM_STATUS_DATA
> ## [1259848972] 7: Processing callback NEBCALLBACK_HOST_STATUS_DATA
> ## [1259848972] 4: key 'localhost' doesn't match any possible selection
> ## [1259848972] 7: Processing callback NEBCALLBACK_HOST_STATUS_DATA
> ## [1259848972] 4: key 'localhost' doesn't match any possible selection
> ## [1259848972] 7: Processing callback NEBCALLBACK_SERVICE_STATUS_DATA
> ## [1259848972] 4: key 'localhost' doesn't match any possible selection
> ## [1259848972] 7: Processing callback NEBCALLBACK_SERVICE_STATUS_DATA
> ## [1259848972] 4: key 'localhost' doesn't match any possible selection
> ## [1259848972] 6: Reaping ipc events
>
> The suspicious line here was:
> ## key 'server-b' doesn't match any possible selection
>
> What does that mean?
>
It means it has no notion that it should send results for that host
anywhere, so it doesn't forward it to the daemon. According to your
configuration, that's not what should've happened, is it?
> Another thing that puzzles me is the option to create binary logs
> (ipc_debug_read/write). Is this supposed to work? I can't find those
> logs...
>
That has been removed, but there's no deprecation warning for it. I'll
make sure to add that asap.
> I have no clue what I'm doing so terribly wrong here.. I really hope
> you guys can put me
> in the right direction :)
>
You aren't doing anything wrong at all. It's Merlin that's still a bit
buggy. Seeing the patches in next at work was good though. I'll merge
the hostgroup assignment ones to 'master' and see what I can do about
the CTRL_ACTIVE/CTRL_INACTIVE dance (the received control codes you saw).
I'll also make sure to stop forwarding host and service state events to
the module and re-enable sending host and service check results instead,
since that actually worked without crashing Nagios.
This shouldn't take long, but will probably be in 'next' rather than
'master'.
Oh, and thanks for the config :)
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
More information about the op5-users
mailing list