[op5-users] Merlin testing and troubleshooting
Russell Jennings
russ at geekwhiz.com
Tue Oct 27 22:23:12 CET 2009
Sean,
An important point that you may have missed (as I once did): in a noc/
poller setup, while both need to have the same configs, it's more like
passive check submital. that is, merlin just needs there to be a valid
entry (on the noc) for the host/service. the actual check execution
should not happen. for those remote hosts that the poller will submit
data for, i have, in the noc, set them up as passive checks, since
nagios itself cannot actively check them.
Hope that is in some way helpful to you...
thanks,
Russell
On Oct 27, 2009, at 1:12 PM, Sean Millichamp wrote:
> Hello everyone,
>
> I am currently testing and evaluating Merlin for a multiple site
> NOC/poller deployment, with eventual redundancy/load sharing via
> peering. It seems very promising and we are excited about the
> possibility of deploying it.
>
> I started with the latest version from git and, after encountering the
> same problems about failed poller->noc status updates described in
> http://lists.op5.com/pipermail/op5-users/2009-October/000524.html I
> reverted to v0.6.2-beta4.
>
> I then restarted testing from scratch with two simple configurations,
> each which I've seen problems with.
>
> 1) The first was a two node peer configuration. Each host had
> identical
> Nagios configurations (simple active ICMP service tests against a
> couple
> of hosts) and identical Merlin configurations other than each "peer"
> entry referencing the other node. In this configuration both hosts
> were
> exchanging updates over the network, and successfully updating the
> local
> databases which always showed in sync. However, both peers were
> running
> all checks still. If all peers continue to run all checks this
> doesn't
> seem like it would fulfill the load-sharing design aspect of Merlin
> peers.
>
> According to Andreas Ericsson in
> http://lists.op5.com/pipermail/op5-users/2009-August/000322.html
>
> "Peers should schedule the same checks, but once one peer has executed
> one check the other peer will reschedule it to the normal check
> interval
> + 15 seconds, which ensures that the second peer will only take over
> the
> check if the first peer has a check latency of 15 seconds or more."
>
> I don't see any evidence that either peer perpetually defers any of
> the
> checks, nor that the check is adjusted to be delayed versus the
> scheduled time on the other host (based on the next_check time in the
> respective databases - maybe not the right thing to be watching).
>
> Is this functionality expected to be working right now? The message
> from August seems to indicate so, but I haven't been able to achieve
> it.
>
> 2) I then moved on to a NOC<->poller relationship. I took the same
> two
> hosts, and configured one node as a noc and one has the poller. Two
> of
> my three ICMP service checks were in the hostgroup assigned to the
> poller, and one was in a hostgroup left unassigned on the NOC.
> Based on
> some documentation, I trimmed the configuration on the poller to
> contain
> only the hosts/services on the hostgroup assigned to it. I left the
> full configuration on the NOC, expecting that it would disable the
> checks that the poller was responsible for.
>
> The poller continued to check its 2 hosts and the NOC continued to
> check
> the poller's 2 hosts and its 1 host. I saw no sign that the checks
> were
> disabled. I did see an odd looking log message logged by the event
> broker on the NOC that might relate to this:
>
> [1256658948] 4: 'lab-nagdl02' is a member of 'hostgroup2', so can't
> add to poller for 'hostgroup2'
> [1256658948] 4: 'lab-nagpv01' is a member of 'hostgroup2', so can't
> add to poller for 'hostgroup2'
> [1256658948] 4: 'hostgroup2' is a selection without hosts. Are you
> sure you want this?
>
> hostgroup2 is the hostgroup configured to be checked on the poller
> node
> and the listed hosts are the two members of hostgroup2.
>
> NOC's merlin.conf node entry:
>
> poller lab-nagdl02 {
> address = 10.211.54.6;
> port = 15551;
> hostgroup = hostgroup2;
> }
>
> Poller's merlin.conf node entry:
>
> noc lab-nagdl04 {
> address = 10.211.54.8;
> port = 15551;
> }
>
> Any ideas? From reading the documentation and the posts on the
> mailing
> list, it feels like this should work and disable the checks properly,
> but I can't figure out what I might be doing wrong. I have spent some
> time reading the code, but I am still struggling with understanding
> the
> flow of things and haven't been able to clearly locate the areas of
> the
> code responsible for these behaviors.
>
> Thanks in advance,
> Sean
>
>
> _______________________________________________
> op5-users mailing list
> op5-users at lists.op5.com
> http://lists.op5.com/mailman/listinfo/op5-users
More information about the op5-users
mailing list