[op5-users] Merlin testing and troubleshooting
Sean Millichamp
sean at bruenor.org
Tue Oct 27 18:12:20 CET 2009
Hello everyone,
I am currently testing and evaluating Merlin for a multiple site
NOC/poller deployment, with eventual redundancy/load sharing via
peering. It seems very promising and we are excited about the
possibility of deploying it.
I started with the latest version from git and, after encountering the
same problems about failed poller->noc status updates described in
http://lists.op5.com/pipermail/op5-users/2009-October/000524.html I
reverted to v0.6.2-beta4.
I then restarted testing from scratch with two simple configurations,
each which I've seen problems with.
1) The first was a two node peer configuration. Each host had identical
Nagios configurations (simple active ICMP service tests against a couple
of hosts) and identical Merlin configurations other than each "peer"
entry referencing the other node. In this configuration both hosts were
exchanging updates over the network, and successfully updating the local
databases which always showed in sync. However, both peers were running
all checks still. If all peers continue to run all checks this doesn't
seem like it would fulfill the load-sharing design aspect of Merlin
peers.
According to Andreas Ericsson in
http://lists.op5.com/pipermail/op5-users/2009-August/000322.html
"Peers should schedule the same checks, but once one peer has executed
one check the other peer will reschedule it to the normal check interval
+ 15 seconds, which ensures that the second peer will only take over the
check if the first peer has a check latency of 15 seconds or more."
I don't see any evidence that either peer perpetually defers any of the
checks, nor that the check is adjusted to be delayed versus the
scheduled time on the other host (based on the next_check time in the
respective databases - maybe not the right thing to be watching).
Is this functionality expected to be working right now? The message
from August seems to indicate so, but I haven't been able to achieve it.
2) I then moved on to a NOC<->poller relationship. I took the same two
hosts, and configured one node as a noc and one has the poller. Two of
my three ICMP service checks were in the hostgroup assigned to the
poller, and one was in a hostgroup left unassigned on the NOC. Based on
some documentation, I trimmed the configuration on the poller to contain
only the hosts/services on the hostgroup assigned to it. I left the
full configuration on the NOC, expecting that it would disable the
checks that the poller was responsible for.
The poller continued to check its 2 hosts and the NOC continued to check
the poller's 2 hosts and its 1 host. I saw no sign that the checks were
disabled. I did see an odd looking log message logged by the event
broker on the NOC that might relate to this:
[1256658948] 4: 'lab-nagdl02' is a member of 'hostgroup2', so can't add to poller for 'hostgroup2'
[1256658948] 4: 'lab-nagpv01' is a member of 'hostgroup2', so can't add to poller for 'hostgroup2'
[1256658948] 4: 'hostgroup2' is a selection without hosts. Are you sure you want this?
hostgroup2 is the hostgroup configured to be checked on the poller node
and the listed hosts are the two members of hostgroup2.
NOC's merlin.conf node entry:
poller lab-nagdl02 {
address = 10.211.54.6;
port = 15551;
hostgroup = hostgroup2;
}
Poller's merlin.conf node entry:
noc lab-nagdl04 {
address = 10.211.54.8;
port = 15551;
}
Any ideas? From reading the documentation and the posts on the mailing
list, it feels like this should work and disable the checks properly,
but I can't figure out what I might be doing wrong. I have spent some
time reading the code, but I am still struggling with understanding the
flow of things and haven't been able to clearly locate the areas of the
code responsible for these behaviors.
Thanks in advance,
Sean
More information about the op5-users
mailing list