[op5-users] [PATCH] NOC/poller check disabling fixes
Sean Millichamp
sean at bruenor.org
Wed Oct 28 15:25:34 CET 2009
I've looked into the mechanisms for the NOC disabling checks handled by
hostgroups associated with connected pollers. While working on this I
have come across some apparent bugs, based on my limited understanding
of what Merlin is trying to do.
The attached patch fixes what appears to be a typo in a call to
handle_control() where pkt.hdr.len is passed into the function instead
of pkt.hdr.code. Since len currently seems to alwways be 0 for a
control packet this prevented handle_control from ever doing anything.
Once I fixed the typo addressed in the patch I found what appears to be
a more subtle bug that I haven't come up with a good way of addressing
yet. I thought I had a simple fix, but implementing it causes Nagios to
fairly immediately core dump (for no reason I have been able to
ascertain yet). Here is the problem:
When a Nagios NOC starts up it looks for poller entries and adds all
"hostgroup" entries it finds to a "selection" array (via add_selection
in module.c:slurp_selection()). Each hostgroup then becomes associated
with a selection index starting at 0 and increasing for the number of
hostgroups found.
Then it proceeds to module.c:setup_host_hash_tables() where it cycles
through all known hostgroups and, if the hostgroup is one that was added
to the selection index (get_sel_id returns an id >= 0) adds the member
hosts to the hash_table:
hash_add(host_hash_table, m->host_name, (void *)id);
where "id" is the hostgroup selection index (0 through nsel-1)
Now, here is the crux of the problem: When the host is later retrieved
with hash_find_val(hostname) it can return a selection index in the
range of 0 to (nsel-1) (as stored in setup_host_hash_tables()) OR, if
the host was never added (because it wasn't in a poller-owned hostgroup)
it returns NULL (line 124 in hash.c:find_hash()). NULL equates to 0 and
so any host that was never added is treated as if it were a member of
the first poller hostgroup (index 0) when create_{host,service}_lists()
are called in control.c
The result is that when the poller which is responsible for the
hostgroup in the index 0 slot connects, all of the hosts associated with
that hostgroup AND all of the unassociated hosts have their checks
disabled.
I hope this explanation makes sense. It took me quite a while to
unravel what the code was trying to do and I've had to infer a lot of
the expected behavior.
Still working on this, but I'm hoping one of the developers can chime in
and at least tell me if I am on the right track here.
Thanks!
Sean
-------------- next part --------------
A non-text attachment was scrubbed...
Name: merlin-handle_control-typo.patch
Type: text/x-patch
Size: 479 bytes
Desc: not available
Url : http://lists.op5.com/pipermail/op5-users/attachments/20091028/d5c20847/attachment.bin
More information about the op5-users
mailing list