[op5-users] [PATCH] NOC/poller check disabling fixes
Andreas Ericsson
ae at op5.se
Wed Oct 28 15:50:34 CET 2009
On 10/28/2009 03:25 PM, Sean Millichamp wrote:
> I've looked into the mechanisms for the NOC disabling checks handled by
> hostgroups associated with connected pollers. While working on this I
> have come across some apparent bugs, based on my limited understanding
> of what Merlin is trying to do.
>
> The attached patch fixes what appears to be a typo in a call to
> handle_control() where pkt.hdr.len is passed into the function instead
> of pkt.hdr.code. Since len currently seems to alwways be 0 for a
> control packet this prevented handle_control from ever doing anything.
>
Well spotted. I've applied a slightly different version of this patch,
letting handle_control() get the entire packet instead of just parts of
it, since I'll need to use the control packets to transfer configuration
files later, which means they have to handle the data too eventually.
I'll push this out momentarily. Many thanks.
> Once I fixed the typo addressed in the patch I found what appears to be
> a more subtle bug that I haven't come up with a good way of addressing
> yet. I thought I had a simple fix, but implementing it causes Nagios to
> fairly immediately core dump (for no reason I have been able to
> ascertain yet). Here is the problem:
>
> When a Nagios NOC starts up it looks for poller entries and adds all
> "hostgroup" entries it finds to a "selection" array (via add_selection
> in module.c:slurp_selection()). Each hostgroup then becomes associated
> with a selection index starting at 0 and increasing for the number of
> hostgroups found.
>
> Then it proceeds to module.c:setup_host_hash_tables() where it cycles
> through all known hostgroups and, if the hostgroup is one that was added
> to the selection index (get_sel_id returns an id>= 0) adds the member
> hosts to the hash_table:
> hash_add(host_hash_table, m->host_name, (void *)id);
> where "id" is the hostgroup selection index (0 through nsel-1)
>
> Now, here is the crux of the problem: When the host is later retrieved
> with hash_find_val(hostname) it can return a selection index in the
> range of 0 to (nsel-1) (as stored in setup_host_hash_tables()) OR, if
> the host was never added (because it wasn't in a poller-owned hostgroup)
> it returns NULL (line 124 in hash.c:find_hash()). NULL equates to 0 and
> so any host that was never added is treated as if it were a member of
> the first poller hostgroup (index 0) when create_{host,service}_lists()
> are called in control.c
>
Ah, right. Bollocks. I'll have to redo that, I guess :-/
> The result is that when the poller which is responsible for the
> hostgroup in the index 0 slot connects, all of the hosts associated with
> that hostgroup AND all of the unassociated hosts have their checks
> disabled.
>
> I hope this explanation makes sense. It took me quite a while to
> unravel what the code was trying to do and I've had to infer a lot of
> the expected behavior.
>
> Still working on this, but I'm hoping one of the developers can chime in
> and at least tell me if I am on the right track here.
>
You're on the right track. The proper solution is to allocate one integer
value and stash that in the hash-table instead of relying on the int-to-
pointer juggling we're currently doing. Incidentally, this will also get
rid of some compilation warnings on 64-bit systems.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
More information about the op5-users
mailing list