[op5-users] Nagios/Merlin NEB-related interaction bug?
Sean Millichamp
sean at bruenor.org
Fri Oct 30 21:44:50 CET 2009
I have been chasing a very elusive (but very consistent) segfault for a
couple of days now, and after countless hours stepping through in
gdb/ddd I think I have finally found it. I have been doing all of my
testing on Nagios 3.2.0 with the latest Merlin from git with the patch I
submitted to fix the hash table handling. The patch is required (in my
setup) to trigger the crash, but the bug that causes the crash isn't
related to the patch.
I have compiled everything with -O0 to assist in debugging. I would
very much appreciate a review of this analysis and see if someone can
find flaw with it.
The scenario:
1) Nagios starts up. It does its usual thing and eventually registers
Merlin as an event broker module. During this time it calls
module.c:nebmodule_init() as the first time execution is passed to
Merlin.
2) In nebmodule_init() Merlin does some basic initialization stuff and
calls neb_register_callback (line 412) as so:
neb_register_callback(NEBCALLBACK_PROCESS_DATA, neb_handle, 0, post_config_init);
3) Back on the Nagios side, flow moves to neb_register_callback. Since
this is the first (and only) registration for the
NEBCALLBACK_PROCESS_DATA callback it malloc()s a new nebcallback
structure and places the pointer to that structure in
neb_callback_list[7], which was previously NULL. This completes
neb_make_callbacks()'s duties and flow returns to Merlin.
4) Merlin returns from the nebmodule_init() function and Nagios resumes
like normal.
5) At some point shortly thereafter, a number of
NEBCALLBACK_PROCESS_DATA events are generated. Nagios calls
neb_make_callbacks() like it does for all callback-eligible events.
The program flow moves into module.c:post_config_init(), but quickly
returns to Nagios due to this check not being satisfied:
if (*(int *)ds != NEBTYPE_PROCESS_EVENTLOOPSTART)
return 0;
6) Eventually, another NEBCALLBACK_PROCESS_DATA event occurs. This one
is of more interest to us. Nagios calls neb_make_callbacks(), like it
does for every event. However, this one is a little different. This
time the event data matches NEBTYPE_PROCESS_EVENTLOOPSTART. So,
post_config_init() proceeds to do its additional initialization. One of
these acts is to:
neb_deregister_callback(NEBCALLBACK_PROCESS_DATA, post_config_init);
7) neb_deregister_callback() takes control. The neb_callback_list[]
array is an array with an index for each callback type. Each of those
indexes contains either a NULL (representing no callbacks registered),
or a linked list of one or more entries that contains the information
required to perform the callback and maintain the linked list structure.
As part of the de-registering process, the linked list for
NEBCALLBACK_PROCESS_DATA is searched. It only has one item, so we find
the first item as a match for post_config_init(). The function sets
neb_callback_list[7] = NULL, indicating there are no more callbacks for
this event type AND then frees the callback event structure that had
contained the required callback information. Its duties done,
neb_deregister_callback() returns back to post_config_init().
8) post_config_init() proceeds to perform a number of additional
initialization, including numerous malloc()s to allocate the hash and
list structures that Merlin needs. Its job done, post_config_init()
returns a success value. Control passes back to neb_make_callbacks()
9) Now, neb_make_callbacks() has to pick back up where it left off, here
is a shortened snippet of the code it is running:
for(temp_callback=neb_callback_list[callback_type];temp_callback!=NULL;temp_callback=temp_callback->next){
callbackfunc=temp_callback->callback_func;
cbresult=callbackfunc(callback_type,data);
}
callbackfunc() points to our post_config_init() function, so the call
flow returns to the end of the for loop. The very next thing it does
is: temp_callback=temp_callback->next. This is where the problem is.
temp_callback is pointing to the nebcallback structure that had
previously been free()d - and then overwritten by all of the subsequent
malloc()s in post_config_init()). temp_callback->next is no longer a
valid pointer, but is assigned to temp_callback anyway.
temp_callback isn't NULL (as it should have been), so the for loop
proceeds, and then tries to access temp_callback->callback_func.
temp_callback doesn't point to valid memory and a segfault is
immediately triggered.
I believe that the reason I didn't see this until I fixed the hash
table/selection handling was that after being fixed
setup_host_hash_tables() and create_object_lists() began allocating and
using more memory - one of the blocks that they were assigned must have
been the free()d nebcallback structure.
At a first pass, the solution seems to be to either re-engineer Merlin's
initialization sequence to not have to do the register/deregister dance
or requires a Nagios fix to handle the situation where a deregister was
called while in the callback function being deregistered. I haven't
spent a lot of time thinking about a fix yet. I think I'm done for the
weekend :)
Hopefully this makes sense and helps someone else trace what I did.
I'll be happy to answer any questions that I can.
Thanks,
Sean
More information about the op5-users
mailing list