[op5-users] Nagios/Merlin NEB-related interaction bug?

Sean Millichamp sean at bruenor.org
Fri Oct 30 21:44:50 CET 2009


I have been chasing a very elusive (but very consistent) segfault for a
couple of days now, and after countless hours stepping through in
gdb/ddd I think I have finally found it.  I have been doing all of my
testing on Nagios 3.2.0 with the latest Merlin from git with the patch I
submitted to fix the hash table handling.  The patch is required (in my
setup) to trigger the crash, but the bug that causes the crash isn't
related to the patch.

I have compiled everything with -O0 to assist in debugging.  I would
very much appreciate a review of this analysis and see if someone can
find flaw with it.

The scenario:

1) Nagios starts up.  It does its usual thing and eventually registers
Merlin as an event broker module.  During this time it calls
module.c:nebmodule_init() as the first time execution is passed to
Merlin.

2) In nebmodule_init() Merlin does some basic initialization stuff and
calls neb_register_callback (line 412) as so:
	neb_register_callback(NEBCALLBACK_PROCESS_DATA, neb_handle, 0, post_config_init);

3) Back on the Nagios side, flow moves to neb_register_callback.  Since
this is the first (and only) registration for the
NEBCALLBACK_PROCESS_DATA callback it malloc()s a new nebcallback
structure and places the pointer to that structure in
neb_callback_list[7], which was previously NULL.  This completes
neb_make_callbacks()'s duties and flow returns to Merlin.

4) Merlin returns from the nebmodule_init() function and Nagios resumes
like normal.

5) At some point shortly thereafter, a number of
NEBCALLBACK_PROCESS_DATA events are generated.  Nagios calls
neb_make_callbacks() like it does for all callback-eligible events.  

The program flow moves into module.c:post_config_init(), but quickly
returns to Nagios due to this check not being satisfied:

 	if (*(int *)ds != NEBTYPE_PROCESS_EVENTLOOPSTART)
		return 0;

6) Eventually, another NEBCALLBACK_PROCESS_DATA event occurs.  This one
is of more interest to us.  Nagios calls neb_make_callbacks(), like it
does for every event.  However, this one is a little different.  This
time the event data matches NEBTYPE_PROCESS_EVENTLOOPSTART. So,
post_config_init() proceeds to do its additional initialization.  One of
these acts is to:

	neb_deregister_callback(NEBCALLBACK_PROCESS_DATA, post_config_init);

7) neb_deregister_callback() takes control.  The neb_callback_list[]
array is an array with an index for each callback type.  Each of those
indexes contains either a NULL (representing no callbacks registered),
or a linked list of one or more entries that contains the information
required to perform the callback and maintain the linked list structure.

As part of the de-registering process, the linked list for
NEBCALLBACK_PROCESS_DATA is searched.  It only has one item, so we find
the first item as a match for post_config_init().  The function sets
neb_callback_list[7] = NULL, indicating there are no more callbacks for
this event type AND then frees the callback event structure that had
contained the required callback information.  Its duties done,
neb_deregister_callback() returns back to post_config_init().  

8) post_config_init() proceeds to perform a number of additional
initialization, including numerous malloc()s to allocate the hash and
list structures that Merlin needs.  Its job done, post_config_init()
returns a success value.  Control passes back to neb_make_callbacks() 

9) Now, neb_make_callbacks() has to pick back up where it left off, here
is a shortened snippet of the code it is running:

	for(temp_callback=neb_callback_list[callback_type];temp_callback!=NULL;temp_callback=temp_callback->next){
		callbackfunc=temp_callback->callback_func;

		cbresult=callbackfunc(callback_type,data);
	}

callbackfunc() points to our post_config_init() function, so the call
flow returns to the end of the for loop.  The very next thing it does
is: temp_callback=temp_callback->next.  This is where the problem is.
temp_callback is pointing to the nebcallback structure that had
previously been free()d - and then overwritten by all of the subsequent
malloc()s in post_config_init()).  temp_callback->next is no longer a
valid pointer, but is assigned to temp_callback anyway.

temp_callback isn't NULL (as it should have been), so the for loop
proceeds, and then tries to access temp_callback->callback_func.
temp_callback doesn't point to valid memory and a segfault is
immediately triggered.

I believe that the reason I didn't see this until I fixed the hash
table/selection handling was that after being fixed
setup_host_hash_tables() and create_object_lists() began allocating and
using more memory - one of the blocks that they were assigned must have
been the free()d nebcallback structure.

At a first pass, the solution seems to be to either re-engineer Merlin's
initialization sequence to not have to do the register/deregister dance
or requires a Nagios fix to handle the situation where a deregister was
called while in the callback function being deregistered.  I haven't
spent a lot of time thinking about a fix yet.  I think I'm done for the
weekend :)

Hopefully this makes sense and helps someone else trace what I did.
I'll be happy to answer any questions that I can.

Thanks,
Sean




More information about the op5-users mailing list