[op5-users] Merlin crashed on me?

Andreas Ericsson ae at op5.se
Wed Jul 1 23:52:03 CEST 2009


Frater, Greg J wrote:
>  
> Frater, Greg J wrote:
>>> I never get neb.log file, should I?  When I start nagios I see a 
>>> console message that says 'Starting nagios:Logging to 
>>> '/usr/local/nagios/merlin/logs/neb.log' but the log file never
> appears.
> 
>> This almost certainly has to do with directory permissions. You can
> try, as root, doing
> 
>>   # chmod 777 /usr/local/nagios/merlin/logs
>>   # (restart nagios)
> 
>> and it should start working.
> 
> It did start working, the neb.log file is now being written, previously
> the permissions were set as follows:  
> 
> drwxr-xr-x 2 root root    4096 Jun 16 09:07 logs
> 
> 
>>> Ah, there's my crash, it dumped while I was writing this message.  
> 
>> Yes, we have a 64-bit system up and running now, but I still haven't
> seen any crashes on it so I'm guessing we're just not exercising it as
> heavily as you are. Does the crash by any chance always happen after
> receiving the same type of event? Inspecting the last 10 or so lines of
> daemon.log after a crash should tell you if this is so, since it logs
> the event type quite a long time before it starts messing around with
> free()ing any pointers.
> 
> Crash #1
> daemon.log
> [1246458609] 7: select() returned 1 (errno = 0: Success)
> [1246458609] 6: inbound data available on ipc socket
> [1246458609] 7: Successfully read 1 NEBCALLBACK_PROGRAM_STATUS_DATA
> event (352 bytes; 288 bytes body) from socket 7
> [1246458609] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1246458611] 7: select() returned 1 (errno = 0: Success)
> [1246458611] 6: inbound data available on ipc socket
> [1246458611] 7: Successfully read 1 NEBCALLBACK_HOST_CHECK_DATA event
> (546 bytes; 482 bytes body) from socket 7
> [1246458611] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1246458611] 7: select() returned 1 (errno = 0: Success)
> [1246458611] 6: inbound data available on ipc socket
> [1246458611] 7: Successfully read 1 NEBCALLBACK_HOST_CHECK_DATA event
> (486 bytes; 422 bytes body) from socket 7
> 
> 
> Crash #2
> [1246462221] 7: select() returned 1 (errno = 0: Success)
> [1246462221] 6: inbound data available on ipc socket
> [1246462221] 7: Successfully read 1 NEBCALLBACK_HOST_CHECK_DATA event
> (546 bytes; 482 bytes body) from socket 7
> [1246462221] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1246462221] 7: select() returned 1 (errno = 0: Success)
> [1246462221] 6: inbound data available on ipc socket
> [1246462221] 7: Successfully read 1 NEBCALLBACK_SERVICE_CHECK_DATA event
> (575 bytes; 511 bytes body) from socket 7
> [1246462221] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1246462221] 7: select() returned 1 (errno = 0: Success)
> [1246462221] 6: inbound data available on ipc socket
> [1246462221] 7: Successfully read 1 NEBCALLBACK_HOST_CHECK_DATA event
> (486 bytes; 422 bytes body) from socket 7
> 

Ooh, I'd quite like to know what that host check looks like. It seems as
if it crashes on the same host-check result both times (judging by the
size only, which is quite a poor heuristic, but still).

I'll re-enable the debugging machinery that dumps inbound messages to a
binary logfile. When that's done, I'll need you to run Merlin until it
crashes again so I get the sequence of events leading up to the actual
crash in the format Merlin sees them. If I replay the same event-chain
on our 64-bit machine, I *should* get the same crash you're getting. If
that's the case, finding and fixing this bug should be fairly trivial.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.


More information about the op5-users mailing list