[op5-users] Merlin crashed on me?
Andreas Ericsson
ae at op5.se
Wed Jul 1 23:52:03 CEST 2009
Frater, Greg J wrote:
>
> Frater, Greg J wrote:
>>> I never get neb.log file, should I? When I start nagios I see a
>>> console message that says 'Starting nagios:Logging to
>>> '/usr/local/nagios/merlin/logs/neb.log' but the log file never
> appears.
>
>> This almost certainly has to do with directory permissions. You can
> try, as root, doing
>
>> # chmod 777 /usr/local/nagios/merlin/logs
>> # (restart nagios)
>
>> and it should start working.
>
> It did start working, the neb.log file is now being written, previously
> the permissions were set as follows:
>
> drwxr-xr-x 2 root root 4096 Jun 16 09:07 logs
>
>
>>> Ah, there's my crash, it dumped while I was writing this message.
>
>> Yes, we have a 64-bit system up and running now, but I still haven't
> seen any crashes on it so I'm guessing we're just not exercising it as
> heavily as you are. Does the crash by any chance always happen after
> receiving the same type of event? Inspecting the last 10 or so lines of
> daemon.log after a crash should tell you if this is so, since it logs
> the event type quite a long time before it starts messing around with
> free()ing any pointers.
>
> Crash #1
> daemon.log
> [1246458609] 7: select() returned 1 (errno = 0: Success)
> [1246458609] 6: inbound data available on ipc socket
> [1246458609] 7: Successfully read 1 NEBCALLBACK_PROGRAM_STATUS_DATA
> event (352 bytes; 288 bytes body) from socket 7
> [1246458609] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1246458611] 7: select() returned 1 (errno = 0: Success)
> [1246458611] 6: inbound data available on ipc socket
> [1246458611] 7: Successfully read 1 NEBCALLBACK_HOST_CHECK_DATA event
> (546 bytes; 482 bytes body) from socket 7
> [1246458611] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1246458611] 7: select() returned 1 (errno = 0: Success)
> [1246458611] 6: inbound data available on ipc socket
> [1246458611] 7: Successfully read 1 NEBCALLBACK_HOST_CHECK_DATA event
> (486 bytes; 422 bytes body) from socket 7
>
>
> Crash #2
> [1246462221] 7: select() returned 1 (errno = 0: Success)
> [1246462221] 6: inbound data available on ipc socket
> [1246462221] 7: Successfully read 1 NEBCALLBACK_HOST_CHECK_DATA event
> (546 bytes; 482 bytes body) from socket 7
> [1246462221] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1246462221] 7: select() returned 1 (errno = 0: Success)
> [1246462221] 6: inbound data available on ipc socket
> [1246462221] 7: Successfully read 1 NEBCALLBACK_SERVICE_CHECK_DATA event
> (575 bytes; 511 bytes body) from socket 7
> [1246462221] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1246462221] 7: select() returned 1 (errno = 0: Success)
> [1246462221] 6: inbound data available on ipc socket
> [1246462221] 7: Successfully read 1 NEBCALLBACK_HOST_CHECK_DATA event
> (486 bytes; 422 bytes body) from socket 7
>
Ooh, I'd quite like to know what that host check looks like. It seems as
if it crashes on the same host-check result both times (judging by the
size only, which is quite a poor heuristic, but still).
I'll re-enable the debugging machinery that dumps inbound messages to a
binary logfile. When that's done, I'll need you to run Merlin until it
crashes again so I get the sequence of events leading up to the actual
crash in the format Merlin sees them. If I replay the same event-chain
on our 64-bit machine, I *should* get the same crash you're getting. If
that's the case, finding and fixing this bug should be fairly trivial.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
More information about the op5-users
mailing list