[op5-users] Réf. : Re: Réf. : Re: Nagios start time delay with Merlin
nicolas.raspail at bnpparibas.com
nicolas.raspail at bnpparibas.com
Wed Aug 26 13:57:37 CEST 2009
op5-users-bounces at lists.op5.com wrote on 26/08/2009 13:30:10:
<snip>
> >>
> >
> > Hi
> >
> > sorry, that will be a long mail this one. Here is the missing
information
> > from my previous report.
> >
>
> That's ok. I can glance at details and find the important bits quite
> easily. :-)
>
> > Versions used :
> > * RHEL 5.3 x86_64
> > * Nagios 3.2.0
> > * merlin 0.6.2-beta2
>
> You should probably update to merlin-0.6.2-beta5. It does quite a lot
> less work while achieving the same results.
>
I will try the new beta. Is a tar file available or this version must be
retreive from the git repository ?
> > * MySQL 5.0.45 (on its own server)
> >
> >
> > First step :
> > * merlind is running
> > * nagios is running
> > * I enable the broker module in nagios.cfg and restart it
> > * this is what I get in the daemon.log
> >
> > [1251280642] 7: select() returned 1 (errno = 2: No such file or
directory)
> >
> > [1251280642] 7: Accepting inbound connection on ipc socket
> > [1251280642] 7: sel_val: 9; ipc_listen_sock: 5; ipc_sock: 9; net_sock:
6
> > [1251280642] 7: select() returned 1 (errno = 2: No such file or
directory)
> >
> > [1251280642] 6: inbound data available on ipc socket
> >
> > [1251280642] 7: sel_val: 9; ipc_listen_sock: 5; ipc_sock: 9; net_sock:
6
> > [1251280642] 7: select() returned 1 (errno = 0: Success)
> >
> > [1251280642] 6: inbound data available on ipc socket
> >
>
> This is normal, albeit very chatty. Now that the first beta is out
> the door and things have been running rather smoothly the past two
> months I'll make it log quite a lot less.
Ok, that will help to debug other thing in Merlin
>
> > [1251280642] 7: Successfully read 1 (null) event (131 bytes; 67 bytes
> > body) from socket 9
> >
> > [1251280642] 7: nagios_paths[0]: /bnp/apps/nagios/etc/nagios.cfg
> > [1251280642] 7: nagios_paths[1]: /bnp/apps/nagios/var/objects.cache
> > [1251280642] 7: Executing import command 'php
> > /bnp/apps/nagios/merlin/import.php
> > --nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg
> > --cache=/bnp/apps/nagios/var/objects.cache --db-name=m
> > erlindb --db-user=merlin --db-pass=Chae2yei --db-host=eqd-nagios-sql'
>
> This happens when the merlin daemon receives an event that triggers an
> import of Nagios' configurations and status data. It happens once per
> restart of either the Merlin daemon or the Nagios daemon.
Ok, I understand why
>
> > [1251280659] 6: dbi_conn_query_null(): Failed to run [SELECT
host_name,
> > current_state, state_type FROM merlindb.host ORDER BY host_name]:
2006:
> > MySQL server has gone away
> >
> > * merlind crashed and I don't know why it says that the MySQL
> > server has gone away because it is working fine.
> >
>
> Hmm. It should try to reconnect at that point instead. Not sure if it
> did that in beta2.
Ok
>
> >
> > Second step :
> > * nagios is still running and doing some checks
> > * I restart merlin
> > * this is what I get in the daemon.log
> >
> > [1251280908] 6: Initializing IPC socket
'/bnp/apps/nagios/merlin/ipc.sock'
> > for daemon
> > [1251280908] 6: Primed object states for 1967 hosts and 14654 services
> > [1251280908] 6: Merlin daemon successfully initialized
> > [1251280908] 7: sel_val: 6; ipc_listen_sock: 5; ipc_sock: -1;
net_sock: 6
> > [1251280909] 7: select() returned 1 (errno = 0: Success)
> >
> > [1251280909] 7: Accepting inbound connection on ipc socket
> > [1251280909] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock:
6
> > [1251280909] 7: select() returned 1 (errno = 0: Success)
> >
> > [1251280909] 6: inbound data available on ipc socket
> >
> > [1251280909] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock:
6
> > [1251280909] 7: select() returned 1 (errno = 0: Success)
> >
> > [1251280909] 6: inbound data available on ipc socket
> >
> > [1251280909] 7: Successfully read 1 (null) event (131 bytes; 67 bytes
> > body) from socket 7
> >
> > [1251280909] 7: nagios_paths[0]: /bnp/apps/nagios/etc/nagios.cfg
> > [1251280909] 7: nagios_paths[1]: /bnp/apps/nagios/var/objects.cache
> > [1251280909] 7: Executing import command 'php
> > /bnp/apps/nagios/merlin/import.php
> > --nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg
> > --cache=/bnp/apps/nagios/var/objects.cache --db-name=m
> > erlindb --db-user=merlin --db-pass=Chae2yei --db-host=eqd-nagios-sql'
> > [1251280926] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock:
6
> > [1251280926] 7: select() returned 1 (errno = 0: Success)
> >
> > [1251280926] 6: inbound data available on ipc socket
> >
> > [1251280926] 7: Successfully read 1 NEBCALLBACK_SERVICE_STATUS_DATA
event
> > (832 bytes; 768 bytes body) from socket 7
> >
> > [1251280926] 7: Updating status for service
'bnp-check-gprime-swap-snmp'
> > on host 'cg41-026'
> > [1251280926] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock:
6
> > [1251280926] 7: select() returned 1 (errno = 0: Success)
> >
> > and so on
> >
> > * All seems to be okay
> >
>
> Yup.
>
> >
> > Third step
> > * merlind is still running
> > * nagios is restarted
> > * this is what I get in the daemon.log
> >
> > [1251281935] 7: nagios_paths[0]: /bnp/apps/nagios/etc/nagios.cfg
> > [1251281935] 7: nagios_paths[1]: /bnp/apps/nagios/var/objects.cache
> > [1251281935] 7: Executing import command 'php
> > /bnp/apps/nagios/merlin/import.php
> > --nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg
> > --cache=/bnp/apps/nagios/var/objects.cache --db-name=m
> > erlindb --db-user=merlin --db-pass=Chae2yei --db-host=eqd-nagios-sql'
> > [1251281952] 7: sel_val: 8; ipc_listen_sock: 5; ipc_sock: 8; net_sock:
6
> > [1251281952] 7: select() returned 1 (errno = 0: Success)
> >
> > [1251281952] 6: inbound data available on ipc socket
> >
> > [1251281952] 7: Successfully read 1 NEBCALLBACK_HOST_STATUS_DATA event
> > (802 bytes; 738 bytes body) from socket 8
> >
> > [1251281952] 7: Updating status for host 'SGRS-rf0-LTE0'
> > [1251281952] 7: sel_val: 8; ipc_listen_sock: 5; ipc_sock: 8; net_sock:
6
> > [1251281952] 7: select() returned 1 (errno = 0: Success)
> >
> > [1251281952] 6: inbound data available on ipc socket
> >
> > [1251281952] 7: Successfully read 1 NEBCALLBACK_HOST_STATUS_DATA event
> > (798 bytes; 734 bytes body) from socket 8
> >
> > [1251281952] 7: Updating status for host 'anasty-bd2'
> > [1251281952] 7: sel_val: 8; ipc_listen_sock: 5; ipc_sock: 8; net_sock:
6
> > [1251281952] 7: select() returned 1 (errno = 0: Success)
> >
> > and so on
> > * nagios has been restarted at timestamp 1251281934 and since
> > that, no check have been made until timestamp 1251282809 ! 15 mins of
> > update, that is a lot of time.
>
> And far, far more than we're experiencing here. Merlin is designed in
> such a way that it rather drops messages than interferes with the
> running Nagios daemon, so what you're seeing is almost certainly not
> a result of Merlin doing something weird.
>
> This should be alleviated by upgrading to the latest Merlin version
> though, since it ignores all events Nagios throws at it until Nagios
> has entered the main event execution loop. Under such circumstances,
> the startup time can't be affected at all by Merlin.
>
Ok, but without Merlin, my Nagios starts immediately some checks. With
NDO,
there is also a delay (7/8 mins), and the mysql server is very busy during
this period. Maybe there is a problem else, but I can't find where.
My Nagios is compiled with the following options.
./configure --prefix=/bnp/apps/nagios --with-nagios-user=nagios
-with-nagios-group=nagios --with-command-group=nagioscmd
--enable-nanosleep
--enable-event-broker
> > The load on the mysql server is mostly null
> > during this time. On the Nagios server, merlind doesn't use a lot of
cpu
> >
>
> That sounds fairly normal. Neither MySQL nor merlind uses a lot of
> CPU in any of our setups.
ok, nice to hear that :)
>
> >
> > Also, During these tests, I have see my check and hosts latencies grow
up
> > and now, with Merlin enabled, I have a large number of orphaned
checks.
> >
> > With Merlin I have the following latencies (and it is increasing as I
> > write my email) :
> >
> > Service Check Latency: 0.00 / 1299.86 / 160.156 sec
> > Host Check Execution Time: 2.54 / 3.19 / 2.564 sec
> > Host Check Latency: 0.00 / 723.49 / 304.415 sec
> >
> > Before Merlin, i don't have the exact values, but I remember that the
> > service latency was under 50s and the host latency under 1s
> >
>
> Was the latency slowly increasing before, or was it totally stable?
> Merlin does add a small overhead to the processing of each check
> result, status update and a plethora of other things. If your
> latency was previously increasing slowly, Merlin will make it
> increase faster. If it was stable before, it's possible that the
> (very small) overhead that Merlin adds is pushing it over the
> limit so that the latency starts converging on infinity.
Before Merlin, the latency was totally stable. I understand why Merlin and
NDO add
a small overhead, but what I'm facing is a huge overhead.
I have disabled Merlin and enabled NDO to compare. Here is my actual
latency with NDO
after 15/20 minutes :
Service Check Execution Time: 0.04 / 30.03 / 0.389 sec
Service Check Latency: 0.00 / 1952.96 / 300.960 sec
Host Check Execution Time: 2.54 / 2.69 / 2.566 sec
Host Check Latency: 0.00 / 17.98 / 5.105 sec
And the latency is still decreasing as I write my email.
Regards
Nicolas
This message and any attachments (the "message") is
intended solely for the addressees and is confidential.
If you receive this message in error, please delete it and
immediately notify the sender. Any use not in accord with
its purpose, any dissemination or disclosure, either whole
or partial, is prohibited except formal approval. The internet
can not guarantee the integrity of this message.
BNP PARIBAS (and its subsidiaries) shall (will) not
therefore be liable for the message if modified.
Do not print this message unless it is necessary,
consider the environment.
---------------------------------------------
Ce message et toutes les pieces jointes (ci-apres le
"message") sont etablis a l'intention exclusive de ses
destinataires et sont confidentiels. Si vous recevez ce
message par erreur, merci de le detruire et d'en avertir
immediatement l'expediteur. Toute utilisation de ce
message non conforme a sa destination, toute diffusion
ou toute publication, totale ou partielle, est interdite, sauf
autorisation expresse. L'internet ne permettant pas
d'assurer l'integrite de ce message, BNP PARIBAS (et ses
filiales) decline(nt) toute responsabilite au titre de ce
message, dans l'hypothese ou il aurait ete modifie.
N'imprimez ce message que si necessaire,
pensez a l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.op5.com/pipermail/op5-users/attachments/20090826/8b1ffb5a/attachment.html
More information about the op5-users
mailing list