[op5-users] Réf. : Re: Nagios start time delay with Merlin
Andreas Ericsson
ae at op5.se
Wed Aug 26 13:30:10 CEST 2009
nicolas.raspail at bnpparibas.com wrote:
> op5-users-bounces at lists.op5.com wrote on 26/08/2009 10:54:16:
>
>> nicolas.raspail at bnpparibas.com wrote:
>>> Hi
>>>
>>> I'm currently evaluating Merlin to replace NDO. But since I have
> enabled
>>> Merlin, when Nagios is reloaded (by the gui or external command),
> during
>>> several minutes, merlind fills its log with some information saying
> that
>>> the database is updated. And during this time, no checks are performed
> by
>>> Nagios. I have see something like that with NDO, but with lower delay.
>>>
>>> Is it the expected behavior or I have missed something ?
>> It might be. "some information" is a pretty vague description. What's
>> happening is most likely that it claims to update the database
> correctly,
>> which just shows that it's working and doing something. Next time you
>> report something you find odd, please include a cut-n-paste section of
>> the log. That way I'll be able to give you a fairly exact description
>> of what's happening.
>>
>>
>>> Is there a
>>> specific way to reload Nagios when merlin is used ? We are using
> NagiosQL
>>> to manage the configuration and restart Nagios from its GUI (external
>>> command, like in the Nagios GUI).
>>>
>> That should work just fine.
>>
>>> Also, is there a configuration item to tell merlind to stop logging
>>> something like that
>>>
>>> [1251181261] 7: Updating status for service 'some service' on host
>>> 'some-host'
>>> [1251181261] 7: sel_val: 9; ipc_listen_sock: 5; ipc_sock: 9; net_sock:
> 6
>>> [1251181261] 7: select() returned 1 (errno = 0: Success)
>>>
>>> We have 1987 hosts, 14689 services, and the log size grows up quickly.
>>>
>> Not while Merlin is still in beta, no. The mechanic to set the desired
>> log_levels is already there, but it's not honored by the actual logging
>> mechanism. Later, you'll be able to set
>>
>> log_levels = info
>>
>> and have merlin only log things of severity "info" or above.
>>
>
> Hi
>
> sorry, that will be a long mail this one. Here is the missing information
> from my previous report.
>
That's ok. I can glance at details and find the important bits quite
easily. :-)
> Versions used :
> * RHEL 5.3 x86_64
> * Nagios 3.2.0
> * merlin 0.6.2-beta2
You should probably update to merlin-0.6.2-beta5. It does quite a lot
less work while achieving the same results.
> * MySQL 5.0.45 (on its own server)
>
>
> First step :
> * merlind is running
> * nagios is running
> * I enable the broker module in nagios.cfg and restart it
> * this is what I get in the daemon.log
>
> [1251280642] 7: select() returned 1 (errno = 2: No such file or directory)
>
> [1251280642] 7: Accepting inbound connection on ipc socket
> [1251280642] 7: sel_val: 9; ipc_listen_sock: 5; ipc_sock: 9; net_sock: 6
> [1251280642] 7: select() returned 1 (errno = 2: No such file or directory)
>
> [1251280642] 6: inbound data available on ipc socket
>
> [1251280642] 7: sel_val: 9; ipc_listen_sock: 5; ipc_sock: 9; net_sock: 6
> [1251280642] 7: select() returned 1 (errno = 0: Success)
>
> [1251280642] 6: inbound data available on ipc socket
>
This is normal, albeit very chatty. Now that the first beta is out
the door and things have been running rather smoothly the past two
months I'll make it log quite a lot less.
> [1251280642] 7: Successfully read 1 (null) event (131 bytes; 67 bytes
> body) from socket 9
>
> [1251280642] 7: nagios_paths[0]: /bnp/apps/nagios/etc/nagios.cfg
> [1251280642] 7: nagios_paths[1]: /bnp/apps/nagios/var/objects.cache
> [1251280642] 7: Executing import command 'php
> /bnp/apps/nagios/merlin/import.php
> --nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg
> --cache=/bnp/apps/nagios/var/objects.cache --db-name=m
> erlindb --db-user=merlin --db-pass=Chae2yei --db-host=eqd-nagios-sql'
This happens when the merlin daemon receives an event that triggers an
import of Nagios' configurations and status data. It happens once per
restart of either the Merlin daemon or the Nagios daemon.
> [1251280659] 6: dbi_conn_query_null(): Failed to run [SELECT host_name,
> current_state, state_type FROM merlindb.host ORDER BY host_name]: 2006:
> MySQL server has gone away
>
> * merlind crashed and I don't know why it says that the MySQL
> server has gone away because it is working fine.
>
Hmm. It should try to reconnect at that point instead. Not sure if it
did that in beta2.
>
> Second step :
> * nagios is still running and doing some checks
> * I restart merlin
> * this is what I get in the daemon.log
>
> [1251280908] 6: Initializing IPC socket '/bnp/apps/nagios/merlin/ipc.sock'
> for daemon
> [1251280908] 6: Primed object states for 1967 hosts and 14654 services
> [1251280908] 6: Merlin daemon successfully initialized
> [1251280908] 7: sel_val: 6; ipc_listen_sock: 5; ipc_sock: -1; net_sock: 6
> [1251280909] 7: select() returned 1 (errno = 0: Success)
>
> [1251280909] 7: Accepting inbound connection on ipc socket
> [1251280909] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1251280909] 7: select() returned 1 (errno = 0: Success)
>
> [1251280909] 6: inbound data available on ipc socket
>
> [1251280909] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1251280909] 7: select() returned 1 (errno = 0: Success)
>
> [1251280909] 6: inbound data available on ipc socket
>
> [1251280909] 7: Successfully read 1 (null) event (131 bytes; 67 bytes
> body) from socket 7
>
> [1251280909] 7: nagios_paths[0]: /bnp/apps/nagios/etc/nagios.cfg
> [1251280909] 7: nagios_paths[1]: /bnp/apps/nagios/var/objects.cache
> [1251280909] 7: Executing import command 'php
> /bnp/apps/nagios/merlin/import.php
> --nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg
> --cache=/bnp/apps/nagios/var/objects.cache --db-name=m
> erlindb --db-user=merlin --db-pass=Chae2yei --db-host=eqd-nagios-sql'
> [1251280926] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1251280926] 7: select() returned 1 (errno = 0: Success)
>
> [1251280926] 6: inbound data available on ipc socket
>
> [1251280926] 7: Successfully read 1 NEBCALLBACK_SERVICE_STATUS_DATA event
> (832 bytes; 768 bytes body) from socket 7
>
> [1251280926] 7: Updating status for service 'bnp-check-gprime-swap-snmp'
> on host 'cg41-026'
> [1251280926] 7: sel_val: 7; ipc_listen_sock: 5; ipc_sock: 7; net_sock: 6
> [1251280926] 7: select() returned 1 (errno = 0: Success)
>
> and so on
>
> * All seems to be okay
>
Yup.
>
> Third step
> * merlind is still running
> * nagios is restarted
> * this is what I get in the daemon.log
>
> [1251281935] 7: nagios_paths[0]: /bnp/apps/nagios/etc/nagios.cfg
> [1251281935] 7: nagios_paths[1]: /bnp/apps/nagios/var/objects.cache
> [1251281935] 7: Executing import command 'php
> /bnp/apps/nagios/merlin/import.php
> --nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg
> --cache=/bnp/apps/nagios/var/objects.cache --db-name=m
> erlindb --db-user=merlin --db-pass=Chae2yei --db-host=eqd-nagios-sql'
> [1251281952] 7: sel_val: 8; ipc_listen_sock: 5; ipc_sock: 8; net_sock: 6
> [1251281952] 7: select() returned 1 (errno = 0: Success)
>
> [1251281952] 6: inbound data available on ipc socket
>
> [1251281952] 7: Successfully read 1 NEBCALLBACK_HOST_STATUS_DATA event
> (802 bytes; 738 bytes body) from socket 8
>
> [1251281952] 7: Updating status for host 'SGRS-rf0-LTE0'
> [1251281952] 7: sel_val: 8; ipc_listen_sock: 5; ipc_sock: 8; net_sock: 6
> [1251281952] 7: select() returned 1 (errno = 0: Success)
>
> [1251281952] 6: inbound data available on ipc socket
>
> [1251281952] 7: Successfully read 1 NEBCALLBACK_HOST_STATUS_DATA event
> (798 bytes; 734 bytes body) from socket 8
>
> [1251281952] 7: Updating status for host 'anasty-bd2'
> [1251281952] 7: sel_val: 8; ipc_listen_sock: 5; ipc_sock: 8; net_sock: 6
> [1251281952] 7: select() returned 1 (errno = 0: Success)
>
> and so on
> * nagios has been restarted at timestamp 1251281934 and since
> that, no check have been made until timestamp 1251282809 ! 15 mins of
> update, that is a lot of time.
And far, far more than we're experiencing here. Merlin is designed in
such a way that it rather drops messages than interferes with the
running Nagios daemon, so what you're seeing is almost certainly not
a result of Merlin doing something weird.
This should be alleviated by upgrading to the latest Merlin version
though, since it ignores all events Nagios throws at it until Nagios
has entered the main event execution loop. Under such circumstances,
the startup time can't be affected at all by Merlin.
> The load on the mysql server is mostly null
> during this time. On the Nagios server, merlind doesn't use a lot of cpu
>
That sounds fairly normal. Neither MySQL nor merlind uses a lot of
CPU in any of our setups.
>
> Also, During these tests, I have see my check and hosts latencies grow up
> and now, with Merlin enabled, I have a large number of orphaned checks.
>
> With Merlin I have the following latencies (and it is increasing as I
> write my email) :
>
> Service Check Latency: 0.00 / 1299.86 / 160.156 sec
> Host Check Execution Time: 2.54 / 3.19 / 2.564 sec
> Host Check Latency: 0.00 / 723.49 / 304.415 sec
>
> Before Merlin, i don't have the exact values, but I remember that the
> service latency was under 50s and the host latency under 1s
>
Was the latency slowly increasing before, or was it totally stable?
Merlin does add a small overhead to the processing of each check
result, status update and a plethora of other things. If your
latency was previously increasing slowly, Merlin will make it
increase faster. If it was stable before, it's possible that the
(very small) overhead that Merlin adds is pushing it over the
limit so that the latency starts converging on infinity.
--
Andreas Ericsson andreas.ericsson at op5.se
OP5 AB www.op5.se
Tel: +46 8-230225 Fax: +46 8-230231
Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.
More information about the op5-users
mailing list