[op5-users] Réf. : Re: Réf. : Re: merlind crash after loosing mysql connection

nicolas.raspail at bnpparibas.com nicolas.raspail at bnpparibas.com
Tue Sep 1 10:04:42 CEST 2009


op5-users-bounces at lists.op5.com wrote on 31/08/2009 17:04:24:
> nicolas.raspail at bnpparibas.com wrote:
> > p5-users-bounces at lists.op5.com wrote on 31/08/2009 10:38:27:
> > 
> >> nicolas.raspail at bnpparibas.com wrote:

<snip>

> > 
> > I have just installed the 0.6.2-beta10 version of merlind. At this 
> > subject, is
> > there a place where we can find the running version ?
> 
> Well, yes and no. If you're building from git, you should be able to
> see the exact version in the logs when Merlin is loaded. I just noticed
> that there was a bug in the gen-version.sh script that caused it to not
> print the DEF_VER variable properly when building from tarballs.
> 

I can't access the git repository beacause of firewalls and proxies. So
I grab the tarball from your web interface :)

> > I see nothing in the 
> > logs
> > of nagios or merlind. And in the source, in the file gen-version.sh, 
there 
> > is
> > only DEF_VER=v0.6.1 and this script seems to do nothing
> > 
> 
> You're not meant to run the script manually. It's run by invoking
> 'make', and it's supposed to create a file called version.c
> 
> > [merlin at eqd-nagios01 merlin]$ ./gen-version.sh 
> > #include "shared.h"
> > const char *merlin_version = "";
> > 
> > But let's to why I have installed the new  version : the lost of mysql 

> > connection !
> > 
> 
> Right. I *think* I may have fixed this one, although a mismerge between
> the upstream Nagios core and some of our own patches forced me to
> redirect my efforts a while.

ok

> 
> > ** After I switch over my MySQL server, I can see that in the log a 
lot of 
> > message like that :
> > 
> > [1251728254] 6: dbi_conn_query_null(): Failed to run [UPDATE 
> > merlindb.service SET initial_state = 0, flap_detection_enabled = 1, 
> > low_flap_threshold = 0.000000, high_flap_threshold 
> > = 0.000000, check_freshness = 0, freshness_threshold = 0, 
> > process_performance_data = 1, active_checks_enabled = 1, 
> > passive_checks_enabled = 1, event_handler_enabled = 1, obsess_ove
> > r_service = 1251669600, problem_has_been_acknowledged = 0, 
> > acknowledgement_type = 0, check_type = 0, current_state = 0, 
last_state = 
> > 0, last_hard_state = 0, state_type = 1, current
> > _attempt = 1, current_event_id = 0, last_event_id = 0, 
current_problem_id 
> > = 0, last_problem_id = 0, latency = 0.032000, execution_time = 
0.044196, 
> > notifications_enabled = 1, last_n
> > otification = 0, next_check = 1251729152, should_be_scheduled = 1, 
> > last_check = 1251728252, last_state_change = 1248342907, 
> > last_hard_state_change = 1248342907, has_been_checked = 
> > 1, current_notification_number = 0, current_notification_id = 0, 
> > check_flapping_recovery_notification = 0, scheduled_downtime_depth = 
0, 
> > pending_flex_downtime = 0, is_flapping = 0,
> >  flapping_comment_id = 0, percent_state_change = 0.000000, output = 
'SNMP 
> > OK - TODEFINE', long_output = '', perf_data = '' WHERE host_name = 
'xxxx' 
> > AND service_description = 'b
> > np-check-snmpd']: 2006: MySQL server has gone away
> 
> Does it ever say that it's managed to connect o the MySQL server in
> the first place?

Yes, I think so because my database gets updated

Here is some log extract from daemon.log when I have started merlind

[1251727986] 6: Initializing IPC socket '/bnp/apps/nagios/merlin/ipc.sock' 
for daemon
[1251727986] 6: Primed object states for 0 hosts and 0 services
[1251727986] 6: Merlin daemon  successfully initialized
[1251727997] 6: Accepting inbound connection on ipc socket
[1251727997] 6: inbound data available on ipc socket

[1251727997] 6: Executing import command 'php 
/bnp/apps/nagios/merlin/import.php 
--nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg 
--cache=/bnp/apps/nagios/var/ob jects.cache --db-name=merlindb 
--db-user=merlin --db-pass=xxx --db-host=eqd-nagios-sql 
--status-log=/bnp/apps/nagios/var/status.dat'
[1251727997] 6: Handled 5 ipc events in 0.002 seconds
[1251727998] 6: inbound data available on ipc socket


> 
> > 
> > Until that point, everything to be good and the merlind process is 
still 
> > running.
> > 
> > ** But when the MySQL server is up again, I see a lot of messages like 

> > that :
> > 
> > [1251728254] 6: Handled 110 ipc events in 0.086 seconds
> > [1251728255] 6: inbound data available on ipc socket
> > 
> > [1251728255] 6: dbi_conn_query_null(): Failed to run [UPDATE 
> > merlindb.program_status SET is_running = 1, last_alive = 1251728255, 
> > program_start = 1251727995, pid = 24400, daemon_mo
> > de = 1, last_command_check = 1251728254, last_log_rotation = 0, 
> > notifications_enabled = 1, active_service_checks_enabled = 1, 
> > passive_service_checks_enabled = 1, active_host_checks
> > _enabled = 1, passive_host_checks_enabled = 1, event_handlers_enabled 
= 1, 
> > flap_detection_enabled = 0, failure_prediction_enabled = 1, 
> > process_performance_data = 0, obsess_over_hos
> > ts = 0, obsess_over_services = 0, modified_host_attributes = 0, 
> > modified_service_attributes = 0, global_host_event_handler = '', 
> > global_service_event_handler = ''WHERE instance_id 
> > = 0]: 2006: MySQL server has gone away
> 
> Basically the same, then. Again, has it ever stated that it has
> successfully connected to the MySQL server?
> 
> > 
> > And after that, the logs are filled with the same messages : handled 
ipc 
> > event, inbound data and failed query
> > 
> > I have restarted the merlind process, it ran an import and after, it 
works 
> > fine again
> > 
> 
> Ok. In that case it's not a configuration error. Can you try using the
> latest git snapshot (download it directly from git for simpler updates)
> and see if that solves this particular problem?
> 
> The latest core code changes can be found in v0.6.2-beta11.
> 

I will try later in the day. But one last thing weird. Yesterday, after 
restarted
the merlind process, ninja stop displaying any information about hosts and 
services.
After some search, I have ssen that the table host show 0 rows. Other 
tables get
updated but not host.

The logs are showing the process restarted and the initial import

[1251729329] 6: Handled 1 ipc events in 0.001 seconds
[1251729338] 6: Initializing IPC socket '/bnp/apps/nagios/merlin/ipc.sock' 
for daemon
[1251729339] 6: Primed object states for 2004 hosts and 14809 services
[1251729339] 6: Merlin daemon  successfully initialized
[1251729342] 6: Accepting inbound connection on ipc socket
[1251729342] 6: inbound data available on ipc socket

[1251729342] 6: Executing import command 'php 
/bnp/apps/nagios/merlin/import.php 
--nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg 
--cache=/bnp/apps/nagios/var/ob
jects.cache --db-name=merlindb --db-user=merlin --db-pass=xxx 
--db-host=eqd-nagios-sql --status-log=/bnp/apps/nagios/var/status.dat'
[1251729342] 6: Executing import command 'php 
/bnp/apps/nagios/merlin/import.php 
--nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg 
--cache=/bnp/apps/nagios/var/ob
jects.cache --db-name=merlindb --db-user=merlin --db-pass=xxx 
--db-host=eqd-nagios-sql --status-log=/bnp/apps/nagios/var/status.dat'
[1251729342] 6: Executing import command 'php 
/bnp/apps/nagios/merlin/import.php 
--nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg 
--cache=/bnp/apps/nagios/var/ob
jects.cache --db-name=merlindb --db-user=merlin --db-pass=xxx 
--db-host=eqd-nagios-sql --status-log=/bnp/apps/nagios/var/status.dat'
[1251729342] 6: Handled 66 ipc events in 0.477 seconds
[1251729358] 6: inbound data available on ipc socket

[1251729358] 6: Executing import command 'php 
/bnp/apps/nagios/merlin/import.php 
--nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg 
--cache=/bnp/apps/nagios/var/ob
jects.cache --db-name=merlindb --db-user=merlin --db-pass=xxx 
--db-host=eqd-nagios-sql --status-log=/bnp/apps/nagios/var/status.dat'
[1251729358] 6: Handled 50 ipc events in 0.019 seconds
[1251729359] 6: inbound data available on ipc socket

But the next import has been made at timestamp 1251767006

[1251767006] 6: Executing import command 'php 
/bnp/apps/nagios/merlin/import.php 
--nagios-cfg=/bnp/apps/nagios/etc/nagios.cfg 
--cache=/bnp/apps/nagios/var/ob
jects.cache --db-name=merlindb --db-user=merlin --db-pass=xxx 
--db-host=eqd-nagios-sql --status-log=/bnp/apps/nagios/var/status.dat'

It seems that 37648 seconds has passed before an another import. Maybe 
this is the
correct behaviour, but I don't really understand why the table host 
contain 0 rows
for several minutes/hours.

> Thanks for your reports. I really appreciate them :-)

You're welcome. We are looking for a good tool to allow a team to have 
access to the
Nagios events. We are testing NDOutils and Merlin. with NDO, nagios is too 
long to
restart, and it is not acceptable. With Merlin, nagios starts immediately, 
that's cool :)
We just want to be sure that there is no side effect with Merlin so I test 
it. When
things seems to settle down, I will look the database schema to see how 
they will be
able to get the events from the merlin database

Regards

Nicolas




This message and any attachments (the "message") is
intended solely for the addressees and is confidential. 
If you receive this message in error, please delete it and 
immediately notify the sender. Any use not in accord with 
its purpose, any dissemination or disclosure, either whole 
or partial, is prohibited except formal approval. The internet
can not guarantee the integrity of this message. 
BNP PARIBAS (and its subsidiaries) shall (will) not 
therefore be liable for the message if modified. 
Do not print this message unless it is necessary,
consider the environment.

                ---------------------------------------------

Ce message et toutes les pieces jointes (ci-apres le 
"message") sont etablis a l'intention exclusive de ses 
destinataires et sont confidentiels. Si vous recevez ce 
message par erreur, merci de le detruire et d'en avertir 
immediatement l'expediteur. Toute utilisation de ce 
message non conforme a sa destination, toute diffusion 
ou toute publication, totale ou partielle, est interdite, sauf 
autorisation expresse. L'internet ne permettant pas 
d'assurer l'integrite de ce message, BNP PARIBAS (et ses
filiales) decline(nt) toute responsabilite au titre de ce 
message, dans l'hypothese ou il aurait ete modifie.
N'imprimez ce message que si necessaire,
pensez a l'environnement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://lists.op5.com/pipermail/op5-users/attachments/20090901/7e5d4e4c/attachment.html 


More information about the op5-users mailing list