[op5-users] Réf. : Re: merlind crash after loosing mysql connection

Andreas Ericsson ae at op5.se
Mon Aug 31 17:04:24 CEST 2009


nicolas.raspail at bnpparibas.com wrote:
> p5-users-bounces at lists.op5.com wrote on 31/08/2009 10:38:27:
> 
>> nicolas.raspail at bnpparibas.com wrote:
>>> Hi
>>>
>>> Merlind (version 0.6.2-beta5) crash if it loose the connection with 
> the 
>>> MySQL. My sql server is running inside a cluster, and for some 
> reasons,  I 
>>> have switched it to another node. To be sure that everything is okay, 
> I 
>>> have checked the log of merlin. As expected, I see the following 
> message :
>>> [1251471412] 6: dbi_conn_query_null(): Failed to run [SELECT 
> host_name, 
>>> current_state, state_type FROM merlindb.host ORDER BY host_name]: 
> 2006: 
>>> MySQL server has gone away
>>>
>>> But what is not expected is the fact that no more merlind process is 
>>> running.
>> Do you mean that no merlind process is running, or that only one is?
>>
>>> I have see the same behaviour with the beta2 as reported in a 
>>> previous email and Andreas answered :
>>> "Hmm. It should try to reconnect at that point instead. Not sure if it 
> did 
>>> that in beta2."
>>>
>> It should, although it log that it does. What version of libdbi are you
>> using? There has been some changes to the error handling in recent 
> versions
>> of libdbi, so perhaps your version no longer returns DBI_ERROR_NOCONN 
> when
>> it notices the database connection has died. I'll run some tests and add
>> some logging and make sure it at least tries to reconnect.
>>
>>> It seems that even with the beta5, merlind is not trying to reconnect
>>>
>> Well, the code in sql.c hasn't changed between beta5 and beta10, so it
>> wouldn't help to upgrade for this particular problem.
>>
>> I'll get back to you in an hour when I've run those tests.
> 
> Hi
> 
> I have just installed the 0.6.2-beta10 version of merlind. At this 
> subject, is
> there a place where we can find the running version ?

Well, yes and no. If you're building from git, you should be able to
see the exact version in the logs when Merlin is loaded. I just noticed
that there was a bug in the gen-version.sh script that caused it to not
print the DEF_VER variable properly when building from tarballs.

> I see nothing in the 
> logs
> of nagios or merlind. And in the source, in the file gen-version.sh, there 
> is
> only DEF_VER=v0.6.1 and this script seems to do nothing
> 

You're not meant to run the script manually. It's run by invoking
'make', and it's supposed to create a file called version.c

> [merlin at eqd-nagios01 merlin]$ ./gen-version.sh 
> #include "shared.h"
> const char *merlin_version = "";
> 
> But let's to why I have installed the new  version : the lost of mysql 
> connection !
> 

Right. I *think* I may have fixed this one, although a mismerge between
the upstream Nagios core and some of our own patches forced me to
redirect my efforts a while.

> ** After I switch over my MySQL server, I can see that in the log a lot of 
> message like that :
> 
> [1251728254] 6: dbi_conn_query_null(): Failed to run [UPDATE 
> merlindb.service SET initial_state = 0, flap_detection_enabled = 1, 
> low_flap_threshold = 0.000000, high_flap_threshold 
> = 0.000000, check_freshness = 0, freshness_threshold = 0, 
> process_performance_data = 1, active_checks_enabled = 1, 
> passive_checks_enabled = 1, event_handler_enabled = 1, obsess_ove
> r_service = 1251669600, problem_has_been_acknowledged = 0, 
> acknowledgement_type = 0, check_type = 0, current_state = 0, last_state = 
> 0, last_hard_state = 0, state_type = 1, current
> _attempt = 1, current_event_id = 0, last_event_id = 0, current_problem_id 
> = 0, last_problem_id = 0, latency = 0.032000, execution_time = 0.044196, 
> notifications_enabled = 1, last_n
> otification = 0, next_check = 1251729152, should_be_scheduled = 1, 
> last_check = 1251728252, last_state_change = 1248342907, 
> last_hard_state_change = 1248342907, has_been_checked = 
> 1, current_notification_number = 0, current_notification_id = 0, 
> check_flapping_recovery_notification = 0, scheduled_downtime_depth = 0, 
> pending_flex_downtime = 0, is_flapping = 0,
>  flapping_comment_id = 0, percent_state_change = 0.000000, output = 'SNMP 
> OK - TODEFINE', long_output = '', perf_data = '' WHERE host_name = 'xxxx' 
> AND service_description = 'b
> np-check-snmpd']: 2006: MySQL server has gone away

Does it ever say that it's managed to connect o the MySQL server in
the first place?

> 
> Until that point, everything to be good and the merlind process is still 
> running.
> 
> ** But when the MySQL server is up again, I see a lot of messages like 
> that :
> 
> [1251728254] 6: Handled 110 ipc events in 0.086 seconds
> [1251728255] 6: inbound data available on ipc socket
> 
> [1251728255] 6: dbi_conn_query_null(): Failed to run [UPDATE 
> merlindb.program_status SET is_running = 1, last_alive = 1251728255, 
> program_start = 1251727995, pid = 24400, daemon_mo
> de = 1, last_command_check = 1251728254, last_log_rotation = 0, 
> notifications_enabled = 1, active_service_checks_enabled = 1, 
> passive_service_checks_enabled = 1, active_host_checks
> _enabled = 1, passive_host_checks_enabled = 1, event_handlers_enabled = 1, 
> flap_detection_enabled = 0, failure_prediction_enabled = 1, 
> process_performance_data = 0, obsess_over_hos
> ts = 0, obsess_over_services = 0, modified_host_attributes = 0, 
> modified_service_attributes = 0, global_host_event_handler = '', 
> global_service_event_handler = ''WHERE instance_id 
> = 0]: 2006: MySQL server has gone away

Basically the same, then. Again, has it ever stated that it has
successfully connected to the MySQL server?

> 
> And after that, the logs are filled with the same messages : handled ipc 
> event, inbound data and failed query
> 
> I have restarted the merlind process, it ran an import and after, it works 
> fine again
> 

Ok. In that case it's not a configuration error. Can you try using the
latest git snapshot (download it directly from git for simpler updates)
and see if that solves this particular problem?

The latest core code changes can be found in v0.6.2-beta11.

Thanks for your reports. I really appreciate them :-)

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.


More information about the op5-users mailing list