[op5-users] active checks & merlin

Andreas Ericsson ae at op5.se
Tue Sep 29 09:46:51 CEST 2009


On 09/28/2009 11:24 PM, Russell Jennings wrote:
> So it's was a bit confusing/surprising , when i got merlin running,
> that active checks weren't actually active. Took me a bit to figure
> out that it's kinda a passive submittal. Is there a way to make the
> active checks active? such as when merlin see's that nagios wants to
> execute a check on a hostgroup it controls, actively tell the
> responsible node(s) to execute that check and get back the data? THAT
> would be sweet. Would give me back the "reschedule the next check..."
> link in nagios, which i use in troubleshooting. if the NOC could
> control when the checks are executed like that, it would mean that it
> would give more control to the NOC on when it gets data. Just thought
> i'd throw this idea out.. but you dudes are pretty smart in my book,
> so i wouldn't be surprised if there's a why for this.
>

Why, thank you :-)

There is, sort of, but we didn't write it. DNX is a check scheduler
that uses slave nodes to do its actual work. I don't think you can
decide which node does which checks with DNX, but perhaps I'm mistaken.
I haven't looked at it for quite some time.

The idea is that commands should be distributed to the proper nodes
for handling, although that's not implemented yet. Also, since merlin
is a two-way communication thing, we need to make sure we know *which*
commands to send which way, and how to configure it.

> right now i have an active check that runs every 240 minutes that
> generates an unknown state. i'll scale it down in production to maybe
> 5 minutes or something (so long as it's bigger than the check_interval
> on the node). So, i figure between that, if it is able to stay in the
> unknown state for that long, after the max check and all is filled in,
> that there actually IS a problem. I don't know if there's a better /
> smarter way to handle this (so if you've got one, i'm all ears!)
>
> though also, if i let a service stay unknown, and it exceeds the
> max_retry (so the status becomes something like 3/3) when he DOES get
> data back, it stays as 3/3, and doesn't reset back to 1/3. not sure
> what to make of this, but it toyed with me, since it means that if i
> fix the issue and all, it still keeps it in 3/3 so the next unknown
> that comes in results in an alert immediately.
>

This depends. If the status of the check doesn't change from UNKNOWN
to something else, it should stay at 3/3. Note that the status of the
check has nothing what so ever to do with what the plugin prints for
output.

-- 
Andreas Ericsson                   andreas.ericsson at op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

Considering the successes of the wars on alcohol, poverty, drugs and
terror, I think we should give some serious thought to declaring war
on peace.


More information about the op5-users mailing list