[op5-users] Ninja backend and architecture

Matthias Flacke Matthias.Flacke at gmx.de
Tue Mar 31 00:50:12 CEST 2009


Hi Andreas & Johannes,

thanks for your comprehensive answers! Comments and more questions
following inline:

Andreas Ericsson wrote:
> Yes, we're using the Merlin module, which was originally designed
> as an event transport module/daemon pair, much like NDOUtils. We
> found that the NDO database scheme made it scale very poorly, so
> we had to design our own. The idea is to make Nagios scale to
> tens of thousands (or hundreds of thousands) hosts, 

Can you explain your decision a bit more? We all know about
constraints and weaknesses in NDO. But was performance the only
reason to create a proprietary DB model and leave the compatibility
path for all addons based on NDO?

which becomes
> rather simple since the Merlin module allows events to be sent not
> only from the Nagios daemon on the same host, but also over the
> network. The protocol is open, so other applications can use it
> to transfer data to a merlin daemon as well.

Is there some specification for this protocol (beside the C header
files ;))?
>From your NEB presentation in Nuremberg I understood Merlin as a
means to exchange events between several Nagios instances. What I
didn't got: who is the real player, Nagios or Merlin? If all the
monitoring objects are instantiated in the Nagios instances, how do
you achieve goals like
- load balancing
- redundancy / failover
- configuration synchronization
via Merlin? Do you hijack the Nagios scheduling in some extent?

BTW - if you combine multiple Nagios instances in terms of
redundance and failover, does this also include the DB backend? Or
is the DB singular and has to be secured in the classical sense of
clustering and HA?

> This'll be hard without some means of drawing stuff, but here's
> what happens when an event is triggered inside Nagios:
> * Nagios calls the eventbroker module part of Merlin.
> * merlinmod creates a merlin packet from the event and
>  - transfers it to the merlin module if the connection is live
> or
>  - writes it to a backlog file for later feeding to the unix
>    domain socket connecting the module to the daemon

Backlog is a nice idea, since Nagios lives in presence and does not
care for non received results after leaving the retry_intervals. But
- is Nagios capable to deal with such events from the past? How does
it fit into concepts like timeperiods, flapping, reporting,
performance metering etc.?

>> - Is it perhaps planned to add some kind of an API to structure the
>> interaction between GUI and backend?
>>
> 
> Yes. We have to do that since a new status map will also be included
> in the GUI, and the people writing the status map are contracted and
> shouldn't have to worry about database layout and things like that.
> We're aware of possible changes that may need to be done to the db
> layout later, and we want a stable API that everyone can use to pull
> stuff out of the database. Per Åsberg knows more about that, as I've
> only been working on the datafeeding part.

I would appreciate to get some more info about that API which is IMO
more important than bells and whistles on the GUI ;) Does it only
reflect data retrieval and manipulation or will it also cover things
like event handling, command queuing, messaging etc.?

I expect to work quite a
> bit on the API once the Merlin stuff is about 90% complete though.

Good news;) One last question: is it already time to start to play
around with Merlin? AFAIR you stated somewhere that its not yet
production stable...

Cheers,
-Matthias


More information about the op5-users mailing list