[op5-users] Me trying to test Merlin and Ninja
Mattias Ryrlén
mattias.ryrlen at op5.com
Thu Jul 30 14:01:14 CEST 2009
On Thu, Jul 30, 2009 at 12:13 PM, Johannes Dagemark<jd at op5.com> wrote:
> Hi Mathieu
>
> Thanks for your extensive testing and feedback, it's really appreciated.
>
> Andreas is on vacation and back on Tuesday next week, but I'm going to
> answer on the parts I can answer on.
>
> The distributed stuff is very limited in Merlin right now, we have been
> focusing on the status info -> db parts since it's needed for Ninja to work.
>
> Non the less, all the issues you posted is a great compilation of merlin
> problems, I have posted them to our mantis bugtracker which basically
> mean that sooner or later we will take action on them. Also I now really
> have to focus on making our mantis installation available for the public
> to make it easier to follow the progress :)
>
> Mathieu Gagné wrote:
>> Hi,
>>
>> This message is kind of a follow-up on my first message on Nagios-devel:
>> "My impression about Merlin and Ninja".
>>
>> I tried several kind of setups:
>>
>> 1) NOC + Poller (on same server)
>> - Everything on the same server: NOC + Poller
>> - All hosts/services definitions on the same server
>>
>> 2) NOC + Poller
>> - Identical hosts/services definitions on all servers. (NOC + Poller)
>> - With a hostgroup telling which hosts the poller is responsible of.
>>
>> 3) NOC + 2 x Pollers (independant)
>> - NOC has all hosts/services definitions
>> - Pollers have only hosts/services definitions they are responsible of.
>> - Hosts are organized in 2 hostgroups, one for each Poller. Hostgroups
>> are configured on the NOC and both Pollers.
>>
>> 4) NOC + 2 x Pollers (independant) and one Poller having a Peer
>> - NOC has all hosts/services definitions
>> - Pollers have only hosts/services definitions they are responsible of.
>> - Hosts are organized in 2 hostgroups, one for each Poller/Peer.
>> Hostgroups are configured on the NOC and both Pollers/Peer.
>>
>>
>> My hope is to be able to handle +40k hosts and around 100k services
>> (maybe more) with ~40 pollers. (depending on the overall performance)
>>
>> I'm not the best QA in town, however here is what I found while trying
>> to use/test Merlin. Please tell me if I'm doing anything wrong.
>>
>>
>> 1) Merlin crashes with this error when I try to stop it on the NOC:
>> *** glibc detected *** /usr/local/merlin/merlind: free(): invalid
>> pointer: 0x08307b68 ***
>>
>> Any idea about how I can debug and pin-point the source of the problem?
>>
>>
> I have to wait for Andreas for this one.
>> 2) Local commands are not transported between NOC, Pollers and Peers.
>>
>> If I disable notification on a service on a Poller, the database gets
>> updated and I can see notifications are disabled in Ninja. However it
>> does not get updated on the NOC nor the Peer. Also, the database gets
>> updated by the NOC or the Peer later on and notifications are now back
>> to "enabled" in Ninja.
>>
>> It's the same for notifications, active checks and probably all other
>> local commands.
>>
>>
> Reported in Mantis and will be fixed
>> 3) Active checks are scheduled on every servers (NOC, Poller and Peer)
>>
>> Active checks for hosts and services are scheduled on every servers:
>> NOC, Poller and Peer.
>>
>> This results in a constant "conflict" between all Nagios instances.
>> Isn't the Poller supposed to be the only one scheduling checks or am I
>> missing something? Did I misconfigured everything?
>>
>> If it's a normal behavior, what's the purpose of Merlin then? If the NOC
>> still schedules every hosts/services the poller is supposed to take care
>> of and propagate check results to everyone, what's the purpose of all this?
>>
> It does not sound normal... I have reported it and will check with Andreas.
>> 4) Question: Which server is responsible for sending notifications?
>>
>> Will I get 3 notices because the NOC, Poller and Peer decided they
>> should send the notification?
>>
>>
> I would say that it depends on how you would like it to be. In some
> cases it makes sense to only have the NOC to send out notifications and
> in other cases you would like the pollers to send notifications as well.
> How to do this is not really clear right now, but I would like if the
> user could simply enable and disable notifications per poller from the
> gui as usual. We need to make sure that Merlin can handle this though.
>> 5) Which server has authority over the database?
>>
>> When I start the NOC, all hosts/services definitions are exported to the
>> database. Everything is fine in Ninja.
>>
>> However, if I start/restart a Poller, only hosts/services the poller is
>> responsible of get exported to the database and everything else is
>> dropped. Am I missing something?
>>
> The starting point is that the NOC should be responsible for updating
> the database. Also something that we need to figure out.
>>
>> 6) STDOUT, STDERR and STDIN should be closed when demonizing.
>>
>> Otherwise I'm getting logs on my console while I'm actually trying to
>> work on it...
>>
>>
> agreed and reported.
>> 7) Split-brain situations
>>
>> In the event the NOC and the Poller can't talk to each others but both
>> can talk to the database, this would cause a split-brain situation.
>>
>>
> yepp, I reported this as well.
>> 8) Latency
>>
>> There's a ~3 seconds latency before the NOC or Poller get updated with
>> each others status.
>>
>> This is just an affirmation, no real problem here.
>>
>>
> I have no comment here and no idea of whats causing it..
>> 9) Sources are a mess! :-O
>>
>> Sources should be better organized. Just take a look at Nagios or
>> NDOutils, everything is organized in folders: includes, src, etc.
>>
>> There's also a lack of comments in the sources. If you want peoples to
>> contribute and help with Merlin, they should be able to understand what
>> the code is doing.
>>
>> A great example is the source code of ARC n ZFS:
>> http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/fs/zfs/arc.c#3647
>>
>> Although it is an extreme example, it shows how good sources can/should
>> be documented.
>>
>>
> agree, I started out as a sysadmin and when downloading code I simply
> hate bad structure, it gives an instant feeling of bad quality. Andreas
> has been working pretty hard on implementing all new functionality in
> Merlin so I kind of have indulgence on this though. We will fix this but
> a patch is always welcome :)
>> 10) NOC port in merlin.conf
>>
>> Merlin should bailout if no port is specified in merlin.conf on a Poller
>> for noc{}, otherwise it seems Merlin will try to connect to TCP port 0
>> on the NOC and miserably fail every seconds. Make it mandatory or use a
>> default one.
>>
>> It seems it's not an issue for poller{} or peer{}. Can someone confirm?
>>
>>
> reported, will check and fix it if it's an issue.
>> 11) Ninja shows active checks disabled on host when only a service has
>> active checks disabled
>>
>> Try disabling active checks on a service. You will see Ninja reporting
>> active checks being disabled on both host and service when only active
>> checks on the service are disabled.
>>
>>
> I'm actually not able to do any external commands right now :( so I have
> not been able to verify it. I did report it though and we will fix this.
>> 12) Files organization in Merlin
>>
>> Binary should be installed in /usr/local/merlin/bin
>> Configuration file should be installed in /usr/local/merlin/etc
>> Etc...
>>
>>
> yepp, or berhaps $nagiosdir/addons/merlin/[bin,etc]
>
>> 13) Configuration file's name and syntax
>>
>> To be consistent with Nagios, NRPE and NDOutils, configuration file
>> should be named merlin.cfg.
>>
>> Syntax in merlin.cfg seems to be very permissive. A lot of syntax are
>> supported:
>> => key = value
>> => key = value;
>> => key value
>> => key value;
>>
>> Which one is recommended? README and example.conf use both "key = value"
>> and "key = value;" in their examples which could be confusing.
>>
>>
> Agree and reported.
When it comes to merlin.conf file you should use the two options you
mention, with and without ;
poller nagios-poller-1 {
address = x.x.x.x
port = 15551
hostgroup = poller1
}
or
poller nagios-poller-1 { addres = x.x.x.x; port = 15551; hostgroup = poller1; }
>> 14) Hard-coded value in daemon.c
>>
>> This value is hard-coded in value which isn't great:
>>
>> static char *import_program = "php
>> /home/exon/git/monitor/merlin/import.php";
>>
>>
> Not going to argue about that :)
>> 15) Status duration are not transported between NOC, Poller and Peer.
>>
>> Although status are shared between them, the status duration isn't which
>> could be confusing if peoples rely on the CGI interface to get their
>> information.
>>
>>
> reported
>> 16) Inconsistency in merlin.conf
>>
>> You specify "address" in the main section but need to specify "port" in
>> daemon{}. Only the daemon listen on "address" so putting it in daemon{}
>> would be the right thing to do. No?
>>
>> "ipc_socket" is used by the daemon and the module right? Keeping it in
>> the "main" section would be ok I tink.
>>
>>
> reported
>> 17) Folder name in archive
>>
>> Then I extract the archive at
>> "http://www.op5.org/op5media/op5.org/downloads/merlin-0.6.2-beta2.p1.tar.gz"
>> I get a folder named merlin/.
>>
>> Would it be possible to name it merlin-0.6.2-beta2.p1/ so I can download
>> multiple versions without getting name conflicts?
>>
>>
> my bad, fixed now
>> 18) Typo in Ninja
>>
>> Typo here:
>> ./ninja/application/views/themes/default/icons/16x16/nofity-disabled.png
>>
>> Views should be fixed and the file renamed.
>>
>>
> This was fixed already in
> http://git.op5.org/git/?p=nagios/ninja.git;a=commitdiff;h=3b264d50c422381689dc247fd0751f003564b591
>
>> That is all for now. Do not hesitate to ask questions if I'm not clear. :)
>>
>>
> Once again, thanks! :)
>
> A lot of the issues you found are related to Merlin not being really
> ready when it comes to distributed stuff. We have discussed a lot about
> Merlin, what it should do, how it should behave, etc and now when people
> actually try to use it outside of our office it gets a lot easier to see
> what's important, what we missed and what's not so important.
>
> Most likely you will get a mail from Andreas next week giving some more
> answers and probably asking some questions.
>
> Until then,
>
> Take care
> /Johannes
>> --
>> Mathieu
>> _______________________________________________
>> op5-users mailing list
>> op5-users at lists.op5.com
>> http://lists.op5.com/mailman/listinfo/op5-users
>>
>
>
> --
> Johannes Dagemark
> CTO / VP Engineering
> ________________________________________
>
> op5 AB
> Första Långgatan 19
> SE-413 27 Gothenburg
> cell: +46 733-70 90 24
> fax: +46 31-774 04 32
> Email: jd at op5.com
> http://www.op5.com/
>
> _______________________________________________
> op5-users mailing list
> op5-users at lists.op5.com
> http://lists.op5.com/mailman/listinfo/op5-users
>
--
Vänliga hälsningar / Best Regards
Mattias Ryrlén
__________________________
op5 AB
Första Långgatan 19
SE-413 27 Göteborg
Mobil: +46 735-17 70 99
Support: +46 31-774 09 24
www.op5.com
More information about the op5-users
mailing list