• 12Apr
    Author: ben Categories: Infrastructure Comments: 4

    At Applied Trust we run Nagios, the excellent open source monitoring system, to ensure the availability of our internal system infrastructure and that of our clients. The system monitors roughly 2,740 services on 468 hosts. It has performed flawlessly for the past six or seven years, but over the last few months we began experiencing performance problems (see the 12 month CPU usage graph below). In particular, we were being alerted regularly about excessive load on the monitoring system and, as a consequence of high load, false positive timeouts when connecting to services.

    12 month CPU usage

    12 month CPU usage

    We started debugging the problem by identifying which resource in particular was causing load problems. Using the iostat command, we noticed that both I/O and wait User CPU time seemed consistently high:

    $ iostat 2 30
     Linux 2.4.21-4.ELsmp ([snip])        03/03/2009
    
     avg-cpu:  %user   %nice %system %iowait   %idle
               30.77   0.00  5.29    12.24     51.71

    High I/O wait can be caused by a variety of problems. Several layers of complexity are eliminated in our case since the monitoring system uses local storage (e.g. not attached to a SAN/NAS/some other remote storage system). However, the storage is configured in RAID1 with Linux software RAID, and dmesg revealed a problem with the mirror:

    md: Autodetecting RAID arrays.
     [events: 0000002a]
     [events: 00000033]
    md: autorun ...
    md: considering sdb3 ...
    md:  adding sdb3 ...
    md:  adding sda3 ...
    md: created md0
    md: bind<sda3,1>
    md: bind<sdb3,2>
    md: running: <sdb3><sda3>
    md: sdb3's event counter: 00000033
    md: sda3's event counter: 0000002a
    md: superblock update time inconsistency -- using the most recent one
    md: freshest: sdb3
    md: kicking non-fresh sda3 from array!
    md: unbind<sda3,1>
    md: export_rdev(sda3)
    md: RAID level 1 does not need chunksize! Continuing anyway.
    md0: max total readahead window set to 124k
    md0: 1 data-disks, max readahead per data-disk: 124k
    raid1: device sdb3 operational as mirror 1
    raid1: md0, not all disks are operational -- trying to recover array
    raid1: raid set md0 active with 1 out of 2 mirrors
    md0: no spare disk to reconstruct array! -- continuing in degraded mode
    md: ... autorun DONE.

    A consistency problem on the sda3 partition degraded the mirror. To determine if the drive was bad, we used smartctl tool:

    $ sudo /usr/sbin/smartctl -Hc /dev/sda3
    smartctl version 5.1-11 Copyright (C) 2002-3 Bruce Allen
    Home page is http://smartmontools.sourceforge.net/ [^]
    
    SMART Sense: Ok!
    Current Drive Temperature:     31 C
    Drive Trip Temperature:        68 C

    The drive seems fine, so we added the partition back to the array with mdadm /dev/md0 –add /dev/sda3 and monitored the rebuild process with watch ‘cat /proc/mdstat’. Unfortunately, after the array was rebuilt and mdadm -D /dev/md0 showed both drives as active, the load problems persisted. This was unlikely to fix the problem anyway since a RAID mirror has very little overall impact on I/O performance.

    Since the storage subsystem seemed sane, we took a step back and looked at the application layer. Using lsof, it appeared that the serviceperf.log file, which holds performance metric data gathered along with service monitoring commands, was showing constant usage:

    $ sudo /usr/sbin/lsof /usr/local/nagios
    COMMAND     PID   USER   FD   TYPE DEVICE      SIZE   NODE NAME
    syslog-ng  6023   root    5w   REG   58,3   8654863  81927 /usr/local/nagios/syslog/10.34.34.59.log
    nagios     6404 nagios  txt    REG   58,3    521120  32782 /usr/local/nagios/bin/nagios
    nagios     6404 nagios  txt    REG   58,3   8654863  81927 /usr/local/nagios/syslog/10.34.34.59.log
    nagios     6404 nagios  txt    REG   58,3 513157004  22182 /usr/local/nagios/var/serviceperf.log
    perl       6663   root  cwd    DIR   58,3      4096  32769 /usr/local/nagios/bin
    tail       6664   root  cwd    DIR   58,3      4096  32769 /usr/local/nagios/bin
    bash      11802  chris  cwd    DIR   58,3      4096      2 /usr/local/nagios
    atreporte 16000 apache  cwd    DIR   58,3      4096 180289 /usr/local/nagios/reporter
    atreporte 16000 apache    3r   REG   58,3     15152 180230 /usr/local/nagios/reporter/cgi-lib.pl

    This make sense since most of our service checks collect performance data for use with rrdtool for graphing and capacity planning purposes. The nagios.cmd file also appeared regularly in the lsof list (though not depicted above), but since it’s a named pipe it does not have an impact on I/O wait time. To determine if the serviceperf.log was the true perpetrator, we removed performance processing from the Nagios configuration for a few hours. Load dropped dramatically, so we needed to reduce the time it took to read and write to the serviceperf.log file.

    Our first idea was to add the noatime and nodiratime mount option to the filesystem that the serviceperf.log file was on (in our case, /usr/local/nagios) to eliminate inode access time updates, which can have a small impact on filesystem performance. This is as simple as executing sudo mount -o remount,noatime,nodiratime. After remounting and monitoring for a few hours, the load did not drop significantly.

    The next tactic we tried was targeted at the other problem we identified: high user CPU time. Our graphs of CPU over the past year (see above) on the monitoring system indicated that user CPU was growing slowly but steadily. The blue line representing User CPU grew from 15% to 30% over a year. Anything running in user space, such as Nagios service checks, will all be classified as user CPU. At some point in the past year we may have added enough service monitoring that it finally impacted user CPU. Since many of our monitored services rely on a custom Perl script that would certainly execute in User mode, we tried migrating to the Nagios embedded Perl interpreter, which adds a number of advantages.

    In the process of migrating, we ran in to a goofy problem compiling Nagios with embedded Perl that caused debug messages to be printed. A number of other users reported that adding #undef DEBUG to the file include/epn_nagios.h fixed the problem. We compiled with ./configure –prefix=/usr/local/nagios –enable-embedded-perl –with-perlcache –with-command-group=nagios and waited. Although all of our custom Perl executed succesfully, the embedded interpreter did not have an impact. Back to the drawing board.

    We decided on two tactics for next steps: reduce the frequency between service checks (fewer over all checks would reduce the number of processes consuming user CPU time) and move the serviceperf.log file to a tmpfs filesystem. For most of our service checks we had previously used intervals between 2 and 5 minutes. We moved to a 10-15 minute interval for most checks. The serviceperf.log data is not useful once it is read and committed to an RRD file, so moving it to an in-memory file system is not risky. We installed memory in the server, created and mounted the tmpfs, and move the serviceperf.log to the new filesystem. These two steps finally solved our load problems. Compare the one week graph below to the 12 month graph above. It’s not hard to see when the changes were put in place.

    1 week CPU usage

    1 week CPU usage

    Hopefully these troubleshooting steps are useful for someone else with a busy Nagios server.

    [Slashdot] [Digg] [Reddit] [del.icio.us] [Technorati] [StumbleUpon]

4 Responses

WP_Floristica
  • Hendrik Says:

    I found something new for me! Thanks!

  • Kumaran Says:

    I think moving the service check timings to an interval of 10-15mins would have had an impact in improved performance. 10-15 mins can be ideal for a few devices but not something that I would recommend in a more dynamic environment. For example, for the WAN infras that we configure nagios 5mins checks themselves are too big an interval as a network/interface flap for a couple of mins can cause serious issues as a lot of HA systems would kickin. So, sooner the engineers know is better. And sometimes you may miss a whole flap completely in these 10-15mins interval

    CPU 30-50% isn’t an impact again.

  • ben Says:

    @Kumaran – Yes, I absolutely agree. Some service checks need to be more granular (2-5 minutes) and others can be monitored on the 10-15 minute scale. It really depends on the service’s purpose and the amount of downtime that can be tolerated. Ultimately, I’d say the interval should be up to the service owner.

  • Mike Kniaziewicz Says:

    Great article Ben! I like the way you draw the picture then present solutions.

    We noticed an increase with our load averages over a period of time. My enterprise monitors over 1700 hosts and 3100 services and is growing. We developed the same type of solutions you propose in your article.

    Performing a “lsof” we were able to identify directories being hit the hardest, so we placed those directories on SAN disks, which reduced the load dramatically. Directories included /usr/lib/mysql and /var/opt/nagios.

    To help reduce the load even further we developed time periods for monitoring certain aspects of the enterprise. We only monitor development and test servers during business hours. We also only monitor services that impact production during the night time hours. We found that receiving a notification at 2 AM about a high CPU was not the answer, so we track these problems then work on solutions the following morning.

    You may also want to turn-up monitoing on systems that the end users are reporting problems. You can create custom templates and even call them problem_child.cfg. You would add hosts to this template only when a need occurs to step up the monitoring.

    Thanks Ben for the article and I found it quite insightful. Keep up the fine work.

Leave a Comment

Please note: Comment moderation is enabled and may delay your comment. There is no need to resubmit your comment.