monitoring interview questions

Top 15 monitoring interview questions

10326 Jobs openings for monitoring

Best practices for backup checking?

It is a common situation, when administrator makes system for automatic backuping and forgets it. Only after a system fails administrator notices, that backup system has broken before or backups are unrestorable because of some fault and he has no current backup to restore from... So what are best practices to avoid such situations??

Source: (StackOverflow)

How can I see how much bandwidth each Apache Virtual Host is using?

I have Apache set up to serve several Virtual Hosts, and I would like to see how much bandwidth each site uses. I can see how much the entire server uses, but I would like more detailed reports.

Most of the things I have found out there are for limiting bandwidth to virtual hosts, but I don't want to do that; I just want to see which sites are using how much bandwidth.

This isn't for billing purposes, just for information.

Is there an apache module I should use? Or is there some other way to do this?

Source: (StackOverflow)

How i configure monit to start a process with a specific user?

Monit runs with root, but i don't want to start my processes as root.. like mysql, mongrel, apache..

Source: (StackOverflow)

Is Zabbix the right tool for me?

I just want to monitor a small handful of servers (less than 10).

From reading various places it sounds like the top leading contenders (for open source at least) are:

  • nagios
  • munin
  • zabbix

From what I have read a lot of people tend to use munin and nagios together -- munin for history and graphs, and nagios for alerting.

On the other hand it sounds like Zabbix is a more complete solution and easier to configure than either of the other two. So I was thinking of going that route.

My thoughts right now are:

  1. What are the general disadvantages of Zabbix?
  2. Does Zabbix have a small footprint on boxes it is monitoring?
  3. Do I really need to setup an entire other server for it? I currently have a server that is under very light load -- can I dual purpose it?

Source: (StackOverflow)

How to monitor and log the memory/cpu usage of processes over time? [closed]

I am looking for a way to diagnose issues, such as swap death, where a balooning memory process fills up swap and kills the whole machine (such as apache).

I'm already using cacti and I can set up nagios (though would rather not) or munin but as far as I can tell they can't record individual program usage - just overall status.

I know I can roll a script that >> to some file every 30s but I'd like to see if an existing mature solution already exists.

Again, ideally it would:

  • record processes' memory usage every N seconds
  • record processes' CPU usage every N seconds
  • support charts and history
  • support averages - like mysqld has used 43% CPU in the last day and averaged 400MB memory
  • be free and open source

Process names are not and should not be known in advance - the idea is to just let it monitor and then have a look at the top offenders.

My system is Linux (OpenSUSE).

Source: (StackOverflow)

Monitoring production server [closed]

Hey all,
We have 3 dedicated server, splitted in several VPS using openVZ. We're using munin to monitor the VPS with the production sites, and monit on some one of the VPS to make sure it restarts the service when failing.

Thing is we need a much better way to monitor all of our servers, since we have up to 14 VPSes, we'd like to have a center hub where we could see not only the data collected by munin, but also some more extra stats on the networks and performances of our services.

Some of our requirements:
- SMS notification on failure (ability to setup certain custom verification)
- Log analyzer for apache error_log and some other.
- Must be central (meaning one server and several nodes collecting the data).
- Doesn't need to be easy to install but easy to maintain.
- Need to be free

I've been pointed to nagios and splunk, what do you think? Thanks,

Source: (StackOverflow)

Is there a Windows equivalent of Unix 'CPU steal time'?

In order to assess performance monitoring accuracy on virtualization platforms, the CPU steal time has become an increasingly relevant metric - see EC2 monitoring: the case of stolen CPU for an instructive summary in the context of Amazon EC2 and IBM's paper on CPU time accounting for a more in-depth technical explanation (including illustrations) of the concept:

Steal time is the percentage of time a virtual CPU waits for a real CPU while the hypervisor is servicing another virtual processor.

Accordingly, it is exposed in most related Unix/Linux monitoring tools nowadays - see e.g. columns %steal or st in sar or top:

st -- Steal Time
The amount of CPU 'stolen' from this virtual machine by the hypervisor for other tasks (such as running another virtual machine).

I've been unable to figure out how to capture the same metric on Windows though, is this possible already? (Ideally for the Windows 2008 Server R2 AMIs on EC2 and via a respective Windows Performance Counters of course.)

Source: (StackOverflow)

What am I looking for in a Monitoring Solution?

This is a Canonical Question about Monitoring Software.

Also Related: What tool do you use to monitor your servers?

I need to monitor my servers; what do I need to consider when deciding on a monitoring solution?

Source: (StackOverflow)

Linux: logwatch(8) is too noisy. How can I control the noise level?

Our Linux systems run logwatch(8) utility by default. On a RedHat/CentOS/SL system, Logwatch is called by the /etc/cron.daily/ cronjob, which then sends a daily email with the results. These emails have a subject like:

Subject: Logwatch for $HOSTNAME

The problem is that by default these daily emails are too noisy and contain a lot of superfluous information (HTTP errors, daily disk usage, etc) which are already monitored by other services (Nagios, Cacti, central syslog, etc). For 100 systems, the email load is unbearable. People ignore the emails, which means that we may miss problems which are picked up by logwatch.

How can I reduce the amount of noise generated by logwatch, but still use logwatch to notify us of significant problems?

I'll post my own answer below, but I would like to see what others have done.

Note: I have a similar question regarding FreeBSD, at FreeBSD: periodic(8) is too noisy. How can I control the noise level?

Source: (StackOverflow)

Do SSDs support SMART?

S.M.A.R.T. (for Self-Monitoring Analysis and Reporting Technology) is a wonderful technology to detect hard drive failure before it really happens.

But is S.M.A.R.T. relevant for SSDs?

Source: (StackOverflow)

Colorize Monitoring of Logs

I sometimes monitor apache and php error logs using tail under FreeBSD. Is there any way to get colorized output, either using tail or some other command line app?

Alternatively, what is your favorite way to monitor the various web-related logs in realtime?

Source: (StackOverflow)

monit: check process without pidfile

I'm looking for a way to kill all processes with a given name that have been running for more than X amount of time. I spawn many instances of this particular executable, and sometimes it goes into a bad state and runs forever, taking up a lot of cpu.

I'm already using monit, but I don't know how to check for a process without a pid file. The rule would be something like this:

kill all processes named xxxx that have a running time greater than 2 minutes

How would you express this in monit?

Source: (StackOverflow)

What is the difference between OpenTSDB and Graphite?

As far as I can tell, here are the main differences:

  1. OpenTSDB does not deteriorate data over time, unlike Graphite where the size of the database is pre-determined.
  2. OpenTSDB can store metrics per second, as opposed to Graphite which has minute intervals (I'm not sure of this, Graphite docs show retention policies which stores metrics every minute, but I don't know if this is the minimum unit of time we can play with)

I want to make an informed decision about which tool to use in order to store metrics, have I missed any other differences in these 2 systems? How performant/scalable are they?

Bonus Question: Is there any other time series system I should look at?

Source: (StackOverflow)

Get notification from supervisord when a job exits

Is there any way supervisord can automatically restart a failed/exited/terminated job and send me a notification email with a dump of the last x lines of log file?

Source: (StackOverflow)

A better "top" command for Mac OS X? [closed]

The top command on OS X is pretty crappy.. The one included with most Linux distros allows you to change the sort-by column using < and >, there is a coloured mode (by pressing the z key), and a bunch of other useful options.

Is there a replacement command line tool? Ideally I would like htop for OS X, but because it relies on the /proc/ filesystem (see this thread) it has not been ported (and probably will never be)

The obvious answer is "Activity Monitor", but I'm looking for a command line tool!

Source: (StackOverflow)