Sysadmin etc: Time Machine Monitoring

Time Machine is great - just set it up and forget about it, and it backs up all you files automatically.

In theory.

In practice it can just stop, and not backup for no apparant reason. The backup disk could fill up. Someone could disable it. There could be errors. And my users don't necessarily report this and so I've got no way of knowing.

So, we need monitoring.

I found this post:
http://smoove-operator.blogspot.com/2010/09/monitoring-timemachine-backups-with.html
which grepped the logs looking for backup ending, uploaded this information to the nagios server which monitored it.

Interesting, I thought, but not quite what I'm after. Most of our Macs are desk bound, so they're on the office network. And they're switched off at night, generally, so any periodic job is likely to fail to run at the right moment.

All the script is doing is to return a date stamp of the last backup done, so how's about we use snmp to return that to the monitoring.

Easy enough to adapt the script:

#!/usr/bin/env ruby
# Get the last backup time we have, with no newlinelast_backup = `/usr/bin/syslog -T sec -F '\$Time - \$Sender -\$Message' | grep backupd | grep 'Backup completed' | tail -1`last_backup.chomp
# Make sure it exists - exit if notif !last_backup.eql? ""
# Get the unix timestamp out of the last message backup_stamp = (last_backup.split "-")[0]
puts backup_stamp
else puts 0end

First off I've re-written it in Ruby - just a personal (and company) preference. Now just outputs the seconds since the epoch of the last successful backup.

stick in into /usr/local/bin/tm_check and add into /etc/snmp/snmpd.conf:

exec tm_check /usr/local/bin/tm_check

startup snmpd:

sudo launchctl load -w /System/Library/LaunchDaemons/org.net-snmp.snmpd.plist

and the client side is ready to go.

We use Nagios for monitoring, which we use to monitor our servers etc. However there are a few caveats which occurred to be when thinking about desktop machines:

1. I don't care if the machine is up or down. In fact, I really don't want to be in a position where that is recorded at all. It is too close to watching what the employee is doing - ie when they're in in the morning and leave at night. Not my job! So need to stop host checks.

2. Similary, if the snmp probe doesn't return, then the machine is probably off, so let's not worry. So the check script records the last backup date in a file, and if the snmp times out then file cache date is returned - this is valid since that date is the worst case date.

3. Don't tell me by email and especially not SMS. We have a screen on the wall which shows current alerts (using NagLite) and any failures will be shown there. So, my host template has:

notification_options n

active_checks_enabled 0

included in it, and

notification_options n

in the service template.

Finally, the check script - I've just adapted one of the existing ones to give a framework and added in the snmpget to get the last date, thus its still in perl :) Get it here.

The cache directory is in: /var/cache/nagios3/tm_cache/ - there's one file per host (to prevent file updating race conditions)

And that's about it!

Sysadmin etc

Tuesday, 13 March 2012

Time Machine Monitoring

No comments:

Post a Comment

About Me