Sunday, 18 March 2012

CouchDB Re-Index

On Friday we discovered that our CouchDB instances had inconsistant results for queries to their indexes. ie the same query on two different servers gave different results on the same data.

Checking the logs revealed that on of them had problems in the index.

Reindexing for most of our databases takes a few minutes, however one of them has about 5 million documents in it, and this takes a good 12 hours to re-index completely.

So to re-index a CouchDB database, take 4 steps:

1. Take it offline - move the work to another one in the cluster
2. Delete the current indexes - they're in a sub-directory of the CouchDB data directory, preceded by a . , named after the database and has _design appended. Renaming is another option for the faint hearted
3. Restart Couchdb, so it notices that the indexes have gone
4. Access one view for each design document - you only need to do one, since it will index against all the views at once.

I've done a script which automate step 4. One of the issues with CouchDB replication is that a replicated document doesn't update the view (unlike a directly saved document), so I've a script which pokes each view every minute or so to stop any staleness building up. We use couchrest model, so it uses those classes to do the accessing. This means the script is a little specific to us, so I hope you'll get the idea


#!/usr/bin/env ruby
# poke all the classes in the database
require 'couchrest'
require 'couchrest_model'
require 'will_paginate'
require 'will_paginate_couchrest'
SERVER = CouchRest.new("http://localhost:5984")
DB     = SERVER.database!("databasename")
require '/opt/local/apps/couchrestclasses.rb'
while true do
 
  puts "waking .... Starting CouchrestClass"
  begin
    blah=CouchrestClass.all(:limit => 10)
  rescue
    puts "CouchrestClass Done"
  end
 
               # repeat the above once for each couchrest class
  puts "done... sleeping"
  sleep 20
end

Thursday, 15 March 2012

Lion Weirdness

We've got 2 Lion machines in the office - one is my laptop, and I discovered one of the newer iMacs is also running Lion.

I obviously started experimenting with backups on my laptop and used a named account (chris) to do the Time Machine backup, and it worked OK. Everywhere else uses a generic "tm" account to connect to the server, and this also works.

However, I thought, I should standardise, so I changed the ownership of my backup on the server and connected with the "tm" user, and it errored! Said that the server didn't support some AFP features it needed. So I changed the ownership back to "chris" and it went back to working.

So, today, I try to do the other Lion machine in the office, and it gives the same error! Creating a new user account on the server and connecting with that made it work.

Lots of weirdness, since, as far as I can tell, there really isn't much of a difference.

But hey, it works :)

Only 6 more machines to setup...

Tuesday, 13 March 2012

Time Machine on OpenIndiana

We've been using a few ReadyNAS boxes for Time Machine for a while now, but they're not without their problems - they're slow, and a bit limited in capacity. The are getting on - they're 1000S model, so pretty much original - they even pre-date Netgear's acquisition of Infrant! It also doesn't work very well with Lion for Time Machine.

So I've been playing with OpenSolaris, and now OpenIndiana, since I love ZFS - we use Sun/Oracle 7000 series storage on our production systems, so having something equivalent in the office is sensible.

So to build a Time Machine the main component needed is and Apple Filing Protocol  server.

The Open Indiana has a good wiki for documentation, and http://wiki.openindiana.org/oi/Using+OpenIndiana+as+a+storage+server and http://wiki.openindiana.org/oi/Netatalk are very good guides to get Netatalk up and running.

Netatalk works out of the box - merely create a tm user account, and a line in
/usr/local/etc/netatalk/AppleVolumes.default
consisting of:
/space2/timemachine/ timemachine options:tm

where /space2/timemachine/ is a zfs filesystem specially created.

One final tweak needed is in /usr/local/etc/netatalk/afpd.conf is to change the following line

- -tcp -noddp -uamlist uams_dhx.so,uams_dhx2.so -nosavepassword -setuplog "default log_debug"

to

- -tcp -noddp -uamlist uams_dhx.so,uams_dhx2_passwd.so -nosavepassword -setuplog "default log_debug"

This stops the daemon crashing as odd points.

Time Machine Monitoring

Time Machine is great - just set it up and forget about it, and it backs up all you files automatically.

In theory.

In practice it can just stop, and not backup for no apparant reason. The backup disk could fill up. Someone could disable it. There could be errors. And my users don't necessarily report this and so I've got no way of knowing.

So, we need monitoring.

I found this post:
http://smoove-operator.blogspot.com/2010/09/monitoring-timemachine-backups-with.html
which grepped the logs looking for backup ending, uploaded this information to the nagios server which monitored it.

Interesting, I thought, but not quite what I'm after. Most of our Macs are desk bound, so they're on the office network. And they're switched off at night, generally, so any periodic job is likely to fail to run at the right moment.

All the script is doing is to return a date stamp of the last backup done, so how's about we use snmp to return that to the monitoring.

Easy enough to adapt the script:


#!/usr/bin/env ruby
# Get the last backup time we have, with no newlinelast_backup = `/usr/bin/syslog -T sec -F '\$Time - \$Sender -\$Message' | grep backupd | grep 'Backup completed' | tail -1`last_backup.chomp
# Make sure it exists - exit if notif !last_backup.eql? ""
  # Get the unix timestamp out of the last message  backup_stamp = (last_backup.split "-")[0]
  puts backup_stamp
else  puts 0end

First off I've re-written it in Ruby - just a personal (and company) preference. Now just outputs the seconds since the epoch of the last successful backup.

stick in into /usr/local/bin/tm_check and add into /etc/snmp/snmpd.conf:

exec tm_check /usr/local/bin/tm_check
startup snmpd:

sudo launchctl load -w /System/Library/LaunchDaemons/org.net-snmp.snmpd.plist


and the client side is ready to go.

We use Nagios for monitoring, which we use to monitor our servers etc. However there are a few caveats which occurred to be when thinking about desktop machines:

1. I don't care if the machine is up or down. In fact, I really don't want to be in a position where that is recorded at all. It is too close to watching what the employee is doing - ie when they're in in the morning and leave at night. Not my job! So need to stop host checks.

2. Similary, if the snmp probe doesn't return, then the machine is probably off, so let's not worry. So the check script records the last backup date in a file, and if the snmp times out then file cache date is returned - this is valid since that date is the worst case date.

3. Don't tell me by email and especially not SMS. We have a screen on the wall which shows current alerts (using NagLite) and any failures will be shown there. So, my host template has:


  notification_options          n
  active_checks_enabled         0

included in it, and

  notification_options          n

in the service template.

Finally, the check script - I've just adapted one of the existing ones to give a framework and added in the snmpget to get the last date, thus its still in perl :) Get it here.

The cache directory is in: /var/cache/nagios3/tm_cache/ - there's one file per host (to prevent file updating race conditions)

And that's about it!