Starting Resque Workers Sequentially

March 22, 2012

Why Start Sequentially?

At work we run a main resque worker server that has about 15-20 workers running at any given time. We also distribute about 5 or 6 workers to our app servers to get more work done and for security against our main worker server going down. Sometimes, our clients cause hundreds of thousands of jobs to get enqueued in a very short period of time. To combat this we will spin up a clone or two of the worker server with 30 or so workers each to handle the extra burden.

You typically start workers like this:

QUEUE=* rake resque:work

We use god to start up all of our workers this way. This worked pretty well for a while.

Here’s the rub: starting 20 workers on a single machine is brutal. Really brutal. All 20 workers are trying to start a heavy rails app environment and are getting into resource contention. This results in load on a 4 core system going up to something like 10. After a deploy, our workers would take upwards of 10 minutes to start.

Solution: Add File Locks in the Rakefile

The solution is to start workers sequentially on the same machine by hooking into the worker tasks in rake. This is what worked for us:

# lib/tasks/resque.rake
require 'fileutils'

# Copied from http://thomasmango.com/2010/05/27/resque-in-production/
namespace :resque do
  LOCKFILE = File.join(File.dirname(__FILE__), '..', '..', 'tmp', 'worker_start.lock')

  desc "wait for lock file to clear before starting"
  task :wait_for_lock do
    begin
      File.open(LOCKFILE, File::CREAT | File::EXCL | File::WRONLY) do |f|
        f.write(Process.pid)
      end
    rescue Errno::EEXIST => e
      sleep(1)
      retry
    end
  end

  task :clear_lock do
    FileUtils.rm(LOCKFILE, :force => true)
  end

  task :preload => [:wait_for_lock, :environment] do
    # Add clear lock to run after loading rails
    Rake::Task['resque:clear_lock'].invoke
  end
end

There is a much more elegant solution in Ruby’s File#flock but it didn’t seem to work in this case. Every contrived script I could come up with would have an exclusive flock (file lock) would work, but in a task like this, the lock would not be exclusive and all the workers would start at once. This solution is ugly but does the trick. Without so much resource contention, our workers are each starting in < 30 seconds and can begin work immediately while the other workers are waiting.