Async Jobs That Survive Chaos | Behind the Scenes of The M.Akita Chronicles
This post is part of a series; follow along through the tag /themakitachronicles. This is part 5.
And make sure to subscribe to my new newsletter The M.Akita Chronicles!
–
Let me be blunt: most developers treat background jobs as if they were scripts that run once and that’s it. “Ah, throw it in a Sidekiq and we’re good.” No, we’re not good. Jobs that handle important things — sending emails, publishing content, calling external APIs — need to be treated as first-class citizens in your architecture.
In this post I’ll show how Rails 8 with ActiveJob and SolidQueue changed my perspective on async processing. This is not theory — these are patterns that emerged from a real system running in production every week.
The Problem: Jobs Are Fragile by Nature
Picture the scenario: you have a job that assembles a newsletter, publishes it on a blog through the GitHub API, waits for a podcast to be ready, and then fires off emails to hundreds of subscribers. Any step can fail. The API can time out. The podcast can take longer than expected. The server can restart in the middle of the delivery.
If you treat this as a linear script, you’re going to suffer. The right question is not “how do I make this job run?” — it’s “how do I make this job recover when something goes wrong?”
Pattern 1: retry_on With Specific Exceptions
ActiveJob’s retry_on is absurdly powerful, but most people use it wrong. Look at the classic anti-pattern:
class MeuJob < ApplicationJob
retry_on StandardError, wait: 5.seconds, attempts: 3
endThis retries any error 3 times and gives up. It’s useless for almost anything in the real world. The pattern that actually works is to create specific exceptions for transient states:
class PodcastNotReady < StandardError; end
class PublishAndSendJob < ApplicationJob
retry_on PodcastNotReady, wait: 15.minutes, attempts: 16
def perform(newsletter_id)
metadata = check_podcast_metadata(newsletter_id)
raise PodcastNotReady, "Aguardando podcast" unless metadata
# keep working...
end
endSee what’s happening? The job waits up to 4 hours (15min × 16) for the podcast to be ready, polling every 15 minutes. This is not an infinite loop — it has a clear limit. And when it hits that limit, you handle the timeout explicitly:
rescue PodcastNotReady => e
if executions >= 16
# Publish without podcast and notify the team
publish_without_podcast(newsletter)
notify_team("Podcast não ficou pronto a tempo")
else
raise # re-raise so SolidQueue schedules the next retry
end
endThis is graceful degradation. The system doesn’t stop because one component failed — it adapts and keeps going.
Pattern 2: Distributed Locks
When you have multiple paths that can trigger the same job (a safety cron, a manual trigger, an API), you need to make sure only one instance runs at a time.
The concept is simple: a file-based (or database-based) lock that expires automatically:
class DeployLock
LOCK_DIR = Rails.root.join("tmp/locks")
DEFAULT_TTL = 30.minutes
def self.with_lock(name, ttl: DEFAULT_TTL)
acquire!(name, ttl: ttl)
yield
ensure
release(name)
end
endWith SQLite, this is trivial — no Redis, no external coordination. The lock has a TTL so it doesn’t hang forever if the process dies. And the ensure guarantees it releases even if the block raises an exception.
Pattern 3: Atomic Email Delivery
This is where most newsletter systems fail spectacularly. The scenario: you’re sending 500 emails and the server restarts on email 247. What happens when the job re-runs?
If you just iterated a list, you’re going to resend the first 247. Your subscribers will love getting the newsletter twice.
The solution is atomic claiming per recipient:
# Before sending, create a record per subscriber
subscribers.each do |sub|
EmailDelivery.create!(
newsletter: newsletter,
subscriber: sub,
status: "pending"
)
end
# At send time, do an atomic claim
delivery = EmailDelivery
.where(status: ["pending", "failed"])
.lock
.first
delivery.update!(status: "sending")
send_email(delivery)
delivery.update!(status: "sent")If the server dies between sending and sent, the record stays as “sending” — and a recovery job (RecoverStaleDeliveriesJob) periodically moves those records to “unknown” after a timeout, so they never get automatically resent. Ambiguous emails are never resent automatically. That’s the kind of detail that separates a hobby system from a production system.
Pattern 4: Orchestrator Jobs
A common mistake is cramming too much logic into a single job. The pattern that works is having orchestrator jobs that delegate to specialized jobs:
PublishAndSendJob (orchestrator)
├── Waits for podcast (retry_on PodcastNotReady)
├── PublishToBlogJob.publish(newsletter)
└── SendNewsletterJob.perform_now(newsletter.id)
└── SendSingleEmailJob (per subscriber)The orchestrator coordinates the sequence. Each specialized job knows how to do exactly one thing and can be re-executed independently. If email sending fails, you can retrigger just the SendNewsletterJob without republishing the blog.
Pattern 5: Safety Nets With Cron
Don’t trust a single execution chain. SolidQueue’s cron (config/recurring.yml) acts as a safety net:
send_newsletter:
class: SendNewsletterJob
schedule: "0 12 * * 1" # Monday 9am BRT (UTC-3)If the 7am PublishAndSendJob failed catastrophically, the 9am cron fires the SendNewsletterJob as a fallback. The job checks whether the newsletter has already been sent and no-ops if so. Idempotency is the keyword here — the job can run as many times as you want and the result is always the same.
But watch out for the inverse: a job that reschedules itself infinitely when it finds no work is a ticking time bomb. If the job doesn’t find a ready newsletter, it simply returns. The cron handles trying again the next week.
Pattern 6: Status Notifications
Every long-running job should communicate its progress. A simple concern handles it:
module DiscordStatus
extend ActiveSupport::Concern
def notify_start(message)
DiscordNotifier.send(channel: status_channel, text: "▶️ #{message}")
end
def notify_done(message)
DiscordNotifier.send(channel: status_channel, text: "✅ #{message}")
end
def notify_error(message)
DiscordNotifier.send(channel: status_channel, text: "❌ #{message}")
end
endEach job includes the concern and calls notify_start at the beginning, notify_done on success, notify_error on rescue. When something goes wrong at 3am, you wake up and see exactly where it stopped — no digging through logs required.
SolidQueue: The End of Redis Dependency
A note on SolidQueue, which ships as default in Rails 8: using the same SQLite database for jobs and application data drastically simplifies operations. No need for a separate Redis running. No need to worry about Redis restarting and losing jobs that were in memory.
Jobs are persisted in the database. If the server restarts, they’re there waiting. The retry state is preserved. It’s absurdly simpler than the alternative, and for 99% of cases, the performance is more than enough.
Conclusion
Rails 8 with ActiveJob and SolidQueue didn’t invent anything revolutionary. What it did was make it ridiculously easy to implement patterns that used to require heavy infrastructure:
- retry_on with specific exceptions and clear limits
- Distributed locks with automatic TTL
- Atomic claiming for operations that can’t duplicate
- Orchestrator jobs that delegate and recover
- Safety crons as fallback
- Notifications for operational visibility
None of these patterns is complicated on its own. Put them together, though, and that’s what makes the difference between a system that breaks at the first failure and one that holds up in production — and lets you sleep peacefully on Monday.