What We Actually Learned

The system grew without a grand plan. Payroll events came first - we needed to decouple email delivery from the approval flow. Then leave events, because the notification routing was getting tangled. Then employee lifecycle, billing, auth, attendance, compensation, imports.

Every major domain covered. Events for every state change that matters. Subscribers for logging, analytics, notifications, Slack, and more.

What Worked

Decoupling is real, not theoretical. When we added PostHog analytics, we wrote one subscriber. Zero changes to existing operations. When we added Slack notifications, same story. The subscriber registers itself, declares which event patterns it cares about, and starts receiving. The operations that emit those events were never opened.

The notifier pattern pays compound interest. Every product request that starts with “also send an email when…” is a one-line change. Add a subscribe_to entry, write the action method, done. We’ve never had to modify an operation to change who gets notified or when. The ROI on that abstraction increases with every new event.

Schemas prevent drift. At scale, we’d absolutely have inconsistent payload shapes without the schema DSL. from_subject forces you to think about what data an event carries upfront - not six months later when a subscriber crashes on a missing field.

Per-subscriber error isolation was the right default. Slack has gone down. PostHog has had latency spikes. Both times, the event system kept running. Emails delivered, logs wrote, todos created. The circuit breakers healed on their own. We’ve never had to manually intervene with a failing subscriber.

What Surprised Us

Idempotency is harder than it sounds. The Redis SETNX pattern works, but the edge cases are sharp. What happens when a job fails after claiming the idempotency key but before completing the work? The key is set, the work isn’t done, retries skip it. We settled on a 24-hour TTL as a pragmatic balance - long enough to cover normal retry windows, short enough that truly failed events can be reprocessed manually if someone notices.

Context preservation is the whole game. When an operation runs in a web request, Current.user and Current.company are set by middleware. When the event is processed in a Sidekiq job three seconds later, they’re gone. The emitter has to capture those values at emit time and carry them in the payload. Getting this wrong means events without tenant context - in a multi-tenant system, that’s not a bug, it’s a security hole.

Lenient validation was the right call. Our schemas validate required params but only log warnings - they don’t raise. We debated strict validation: reject events with missing fields. In practice, a missing optional field on a billing event shouldn’t prevent a payroll notification from sending. Production systems need to be forgiving about things that don’t matter and strict about things that do.

The Two Queues

Not all events are equal. A payslip notification after payroll approval is time-sensitive. A PostHog analytics event is not.

We split dispatch into two Sidekiq queues: :events_critical for payroll, leave, billing, and auth events. :events for everything else. The critical queue gets priority.

queue_as do
  event_name.start_with?(*CRITICAL_PREFIXES) ? :events_critical : :events
end

Three lines. But they mean a surge of import events doesn’t delay someone’s leave approval notification. This was a late addition that should have been there from the start.

What We’d Do Differently

Start with fewer events. Early on, we emitted events for everything. Record created? Event. Record updated? Event. Status changed? Event. Most had zero subscribers. An event nobody listens to is dead code with a runtime cost.

The bar should be: will something else need to react to this? If yes, emit. If no, don’t. You can always add an event later when a subscriber needs it. You can’t easily remove one that other systems depend on.

Name events for the business, not the code. employee.suspended is clear. employee.status_updated is not - updated to what? Events should describe what happened in domain language, not what method ran.

The Design in One Sentence

Operations emit. Subscribers react. Nobody coordinates.

The operation’s job ends when the state changes. The event carries the news. Subscribers decide independently what to do about it. Not event sourcing. Not a distributed system. Just a pattern for keeping a growing Rails monolith organized - one where adding a new side effect never means touching the code that triggered it.

In the next post, we’ll trace one event - a leave approval - through every layer of the system, start to finish.