Mert Tosun
← Posts
Distributed Cron Jobs with Temporal and Queues

Distributed Cron Jobs with Temporal and Queues

Blog Author4 min readBackend

Traditional cron works well on a single server, but modern backend systems run across many instances, zones, and deployment cycles. In this environment, "run every hour" is no longer trivial. Jobs can execute twice, be skipped during failover, overlap dangerously, or create noisy retries that overload dependent services. Distributed scheduling requires durability, determinism, and clear ownership of execution state.

Temporal and queue-backed workers offer a strong solution for this class of problems. Instead of relying on local machine schedules, you define workflows as durable state machines. Triggers are persisted, retries are explicit, timeouts are controlled, and recovery is automatic after process crashes. This changes scheduled jobs from a best-effort background task into a reliable business process.

Why naive cron breaks at scale

Common pain points in distributed cron implementations include:

  • Multiple pods running the same schedule after autoscaling.
  • Missed windows during deployments or restarts.
  • Non-idempotent job handlers causing duplicate side effects.
  • Long-running tasks overlapping with next intervals.
  • Poor visibility into what ran, failed, or was skipped.

Even leader-election based schedulers can fail under partition or delayed heartbeat conditions. The issue is not only "who runs the job," but "how execution intent is durably recorded."

Temporal scheduling model

Temporal gives you two key building blocks: schedules and workflows. A schedule emits workflow executions based on time rules, while workflow state is durably persisted in Temporal history. If workers die mid-run, another worker can continue from the last known event. No custom checkpointing logic is required in most cases.

This model gives predictable behavior:

  • At-least-once trigger delivery with deterministic replay.
  • Controlled retries per activity with backoff policies.
  • Timeouts at workflow and activity levels.
  • Explicit compensation logic for partial failure paths.

For business-critical periodic tasks such as billing cycles, reconciliation, or data exports, this predictability is essential.

Queue integration and backpressure

Many scheduled workflows eventually publish work to queues. This is where architecture discipline matters. Do not let a scheduler fan out unbounded messages during spikes. Introduce bounded concurrency and adaptive batching based on downstream capacity.

A practical pattern:

  1. Temporal schedule starts a coordinator workflow.
  2. Coordinator computes due work items.
  3. Items are enqueued with idempotency keys.
  4. Consumers process with visibility timeout and retry contracts.
  5. Results are aggregated and checkpointed in workflow state.

This separation makes the system resilient. Scheduling remains reliable, while throughput is controlled by queue consumers and worker pool policies.

Idempotency and duplicate safety

In distributed systems, exactly-once delivery is usually an illusion. Design for at-least-once execution and make handlers idempotent. Assign deterministic operation keys (tenant + period + taskType) and store completion markers in a durable store. Before applying side effects, check if the key has already been processed.

For external APIs, persist outbound request IDs and map responses back to your operation key. If retries occur, reuse the same idempotency token whenever possible. This prevents duplicate invoices, repeated notifications, or double ledger mutations.

Operational controls and incident readiness

A robust scheduling platform needs more than code:

  • Dashboards for scheduled runs, success ratio, and lag.
  • Alerting on missed executions and retry storms.
  • Dead-letter queue visibility with replay tooling.
  • Runbooks for pausing schedules safely during incidents.
  • Audit logs for manual trigger and skip actions.

Temporal visibility APIs and queue metrics together provide a near-complete operational picture. Teams can answer: what was supposed to run, what actually ran, and what is currently blocked.

Migration from legacy cron

You do not need a big-bang rewrite. Start with high-impact jobs where missed or duplicated execution is expensive. Wrap legacy handlers behind idempotent interfaces, then orchestrate them from Temporal workflows. As confidence grows, retire machine-level cron entries gradually.

A proven migration sequence:

  1. Add idempotency to existing handlers.
  2. Move trigger ownership to Temporal schedule.
  3. Add queue buffering and bounded consumers.
  4. Add observability and on-call runbooks.
  5. Remove legacy cron definitions.

Distributed cron is not about replacing one timer with another. It is about making time-based business processes reliable under real-world failures. Temporal plus queue-centric execution gives engineering teams the control plane they need to run scheduled workloads safely at scale.