Building a Distributed Job Orchestration Platform with Kafka and Spring Boot

Modern backend systems rely heavily on asynchronous processing. Tasks like sending emails, generating reports, processing payments, or interacting with external services are often too slow or unreliable to execute directly inside a synchronous API request.

Instead, these workloads are typically offloaded to background workers through messaging systems such as Kafka or RabbitMQ. This allows APIs to remain responsive while dedicated worker services process jobs independently.

To better understand these patterns, I built a distributed job orchestration platform using Java, Spring Boot, Kafka, and PostgreSQL. The system models how real-world SaaS and fintech platforms handle asynchronous workflows, retries, external integrations, and failure recovery.

The project consists of:

A REST API service responsible for accepting and tracking jobs
A Kafka-based messaging layer for distributing work
A worker service responsible for executing jobs asynchronously
PostgreSQL as the durable source of truth for job state
Support for external workflows through webhook callbacks

The system supports multiple job types, including internally processed tasks (such as email and report jobs) as well as externally executed jobs that complete asynchronously through callbacks.

High-Level Architecture

At a high level, the platform separates job ingestion from job execution.

Clients submit jobs to the API service, which persists the job state in PostgreSQL and publishes a message to Kafka. Worker services consume these messages independently and execute the appropriate handlers based on the job type.

For external jobs, the worker communicates with a third-party service and leaves the job in a RUNNING state until a callback is received by the API.

Rendering diagram…

This separation between API and worker services helps model a common distributed systems pattern:

APIs remain lightweight and responsive
Workers scale independently
Kafka acts as a buffer between ingestion and execution
PostgreSQL remains the authoritative system of record

One important design decision in this project was ensuring that jobs are persisted and transitioned to QUEUED before publishing messages to Kafka. This prevents workers from consuming messages for jobs that do not yet exist in the database, avoiding inconsistent execution states.

Job Lifecycle & Processing Flow

At the core of the platform is a simple job lifecycle model:

PENDING → QUEUED → RUNNING → COMPLETED / FAILED

Each state represents a specific stage in the processing pipeline and helps coordinate work safely across distributed components.

When a client submits a job, the API first persists the job in PostgreSQL with a PENDING status. The job is then immediately transitioned to QUEUED before a message is published to Kafka.

This ordering is intentional. In distributed systems, it is important to think carefully about the relationship between durable state and asynchronous messaging. If a Kafka message were published before the database transaction completed, a worker could potentially consume and execute a job that does not yet exist in persistent storage.

By persisting the job state before publishing to Kafka, the system ensures that workers only process jobs that are already visible in the database.

Once the job reaches Kafka, the worker service consumes the message and attempts to claim the job for execution. Before processing begins, the worker validates that the job is still in the QUEUED state. This acts as a lightweight safeguard against duplicate Kafka delivery or repeated execution attempts.

After claiming the job, the worker transitions it to RUNNING and dispatches execution to the appropriate handler based on the job type.

The platform currently supports three types of jobs:

Job Type	Description
EMAIL	Simulates asynchronous email delivery
REPORT	Simulates long-running report generation
EXTERNAL	Calls a third-party webhook and waits for asynchronous completion via callback

Internal jobs (EMAIL and REPORT) complete entirely inside the worker process. Once execution finishes successfully, the worker updates the database row to COMPLETED. If execution fails, the job transitions to FAILED.

External jobs follow a different lifecycle. Instead of completing immediately, the worker sends an outbound HTTP request to a third-party service and leaves the job in a RUNNING state. The external system is then responsible for notifying the platform asynchronously through a callback endpoint:

POST /jobs/callback

This mirrors how many real-world systems operate. Payment processors, document conversion pipelines, and third-party integrations often execute asynchronously and communicate completion through webhooks instead of synchronous HTTP responses.

The callback endpoint accepts either COMPLETED or FAILED updates and validates an optional shared secret using the X-Callback-Secret header before updating the job state.

This separation between job submission, execution, and completion allows the system to model asynchronous workflows without blocking API requests or long-running worker threads.

Design Decisions & Tradeoffs

Building distributed systems is rarely just about getting components to communicate with each other. Most of the complexity comes from handling consistency, failures, retries, and coordination between services that operate independently.

While building this platform, several design decisions significantly influenced how the system behaved under asynchronous and failure-prone conditions.

Database vs Kafka Ordering

One of the first design questions was deciding when a job should be published to Kafka relative to when it is persisted in PostgreSQL.

A naive implementation might publish the Kafka message first and then insert the database row afterward. While this may appear harmless, it creates a dangerous race condition in distributed systems.

If the worker consumes the Kafka message before the database transaction commits successfully, the worker may attempt to process a job that does not yet exist in persistent storage. This creates what is sometimes referred to as a phantom execution scenario.

To avoid this, the platform follows a database-first ordering strategy:

Persist the job in PostgreSQL
Transition the job to QUEUED
Publish the Kafka message

This guarantees that any worker consuming the message can safely load the job state from the database.

The tradeoff is that database persistence and Kafka publishing are still separate operations. If the application crashes after the database commit but before the Kafka publish succeeds, the job may remain permanently in the QUEUED state without ever being processed.

Production systems often solve this problem using the Outbox Pattern, where events are first written into an outbox table inside the same database transaction and later published asynchronously by a separate relay process.

For this project, I intentionally avoided implementing a full outbox architecture in order to keep the system smaller and easier to reason about while still modeling realistic distributed systems concerns.

At-Least-Once Processing & Idempotency

Kafka consumers typically operate under an at-least-once delivery model. This means a message may occasionally be delivered more than once due to retries, consumer restarts, or offset commit timing.

As a result, worker services must assume duplicate delivery is possible.

To reduce duplicate execution, the worker validates the persisted job state before processing any message. Only jobs currently in the QUEUED state are eligible for execution.

This creates a lightweight idempotency safeguard:

If a duplicate Kafka message arrives after a job has already transitioned to RUNNING or COMPLETED, the worker simply ignores it
The database acts as the authoritative coordination mechanism between distributed components

This approach does not provide perfect exactly-once semantics, but it reflects how many production systems balance simplicity, reliability, and operational complexity.

Designing around at-least-once delivery also influenced retry behavior. Failed jobs can be re-queued intentionally through retry APIs or automatically through watchdog recovery mechanisms, making idempotent execution an important consideration throughout the system.

External Jobs & Callback-Based Workflows

Internal asynchronous jobs are relatively straightforward because execution and completion occur entirely inside the worker process.

External workflows are significantly more complex.

For EXTERNAL jobs, the worker sends an outbound HTTP request to a third-party service and transitions the job into a RUNNING state. Instead of waiting synchronously for the external system to finish, the platform relies on a callback endpoint to receive completion updates asynchronously.

This pattern mirrors many real-world integrations:

payment gateways
document processing systems
media conversion pipelines
external verification services

Using callbacks instead of synchronous waiting offers several advantages:

workers remain non-blocking
long-running external operations do not tie up worker threads
external systems can complete work independently

However, it also introduces new reliability concerns:

external systems may never respond
callbacks may arrive multiple times
malicious or invalid callbacks may be sent

To address these concerns, the platform includes:

watchdog-based timeout recovery for stale RUNNING jobs
retry limits for repeated failures
optional callback secret validation using X-Callback-Secret

This part of the system was particularly interesting because it highlighted how asynchronous external integrations often introduce more complexity than the internal job processing itself.

Failure Handling & Reliability

If the earlier sections describe how jobs move happily through PENDING, QUEUED, and RUNNING, this section describes what happens when the real world ignores that script. Failures do not politely wait for idle CPU: networks flap, brokers redeliver, processes die mid-flight, and external vendors go quiet. Handling those cases is where a toy demo becomes something that resembles engineering.

Retries and bounded optimism

Transient failures matter. A flaky HTTP call from a worker or a temporary database blip might succeed on a second attempt, so retrying selectively is worthwhile. Blind infinite retries can just as easily amplify damage: the same buggy handler might hammer an external dependency, fill topics with duplicates, or keep FAILED work visible as RUNNING.

The implementation uses an explicit notion of retry budget alongside state transitions back to QUEUED after soft failures where that makes sense for the workload. Retries complement the QUEUED-state idempotency gate from earlier consumption: retries are deliberate; duplicate Kafka deliveries are guarded by state checks rather than blindly re-running business logic without coordination.

Put differently, retries are optimistic; the database-backed lifecycle is pessimistic—and the combination is deliberate.

Watchdog recovery and stale RUNNING jobs

INTERNAL jobs (EMAIL, REPORT) complete inside the worker. EXTERNAL jobs deliberately park in RUNNING until a webhook arrives—or never. In production systems, callbacks get lost behind bad routing, buggy partners, clocks that disagree, or human configuration errors just often enough that “wait forever” is not acceptable.

The platform runs a watchdog that periodically inspects jobs that have been RUNNING beyond a configurable threshold. Jobs that violate that policy are transitioned out of perpetual limbo—for example recovered by returning them to QUEUED within limits, failing them when retries are exhausted, or escalating them into paths that analysts can inspect downstream.

That single mechanism ties together several lessons at once:

timers are coordination primitives, not an afterthought
RUNNING is not “success”; it means “ownership was claimed somewhere”
external completion must be modeled as hypotheses that can expire

Stale RUNNING jobs are surprisingly common failure modes once callbacks exist; acknowledging that early keeps the lifecycle honest.

Dead-letter queue (DLQ)

Kafka’s main jobs topic assumes most messages deserve another chance—or at least are safe to rewind and retry consumer-side logic. Messages that violate that assumption—for example payloads that violate schema assumptions, payloads that correlate to jobs terminally invalidated, or error paths explicitly marked poison—would otherwise ride the happy path indefinitely.

Publishing to a dedicated DLQ acknowledges terminal failure explicitly. Operators (or tooling) gain a choke point for inspection without blocking normal throughput. Conceptually this mirrors how teams separate “recoverable backlog” from “things we should not blindly re-drive.”

Separating DLQ routing from accidental infinite retry loops trades a bit more moving parts for a lot more operational clarity once multiple workers and retries are involved.

Worker crashes mid-execution

A worker crashing after claiming a job produces the distributed equivalent of unfinished homework: Kafka may believe progress was insufficient to commit offsets, or it may replay a delivery after restart. Earlier design choices mitigate the worst outcomes:

if the replay arrives while PostgreSQL shows RUNNING or COMPLETED, duplicate handling short-circuits
watchdog recovery catches cases where offsets and human intuition disagree about progress
INTERNAL completion paths differ from EXTERNAL ones, because only the latter tolerate long gaps depending on timers and callbacks

The honest summary is subtle: no single abstraction gives you “exactly once job execution”; you accumulate partial guarantees—idempotent transitions, timeouts, retries, DLQ—which together approximate correctness under plausible failure timelines.

Failure handling turned out to be the place where textbook lifecycle diagrams collide with crashing JVMs and silent third-party webhooks—it is also where distributed systems genuinely become interesting.

What I Learned

Building this stack mostly taught me humility about how much incidental complexity lives off the diagrams.

Distributed systems skew toward pessimism faster than synchronous CRUD stacks. At-least-once delivery stops being jargon and turns into invariant thinking: duplicates exist, reordering happens, timers lie a little—and your database row is often the referee.

Consistency is inseparable from tradeoffs. Persisting QUEUED before publishing to Kafka tightened one race but surfaced another (publish failures after commit). External callbacks elegantly model slow partners but amplify uncertainty about eventual completion unless you deliberately bound how long hypotheses live in RUNNING.

Asynchronous systems grow complex fast. One callback path does not multiply effort linearly—you introduce secrets, timeouts, replay semantics for bad actors, observability gaps, and the mental bookkeeping of partial state for each integration.

Above all: reliability rewards intention. Retries left implicit become infinite loops or silent drops. Timers omitted become stuck jobs disguised as “still processing.” A DLQ not modeled becomes chatty logs nobody reads until production shouts first.

This project clarified that “correct when everything connects” was never enough of a specification; correctness under partial failure timelines is closer to what production demands.

Future Improvements

A few obvious next steps stayed out of scope to keep boundaries clean:

Direction	Rough idea
Outbox pattern	Co-locate publishes with transactional writes to remove the orphaned QUEUED edge case cleanly
Metrics & tracing	Golden signals plus trace IDs across HTTP, DB, Kafka, callbacks
Operator dashboard	Quick visibility into stuck jobs without SQL spelunking
Redis	Faster coordination or rate limiting if contention grew
Kubernetes deployment	Beyond Docker Compose topology for repeatable horizontal scaling drills

Nothing here invalidates current behavior; each item deepens observability or operational guarantees.

Conclusion

The distributed job orchestration platform stitched together REST APIs, transactional PostgreSQL state, Kafka fan-out, and worker-side execution—with external lifetimes bridged via callbacks—and forced hard questions about retries, stalled RUNNING work, DLQ segregation, and consumer crash semantics.

Walking through lifecycle design, deliberate ordering, defensive consumption, recovery timers, and dead-letter ergonomics clarified how resilient asynchronous platforms are assembled from overlapping partial guarantees rather than from a single silver bullet abstraction.

Implementation details for all services, topics, migrations, smoke coverage, and end-to-end failure scenarios live in the repository:

github.com/abubakaransari326/Distributed-Job-Orchestration-Platform

If you skim one thing besides the diagrams, skim the watchdog and DLQ paths—they are modest in code footprint but disproportionate in what they defend against.