What is idempotency and why does it matter for webhooks?

Idempotency means processing the same webhook event twice produces the same result as processing it once. It matters because webhook providers retry on timeout, and your endpoint will occasionally receive duplicate events. The fix: store a processed event ID in Redis or Postgres, and skip re-processing if that ID was already handled. Without this, a payment webhook retry might create two orders.

How do you handle webhook signature verification?

At Creative Codes, every production webhook endpoint verifies the request signature before processing. Most providers (Stripe, GitHub, Shopify) send an HMAC-SHA256 signature in a header. You reconstruct the signature from the raw request body and your webhook secret, then compare using a constant-time comparison function. Never use the parsed JSON body for verification: it loses whitespace and may differ from what the provider signed.

What's a dead-letter queue and when do you need one?

A dead-letter queue (DLQ) is where failed webhook events go after all retry attempts are exhausted. You need it when webhooks process critical business events: orders, payments, alerts. Without a DLQ, a transient database error during a payment webhook means that event is silently lost. Our production webhook systems always include a DLQ with alerting so nothing disappears without visibility.

← All insights

AutomationJune 1, 20268 min read

Webhook-Driven Automation: Architecture Patterns That Actually Work

Webhooks are the foundation of event-driven automation. Here's how to receive them reliably, process them safely, handle retries correctly, and recover cleanly when things go wrong.

Muhammad Hassan

Founder, Creative Codes. 8 years on backends; last 3 deep on AI agents, RAG pipelines, and production scraping. Python, LangGraph, Playwright, n8n, FastAPI.

GitHub Upwork

Webhooks are the simplest-looking part of an automation pipeline — and the most common source of silent data loss. CRM events, payment notifications, form submissions, marketplace alerts: they all arrive as HTTP POST requests to an endpoint we control. Getting this right means your automation runs when it should and never drops an event.

What can go wrong with webhooks

The naive webhook handler is a function that receives the request, processes the payload synchronously, and returns 200. This works for demos. In production, it creates problems:

Timeouts: if processing takes more than a few seconds, the sender will retry. You'll process the same event twice.
Downstream failures: if your database is slow, your automation logic throws an exception, or a third-party API is down, the sender sees a 5xx error and retries. Again: duplicate processing.
No visibility: when something goes wrong, you have no record of what arrived or what was done with it.
No recovery: a crash mid-processing means the event is lost, or processed twice after retry.

Good webhook architecture separates receiving the event from processing it.

Pattern 1: Receive fast, process async

The webhook receiver has one job: validate the request, store the raw payload, and return 200 immediately. Processing happens in a background worker.

python

from fastapi import FastAPI, BackgroundTasks, HTTPException, Request
import hmac, hashlib, json
from datetime import datetime

app = FastAPI()

# In-memory for illustration — use Redis or a database in production
event_queue = []

@app.post("/webhooks/crm")
async def receive_crm_event(request: Request, background_tasks: BackgroundTasks):
    body = await request.body()
    
    # Validate signature before storing
    signature = request.headers.get("X-Webhook-Signature", "")
    if not verify_signature(body, signature, secret=WEBHOOK_SECRET):
        raise HTTPException(status_code=401, detail="Invalid signature")
    
    payload = json.loads(body)
    event_id = store_raw_event(payload)
    
    # Queue processing in the background — don't block the response
    background_tasks.add_task(process_event, event_id, payload)
    
    return {"received": True, "event_id": event_id}

The sender gets a fast 200. Processing happens asynchronously. If processing fails, the raw event is already stored and can be replayed.

Pattern 2: Idempotency keys

Most webhook senders will retry on timeout or 5xx. You will receive the same event multiple times. Your processing logic must be idempotent: processing the same event twice should produce the same result as processing it once.

The standard approach is tracking which events have already been processed:

python

def process_event(event_id: str, payload: dict):
    # Check if already processed
    if is_already_processed(event_id):
        return  # Skip silently — this is a retry
    
    try:
        # Do the actual work
        result = execute_automation_logic(payload)
        
        # Mark as processed AFTER success
        mark_as_processed(event_id, result)
    except Exception as e:
        log_processing_failure(event_id, str(e))
        raise

For n8n workflows, use the event ID as part of the de-duplication check in the first node. Store processed event IDs in your database or in Redis with a TTL that matches your sender's retry window (usually 24-72 hours).

Pattern 3: Signature verification

Every production webhook receiver must verify the request signature. Without this, anyone who discovers your webhook URL can inject fake events into your automation.

Most webhook senders (Stripe, GitHub, HubSpot, Shopify) provide HMAC-SHA256 signatures. The pattern is the same across providers:

python

def verify_signature(body: bytes, signature: str, secret: str) -> bool:
    expected = hmac.new(
        secret.encode(),
        body,
        hashlib.sha256,
    ).hexdigest()
    
    # Use constant-time comparison to prevent timing attacks
    return hmac.compare_digest(expected, signature)

Some providers prefix the signature with a scheme identifier (e.g., sha256=abc123). Strip that prefix before comparing.

Pattern 4: Dead-letter queue

When processing fails after retries, the event goes to a dead-letter queue (DLQ) rather than being silently dropped. The DLQ holds events that couldn't be processed so they can be inspected, fixed, and replayed.

For n8n-based automation, we implement this as a separate n8n workflow that triggers on failure and writes to a dedicated "failed events" table in the database. For Python-based services, we use Redis pub/sub or a simple database table.

python

def handle_processing_failure(event_id: str, payload: dict, error: str, attempt: int):
    if attempt < MAX_RETRIES:
        # Schedule retry with exponential backoff
        retry_delay = 2 ** attempt * 60  # 2min, 4min, 8min...
        schedule_retry(event_id, payload, delay_seconds=retry_delay)
    else:
        # Move to DLQ
        write_to_dead_letter_queue(event_id, payload, error)
        send_slack_alert(f"Event {event_id} moved to DLQ after {attempt} attempts: {error}")

Pattern 5: Observability

Every webhook you receive should produce a log entry with:

Event ID
Event type
Timestamp received
Processing status (queued, processing, completed, failed)
Processing duration

This is table stakes for debugging production issues. "Why didn't my automation trigger?" has one of two answers: the event wasn't received, or the event was received but processing failed. Without logs, you can't tell which.

In n8n, use the execution log aggressively. Add a database write node early in the workflow to record that the webhook was received and is being processed. This creates a paper trail independent of n8n's own execution history.

Putting it together: n8n implementation

For our n8n-based automation builds, the webhook pattern looks like this:

Webhook trigger node — receives the event, returns 200
Signature verification (Function node or HTTP Request to validation service)
De-duplication check (database lookup for event ID)
Set node — normalize the payload to a consistent internal format
Business logic — the actual automation steps
Database write — record completion with outcome
Error workflow — separate workflow triggered on failure, writes to DLQ and sends Slack alert

The key is that step 7 is a separate workflow, not error handling inside the main workflow. This ensures that failure in error handling doesn't swallow the original error.

Monitoring webhook SLA in production

Most teams monitor what their automation does but not how quickly or reliably it receives. Webhook SLA monitoring tracks three things.

1. Delivery latency. The time between a webhook being sent by the provider and your endpoint returning 200. Most providers expose this in their dashboards (Stripe calls it "response time", HubSpot shows "processing duration"). Track the p50, p95, and p99 across a rolling 24-hour window. For CRM-triggered automations, a p99 under 500ms is healthy. A p99 above 2 seconds means your endpoint is struggling and retries are accumulating.

python

import time

@app.post("/webhooks/crm")
async def receive_crm_event(request: Request, background_tasks: BackgroundTasks):
    received_at = time.time()
    # ... validation, storage ...
    background_tasks.add_task(process_event, event_id, payload)

    duration_ms = (time.time() - received_at) * 1000
    log_webhook_latency(event_type=payload.get("type"), latency_ms=duration_ms)
    return {"received": True}

Track this in a webhook_events table with received_at, event_type, and response_ms columns. A weekly SELECT percentile_cont(0.99) tells you whether your SLA is healthy.

2. DLQ depth. Your dead-letter queue should be empty under normal operation. If DLQ depth rises, events are failing to process and accumulating. Wire a cron job to alert when DLQ depth exceeds 5 events:

python

def check_dlq_health():
    depth = db.query("SELECT COUNT(*) FROM dead_letter_queue WHERE resolved = false")
    if depth > 5:
        send_slack_alert(f":warning: DLQ depth {depth} — events need attention")

Run this every 5 minutes. DLQ depth is a leading indicator: it catches problems before downstream systems notice missing data.

3. Replay success rate. When you replay events from the DLQ, track how many succeed on first replay versus continuing to fail. A healthy system replays 95%+ of DLQ events on the first attempt. Low replay success rate means the underlying bug isn't fixed, just deferred.

We set one alert threshold per metric: p99 latency over 2s, DLQ depth over 5, or replay success below 80% triggers a Slack message. These three catch essentially every webhook reliability issue before a user notices.

Webhook replay in practice

Replaying events from the DLQ is straightforward if you've stored the raw payload:

python

def replay_dlq_event(event_id: str):
    event = db.get("SELECT * FROM dead_letter_queue WHERE event_id = %s", event_id)
    # Clear the processed flag so idempotency check passes again
    db.execute("DELETE FROM processed_events WHERE event_id = %s", event_id)
    process_event(event_id, event.payload)

Two cases to handle explicitly: events that failed due to a temporary downstream outage (Stripe was down, HubSpot rate-limited), and events that failed due to a code bug. For the first, replay works immediately once the outage clears. For the second, you need to fix the code first, then replay.

Keep a failure_reason field in the DLQ. A simple admin query that shows the last error per event tells you instantly whether you're looking at an infrastructure issue ("connection refused") or a code bug ("KeyError: 'amount'"). The distinction determines whether you need ops intervention or a deploy. Either way, you make the decision in under a minute rather than digging through logs trying to reconstruct what happened.

If you're building webhook-driven automation and want it to handle retries, failures, and edge cases correctly, let's talk about the architecture.

For a production example of these patterns in a CRM pipeline handling 15+ automated workflows, see the CRM Operations Automation case study.

Workflow Automation services →

Related service

Need complex n8n workflows built to production standards?

AI Workflow Automation →

← All insights

Automation10 min

ETL Without the Engineering Tax: Syncing Data Between APIs, Databases, and Warehouses