There is a specific kind of failure that I have seen break more production automations than any other. It is not the failure that crashes loudly with an error message you cannot miss. It is the failure that does nothing. The workflow runs, the execution shows green, and somewhere in the middle, data quietly vanished.
You only discover it when a client asks why their Notion database has not been updated in three days, or when a payment record never made it from Stripe into your accounting system, or when you realize the contacts that were supposed to sync from your CRM last Tuesday simply do not exist anywhere.
This is what happens when automations are built for the demo, not for production. The demo always works. Production is where the real questions get answered: what happens when the API is down for 90 seconds? What happens when Stripe sends the same webhook twice? What happens when 3 of 200 items in a batch fail, and the other 197 keep going?
This post is about building n8n workflows that answer those questions correctly. That means retries, error workflows, dead-letter queues, idempotency, batch error handling, and monitoring. I am going to cover each of these in depth, with specific n8n patterns and code you can actually use.
Why the default n8n behavior loses data
When you build a workflow in n8n and do not configure any error handling, this is what happens when a node fails:
The execution stops at the point of failure. Every item that has not yet been processed is abandoned. No notification goes anywhere. The execution is marked as "Error" in n8n's execution list, but only if someone opens n8n and looks at that list. If you are running n8n unattended on a server, which is the whole point of automation, you will not know it happened.
// What n8n does by default when a node errors:
// 1. Execution stops immediately at the failing node.
// 2. All items that haven't run yet are silently skipped.
// 3. No notification is sent anywhere.
// 4. The execution is marked "Error" in the executions list
// — but only if someone opens n8n and looks.
//
// If you are running n8n unattended (as you should be),
// you will not know this happened. The data is gone.This is the correct default for a development tool. When you are building and testing, you want immediate failure. You want to see what broke and where. Silent continuation would make debugging impossible.
But the moment a workflow goes into production and starts running on a schedule or responding to real webhook events, that default behavior becomes dangerous. The workflow that processes your incoming leads at 2am is not going to have anyone watching the execution log. The automation that syncs your CRM into Notion every hour will fail quietly for days before anyone notices.
The gap between "it works in testing" and "it runs reliably unattended" is almost entirely a question of error handling. Let me take you through each layer of it.
The two types of failure you need to handle differently
Before getting into specific patterns, it is worth being clear about what kind of failures we are dealing with, because the right response depends entirely on the cause.
Transient failures are temporary. The API was briefly unavailable, the network had a hiccup, the connection timed out because the server was overloaded. These failures often resolve on their own in seconds or minutes. The right response is to wait and retry.
Permanent failures will not resolve with retrying. The authentication token expired. The payload is malformed. The record you are trying to update does not exist. The API endpoint returned a 400 Bad Request because you sent something it does not understand. Retrying these wastes time and can make things worse.
Good error handling distinguishes between these two types and responds differently to each. Retrying a bad request 10 times does not help. Retrying a network timeout once or twice usually does.
Retry strategies: what n8n gives you and when to use it
n8n has built-in retry functionality on every node. You find it in the node's Settings tab. Three settings control it: "Retry On Fail" (toggle), "Max Tries" (how many total attempts), and "Wait Between Tries" (seconds between each attempt).
This is simple to set up and covers a real class of failures. I use it on any node that talks to an external API, particularly for transient failures like connection timeouts or brief 503 errors. Two to three retries with a five-second wait between them will handle the vast majority of temporary network problems.
// n8n built-in retry settings (per node, in Settings tab)
// "Retry On Fail": toggle on
// "Max Tries": 2-5 (default 3 when retry is on)
// "Wait Between Tries": seconds between attempts
// What this buys you:
// - Transient network errors
// - Brief API unavailability
// - Connection timeouts on slow endpoints
// What this does NOT protect against:
// - Rate limit errors (429) — retrying immediately makes it worse
// - Auth errors (401, 403) — retrying does nothing
// - Malformed payload errors (400) — retrying does nothing
// - Downstream service outages lasting > a few secondsThere is an important limitation here. n8n's built-in retry uses a fixed wait time, not exponential backoff. For most transient errors, this is fine. For rate limit errors, it is not enough, and retrying immediately can make things significantly worse.
Immediate vs. exponential backoff
Exponential backoff means waiting longer on each successive retry. First retry: 1 second. Second retry: 2 seconds. Third retry: 4 seconds. Fourth: 8 seconds. The gap grows exponentially.
The reason for this pattern is not arbitrary. When you hit a rate limit or a service under load, retrying immediately adds to the load on that service. If 100 workflows all get rate-limited at the same moment and all retry in 5 seconds, you get another spike of requests in 5 seconds that causes another wave of failures. Exponential backoff spreads these retries out, reducing the probability of synchronized retry storms.
For rate limit scenarios specifically, you also want to respect the Retry-After header if the API sends one. Notion, Slack, and most well-designed APIs will tell you exactly how long to wait. Ignoring that header and retrying on your own schedule is both less effective and less polite to the API you depend on.
For complex retry logic, a Code node in n8n gives you full control:
// Exponential backoff: wait longer on each retry
// n8n's "Wait Between Tries" is a fixed delay — it is not backoff.
// For true backoff, handle it in a Code node or use a loop.
// Simple fixed-delay retry in a Code node:
const MAX_RETRIES = 4;
const BASE_DELAY_MS = 1000;
async function callWithRetry(fn) {
let lastError;
for (let attempt = 0; attempt < MAX_RETRIES; attempt++) {
try {
return await fn();
} catch (err) {
lastError = err;
// Do not retry auth errors or bad requests
const status = err?.response?.status;
if (status === 401 || status === 403 || status === 400) {
throw err;
}
if (attempt < MAX_RETRIES - 1) {
// Exponential: 1s, 2s, 4s, 8s ...
const delay = BASE_DELAY_MS * Math.pow(2, attempt);
await new Promise(r => setTimeout(r, delay));
}
}
}
throw lastError;
}Notice the logic that skips retrying on 401, 403, and 400 status codes. These are permanent failures. A 401 means your authentication is wrong. Retrying it 10 times does not change your authentication credentials. A 400 means you sent a malformed request. Retrying the same malformed request repeatedly does nothing. The code catches these cases and throws immediately, letting the error propagate to your error handler where it belongs.
When retrying makes things worse
There is a category of operation where retrying without idempotency guarantees creates more damage than the original failure. If an operation is not idempotent, meaning if running it twice produces a different result than running it once, then retrying it can cause duplicates.
Creating a new record in a database is the most common example. If you create a Notion page, the request succeeds, but the response gets lost before n8n receives it, n8n will think the node failed and retry. Now you have two identical Notion pages. If this is a client record, a billing entry, or a contract, that duplication is a real problem.
The solution for this class of operation is to make the operation idempotent before you retry it. I cover that in detail in the idempotency section. But the short version is: always check if the operation you are about to perform has already been performed before performing it again.
Error workflows: building a central error handler in n8n
n8n has a built-in mechanism for centralized error handling: the Error Trigger node. When you create a workflow that starts with an Error Trigger and then connect that workflow to another workflow, n8n will automatically fire the error handler workflow whenever the connected workflow fails.
This is one of the most underused features in n8n. Most people either ignore errors entirely or add error handling directly inside each workflow, which means maintaining error handling logic in 20 different places. A central error handler solves this: one workflow, consistent behavior, and a single place to improve your alerting over time.
The Error Trigger node and what it gives you
When your error handler workflow fires, the Error Trigger node outputs a single item containing everything you need to understand what went wrong:
// Error Trigger node — fires whenever any execution in the
// connected workflow fails. The input item contains:
{
"execution": {
"id": "1234",
"url": "https://your-n8n.com/execution/1234",
"retryOf": null,
"error": {
"message": "Request failed with status code 429",
"stack": "Error: Request failed...",
"name": "NodeApiError"
},
"lastNodeExecuted": "Update Notion Page",
"mode": "trigger"
},
"workflow": {
"id": "42",
"name": "CRM Sync — Attio to Notion",
"url": "https://your-n8n.com/workflow/42"
}
}
// Key fields to capture:
// workflow.name — which workflow failed
// execution.id — link directly to the failed run
// execution.error.message — what went wrong
// execution.lastNodeExecuted — which node was runningThe fields you almost always want to capture and act on are: the workflow name (so you know which automation failed), the execution ID (which gives you a direct link to the failed run), the error message (what went wrong), the last node that executed (where it went wrong), and a timestamp (when it went wrong).
With these five pieces of information, anyone on your team can open the failed execution in n8n and understand immediately what happened and where to look.
Building the error handler workflow
My standard error handler workflow has four nodes after the Error Trigger. First, a Code node that extracts and formats the fields I need. Second, a node that writes to a persistent error log (more on what that looks like in a moment). Third, a Slack node that sends an alert to a dedicated errors channel. Fourth, an email node as a fallback in case the Slack node itself fails.
The Code node that extracts fields is worth being explicit about, because the raw Error Trigger output is nested and verbose. You want a flat, clean object that downstream nodes can work with easily:
// Central error handler workflow structure:
//
// [Error Trigger]
// |
// [Extract fields] (Code node)
// |
// [Write to error log] (Notion / Google Sheet / Postgres)
// |
// [Send Slack alert]
// |
// [Send email fallback] (if Slack fails)
// Code node — Extract fields:
const execution = $input.first().json.execution;
const workflow = $input.first().json.workflow;
return [{
json: {
workflow_name: workflow.name,
workflow_id: workflow.id,
execution_id: execution.id,
execution_url: execution.url,
error_message: execution.error?.message || 'Unknown error',
failed_node: execution.lastNodeExecuted,
timestamp: new Date().toISOString(),
status: 'failed',
retry_count: 0
}
}];The error log database
I use a Notion database as my error log. Not because Notion is the only option, but because it is usually already in the client's stack, it is easy to query and filter, and it gives you a place to add notes about each error as you investigate.
The schema I use:
// Notion database schema for the error log:
// Properties:
// Workflow Name — Title
// Error Message — Text
// Failed Node — Text
// Execution URL — URL
// Timestamp — Date
// Status — Select: failed | retrying | resolved
// Retry Count — Number
// Notes — Text (for manual investigation notes)
// Notion API call (via n8n Notion node):
// Operation: Create a database item
// Database ID: your-error-log-db-id
// Properties:
{
"Workflow Name": { "title": [{ "text": { "content": "{{ $json.workflow_name }}" } }] },
"Error Message": { "rich_text": [{ "text": { "content": "{{ $json.error_message }}" } }] },
"Failed Node": { "rich_text": [{ "text": { "content": "{{ $json.failed_node }}" } }] },
"Execution URL": { "url": "{{ $json.execution_url }}" },
"Timestamp": { "date": { "start": "{{ $json.timestamp }}" } },
"Status": { "select": { "name": "failed" } },
"Retry Count": { "number": 0 }
}The Status field is important. It starts at "failed" when the record is created. When I am actively investigating, I change it to something that indicates that. When the issue is resolved, I mark it resolved. This gives you a clear queue of outstanding problems and a history of what was fixed and when.
The Notes field is where the real value accumulates over time. When you fix an issue, write down what caused it and what you did. Six months later when the same error recurs, you will thank yourself for that note.
The Slack alert
The Slack alert is what makes error handling actionable in practice. Writing to a database is good for record-keeping. An alert in Slack is what actually gets someone to look at it within the next few minutes.
I use a dedicated #n8n-errors channel for this. Putting automation errors in a general team channel creates noise that makes people tune it out. A dedicated channel means anyone who sees a message there knows it needs attention.
// Slack alert message for failed execution:
// Send to a dedicated #n8n-errors channel.
// Use Block Kit for readable formatting.
{
"channel": "#n8n-errors",
"blocks": [
{
"type": "header",
"text": {
"type": "plain_text",
"text": "Workflow Failed",
"emoji": true
}
},
{
"type": "section",
"fields": [
{ "type": "mrkdwn", "text": "*Workflow:*\n{{ $json.workflow_name }}" },
{ "type": "mrkdwn", "text": "*Failed Node:*\n{{ $json.failed_node }}" },
{ "type": "mrkdwn", "text": "*Error:*\n{{ $json.error_message }}" },
{ "type": "mrkdwn", "text": "*Time:*\n{{ $json.timestamp }}" }
]
},
{
"type": "actions",
"elements": [
{
"type": "button",
"text": { "type": "plain_text", "text": "View Execution" },
"url": "{{ $json.execution_url }}"
}
]
}
]
}The "View Execution" button in the Slack message is the most practically useful part of this. One click takes you directly to the failed execution in n8n, where you can see exactly which node failed, what data was flowing through it, and what error it produced. This turns "something failed" into "this specific node failed on this specific data for this specific reason" in about 30 seconds.
Connecting the error handler to your workflows
In n8n, you connect an error handler workflow to another workflow through the workflow settings. Open the workflow you want to protect, go to its settings, and set the "Error Workflow" field to your central error handler workflow.
You need to do this for every workflow that should report errors. Yes, this is a manual step per workflow. Yes, it is worth doing. For anything business-critical, from CRM syncs to payment processing to client onboarding automations, this connection is not optional.
One thing to be aware of: the error handler workflow itself can fail. If your Slack workspace is down, the Slack node in your error handler will error. This is why I add the email fallback. If Slack fails, the email still goes out. The email might be less pretty, but it gets through.
Dead-letter patterns: what happens when retries are not enough
Retries handle transient failures. Error workflows capture and alert on failures. But there is a third layer that most n8n setups are missing: a place for items that have failed multiple times and need human intervention before they can be processed.
In message queue systems, this place is called a dead-letter queue. The concept is simple. When a message fails processing after a configured number of retries, instead of being silently dropped, it is moved to a separate store where it waits for someone to look at it, fix whatever caused the failure, and replay it.
The key word there is "replay." A dead-letter pattern is not just about knowing something failed. It is about preserving the original data so that when you fix the underlying issue, you can process the data that was lost without having to recreate it from scratch.
Building a dead-letter database
In n8n, you build this pattern with a database that stores failed items along with enough context to replay them later. I use Notion for this too, though a spreadsheet or any database works fine.
// Dead-letter queue pattern in n8n.
// After N retries, items that still fail move to a "dead letter" store
// where they wait for manual review and replay.
// Notion dead-letter database schema:
// Properties:
// Item ID — Title (the original record's unique ID)
// Payload — Text (JSON stringified original data)
// Workflow — Text (which workflow failed)
// Node — Text (which node failed)
// Error — Text (error message)
// Retry Count — Number
// First Failed — Date
// Last Attempted — Date
// Status — Select: waiting | replaying | resolved | abandoned
// To replay an item:
// 1. Change Status to "replaying"
// 2. Trigger a replay workflow via webhook or manual trigger
// 3. The replay workflow reads payload from this database
// 4. If it succeeds, update Status to "resolved"
// 5. If it fails again, increment Retry Count and reset to "waiting"
// Replay trigger workflow (simplified):
const deadLetterItem = $input.first().json;
const payload = JSON.parse(deadLetterItem.payload);
// Re-run the original processing logic with the saved payload
// ... your normal processing code here ...
// On success, mark resolved in Notion:
await notionClient.pages.update({
page_id: deadLetterItem.notion_page_id,
properties: {
Status: { select: { name: 'resolved' } }
}
});The Payload field is the critical one. This is the original data that the workflow was trying to process when it failed. JSON-stringified and stored in full. When you are ready to replay, you read this payload back out, run it through the processing logic again, and if it succeeds, mark the record resolved.
The First Failed and Last Attempted dates give you visibility into items that have been sitting in the dead-letter store for a long time. An item that first failed two weeks ago and has never been retried is a signal that something needs attention. Maybe the fix is simple. Maybe there is a systemic issue that is blocking a whole category of items.
The replay workflow
A replay workflow is a separate workflow that reads from the dead-letter store and re-processes the stored items. The simplest version: a webhook trigger, a Notion query to fetch items in "waiting" status, and then the same processing logic as the original workflow.
When you replay an item, you want to be careful about two things. First, make sure you are replaying with the original data, not some modified version. The point of storing the payload is to have an exact copy of what the system was trying to process. Second, handle the case where replay also fails. If it fails again, increment the retry count, update the Last Attempted date, and leave the Status as "waiting" for another attempt later. If it succeeds, mark it resolved.
The Status field is what lets you use the dead-letter store as a work queue. You can filter by Status: waiting to see what needs attention, by Status: replaying to see what is actively being retried, and by Status: resolved to see the history of what was fixed.
When to route to dead letters vs. immediate alert
Not every failure should go to the dead-letter store. Some failures need immediate human attention, not a queue. If a payment fails to process, you want to know that now, not discover it during a weekly dead-letter review.
The distinction I use: route to dead letters when the item can be safely replayed later without time-sensitive consequences. Send an immediate alert when the failure has business consequences that get worse with time. In practice, payments and anything customer-facing go to immediate alert. Background data syncs, report generation, and internal data processing go to dead letters with a standard alert cadence.
Idempotency: the concept that separates reliable automations from fragile ones
I mentioned idempotency briefly in the retry section. Now I want to go deep on it, because it is the concept that most n8n workflows are missing and it is responsible for a large fraction of the "weird duplicate data" problems I have been called in to diagnose.
Idempotency means that performing an operation multiple times produces the same result as performing it once. An idempotent operation is safe to retry. A non-idempotent operation is not.
Updating a record to have a specific value is idempotent. Running it 10 times, the record ends up with that value. Running it once, same result.
Creating a new record is not idempotent. Running it 10 times creates 10 records. Running it once creates one.
Sending an email is not idempotent. Running it 10 times sends 10 emails.
Charging a customer is definitely not idempotent.
The challenge in automation is that workflows that include non-idempotent operations will inevitably be run more than once on the same input. Webhook providers retry deliveries. n8n itself has a retry mechanism. Someone may manually replay a failed execution. These are all legitimate, expected behaviors. The question is whether your workflow handles them correctly.
Idempotency keys
An idempotency key is a stable, unique identifier for a unit of work. "Stable" means it is the same every time the same operation is attempted. "Unique" means it is different for genuinely different operations.
The best idempotency keys are the ones provided by the triggering system. Stripe event IDs are globally unique and stable. Webhook delivery IDs, when the provider includes them, are stable. Record IDs from your CRM, combined with the operation type, can serve as keys.
The worst idempotency keys are ones you generate yourself at the start of a run. A UUID generated when the workflow starts is different every time the workflow runs, even if it is processing the same data. That defeats the purpose entirely.
// Idempotency key: a stable, unique identifier for each unit of work.
// The key must be the same if the same operation is retried.
// Good idempotency keys:
// - Stripe event ID: evt_1OoX2tBzX2x9Y (globally unique)
// - Webhook delivery ID: header X-Webhook-Id or similar
// - Record ID + operation: notion_page_abc123_update_crm_status
// - Order ID + step: ORD-8821_send_confirmation
// Bad idempotency keys:
// - Current timestamp (changes on each run)
// - UUID generated at start of run (different on retry)
// - Record ID alone (same record might need multiple operations)
// In n8n: store processed keys in a Notion database, Google Sheet,
// or a Redis cache if you have one available.
// Check-before-process pattern:
const itemId = $input.first().json.stripe_event_id; // stable key
// 1. Query your idempotency store
const existing = await $helpers.httpRequest({
method: 'POST',
url: 'https://api.notion.com/v1/databases/YOUR_DB_ID/query',
headers: {
'Authorization': 'Bearer ' + process.env.NOTION_TOKEN,
'Notion-Version': '2022-06-28'
},
body: {
filter: {
property: 'Event ID',
rich_text: { equals: itemId }
}
}
});
if (existing.results.length > 0) {
// Already processed — stop this branch
return []; // empty output stops downstream nodes
}
// 2. Record as in-progress before doing any work
// (prevents race conditions if two triggers fire simultaneously)Why webhook retries cause duplicates
When an external service, Stripe, Attio, or any webhook provider, sends a webhook to your n8n workflow, it expects a 200 response within a certain time window. If n8n takes too long to respond, or if the connection drops before the response is delivered, the provider assumes the delivery failed and retries.
From n8n's perspective, this looks like two separate trigger events. Without idempotency checks, n8n has no way to know it already processed this event. It processes it again. If processing includes creating a record or sending a notification, you now have two of each.
The pattern is straightforward: before processing any webhook event, check whether you have already processed an event with this ID. If yes, return immediately and do nothing. If no, record the ID and process.
For a deeper look at webhook delivery mechanics, idempotency, and signature verification, I wrote a full post covering these concepts in the context of webhook handlers generally. The principles there apply directly to how n8n receives and processes webhook triggers.
Implementing idempotency in n8n without a database
The simplest approach in n8n is a Notion database or Google Sheet as your idempotency store. Before processing, query the store for the event ID. If found, output an empty array to stop the workflow. If not found, insert the event ID and continue.
This is not as fast as a Redis cache, but for most n8n use cases it is more than sufficient. A Notion query for a single record takes under a second. The overhead is acceptable for workflows that are not processing thousands of events per minute.
If you are running high-volume workflows and the Notion query latency is a concern, n8n also supports Redis via the Execute Command node or through HTTP calls to a Redis REST API. Redis can check and set an idempotency key in under 5 milliseconds.
Upsert patterns: making create operations idempotent
When you need to create-or-update a record based on an identifier, rather than always creating, the upsert pattern makes that operation idempotent.
Instead of blindly creating a new record, you first search for an existing record with the matching ID. If it exists, update it. If it does not exist, create it. Either way, the result is exactly one record with the correct data.
// Upsert pattern: safe to run multiple times, produces same result.
// Instead of: INSERT then handle duplicate errors
// Use: INSERT OR UPDATE based on a unique key
// In n8n Notion node:
// Operation: "Update a database item" with a search-first pattern.
// Or use the "Create or Update" (upsert) pattern:
// Step 1: Search for existing record
const searchResult = await notionClient.databases.query({
database_id: 'your-crm-db-id',
filter: {
property: 'Attio Contact ID',
rich_text: { equals: contactId }
}
});
if (searchResult.results.length > 0) {
// Record exists: update it
await notionClient.pages.update({
page_id: searchResult.results[0].id,
properties: { /* updated fields */ }
});
} else {
// Record does not exist: create it
await notionClient.pages.create({
parent: { database_id: 'your-crm-db-id' },
properties: { /* all fields including Attio Contact ID */ }
});
}
// This is safe to run 10 times on the same contact.
// The result is always one record, always up to date.In n8n, you can implement this pattern with a Notion node in Query mode followed by a Merge node that branches on whether the query returned results. Many n8n nodes also have built-in upsert operations. The Notion node has "Update a database item" which can be paired with a prior search. The Airtable node has a similar capability. Check your specific node for what it supports natively before writing a custom Code node to do it manually.
Upserts are also the correct pattern for CRM syncs, where the source system sends you an update event for a record that may or may not already exist in your destination. Using upserts means you never have to worry about whether your sync was running when the record was first created. It will create it on the first sync and update it on every subsequent one.
Partial failure in batch processing: the problem with all-or-nothing
One of the most common n8n patterns is processing a list of items: all new contacts from a CRM, all invoices from a billing system, all records that match some filter. The default behavior when any item fails is to stop the entire batch. This is the worst possible default for production use.
Imagine you are syncing 200 contacts from Attio into Notion. Item 47 has malformed data, maybe a Unicode character in the name field that the Notion API does not accept. n8n fails at item 47 and stops. Items 48 through 200 never run. You now have 46 synced contacts and 154 that will not be synced until you manually investigate, fix the problem, and replay the entire batch.
The better approach is to let each item succeed or fail independently, capture which ones failed and why, and continue processing the rest. This is exactly what "Continue on Fail" enables.
Continue on Fail: what it does and what to do with the errors
Enable Continue on Fail in any node's Settings tab. When this is on, items that cause errors pass through to the next node with an "error" key added to their data, rather than stopping the execution.
After the node with Continue on Fail enabled, you split the output using an IF node that checks whether the error key exists. Items with errors go to one branch, which logs them to your error database. Items without errors continue to the normal downstream processing.
// Continue on Fail: per-node setting in n8n.
// When enabled: if the node errors on an item, that item gets an
// "error" key in its output, and processing continues downstream.
// Items that succeeded pass through normally.
// After a node with Continue on Fail, split your flow:
// Use an IF node to separate errored items from successful ones.
// Check for error in IF node condition:
// Left value: {{ $json.error }}
// Condition: exists (is not empty)
// True branch (errored items):
// - Write to your error log
// - Continue with next batch item
// False branch (successful items):
// - Continue with normal downstream processing
// What the error output looks like:
{
"error": {
"message": "Could not find page",
"name": "NodeApiError",
"status": 404
},
"original_data": {
"contact_id": "abc123",
"name": "Jane Smith"
}
}The important part of this pattern is that the error items get logged with enough context to replay them later. You want the original data (so you can fix and replay it), the error message (so you understand what went wrong), and a reference to which workflow and node failed (so you can find the right place to fix it).
Loop Over Items: controlling batch size and failure isolation
n8n's Loop Over Items node processes items one at a time (or in configurable batch sizes). This is useful for rate-limit compliance, but it is also useful for failure isolation. When you process items inside a loop with Continue on Fail enabled on critical nodes, a failure on item 47 does not affect items 48 through 200. Each iteration is independent.
At the end of a batch loop, I add a Code node that builds a run summary: how many items were processed, how many succeeded, how many failed, and what the individual errors were. This summary can then be sent to Slack or written to a report database, giving you a clear picture of each batch run without having to dig through individual execution logs.
// Processing a batch of items with per-item error tracking.
// Pattern: Loop Over Items node + Continue on Fail + status tracking.
// Workflow structure:
// [Trigger] → [Get all contacts] → [Loop Over Items]
// |
// [Process one contact]
// (Continue on Fail: ON)
// |
// [IF: did it error?]
// | |
// [Log error] [Mark success]
// | |
// [Merge] ← ←←←←←
// Code node to build a run summary:
const items = $input.all();
let succeeded = 0;
let failed = 0;
const errors = [];
for (const item of items) {
if (item.json.error) {
failed++;
errors.push({
id: item.json.original_data?.id,
error: item.json.error.message
});
} else {
succeeded++;
}
}
return [{
json: {
total: items.length,
succeeded,
failed,
errors,
run_at: new Date().toISOString()
}
}];This summary is also the trigger for routing items to your dead-letter store. If the failed count is greater than zero, write each failed item (with its payload and error) to the dead-letter database and alert via Slack. The batch continues to completion. Nothing is lost. The failed items are waiting to be replayed when you have fixed the underlying issue.
Splitting large batches to limit blast radius
For very large batches, there is another consideration: if a batch of 2000 items fails partway through due to a systemic issue, you have a much larger cleanup job than if you had processed the same 2000 items in batches of 50.
In n8n, the Loop Over Items node has a Batch Size setting. Setting this to 50 means the loop processes 50 items at a time, writes results, then continues. If something goes wrong at item 847, you have batches 1 through 16 fully processed and written, and only batch 17 needs to be replayed. This limits the blast radius of any single failure significantly.
The tradeoff is speed. Processing in smaller batches is slower than processing everything at once. For most operational workflows this is an acceptable tradeoff. The occasional business that processes millions of records per hour needs a different architecture than n8n anyway.
Tracking which items succeeded across runs
When you replay a failed batch, you need to avoid reprocessing items that already succeeded. This is where the idempotency patterns from the previous section connect with batch processing. If every item in your batch has a stable unique ID, and you check that ID before processing, then replaying the full batch is safe: already-processed items are skipped, and only the failed ones actually run.
Without this, replaying a batch means running through 200 items again and hoping your processing is idempotent. Sometimes it is, sometimes it is not. Building the idempotency check in explicitly means you do not have to hope.
For more on how batch processing interacts with API rate limits, I covered rate limit mechanics, exponential backoff, and queue throttling in depth in a dedicated post. The two topics are tightly connected: good batch error handling also needs to account for rate limit responses from the APIs you are calling.
The dead-letter queue: preserving failed items for replay
Earlier I described the concept of a dead-letter pattern and what the database schema looks like. Now I want to be specific about how to build the queue and how to operate it.
The dead-letter queue serves two purposes. One: it preserves failed items so data is never permanently lost due to a transient or fixable failure. Two: it gives you a structured place to track outstanding problems and confirm when they are resolved.
// Dead-letter queue pattern in n8n.
// After N retries, items that still fail move to a "dead letter" store
// where they wait for manual review and replay.
// Notion dead-letter database schema:
// Properties:
// Item ID — Title (the original record's unique ID)
// Payload — Text (JSON stringified original data)
// Workflow — Text (which workflow failed)
// Node — Text (which node failed)
// Error — Text (error message)
// Retry Count — Number
// First Failed — Date
// Last Attempted — Date
// Status — Select: waiting | replaying | resolved | abandoned
// To replay an item:
// 1. Change Status to "replaying"
// 2. Trigger a replay workflow via webhook or manual trigger
// 3. The replay workflow reads payload from this database
// 4. If it succeeds, update Status to "resolved"
// 5. If it fails again, increment Retry Count and reset to "waiting"
// Replay trigger workflow (simplified):
const deadLetterItem = $input.first().json;
const payload = JSON.parse(deadLetterItem.payload);
// Re-run the original processing logic with the saved payload
// ... your normal processing code here ...
// On success, mark resolved in Notion:
await notionClient.pages.update({
page_id: deadLetterItem.notion_page_id,
properties: {
Status: { select: { name: 'resolved' } }
}
});The workflow that writes to the dead-letter queue is triggered by your central error handler or by the Continue on Fail branch in your batch processing workflows. When an item lands in the dead-letter store, it is not dropped. It is waiting.
Building the replay mechanism
The replay mechanism is a separate workflow. Its trigger can be manual (you go in, change a record's status to "ready to replay", and trigger the workflow manually) or automated (a scheduled workflow that queries for waiting items and replays them once an hour).
For most setups I recommend starting with manual replay. This forces you to look at each failed item, understand why it failed, and confirm the underlying issue is fixed before replaying. Automated replay is convenient, but it can mask systemic issues if you are not careful. An item that keeps failing and getting replayed automatically can generate noise without anyone noticing the root cause.
Once your error handling is mature and your common failure modes are well understood, you can add automated replay for specific categories of failures where you are confident the retry is safe.
Alerting cadence from the dead-letter store
In addition to the real-time alerts from your error handler workflow, I add a scheduled workflow that runs daily and reports on the state of the dead-letter queue. The daily report shows: how many items are in waiting status, how many were added in the last 24 hours, how many are more than a week old.
Items older than a week are a specific concern. They usually indicate either a systemic issue that keeps blocking replay, or an item that someone forgot about. The daily report keeps that category visible without requiring someone to manually check the dead-letter database every day.
Monitoring: the difference between an automation that works in a demo and one that runs unattended
Error handling is reactive. Something goes wrong and you find out. Monitoring is proactive. You are watching the system continuously and you can detect problems before they cause visible damage.
For most n8n deployments, full observability infrastructure is overkill. You do not need Datadog. You do need a few specific things that will catch the failure modes that error handling alone will not.
The silent skips problem
Error handling catches nodes that throw exceptions. It does not catch nodes that succeed but silently produce no output when they should have produced output.
This is more common than you might expect. A Notion query that returns zero results does not error. It succeeds with an empty array. If you expected it to return 50 records and it returned 0, your workflow will continue to run happily, processing nothing, and everything downstream will be silently skipped. No error, no alert, no record in your error log. Just missing data.
The way to catch this is count assertions. After any critical node, add a Code node that checks whether the output count matches your expectation. If the query should have returned at least one item and it returned zero, that is an error condition. Throw an error. Let it propagate to your error handler.
// Minimal monitoring checklist for unattended n8n workflows.
// 1. Execution success rate — track daily:
// SELECT
// DATE(started_at) as day,
// COUNT(*) as total,
// SUM(CASE WHEN status = 'success' THEN 1 ELSE 0 END) as succeeded,
// SUM(CASE WHEN status = 'error' THEN 1 ELSE 0 END) as failed
// FROM n8n_executions
// GROUP BY DATE(started_at)
// ORDER BY day DESC;
// 2. Silent skips — items that passed through but produced no output:
// After each critical node, add a Code node that verifies
// the output count matches expectations.
// If it does not match, write to error log and alert.
const inputCount = $input.all().length;
const outputItems = processedItems; // items after your transform
if (outputItems.length !== inputCount) {
// Some items were silently dropped
throw new Error(
`Item count mismatch: expected ${inputCount}, got ${outputItems.length}`
);
}
// 3. Stale trigger detection:
// For each time-triggered workflow, record the last successful run time.
// A separate watchdog workflow (runs every hour) checks these timestamps.
// If any workflow hasn't run in 2x its expected interval, alert immediately.
// 4. Data freshness checks:
// For sync workflows, verify that the most recently modified record
// in your destination (Notion, CRM) is more recent than
// the workflow's trigger interval. Stale data = something is not running.Stale trigger detection
Scheduled workflows can stop running without producing any errors. This happens more often than people expect. You update n8n. The workflow gets disabled during the update. You forget to re-enable it. Or a credential expires and the trigger itself can no longer authenticate. The workflow does not fire. No error is generated anywhere. Everything looks fine from the outside.
The pattern that catches this: at the end of each critical workflow, write a "last successful run" timestamp to a simple store (a Notion page, a Google Sheet, a single-cell database). A separate watchdog workflow, running every hour, reads these timestamps and compares them to the expected run interval. If any workflow has not run within twice its expected interval, the watchdog alerts immediately.
This is simple to build and catches an entire class of failures that error handling will never see, because there is no error to handle. The workflow simply did not run.
Data freshness checks
For sync workflows, the stale trigger check can be augmented with a data freshness check. Instead of just asking "did the workflow run?", you ask "did the workflow do something useful when it ran?"
In a CRM sync, for example, you can check whether the most recently modified record in your Notion database is more recent than your sync interval. If you sync every hour and the newest record in Notion is four hours old, that is a problem even if the workflow appears to have run. Maybe it ran and found nothing to sync. Maybe it ran and silently skipped everything. The data freshness check catches both cases.
Execution success rate tracking
If you have access to the n8n database (self-hosted instances running on Postgres or SQLite), you can query execution metrics directly. Success rate over time, average execution duration, error rate by workflow. This gives you a dashboard view of system health without requiring any additional tooling.
Even without direct database access, you can approximate this by having each workflow write a status record at the end of every run (success or fail). A scheduled report aggregates these records daily and sends a summary. This gives you trend data: is this workflow failing more often this week than last week? Is execution time trending up, which might indicate something is getting slower?
The practical minimum
For most production n8n deployments I work with at Abhiman Labs, the practical monitoring minimum is: a central error handler with Slack alerts, count assertions after critical queries, stale trigger detection for scheduled workflows, and a daily summary of the dead-letter queue state.
This is not complex to build. Each piece is a small workflow. Together they give you enough visibility to catch the vast majority of production failures within minutes, before they cause business impact that is difficult to recover from.
Putting it all together: what a production-hardened n8n workflow looks like
Let me describe what a production-hardened workflow actually looks like in practice, using a CRM-to-Notion sync as a concrete example. This is the kind of workflow I build for clients at Abhiman Labs where reliability genuinely matters.
The trigger layer
The workflow starts with a webhook trigger from Attio (the CRM). Every time a contact is created or updated in Attio, it fires a webhook to n8n.
The first nodes after the trigger are not processing nodes. They are safety nodes:
First, an idempotency check. Extract the webhook delivery ID from the header (or compute a stable hash of the payload). Query the idempotency store for that ID. If it exists, output an empty array and stop. If it does not exist, record the ID and continue.
Second, a payload validation check. Confirm the required fields are present and in the expected format. If validation fails, route to the dead-letter store immediately with a clear error about what was missing. Do not try to process malformed data.
The processing layer
With idempotency checked and payload validated, the actual processing begins. All nodes that call external APIs have "Retry On Fail" enabled with two to three retries and a five-second wait. For the Notion API calls specifically, I add an extra Code node with exponential backoff logic because Notion's rate limits require more sophisticated retry handling than a fixed-delay retry provides.
The core operation is an upsert: search Notion for an existing record matching the Attio contact ID, update it if found, create it if not. This makes the entire processing layer idempotent. Even if this exact event fires three times, the result is always one record with the most recent data.
Continue on Fail is enabled on the Notion API node. If the upsert fails after retries, the item does not stop the workflow. It passes through with an error key, gets caught by the IF node, and is routed to the dead-letter store with the original payload preserved.
The error and monitoring layer
The workflow is connected to the central error handler workflow in its settings. Any unhandled exception (beyond what Continue on Fail covers) triggers the error handler, which writes to the error log and fires the Slack alert.
At the end of a successful run, the workflow writes a timestamp to the "last successful run" store that the watchdog workflow checks hourly.
For batch versions of this workflow (where I pull all contacts modified in the last 24 hours rather than processing one at a time), the loop structure with per-item Continue on Fail and a run summary at the end replaces the single-item processing. The dead-letter behavior is the same: any item that fails after retries goes to the dead-letter store, not to silence.
The operations layer
Beyond the technical architecture, there are operational practices that make this sustainable:
The dead-letter queue gets reviewed weekly at minimum. Items in "waiting" status that are more than three days old get escalated. The daily summary report keeps this visible without requiring manual database checks.
When a failure pattern repeats (same error, same node, multiple times), that is the signal to fix the root cause, not just replay the items. Repeatedly replaying items from a broken pattern without fixing the pattern is not error handling. It is technical debt accumulation.
When something does fail and gets resolved, the Notes field in the error log gets updated with what the cause was and what fixed it. This knowledge accumulates over time and makes future debugging faster.
When you actually need all of this
I want to be honest about something. Not every n8n workflow needs every layer of error handling described in this post.
A workflow that runs once a week and generates a report that someone reviews before acting on it is low-stakes. If it fails, the person reviews last week's report instead. Missing data is not a crisis. For this workflow, basic error handler alerts and maybe retry settings on the API calls is plenty.
A workflow that processes client payments, updates client records, and triggers downstream automations that depend on accurate data is high-stakes. A failure there can mean a client does not get what they paid for, or data gets corrupted in a way that takes hours to clean up, or an important client-facing process gets missed entirely. This workflow needs idempotency, dead letters, monitoring, the full stack.
The question to ask for each workflow: what is the business cost of a silent failure? If the answer is "someone might not get the report they expected," the lightweight approach is fine. If the answer is "a client might be charged incorrectly" or "we might miss an important contract," you need the full error handling architecture.
I have had conversations with people who were running financial workflows with no error handling at all, not even basic retry settings on the nodes. They had not thought about what would happen if a node failed because nothing had failed yet. Then it did. The cleanup was painful and expensive. The error handling I have described in this post would have taken a few hours to build. The cleanup took days.
n8n-specific considerations worth knowing
A few things specific to n8n that are worth being aware of as you implement these patterns.
Error workflow connections are per-workflow, not global
In n8n, you connect an error handler workflow to each workflow individually through that workflow's settings. There is no global error handler setting. This means you need to consciously connect each production workflow to your error handler. Building a checklist that you go through before marking a workflow as production-ready is useful here.
The execution log is finite
n8n stores execution data for a configurable number of executions or days. After that, old executions are pruned. If you need long-term audit trails, do not rely on n8n's execution log. Write the important data to an external store (your Notion error log, a spreadsheet, a database) as part of the workflow. Assume the execution log will not be available more than a few days or weeks out.
Credentials and the silent expiry problem
API credentials expire. OAuth tokens expire. When they do, every node using that credential will fail with a 401. If you have retry logic that does not exclude 401 errors, you will retry 401s until you hit your max tries limit and then fail. If you do exclude 401s (as the exponential backoff code example does), you fail fast and correctly to your error handler.
The more proactive approach is to monitor credential expiry directly. For credentials with known expiry times, add them to a maintenance calendar and rotate them proactively before they expire. For OAuth credentials, prefer the ones that auto-refresh, and test that the auto-refresh is actually working. OAuth refresh token expiry is a common silent failure mode because the initial auth worked fine but the refresh fails quietly weeks later.
n8n's built-in execution retry vs. workflow-level retry
n8n has a feature that lets you manually retry a failed execution from the execution log interface. This is useful for one-off failures and debugging. It is not a substitute for the dead-letter pattern described in this post, because it requires someone to go into n8n, find the failed execution, and manually trigger the retry. That is fine for debugging. It does not scale for any failure pattern that recurs in production.
The dead-letter store is better for production because it is structured, queryable, and operates independently of how long n8n retains execution history. When an execution gets pruned from n8n's log, the dead-letter record with the original payload is still there, waiting to be replayed.
Summary: what production-grade n8n error handling looks like
The default n8n behavior stops on failure and produces no alerts. This is correct for development. It is dangerous for production. The gap between "works in testing" and "runs reliably unattended" is almost entirely a function of error handling.
Retry settings: Enable on any node that calls an external API. Use two to three retries with a short fixed delay for transient failures. Use Code node exponential backoff for rate limit scenarios. Never retry on 400, 401, or 403 errors.
Error workflows: Create a single central error handler workflow connected to all production workflows. It should extract the key fields, write to a persistent error log, and fire a Slack alert. Every production workflow must be connected to this handler.
Dead-letter queues: Failed items after retries should be stored with their original payload in a structured database, not dropped. The dead-letter store should have a replay mechanism and should be reviewed regularly.
Idempotency: Any workflow that can be triggered more than once on the same data must have idempotency checks. Use stable, externally-provided IDs as idempotency keys. Check before processing. Use upsert patterns for create operations.
Batch processing: Enable Continue on Fail on critical nodes. Split error items from successful ones. Log errors individually. Generate a run summary at the end of each batch. Route failed items to the dead-letter store, not to silence.
Monitoring: Add count assertions after critical queries. Implement stale trigger detection with a watchdog workflow. Send a daily dead-letter summary. Track data freshness for sync workflows.
This architecture takes a few days to build properly the first time. Once it exists, it applies to every new workflow you build. The incremental cost per workflow is small: connect it to the error handler, add idempotency where needed, enable Continue on Fail on critical nodes. Most of the scaffolding is shared.
The business case for this work is simple. An automation that loses data silently is worse than no automation at all. At least without automation, you know you have to do the work manually. With a broken automation, you think the work is being done when it is not. By the time you discover the problem, you may have days or weeks of missing data to reconstruct.