ArchitectureReliability

Designing financial systems that don't fail

March 22, 2026 · 3 min read

Financial systems have a different failure model than consumer apps. A 500 error on Twitter is annoying. A 500 error mid-transfer could mean lost funds, double charges, or inconsistent ledger state.

The patterns that matter: idempotency keys on every mutation, event sourcing for audit trails, saga patterns for distributed transactions, and pessimistic locking on balance updates. But more than any pattern, you need to model your failure modes explicitly before writing a line of code.

Ask: "What happens if this service crashes here?" for every network call. The answer tells you what you need to build.

Idempotency is non-negotiable

The most important reliability pattern in financial systems is idempotency. Every mutation endpoint should accept an Idempotency-Key header. If the client retries with the same key, the server must return the same result — not execute the operation again.

However, a naive "read-then-write" check is vulnerable to concurrent race conditions. If two identical requests hit the server at the exact same millisecond, both will read null from the database and both will execute the transfer.

To prevent this, you must insert the key first to lock the intent, relying on a database-level UNIQUE constraint:

async function processTransfer(params: {
  idempotencyKey: string
  from: string
  to: string
  amount: bigint
}) {
  // 1. Lock the key first. Rely on DB-level UNIQUE constraints to fail duplicates.
  try {
    await db.idempotency.create({
      data: { 
        key: params.idempotencyKey, 
        status: 'STARTED' 
      }
    })
  } catch (err) {
    if (err.code === 'P2002') { // Prisma unique constraint violation code
      const existing = await db.idempotency.findUnique({
        where: { key: params.idempotencyKey }
      })
      if (existing?.status === 'STARTED') {
        throw new Error('Concurrent request in-flight. Please retry.')
      }
      return existing?.result
    }
    throw err
  }

  try {
    // 2. Execute transfer (ideally inside a db transaction with pessimistic locking)
    const result = await executeTransfer(params)

    // 3. Mark as success and store result
    await db.idempotency.update({
      where: { key: params.idempotencyKey },
      data: { status: 'SUCCESS', result }
    })

    return result
  } catch (err) {
    // 4. On failure, remove the key so the client can retry the operation
    await db.idempotency.delete({
      where: { key: params.idempotencyKey }
    })
    throw err
  }
}

By using database-level constraints as locks, a network timeout followed by a retry will never execute a transfer twice. The duplicate request is caught at the DB layer before any external transactions are initiated.

The saga pattern for distributed transactions

When a single financial operation spans multiple services (deduct from account A, credit account B, record the ledger entry, send the notification), you can't use a database transaction. Each service owns its data.

The saga pattern breaks the operation into a series of local transactions, each with a compensating action for rollback:

| Step | Action | Compensation | |------|--------|-------------| | 1 | Reserve funds in source account | Release reservation | | 2 | Send to settlement provider | Cancel settlement | | 3 | Credit destination account | Reverse credit | | 4 | Record ledger entry | Mark as reversed | | 5 | Send notification | (None, eventually consistent) |

If step 3 fails, the saga executes compensating actions for steps 2 and 1. This gives you atomicity without distributed locking.

The one question rule

The most valuable reliability practice I've adopted is deceptively simple: before writing any network call, ask "what happens if this crashes here?"

async function handlePaymentWebhook(payload: WebhookEvent) {
  // What if we crash after saving to DB but before calling the notification service?
  await db.payment.update({ where: { id: payload.id }, data: { status: 'confirmed' } })

  // What if we crash here? The payment is confirmed but the user doesn't know.
  await notificationService.send(payload.userId, 'Payment confirmed!')
}

The answer to "what if this crashes here?" tells you exactly what recovery mechanism you need. In the example above, the answer is a background job: save the notification intent to the database (in the same transaction as the payment update) and let a worker deliver it reliably.

Model your failure modes first. The code for the happy path will write itself.

Designing financial systems that don't fail

Idempotency is non-negotiable

The saga pattern for distributed transactions

The one question rule

Why recurring payments on-chain are fundamentally hard

The hidden cost of third-party integrations