Designing for Failure: Patterns for Resilient Distributed Systems

In distributed systems, failure isn’t a possibility—it’s a certainty. Networks partition, services crash, databases timeout. The question isn’t if things will fail, but how your system will respond when they do.

This post covers practical patterns for building resilient distributed systems that degrade gracefully under failure.

The Fallacies of Distributed Computing

Before diving into patterns, let’s acknowledge the uncomfortable truths:

The network is not reliable
Latency is not zero
Bandwidth is not infinite
The network is not secure
Topology does change
There is not one administrator
Transport cost is not zero
The network is not homogeneous

Every pattern we discuss is designed to cope with these realities.

Pattern 1: Retries with Exponential Backoff

The simplest resilience pattern: if something fails, try again. But naive retries can make things worse.

The Problem with Naive Retries

// DON'T DO THIS
async function callService() {
  while (true) {
    try {
      return await fetch('/api/data');
    } catch (e) {
      // Immediately retry - this can overwhelm a struggling service
    }
  }
}

If a service is overloaded, hammering it with retries makes the problem worse.

Exponential Backoff with Jitter

async function callWithBackoff<T>(
  fn: () => Promise<T>,
  maxRetries = 5
): Promise<T> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries - 1) throw error;
      
      // Exponential backoff: 100ms, 200ms, 400ms, 800ms...
      const baseDelay = 100 * Math.pow(2, attempt);
      
      // Add jitter to prevent thundering herd
      const jitter = Math.random() * baseDelay * 0.5;
      
      await sleep(baseDelay + jitter);
    }
  }
}

The jitter is crucial—without it, all clients retry at the same time, creating “thundering herd” problems.

Pattern 2: Circuit Breaker

Retries help with transient failures, but what about prolonged outages? You don’t want to keep trying (and timing out) when a service is down.

The Circuit Breaker States

┌───────────┐  failure threshold  ┌───────────┐
│  CLOSED   │────────────────────▶│   OPEN    │
│ (normal)  │                     │ (failing) │
└───────────┘                     └─────┬─────┘
      ▲                                 │
      │         timeout elapsed         │
      │                                 ▼
      │                           ┌───────────┐
      └───────────────────────────│ HALF-OPEN │
            success               │  (probe)  │
                                  └───────────┘

Closed: Normal operation, requests flow through
Open: Failures exceeded threshold, requests fail immediately (fast failure)
Half-Open: After timeout, allow one request through to test recovery

Implementation

class CircuitBreaker {
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private failures = 0;
  private lastFailure: number = 0;
  
  constructor(
    private threshold = 5,
    private timeout = 30000
  ) {}
  
  async call<T>(fn: () => Promise<T>): Promise<T> {
    if (this.state === 'open') {
      if (Date.now() - this.lastFailure > this.timeout) {
        this.state = 'half-open';
      } else {
        throw new Error('Circuit is open');
      }
    }
    
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (error) {
      this.onFailure();
      throw error;
    }
  }
  
  private onSuccess() {
    this.failures = 0;
    this.state = 'closed';
  }
  
  private onFailure() {
    this.failures++;
    this.lastFailure = Date.now();
    if (this.failures >= this.threshold) {
      this.state = 'open';
    }
  }
}

Pattern 3: Bulkheads

In ships, bulkheads are partitions that prevent water from flooding the entire vessel. In software, bulkheads isolate failures to prevent cascading effects.

Resource Isolation

// Separate connection pools for different services
const orderServicePool = new ConnectionPool({ maxSize: 10 });
const inventoryServicePool = new ConnectionPool({ maxSize: 10 });
const paymentServicePool = new ConnectionPool({ maxSize: 5 });

// If inventory service is slow, it only exhausts its own pool
// Order and payment services continue working

Thread Pool Isolation

For CPU-bound work, use separate thread pools:

const criticalWorkPool = new WorkerPool({ workers: 4 });
const backgroundWorkPool = new WorkerPool({ workers: 2 });

// Background work can't starve critical operations

Pattern 4: Saga for Distributed Transactions

When a business process spans multiple services, traditional transactions don’t work. Sagas provide an alternative.

The Problem

CreateOrder saga:
1. Reserve inventory  ✓
2. Process payment    ✓
3. Create shipment    ✗ (failed!)

// Now what? We need to undo steps 1 and 2

Compensating Transactions

Each step in a saga has a compensating action that undoes its effect:

const createOrderSaga = {
  steps: [
    {
      action: reserveInventory,
      compensation: releaseInventory
    },
    {
      action: processPayment,
      compensation: refundPayment
    },
    {
      action: createShipment,
      compensation: cancelShipment
    }
  ]
};

async function executeSaga(saga, context) {
  const completedSteps = [];
  
  for (const step of saga.steps) {
    try {
      await step.action(context);
      completedSteps.push(step);
    } catch (error) {
      // Compensate in reverse order
      for (const completed of completedSteps.reverse()) {
        await completed.compensation(context);
      }
      throw error;
    }
  }
}

Pattern 5: Idempotency

In distributed systems, messages can be delivered more than once. Your handlers must handle duplicates gracefully.

Idempotency Keys

async function processPayment(request: PaymentRequest) {
  // Check if we've already processed this request
  const existing = await db.payments.findByIdempotencyKey(
    request.idempotencyKey
  );
  
  if (existing) {
    return existing.result; // Return cached result
  }
  
  // Process the payment
  const result = await paymentProvider.charge(request);
  
  // Store result with idempotency key
  await db.payments.create({
    idempotencyKey: request.idempotencyKey,
    result
  });
  
  return result;
}

Natural Idempotency

Design operations to be naturally idempotent when possible:

// NOT idempotent: incrementing can be applied multiple times
await db.execute('UPDATE accounts SET balance = balance + 100');

// Idempotent: setting to a specific value
await db.execute('UPDATE accounts SET balance = 1100 WHERE balance = 1000');

Putting It All Together

Real systems combine these patterns:

class ResilientServiceClient {
  private circuitBreaker = new CircuitBreaker();
  
  async call(request: Request): Promise<Response> {
    // Circuit breaker prevents calls to failing services
    return this.circuitBreaker.call(async () => {
      // Retries with backoff handle transient failures
      return callWithBackoff(async () => {
        // Timeout prevents hanging
        return withTimeout(
          this.httpClient.post(request),
          5000
        );
      });
    });
  }
}

Key Takeaways

Expect failure: Design your system assuming components will fail
Fail fast: Don’t wait for timeouts when you know something is broken
Fail gracefully: Degrade functionality rather than failing completely
Isolate failures: Prevent failures in one component from cascading
Make operations idempotent: Handle duplicate messages safely

Building resilient distributed systems is hard. We’re working on infrastructure that handles these patterns for you. Join our pilot program to learn more.