Fundamentals of System Design - Failure Models

Distributed systems consist of a set of loosely coupled services that end users perceive as a single coherent system. However, due to the loose coupling between components and communication over the network, failures are inevitable. Network congestion, a few instances going down, there can be a multitude of reasons that failure can be introduced in the system.

Timeout Management

Each retry attempt should have a reasonable timeout to prevent hanging indefinitely.

Without proper request timeouts, the service would consume excessive resources. For example, 100 requests per second, each using around 200 KB of memory, can accumulate ~1GB of memory to serve inflight requests. From the above example, if the timeout isn’t used appropriately, the service could run into out-of-memory issues pretty soon.

It’s also worth noting that the underlying work on abandoned requests should be canceled once the timeout is hit. Otherwise, the benefit of relinquishing resources will not take place.

A good metric for choosing the timeout value is the latency of the downstream service. Choose an acceptable rate of false timeout (0.01%). We also need to note the latency for both p99.9 and p50. If their latencies are similar, it’s better to pad the timeout value to avoid a small latency change causing high timeouts.

package main

import (
    "context"
    "fmt"
    "log"
    "net/http"
    "time"
)

// Basic timeout middleware
func TimeoutMiddleware(timeout time.Duration) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            ctx, cancel := context.WithTimeout(r.Context(), timeout)
            defer cancel()
            
            // Create channel to signal completion
            done := make(chan struct{})
            
            // Run handler in goroutine
            go func() {
                next.ServeHTTP(w, r.WithContext(ctx))
                close(done)
            }()
            
            // Wait for either completion or timeout
            select {
            case <-done:
                return
            case <-ctx.Done():
                w.WriteHeader(http.StatusGatewayTimeout)
                w.Write([]byte("Request timeout"))
            }
        })
    }
}

// Handler that respects context timeout
func slowHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    
    // Simulate slow operation
    select {
    case <-time.After(3 * time.Second):
        w.Write([]byte("Operation completed"))
    case <-ctx.Done():
        // Context cancelled/timeout - stop processing
        log.Println("Request timeout:", ctx.Err())
        return
    }
}

// Handler with database query example
func databaseHandler(w http.ResponseWriter, r *http.Request) {
    ctx := r.Context()
    
    // Pass context to database queries
    // rows, err := db.QueryContext(ctx, "SELECT * FROM users")
    // if err != nil {
    //     if ctx.Err() == context.DeadlineExceeded {
    //         log.Println("Database query timeout")
    //         return
    //     }
    //     // Handle other errors
    // }
    
    // Simulate work
    for i := 0; i < 10; i++ {
        select {
        case <-ctx.Done():
            log.Println("Context cancelled, stopping work")
            return
        default:
            time.Sleep(500 * time.Millisecond)
            fmt.Printf("Processing step %d\n", i+1)
        }
    }
    
    w.Write([]byte("Database operation completed"))
}

func main() {
    mux := http.NewServeMux()
    
    // Apply timeout middleware
    mux.Handle("/slow", TimeoutMiddleware(2*time.Second)(http.HandlerFunc(slowHandler)))
    mux.Handle("/db", TimeoutMiddleware(5*time.Second)(http.HandlerFunc(databaseHandler)))
    mux.HandleFunc("/fast", func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte("Fast response"))
    })
    
    log.Println("Server starting on :8080")
    log.Fatal(http.ListenAndServe(":8080", mux))
}

For more examples, please check here.

Retries

Follow a strategy to avoid thundering herds, where the calls that failed at almost the same time are retried almost at the same time. Utilize exponential backoff and jitter. Exponential backoff, to wait longer on every retry, giving more time to recover to a potentially overloaded upstream, and jitter to introduce some randomness to avoid thundering herds.

Issues with retries:

Drop in throughput: With a larger number of retries for the same request, the number of requests served by the server per unit of time gets reduced.
SLA violations: Take SLA into account while setting retry count and timeout period
Retry sparingly: Not all requests can or should be retried. The following status codes are perfect candidates for retry: 429, 500, 502, 503, and 504. However, status codes like 404 indicate client-side issues with request formation; hence, multiple retries wouldn’t change the outcome.
Retriable operations: Be aware of the potential side effects of the operation
Retry only on one level:

Use Load shedding and backpressure: Upstream service should shed any additional requests that it can’t handle

Use Idempotent APIs to make retries safe

The majority of the HTTP requests are idempotent, except for POST and PATCH. Incorporate a unique caller-provided client request identifier into the API contracts. Requests from the same caller with the same client request identifier can be considered duplicates and handled accordingly.

Database Design Adjustments (Upsert Operation)

Prefer to use upsert operations (which update a record if it exists or insert it if it doesn’t) or apply unique constraints that prevent duplicates from being added in the first place.

Idempotency in Messaging Systems

Idempotency can be enforced by storing a log of processed message IDs and checking each incoming message against it.

Failover Caching

Most of these outages are temporary. Thanks to self-healing and advanced load balancing, we should be able to keep our service running during these glitches. This is where failover caching can help and provide the necessary data to our application.

Only use failover caching when it serves the outdated data better than nothing.

Circuit Breaker Pattern

The Circuit Breaker pattern prevents an application from performing an operation that’s likely to fail. A circuit breaker acts as a proxy for operations that might fail. The proxy should monitor the number of recent failures and use this information to decide whether to allow the operation to proceed or to return an exception immediately.

A circuit breaker consists of the following three states:

Closed: The request is routed to the upstream service without any hindrance
Open: The request failed immediately without invoking the upstream service
Half-Open: A limited number of requests are allowed to invoke the upstream service

The retry logic should be sensitive to any exceptions that the circuit breaker returns and stop retry attempts if the circuit breaker indicates that a fault isn’t transient.

Service meshes often support circuit breaking as a sidecar or as a standalone capability without modifying application code.

package main

import (
	"github.com/sony/gobreaker"
)

type ResilientPaymentClient struct {
	service        *PaymentService
	circuitBreaker *gobreaker.CircuitBreaker
}

func NewResilientPaymentClient(service *PaymentService) *ResilientPaymentClient {
	// Configure circuit breaker
	settings := gobreaker.Settings{
		Name: "PaymentService",
		// Maximum requests allowed in half-open state before deciding to close
		MaxRequests: 2,
		// Time period after which failure counts reset in closed state
		Interval: 10 * time.Second,
		// Time to wait before attempting half-open state
		Timeout: 5 * time.Second,
		// Function to determine when circuit should trip open
		ReadyToTrip: func(counts gobreaker.Counts) bool {
			failureRatio := float64(counts.TotalFailures) / float64(counts.Requests)
			return counts.Requests >= 3 && failureRatio >= 0.6
		},
		// Callback for state transitions (useful for monitoring)
		OnStateChange: func(name string, from gobreaker.State, to gobreaker.State) {
			log.Printf("[CIRCUIT BREAKER] %s: State changed from %s to %s", 
				name, from.String(), to.String())
		},
	}

	return &ResilientPaymentClient{
		service:        service,
		circuitBreaker: gobreaker.NewCircuitBreaker(settings),
	}
}

func (r *ResilientPaymentClient) ProcessPayment(ctx context.Context, amount float64) (string, error) {
	// Execute through circuit breaker
	result, err := r.circuitBreaker.Execute(func() (interface{}, error) {
		return r.service.ProcessPayment(ctx, amount)
	})
	
	if err != nil {
		return "", err
	}
	
	return result.(string), nil
}

Load shedding and Backpressure

type LoadShedder struct {
	maxConcurrent   int64
	currentLoad     atomic.Int64
	cpuThreshold    float64
	dropProbability atomic.Value // stores float64
}

func NewLoadShedder(maxConcurrent int, cpuThreshold float64) *LoadShedder {
	ls := &LoadShedder{
		maxConcurrent: int64(maxConcurrent),
		cpuThreshold:  cpuThreshold,
	}
	ls.dropProbability.Store(0.0)
	
	// Background goroutine to monitor system load
	go ls.monitorSystemLoad()
	
	return ls
}

// Monitor CPU and adjust drop probability
func (ls *LoadShedder) monitorSystemLoad() {
	ticker := time.NewTicker(1 * time.Second)
	defer ticker.Stop()
	
	for range ticker.C {
		// Get current CPU usage (simplified - in production use actual metrics)
		numGoroutines := runtime.NumGoroutine()
		cpuUsage := float64(numGoroutines) / 1000.0 // Simulated CPU usage
		
		// Calculate drop probability based on load
		var dropProb float64
		if cpuUsage > ls.cpuThreshold {
			// Linear increase from threshold to 100%
			dropProb = (cpuUsage - ls.cpuThreshold) / (1.0 - ls.cpuThreshold)
			if dropProb > 0.9 {
				dropProb = 0.9 // Never drop more than 90%
			}
		}
		
		ls.dropProbability.Store(dropProb)
		
		if dropProb > 0 {
			log.Printf("[LOAD SHEDDER] System overloaded. Goroutines: %d, Drop probability: %.2f%%",
				numGoroutines, dropProb*100)
		}
	}
}

// ShouldAcceptRequest decides whether to accept or reject a request
func (ls *LoadShedder) ShouldAcceptRequest() bool {
	// Check 1: Hard limit on concurrent requests
	current := ls.currentLoad.Load()
	if current >= ls.maxConcurrent {
		log.Println("[LOAD SHEDDER] ❌ Rejected: Max concurrent limit reached")
		return false
	}
	
	// Check 2: Probabilistic rejection based on system load
	dropProb := ls.dropProbability.Load().(float64)
	if dropProb > 0 && rand.Float64() < dropProb {
		log.Printf("[LOAD SHEDDER] ❌ Rejected: Probabilistic shedding (%.0f%% drop rate)", 
			dropProb*100)
		return false
	}
	
	return true
}

// Acquire increments the load counter
func (ls *LoadShedder) Acquire() {
	ls.currentLoad.Add(1)
}

// Release decrements the load counter
func (ls *LoadShedder) Release() {
	ls.currentLoad.Add(-1)
}

Bulkhead Pattern

Partition service instances into different groups, based on consumer load and availability requirements. This design helps to isolate failures, and allows you to sustain service functionality for some consumers, even during a failure.

// Bulkhead represents a resource pool with limited capacity
type Bulkhead struct {
	name      string
	semaphore chan struct{} // Acts as a token bucket
	maxWorkers int
	timeout    time.Duration
	mu         sync.Mutex
	activeCount int
}

// NewBulkhead creates a new bulkhead with specified capacity
func NewBulkhead(name string, maxWorkers int, timeout time.Duration) *Bulkhead {
	return &Bulkhead{
		name:       name,
		semaphore:  make(chan struct{}, maxWorkers),
		maxWorkers: maxWorkers,
		timeout:    timeout,
	}
}

// Execute runs a function within the bulkhead's resource limits
func (b *Bulkhead) Execute(ctx context.Context, fn func() error) error {
	// Try to acquire a permit (worker slot)
	select {
	case b.semaphore <- struct{}{}: // Acquired
		b.mu.Lock()
		b.activeCount++
		active := b.activeCount
		b.mu.Unlock()
		
		log.Printf("[%s] Acquired permit. Active workers: %d/%d", 
			b.name, active, b.maxWorkers)
		
		defer func() {
			<-b.semaphore // Release permit
			b.mu.Lock()
			b.activeCount--
			b.mu.Unlock()
			log.Printf("[%s] Released permit. Active workers: %d/%d", 
				b.name, b.activeCount, b.maxWorkers)
		}()
		
		// Execute the function
		return fn()
		
	case <-time.After(b.timeout):
		return fmt.Errorf("bulkhead %s: timeout waiting for available worker", b.name)
		
	case <-ctx.Done():
		return ctx.Err()
	}
}

References

Written on December 28, 2025