SRE Documentation

Incident Postmortem: Database Connection Pool Exhaustion

This sample demonstrates blameless postmortem writing: clear timelines, root cause analysis, impact assessment, and actionable follow-up items.

Incident Summary

  • Incident ID: INC-2024-0918
  • Date: September 18, 2024
  • Duration: 47 minutes (14:23 - 15:10 UTC)
  • Severity: SEV-1 (Customer-facing service degradation)
  • Services affected: API Gateway, User Service, Payments Service
  • Incident Commander: Sarah Chen
  • Author: DevOps Team

Executive Summary

On September 18, 2024, our primary API experienced severe latency and partial outages for 47 minutes due to database connection pool exhaustion. The root cause was a combination of a traffic spike from a new enterprise customer onboarding and a connection leak in the recently deployed User Service v2.4.0.

Approximately 12% of API requests failed during the incident window. No data loss occurred. The issue was mitigated by rolling back User Service to v2.3.2 and temporarily increasing database connection limits.

Impact

Customer Impact

  • ~15,000 users experienced failed or slow requests
  • 12% error rate across all API endpoints (baseline: 0.1%)
  • P99 latency increased from 200ms to 8,500ms
  • Mobile app users saw "Unable to connect" errors
  • 3 enterprise customers filed support tickets

Business Impact

  • Estimated revenue impact: $12,400 (failed checkout transactions)
  • 23 support tickets opened during incident
  • Status page updated to "Degraded Performance" for 52 minutes

Internal Impact

  • 7 engineers pulled into incident response
  • Deployment freeze enacted for 24 hours
  • Scheduled maintenance postponed

Timeline (All times UTC)

Detection

  • 14:23 — PagerDuty alert: "API Gateway error rate > 5%" triggered
  • 14:24 — On-call engineer (Mike R.) acknowledges alert
  • 14:26 — Second alert: "Database connection pool at 95% capacity"
  • 14:28 — Incident declared SEV-2, war room opened

Investigation

  • 14:30 — Initial hypothesis: DDoS attack. Traffic analysis shows legitimate traffic from new enterprise customer
  • 14:35 — Database team confirms connection pool exhausted (500/500 connections in use)
  • 14:38 — Incident escalated to SEV-1, Sarah C. assumes Incident Commander role
  • 14:42 — User Service logs show connections not being released after requests complete
  • 14:45 — Correlation identified: User Service v2.4.0 deployed at 13:15 today

Mitigation

  • 14:48 — Decision: Roll back User Service to v2.3.2
  • 14:52 — Rollback initiated
  • 14:58 — Rollback complete, connections beginning to release
  • 15:02 — Connection pool at 60% capacity
  • 15:05 — Error rate returns to baseline (0.1%)
  • 15:10 — Incident resolved, monitoring continues

Post-Incident

  • 15:30 — Status page updated to "Operational"
  • 16:00 — Initial customer communications sent
  • 17:00 — Engineering all-hands briefing
  • Sep 20 — Postmortem review meeting held

Root Cause Analysis

Primary Cause

A code change in User Service v2.4.0 introduced a connection leak in the getUserProfile() function. When requests included the optional include_preferences parameter, the code path taken did not properly release the database connection back to the pool.

// BEFORE (v2.4.0 - Bug)
async function getUserProfile(userId, options) {
  const conn = await pool.getConnection();
  const user = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);

  if (options.include_preferences) {
    const prefs = await conn.query('SELECT * FROM preferences WHERE user_id = ?', [userId]);
    return { user, preferences: prefs };  // Connection never released!
  }

  conn.release();
  return { user };
}

// AFTER (v2.4.1 - Fixed)
async function getUserProfile(userId, options) {
  const conn = await pool.getConnection();
  try {
    const user = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);

    if (options.include_preferences) {
      const prefs = await conn.query('SELECT * FROM preferences WHERE user_id = ?', [userId]);
      return { user, preferences: prefs };
    }

    return { user };
  } finally {
    conn.release();  // Always release connection
  }
}

Contributing Factors

  • Traffic spike: New enterprise customer onboarding increased traffic by 40%, accelerating connection pool exhaustion
  • Insufficient test coverage: The include_preferences code path was not covered by integration tests
  • Missing connection pool monitoring: Alert threshold was set at 95%, providing insufficient lead time
  • No connection timeout: Leaked connections persisted indefinitely instead of timing out

Why Wasn't This Caught Earlier?

  • Unit tests mocked database connections, hiding the leak
  • Staging environment has smaller traffic, taking longer to exhaust pool
  • Code review focused on business logic, not resource management
  • No automated static analysis for resource leak patterns

What Went Well

  • Alert fired within 2 minutes of threshold breach
  • On-call response within 1 minute of alert
  • Clear incident command structure enabled efficient coordination
  • Rollback procedure worked flawlessly
  • Customer communication was timely and clear
  • No data loss or corruption occurred

What Went Poorly

  • Initial hypothesis (DDoS) delayed root cause identification by ~10 minutes
  • Connection pool monitoring alert triggered too late (95% threshold)
  • No automated connection leak detection in CI/CD pipeline
  • Rollback decision took 10 minutes due to uncertainty about data impact
  • Status page update was delayed by 5 minutes after incident was declared

Action Items

Immediate (This Sprint)

  • [P0] Deploy User Service v2.4.1 with connection leak fix — Owner: Backend Team — Due: Sep 20
  • [P0] Lower connection pool alert threshold to 70% — Owner: SRE — Due: Sep 19
  • [P0] Add connection idle timeout (5 minutes) to database config — Owner: DBA — Due: Sep 19

Short-term (This Quarter)

  • [P1] Add integration tests for all database code paths — Owner: Backend Team — Due: Oct 15
  • [P1] Implement static analysis for resource leak patterns — Owner: Platform Team — Due: Nov 1
  • [P1] Create runbook for connection pool exhaustion incidents — Owner: SRE — Due: Sep 30
  • [P2] Add connection pool metrics to team dashboard — Owner: SRE — Due: Oct 8

Long-term (Next Quarter)

  • [P2] Evaluate connection pooling library upgrade (PgBouncer) — Owner: DBA — Due: Q2
  • [P2] Implement chaos engineering for database failure scenarios — Owner: SRE — Due: Q2
  • [P3] Add automated rollback on error rate spike — Owner: Platform Team — Due: Q2

Lessons Learned

Technical

  • Always use try/finally or connection managers for database resources
  • Mock objects in unit tests can hide resource management bugs
  • Connection pool metrics are critical—alert early, not at capacity
  • Idle connection timeouts provide a safety net for leaks

Process

  • Code reviews should include a "resource management" checklist item
  • Large customer onboardings should trigger proactive scaling
  • Having a clear rollback decision tree reduces incident duration
  • Status page automation would improve customer communication speed

Supporting Data

Error Rate During Incident

Time (UTC)    Error Rate    Requests/min
14:20         0.1%          8,500
14:25         2.3%          9,200
14:30         8.7%          9,100
14:35         12.4%         8,800
14:40         11.8%         7,200  (traffic shedding)
14:45         10.2%         6,900
14:50         9.1%          7,400
14:55         5.6%          7,800
15:00         1.2%          8,200
15:05         0.1%          8,600

Connection Pool Utilization

Time (UTC)    Active    Available    Max
14:00         180       320          500
14:15         245       255          500
14:25         420       80           500
14:30         498       2            500
14:35         500       0            500 (exhausted)
15:00         380       120          500
15:10         195       305          500

References

  • Incident Slack channel: #inc-2024-0918-db-connections
  • Related PR: user-service#1847 (connection leak fix)
  • Grafana dashboard: Database Connections Overview
  • Previous related incident: INC-2023-0412 (similar pattern)

Related Samples

This is a sample article to demonstrate how I write.