SRE Documentation

Incident Postmortem: Database Connection Pool Exhaustion

This sample demonstrates blameless postmortem writing: clear timelines, root cause analysis, impact assessment, and actionable follow-up items.

Incident Summary

Incident ID: INC-2024-0918
Date: September 18, 2024
Duration: 47 minutes (14:23 - 15:10 UTC)
Severity: SEV-1 (Customer-facing service degradation)
Services affected: API Gateway, User Service, Payments Service
Incident Commander: Sarah Chen
Author: DevOps Team

Executive Summary

On September 18, 2024, our primary API experienced severe latency and partial outages for 47 minutes due to database connection pool exhaustion. The root cause was a combination of a traffic spike from a new enterprise customer onboarding and a connection leak in the recently deployed User Service v2.4.0.

Approximately 12% of API requests failed during the incident window. No data loss occurred. The issue was mitigated by rolling back User Service to v2.3.2 and temporarily increasing database connection limits.

Impact

Customer Impact

~15,000 users experienced failed or slow requests
12% error rate across all API endpoints (baseline: 0.1%)
P99 latency increased from 200ms to 8,500ms
Mobile app users saw "Unable to connect" errors
3 enterprise customers filed support tickets

Business Impact

Estimated revenue impact: $12,400 (failed checkout transactions)
23 support tickets opened during incident
Status page updated to "Degraded Performance" for 52 minutes

Internal Impact

7 engineers pulled into incident response
Deployment freeze enacted for 24 hours
Scheduled maintenance postponed

Timeline (All times UTC)

Detection

14:23 — PagerDuty alert: "API Gateway error rate > 5%" triggered
14:24 — On-call engineer (Mike R.) acknowledges alert
14:26 — Second alert: "Database connection pool at 95% capacity"
14:28 — Incident declared SEV-2, war room opened

Investigation

14:30 — Initial hypothesis: DDoS attack. Traffic analysis shows legitimate traffic from new enterprise customer
14:35 — Database team confirms connection pool exhausted (500/500 connections in use)
14:38 — Incident escalated to SEV-1, Sarah C. assumes Incident Commander role
14:42 — User Service logs show connections not being released after requests complete
14:45 — Correlation identified: User Service v2.4.0 deployed at 13:15 today

Mitigation

14:48 — Decision: Roll back User Service to v2.3.2
14:52 — Rollback initiated
14:58 — Rollback complete, connections beginning to release
15:02 — Connection pool at 60% capacity
15:05 — Error rate returns to baseline (0.1%)
15:10 — Incident resolved, monitoring continues

Post-Incident

15:30 — Status page updated to "Operational"
16:00 — Initial customer communications sent
17:00 — Engineering all-hands briefing
Sep 20 — Postmortem review meeting held

Root Cause Analysis

Primary Cause

A code change in User Service v2.4.0 introduced a connection leak in the getUserProfile() function. When requests included the optional include_preferences parameter, the code path taken did not properly release the database connection back to the pool.

// BEFORE (v2.4.0 - Bug)
async function getUserProfile(userId, options) {
  const conn = await pool.getConnection();
  const user = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);

  if (options.include_preferences) {
    const prefs = await conn.query('SELECT * FROM preferences WHERE user_id = ?', [userId]);
    return { user, preferences: prefs };  // Connection never released!
  }

  conn.release();
  return { user };
}

// AFTER (v2.4.1 - Fixed)
async function getUserProfile(userId, options) {
  const conn = await pool.getConnection();
  try {
    const user = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);

    if (options.include_preferences) {
      const prefs = await conn.query('SELECT * FROM preferences WHERE user_id = ?', [userId]);
      return { user, preferences: prefs };
    }

    return { user };
  } finally {
    conn.release();  // Always release connection
  }
}

Contributing Factors

Traffic spike: New enterprise customer onboarding increased traffic by 40%, accelerating connection pool exhaustion
Insufficient test coverage: The include_preferences code path was not covered by integration tests
Missing connection pool monitoring: Alert threshold was set at 95%, providing insufficient lead time
No connection timeout: Leaked connections persisted indefinitely instead of timing out

Why Wasn't This Caught Earlier?

Unit tests mocked database connections, hiding the leak
Staging environment has smaller traffic, taking longer to exhaust pool
Code review focused on business logic, not resource management
No automated static analysis for resource leak patterns

What Went Well

Alert fired within 2 minutes of threshold breach
On-call response within 1 minute of alert
Clear incident command structure enabled efficient coordination
Rollback procedure worked flawlessly
Customer communication was timely and clear
No data loss or corruption occurred

What Went Poorly

Initial hypothesis (DDoS) delayed root cause identification by ~10 minutes
Connection pool monitoring alert triggered too late (95% threshold)
No automated connection leak detection in CI/CD pipeline
Rollback decision took 10 minutes due to uncertainty about data impact
Status page update was delayed by 5 minutes after incident was declared

Action Items

Immediate (This Sprint)

[P0] Deploy User Service v2.4.1 with connection leak fix — Owner: Backend Team — Due: Sep 20
[P0] Lower connection pool alert threshold to 70% — Owner: SRE — Due: Sep 19
[P0] Add connection idle timeout (5 minutes) to database config — Owner: DBA — Due: Sep 19

Short-term (This Quarter)

[P1] Add integration tests for all database code paths — Owner: Backend Team — Due: Oct 15
[P1] Implement static analysis for resource leak patterns — Owner: Platform Team — Due: Nov 1
[P1] Create runbook for connection pool exhaustion incidents — Owner: SRE — Due: Sep 30
[P2] Add connection pool metrics to team dashboard — Owner: SRE — Due: Oct 8

Long-term (Next Quarter)

[P2] Evaluate connection pooling library upgrade (PgBouncer) — Owner: DBA — Due: Q2
[P2] Implement chaos engineering for database failure scenarios — Owner: SRE — Due: Q2
[P3] Add automated rollback on error rate spike — Owner: Platform Team — Due: Q2

Lessons Learned

Technical

Always use try/finally or connection managers for database resources
Mock objects in unit tests can hide resource management bugs
Connection pool metrics are critical—alert early, not at capacity
Idle connection timeouts provide a safety net for leaks

Process

Code reviews should include a "resource management" checklist item
Large customer onboardings should trigger proactive scaling
Having a clear rollback decision tree reduces incident duration
Status page automation would improve customer communication speed

Supporting Data

Error Rate During Incident

Time (UTC)    Error Rate    Requests/min
14:20         0.1%          8,500
14:25         2.3%          9,200
14:30         8.7%          9,100
14:35         12.4%         8,800
14:40         11.8%         7,200  (traffic shedding)
14:45         10.2%         6,900
14:50         9.1%          7,400
14:55         5.6%          7,800
15:00         1.2%          8,200
15:05         0.1%          8,600

Connection Pool Utilization

Time (UTC)    Active    Available    Max
14:00         180       320          500
14:15         245       255          500
14:25         420       80           500
14:30         498       2            500
14:35         500       0            500 (exhausted)
15:00         380       120          500
15:10         195       305          500

References

Incident Slack channel: #inc-2024-0918-db-connections
Related PR: user-service#1847 (connection leak fix)
Grafana dashboard: Database Connections Overview
Previous related incident: INC-2023-0412 (similar pattern)

Related Samples

This is a sample article to demonstrate how I write.