SRE Documentation
Incident Postmortem: Database Connection Pool Exhaustion
This sample demonstrates blameless postmortem writing: clear timelines, root cause analysis, impact assessment, and actionable follow-up items.
Incident Summary
- Incident ID: INC-2024-0918
- Date: September 18, 2024
- Duration: 47 minutes (14:23 - 15:10 UTC)
- Severity: SEV-1 (Customer-facing service degradation)
- Services affected: API Gateway, User Service, Payments Service
- Incident Commander: Sarah Chen
- Author: DevOps Team
Executive Summary
On September 18, 2024, our primary API experienced severe latency and partial outages for 47 minutes due to database connection pool exhaustion. The root cause was a combination of a traffic spike from a new enterprise customer onboarding and a connection leak in the recently deployed User Service v2.4.0.
Approximately 12% of API requests failed during the incident window. No data loss occurred. The issue was mitigated by rolling back User Service to v2.3.2 and temporarily increasing database connection limits.
Impact
Customer Impact
- ~15,000 users experienced failed or slow requests
- 12% error rate across all API endpoints (baseline: 0.1%)
- P99 latency increased from 200ms to 8,500ms
- Mobile app users saw "Unable to connect" errors
- 3 enterprise customers filed support tickets
Business Impact
- Estimated revenue impact: $12,400 (failed checkout transactions)
- 23 support tickets opened during incident
- Status page updated to "Degraded Performance" for 52 minutes
Internal Impact
- 7 engineers pulled into incident response
- Deployment freeze enacted for 24 hours
- Scheduled maintenance postponed
Timeline (All times UTC)
Detection
- 14:23 — PagerDuty alert: "API Gateway error rate > 5%" triggered
- 14:24 — On-call engineer (Mike R.) acknowledges alert
- 14:26 — Second alert: "Database connection pool at 95% capacity"
- 14:28 — Incident declared SEV-2, war room opened
Investigation
- 14:30 — Initial hypothesis: DDoS attack. Traffic analysis shows legitimate traffic from new enterprise customer
- 14:35 — Database team confirms connection pool exhausted (500/500 connections in use)
- 14:38 — Incident escalated to SEV-1, Sarah C. assumes Incident Commander role
- 14:42 — User Service logs show connections not being released after requests complete
- 14:45 — Correlation identified: User Service v2.4.0 deployed at 13:15 today
Mitigation
- 14:48 — Decision: Roll back User Service to v2.3.2
- 14:52 — Rollback initiated
- 14:58 — Rollback complete, connections beginning to release
- 15:02 — Connection pool at 60% capacity
- 15:05 — Error rate returns to baseline (0.1%)
- 15:10 — Incident resolved, monitoring continues
Post-Incident
- 15:30 — Status page updated to "Operational"
- 16:00 — Initial customer communications sent
- 17:00 — Engineering all-hands briefing
- Sep 20 — Postmortem review meeting held
Root Cause Analysis
Primary Cause
A code change in User Service v2.4.0 introduced a connection leak in the
getUserProfile() function. When requests included the optional
include_preferences parameter, the code path taken did not properly
release the database connection back to the pool.
// BEFORE (v2.4.0 - Bug)
async function getUserProfile(userId, options) {
const conn = await pool.getConnection();
const user = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
if (options.include_preferences) {
const prefs = await conn.query('SELECT * FROM preferences WHERE user_id = ?', [userId]);
return { user, preferences: prefs }; // Connection never released!
}
conn.release();
return { user };
}
// AFTER (v2.4.1 - Fixed)
async function getUserProfile(userId, options) {
const conn = await pool.getConnection();
try {
const user = await conn.query('SELECT * FROM users WHERE id = ?', [userId]);
if (options.include_preferences) {
const prefs = await conn.query('SELECT * FROM preferences WHERE user_id = ?', [userId]);
return { user, preferences: prefs };
}
return { user };
} finally {
conn.release(); // Always release connection
}
}
Contributing Factors
- Traffic spike: New enterprise customer onboarding increased traffic by 40%, accelerating connection pool exhaustion
- Insufficient test coverage: The
include_preferencescode path was not covered by integration tests - Missing connection pool monitoring: Alert threshold was set at 95%, providing insufficient lead time
- No connection timeout: Leaked connections persisted indefinitely instead of timing out
Why Wasn't This Caught Earlier?
- Unit tests mocked database connections, hiding the leak
- Staging environment has smaller traffic, taking longer to exhaust pool
- Code review focused on business logic, not resource management
- No automated static analysis for resource leak patterns
What Went Well
- Alert fired within 2 minutes of threshold breach
- On-call response within 1 minute of alert
- Clear incident command structure enabled efficient coordination
- Rollback procedure worked flawlessly
- Customer communication was timely and clear
- No data loss or corruption occurred
What Went Poorly
- Initial hypothesis (DDoS) delayed root cause identification by ~10 minutes
- Connection pool monitoring alert triggered too late (95% threshold)
- No automated connection leak detection in CI/CD pipeline
- Rollback decision took 10 minutes due to uncertainty about data impact
- Status page update was delayed by 5 minutes after incident was declared
Action Items
Immediate (This Sprint)
- [P0] Deploy User Service v2.4.1 with connection leak fix — Owner: Backend Team — Due: Sep 20
- [P0] Lower connection pool alert threshold to 70% — Owner: SRE — Due: Sep 19
- [P0] Add connection idle timeout (5 minutes) to database config — Owner: DBA — Due: Sep 19
Short-term (This Quarter)
- [P1] Add integration tests for all database code paths — Owner: Backend Team — Due: Oct 15
- [P1] Implement static analysis for resource leak patterns — Owner: Platform Team — Due: Nov 1
- [P1] Create runbook for connection pool exhaustion incidents — Owner: SRE — Due: Sep 30
- [P2] Add connection pool metrics to team dashboard — Owner: SRE — Due: Oct 8
Long-term (Next Quarter)
- [P2] Evaluate connection pooling library upgrade (PgBouncer) — Owner: DBA — Due: Q2
- [P2] Implement chaos engineering for database failure scenarios — Owner: SRE — Due: Q2
- [P3] Add automated rollback on error rate spike — Owner: Platform Team — Due: Q2
Lessons Learned
Technical
- Always use try/finally or connection managers for database resources
- Mock objects in unit tests can hide resource management bugs
- Connection pool metrics are critical—alert early, not at capacity
- Idle connection timeouts provide a safety net for leaks
Process
- Code reviews should include a "resource management" checklist item
- Large customer onboardings should trigger proactive scaling
- Having a clear rollback decision tree reduces incident duration
- Status page automation would improve customer communication speed
Supporting Data
Error Rate During Incident
Time (UTC) Error Rate Requests/min
14:20 0.1% 8,500
14:25 2.3% 9,200
14:30 8.7% 9,100
14:35 12.4% 8,800
14:40 11.8% 7,200 (traffic shedding)
14:45 10.2% 6,900
14:50 9.1% 7,400
14:55 5.6% 7,800
15:00 1.2% 8,200
15:05 0.1% 8,600
Connection Pool Utilization
Time (UTC) Active Available Max
14:00 180 320 500
14:15 245 255 500
14:25 420 80 500
14:30 498 2 500
14:35 500 0 500 (exhausted)
15:00 380 120 500
15:10 195 305 500
References
- Incident Slack channel: #inc-2024-0918-db-connections
- Related PR:
user-service#1847(connection leak fix) - Grafana dashboard: Database Connections Overview
- Previous related incident: INC-2023-0412 (similar pattern)
Related Samples
This is a sample article to demonstrate how I write.