Troubleshooting Guide
Diagnosing CrashLoopBackOff in Kubernetes
CrashLoopBackOff means a container repeatedly starts, crashes, and is throttled by Kubernetes’ backoff.
This guide provides a safe diagnosis workflow and common fixes.
What CrashLoopBackOff actually means
Kubernetes restarts containers based on the pod’s restart policy (typically Always for Deployments).
When the container exits repeatedly, Kubernetes delays subsequent restarts (backoff), resulting in CrashLoopBackOff.
Fast triage checklist (5 minutes)
- Identify the failing container and last exit code.
- Check recent logs (including previous instance logs).
- Inspect pod events (image pull errors, probes, OOM kills).
- Validate config/secrets mounts and environment variables.
- Confirm readiness/liveness probe behavior.
Step 1 — Locate the failing pod and container
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
In describe output, look for:
- Last State and Exit Code
- Reason (for example,
OOMKilled,Error) - Events at the bottom (probe failures, mount errors, image pulls)
Step 2 — Read logs (including the previous crash)
If the container restarts quickly, “current logs” might be short. Use --previous to view the prior instance.
# Current container logs
kubectl logs <pod-name> -n <namespace> -c <container-name>
# Logs from the previous crashed instance
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous
Step 3 — Confirm whether probes are causing restarts
Misconfigured probes are a common cause of restart loops, especially when:
- The app needs more startup time than
initialDelaySecondsprovides - The readiness endpoint is too heavy and times out
- Liveness probes check a dependency (DB/cache) that may be temporarily down
Check events for probe failures:
kubectl describe pod <pod-name> -n <namespace> | sed -n '/Events/,$p'
Step 4 — Common root causes and fixes
A) Application exits immediately (bad args/config)
Symptoms: exit code 1 or 2, logs show “unknown flag”, “missing env var”, or “cannot parse config”.
Fix:
- Validate env vars and secrets exist and are spelled correctly
- Confirm config file path and mount permissions
- Compare container args/command to a known-good release
B) OOMKilled (out of memory)
Symptoms: Reason: OOMKilled in pod status; restarts increase under load.
Fix:
- Increase memory limits/requests appropriately
- Investigate memory leaks (heap dumps, profiling)
- Lower concurrency temporarily to stabilize
C) Image or runtime errors
Symptoms: events show ImagePullBackOff, ErrImagePull, or the binary can’t execute (wrong architecture).
Fix:
- Verify image tag exists and registry credentials are valid
- Confirm image architecture matches nodes (amd64/arm64)
- Pin to a known-good tag (avoid floating
latest)
D) Dependency not ready (DB/cache)
Symptoms: logs show connection refused/timeouts; the app exits rather than retrying.
Fix:
- Add retry/backoff logic in the application
- Increase startup probe window (use
startupProbewhen appropriate) - Separate readiness checks from dependency health when possible
Safe remediation patterns
- Don’t delete blindly: capture
describeoutput and logs first. - Roll back quickly: if the crash correlates with a recent rollout, revert to the last stable version.
- Change one thing: adjust probes or config in small steps to avoid hiding the real cause.
When to escalate
Escalate to engineering with concrete evidence:
- Exit code + reason (OOMKilled, Error)
- Relevant log excerpt (with timestamps)
- Recent deployment changes (image tag, config map, secret, flags)
- Probe failures and thresholds
This is a sample article to demonstrate how I write.