Troubleshooting Guide

Diagnosing CrashLoopBackOff in Kubernetes

CrashLoopBackOff means a container repeatedly starts, crashes, and is throttled by Kubernetes’ backoff. This guide provides a safe diagnosis workflow and common fixes.

What CrashLoopBackOff actually means

Kubernetes restarts containers based on the pod’s restart policy (typically Always for Deployments). When the container exits repeatedly, Kubernetes delays subsequent restarts (backoff), resulting in CrashLoopBackOff.

Fast triage checklist (5 minutes)

  1. Identify the failing container and last exit code.
  2. Check recent logs (including previous instance logs).
  3. Inspect pod events (image pull errors, probes, OOM kills).
  4. Validate config/secrets mounts and environment variables.
  5. Confirm readiness/liveness probe behavior.

Step 1 — Locate the failing pod and container

kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>

In describe output, look for:

  • Last State and Exit Code
  • Reason (for example, OOMKilled, Error)
  • Events at the bottom (probe failures, mount errors, image pulls)

Step 2 — Read logs (including the previous crash)

If the container restarts quickly, “current logs” might be short. Use --previous to view the prior instance.

# Current container logs
kubectl logs <pod-name> -n <namespace> -c <container-name>

# Logs from the previous crashed instance
kubectl logs <pod-name> -n <namespace> -c <container-name> --previous

Step 3 — Confirm whether probes are causing restarts

Misconfigured probes are a common cause of restart loops, especially when:

  • The app needs more startup time than initialDelaySeconds provides
  • The readiness endpoint is too heavy and times out
  • Liveness probes check a dependency (DB/cache) that may be temporarily down

Check events for probe failures:

kubectl describe pod <pod-name> -n <namespace> | sed -n '/Events/,$p'

Step 4 — Common root causes and fixes

A) Application exits immediately (bad args/config)

Symptoms: exit code 1 or 2, logs show “unknown flag”, “missing env var”, or “cannot parse config”.

Fix:

  • Validate env vars and secrets exist and are spelled correctly
  • Confirm config file path and mount permissions
  • Compare container args/command to a known-good release

B) OOMKilled (out of memory)

Symptoms: Reason: OOMKilled in pod status; restarts increase under load.

Fix:

  • Increase memory limits/requests appropriately
  • Investigate memory leaks (heap dumps, profiling)
  • Lower concurrency temporarily to stabilize

C) Image or runtime errors

Symptoms: events show ImagePullBackOff, ErrImagePull, or the binary can’t execute (wrong architecture).

Fix:

  • Verify image tag exists and registry credentials are valid
  • Confirm image architecture matches nodes (amd64/arm64)
  • Pin to a known-good tag (avoid floating latest)

D) Dependency not ready (DB/cache)

Symptoms: logs show connection refused/timeouts; the app exits rather than retrying.

Fix:

  • Add retry/backoff logic in the application
  • Increase startup probe window (use startupProbe when appropriate)
  • Separate readiness checks from dependency health when possible

Safe remediation patterns

  • Don’t delete blindly: capture describe output and logs first.
  • Roll back quickly: if the crash correlates with a recent rollout, revert to the last stable version.
  • Change one thing: adjust probes or config in small steps to avoid hiding the real cause.

When to escalate

Escalate to engineering with concrete evidence:

  • Exit code + reason (OOMKilled, Error)
  • Relevant log excerpt (with timestamps)
  • Recent deployment changes (image tag, config map, secret, flags)
  • Probe failures and thresholds

This is a sample article to demonstrate how I write.