Skip to content
Intellira

Kubernetes

Debugging Kubernetes pod failures: a field guide

A triage-first field guide to debugging Kubernetes pods. Read STATUS, describe Events and Last State, and route each failure to its fix instead of guessing.

By Intellira Engineering, Editorial team

Start here: read the status, do not guess

When a pod is not running, the fastest path to a fix is not a hunch about the application. It is a fixed three-command triage that tells you which class of failure you are in, after which you go straight to the right fix. Run these in order:

kubectl get pods                       # read STATUS + RESTARTS
kubectl describe pod <pod>             # read Events + State / Last State
kubectl logs <pod> --previous          # read the crashed container's output

STATUS tells you the phase or the waiting reason. describe gives you the Events stream and the container's Last State (reason + exit code). logs --previous shows what the container printed before it died. Most pod failures are fully classified by these three outputs. The rest of this guide is the decision tree for reading them, plus a link to the dedicated fix for each failure. It does not repeat those fixes — it gets you to the right one.

Why a systematic approach beats guessing

A Kubernetes pod can fail to run for reasons that live in completely different subsystems: the scheduler could not place it, the kubelet could not pull its image, a referenced Secret does not exist, the container crashed on startup, the kernel OOM-killed it, or the node it landed on went unhealthy. These look superficially similar — "my pod isn't up" — but the evidence that confirms each one is different, and so is the fix.

Guessing is expensive because the failure classes are disjoint. Editing resource limits does nothing for a missing ConfigMap; fixing an image tag does nothing for a node that stopped reporting. The triage below is built around one fact: Kubernetes already records which class you are in, in two places — the pod phase and the container state's Reason — and you only need to read them in the right order.

The triage framework

Step 1 — kubectl get pods: phase and restarts

A pod has a phase, a high-level summary of where it is in its lifecycle. The official definitions (Kubernetes: Pod Lifecycle):

  • Pending — "The Pod has been accepted by the Kubernetes cluster, but one or more of the containers has not been set up and made ready to run. This includes time a Pod spends waiting to be scheduled as well as the time spent downloading container images over the network."
  • Running — "The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting."
  • Succeeded — all containers exited 0 and will not restart.
  • Failed — all containers terminated and at least one failed.
  • Unknown — "the state of the Pod could not be obtained," typically a communication error with the node.

The STATUS column you see in kubectl get pods is often not the bare phase but a more specific waiting/terminated reason such as ContainerCreating, ImagePullBackOff, CrashLoopBackOff, or CreateContainerConfigError. That string is your first routing key. The RESTARTS column is the second: a high and climbing restart count means the container starts and dies repeatedly, which is the signature of a crash loop rather than a placement or config problem.

Step 2 — kubectl describe pod: Events and Last State

kubectl describe pod <pod> is the highest-value single command. It returns the container state (Waiting, Running, or Terminated), the restart count, resource requests and limits, the pod conditions, and — most importantly — the Events stream and the container's Last State (Kubernetes: Debug Running Pods).

Two fields decide most cases:

  • Events — the scheduler and kubelet narrate here. "0/3 nodes are available: insufficient cpu" is a scheduling failure; "Failed to pull image" is a registry failure; "configmap … not found" is a config failure. The official guidance for a stuck Pending pod is explicit: "There should be messages from the scheduler about why it can not schedule your pod" (Kubernetes: Debug Pods).
  • Last State — for a container that has restarted, this shows the previous termination's Reason and exit code. Reason: OOMKilled with exit code 137 means the kernel killed it; a non-zero application exit code means the program itself failed.

If describe is not enough, kubectl get pod <pod> -o yaml returns "essentially all of the information the system has about the Pod" (Kubernetes: Debug Running Pods).

Step 3 — kubectl logs --previous: what the dead container said

When a container has already crashed and restarted, the live logs are from the new attempt and may be empty. The --previous flag "retrieves logs from the prior, terminated container, which is where you can find the specific stack trace or error message that reveals the cause of the crash" (Google Cloud: Troubleshoot CrashLoopBackOff events). For any crash loop, this is the command that contains the actual answer.

The decision: route by phase, then by reason

Read the outputs top-down and branch:

  • Phase is Pending → the pod is not scheduled yet. Read Events for the scheduler's reason. → Pod Pending or Node NotReady.
  • STATUS is ContainerCreating and stays there → scheduled, but the kubelet cannot finish setup (volume, network, image). → ContainerCreating stuck.
  • STATUS is ImagePullBackOff / ErrImagePull → the image cannot be pulled. → ImagePullBackOff.
  • STATUS is CreateContainerConfigError → a referenced ConfigMap or Secret is missing or malformed. → CreateContainerConfigError.
  • STATUS is CrashLoopBackOff with climbing RESTARTS → the container starts then dies. Check Last State reason/exit code. → CrashLoopBackOff.
  • Last State reason is OOMKilled (exit 137) → memory limit or node pressure. → OOMKilled.
  • A whole node's pods go Unknown/NotReady at once → suspect the node, not the app. → Node NotReady.

Recognising each failure (then go to the fix)

Each section below is recognition only — the signature you see in triage. Follow the link for the full diagnosis and fix.

Pending — the scheduler could not place it

You see phase Pending for more than a few seconds, and describe Events carry a scheduler message. Per the docs: "If a Pod is stuck in Pending it means that it can not be scheduled onto a node. Generally this is because there are insufficient resources of one type or another" (Kubernetes: Debug Pods). Typical Events name insufficient CPU/memory, taints the pod does not tolerate, node-affinity that matches nothing, or unbound PVCs. Full triage: Pod Pending.

ContainerCreating — scheduled but cannot finish setup

The pod is bound to a node but STATUS stays ContainerCreating. The docs: "If a Pod is stuck in the Waiting state, then it has been scheduled to a worker node, but it can't run on that machine" (Kubernetes: Debug Pods). Common Events point at volume attach/mount failures, CNI/network plugin errors, or a long image pull. Full triage: ContainerCreating stuck.

ImagePullBackOff / ErrImagePull — the image cannot be pulled

STATUS is ErrImagePull on the first failure and ImagePullBackOff once the kubelet starts backing off retries. The Events line is a "Failed to pull image" message. Causes cluster around a wrong image name or tag, a missing or wrong imagePullSecret, a private registry the node cannot authenticate to, or registry rate limiting. The official first checks are blunt: confirm the image name is correct, confirm it was pushed, and try to pull it manually (Kubernetes: Debug Pods). Full triage: ImagePullBackOff.

CreateContainerConfigError — a referenced ConfigMap or Secret is missing

STATUS is CreateContainerConfigError, and describe names the object — for example, "configmap … not found." This is a configuration reference problem: the container spec points at a ConfigMap or Secret (as an env var or volume) that does not exist or has the wrong key, not a code or image fault. It surfaces before the container ever starts, so there are no application logs to read — the evidence is entirely in describe. Full triage: CreateContainerConfigError.

CrashLoopBackOff — the container starts, then dies, repeatedly

STATUS is CrashLoopBackOff and RESTARTS keeps climbing. This is not an error in itself — it is the kubelet rate-limiting a container that keeps exiting. When a container exits, the kubelet restarts it "with an exponential back-off delay (10s, 20s, 40s, …), that is capped at 300 seconds (5 minutes)"; once a container has run successfully for long enough the back-off timer resets (Kubernetes: Pod Lifecycle, Google Cloud: Troubleshoot CrashLoopBackOff events).

To recognise why it loops, read Last State and logs --previous. A non-zero application exit code points at a startup bug, a missing dependency, or a bad config value; exit 137 with Reason: OOMKilled is a memory kill (see below); a failing liveness probe restarts an otherwise healthy process. Full triage: CrashLoopBackOff.

OOMKilled — the kernel killed it for using too much memory

In triage you see Last State with Reason: OOMKilled and exit code 137 (128 + signal 9, SIGKILL). Two distinct mechanisms produce it: a container exceeding its own limits.memory, or node memory pressure where the kubelet evicts pods — and the victim is chosen by QoS class, not by who leaked. Recognition is the same line; the fix depends on which mechanism you are in. For the full breakdown — confirming the kill, the reactive kernel enforcement, finding the real consumer, and stopping recurrence — use the deep guide Kubernetes OOMKilled: a complete debugging guide, or the quick-fix page OOMKilled.

Node NotReady — suspect the node, not the pod

If many pods on the same node go Unknown or stop reporting at once, the problem is likely the node. A node's Ready condition is "True if the node is healthy and ready to accept pods, False if the node is not healthy … and Unknown if the node controller has not heard from the node in the last node-monitor-grace-period (default is 50 seconds)." When Ready stays Unknown or False past that grace period, the control plane adds a node.kubernetes.io/unreachable or node.kubernetes.io/not-ready taint, and "existing pods scheduled to the node may be evicted due to the application of NoExecute taints" (Kubernetes: Node Status). The tell is the blast radius: a whole node's workloads degrade together. Full triage: Node NotReady.

The pod is healthy but traffic does not flow

Everything above is about a pod that will not start. A different class of problem is a pod that is Running and Ready but still unreachable — the failure is in networking, not the pod lifecycle. If kubectl get pods looks clean and the symptom is a connection that hangs, is refused, or a name that will not resolve, route to the connectivity pages instead:

  • Connections refused or no backends — the Service has no ready endpoints (selector mismatch, pods not Ready, or a targetPort mismatch). → Service has no endpoints.
  • Connections hang and time out — traffic is being dropped, often by a NetworkPolicy that isolates the pod without an allow rule. → Traffic blocked by NetworkPolicy.
  • Name resolution fails — DNS is broken before any connection is attempted (CoreDNS, ndots, or a blocked DNS egress policy). → DNS resolution failures.

The first triage question is the same as for the node-vs-pod split: is the pod itself failing, or is it fine and something between the client and the pod is in the way? Answer that before debugging either layer.

General prevention and operational principles

  • Set requests and limits deliberately. Requests drive scheduling and prevent Pending; limits cap blast radius and shape OOM behavior. Setting requests too high starves the scheduler; setting limits too low triggers OOM kills. See the OOMKilled guide for the QoS tradeoffs.
  • Treat probes as load-bearing config. A liveness probe that is too aggressive turns a slow-but-healthy startup into a CrashLoopBackOff. Tune initialDelaySeconds and timeouts to real startup behavior.
  • Pin image references and pre-stage imagePullSecrets. Mutable tags and per-namespace secret drift are the two most common ImagePullBackOff sources; both are preventable at deploy time.
  • Validate manifests before they reach the cluster. Running kubectl apply --validate -f mypod.yaml (Kubernetes: Debug Pods) and verifying referenced ConfigMaps/Secrets exist catches CreateContainerConfigError in CI rather than in production.
  • Watch Events, not just dashboards. Most of the answers above appear in the pod's Events stream first. kubectl get events --namespace=<ns> surfaces scheduling and pull failures cluster-wide (Kubernetes: Debug Running Pods).
  • Separate node failures from pod failures early. Before deep-diving an app, check whether the blast radius is one pod or one node — it changes the entire investigation.

For the full catalogue of per-error fixes, see the Kubernetes troubleshooting index.

Sources

By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline; last verified 2026-06-02.

Related

Stop guessing at root cause

Connect your stack read-only and get an evidence-backed RCA on your next incident.