Start here: read the status, do not guess
When a pod is not running, the fastest path to a fix is not a hunch about the application. It is a fixed three-command triage that tells you which class of failure you are in, after which you go straight to the right fix. Run these in order:
kubectl get pods # read STATUS + RESTARTS
kubectl describe pod <pod> # read Events + State / Last State
kubectl logs <pod> --previous # read the crashed container's output
STATUS tells you the phase or the waiting reason. describe gives you the
Events stream and the container's Last State (reason + exit code). logs --previous shows what the container printed before it died. Most pod
failures are fully classified by these three outputs. The rest of this guide
is the decision tree for reading them, plus a link to the dedicated fix for
each failure. It does not repeat those fixes — it gets you to the right one.
Why a systematic approach beats guessing
A Kubernetes pod can fail to run for reasons that live in completely different subsystems: the scheduler could not place it, the kubelet could not pull its image, a referenced Secret does not exist, the container crashed on startup, the kernel OOM-killed it, or the node it landed on went unhealthy. These look superficially similar — "my pod isn't up" — but the evidence that confirms each one is different, and so is the fix.
Guessing is expensive because the failure classes are disjoint. Editing
resource limits does nothing for a missing ConfigMap; fixing an image tag
does nothing for a node that stopped reporting. The triage below is built
around one fact: Kubernetes already records which class you are in, in two
places — the pod phase and the container state's Reason — and you
only need to read them in the right order.
The triage framework
Step 1 — kubectl get pods: phase and restarts
A pod has a phase, a high-level summary of where it is in its lifecycle. The official definitions (Kubernetes: Pod Lifecycle):
- Pending — "The Pod has been accepted by the Kubernetes cluster, but one or more of the containers has not been set up and made ready to run. This includes time a Pod spends waiting to be scheduled as well as the time spent downloading container images over the network."
- Running — "The Pod has been bound to a node, and all of the containers have been created. At least one container is still running, or is in the process of starting or restarting."
- Succeeded — all containers exited 0 and will not restart.
- Failed — all containers terminated and at least one failed.
- Unknown — "the state of the Pod could not be obtained," typically a communication error with the node.
The STATUS column you see in kubectl get pods is often not the bare phase
but a more specific waiting/terminated reason such as ContainerCreating,
ImagePullBackOff, CrashLoopBackOff, or CreateContainerConfigError. That
string is your first routing key. The RESTARTS column is the second: a high
and climbing restart count means the container starts and dies repeatedly,
which is the signature of a crash loop rather than a placement or config
problem.
Step 2 — kubectl describe pod: Events and Last State
kubectl describe pod <pod> is the highest-value single command. It returns
the container state (Waiting, Running, or Terminated), the restart
count, resource requests and limits, the pod conditions, and — most
importantly — the Events stream and the container's Last State
(Kubernetes: Debug Running Pods).
Two fields decide most cases:
- Events — the scheduler and kubelet narrate here. "0/3 nodes are
available: insufficient cpu" is a scheduling failure; "Failed to pull
image" is a registry failure; "configmap … not found" is a config failure.
The official guidance for a stuck
Pendingpod is explicit: "There should be messages from the scheduler about why it can not schedule your pod" (Kubernetes: Debug Pods). - Last State — for a container that has restarted, this shows the
previous termination's
Reasonand exit code.Reason: OOMKilledwith exit code 137 means the kernel killed it; a non-zero application exit code means the program itself failed.
If describe is not enough, kubectl get pod <pod> -o yaml returns
"essentially all of the information the system has about the Pod"
(Kubernetes: Debug Running Pods).
Step 3 — kubectl logs --previous: what the dead container said
When a container has already crashed and restarted, the live logs are from
the new attempt and may be empty. The --previous flag "retrieves logs
from the prior, terminated container, which is where you can find the
specific stack trace or error message that reveals the cause of the crash"
(Google Cloud: Troubleshoot CrashLoopBackOff events).
For any crash loop, this is the command that contains the actual answer.
The decision: route by phase, then by reason
Read the outputs top-down and branch:
- Phase is
Pending→ the pod is not scheduled yet. Read Events for the scheduler's reason. → Pod Pending or Node NotReady. - STATUS is
ContainerCreatingand stays there → scheduled, but the kubelet cannot finish setup (volume, network, image). → ContainerCreating stuck. - STATUS is
ImagePullBackOff/ErrImagePull→ the image cannot be pulled. → ImagePullBackOff. - STATUS is
CreateContainerConfigError→ a referenced ConfigMap or Secret is missing or malformed. → CreateContainerConfigError. - STATUS is
CrashLoopBackOffwith climbing RESTARTS → the container starts then dies. CheckLast Statereason/exit code. → CrashLoopBackOff. Last Statereason isOOMKilled(exit 137) → memory limit or node pressure. → OOMKilled.- A whole node's pods go
Unknown/NotReadyat once → suspect the node, not the app. → Node NotReady.
Recognising each failure (then go to the fix)
Each section below is recognition only — the signature you see in triage. Follow the link for the full diagnosis and fix.
Pending — the scheduler could not place it
You see phase Pending for more than a few seconds, and describe Events
carry a scheduler message. Per the docs: "If a Pod is stuck in Pending it
means that it can not be scheduled onto a node. Generally this is because
there are insufficient resources of one type or another"
(Kubernetes: Debug Pods).
Typical Events name insufficient CPU/memory, taints the pod does not
tolerate, node-affinity that matches nothing, or unbound PVCs. Full triage:
Pod Pending.
ContainerCreating — scheduled but cannot finish setup
The pod is bound to a node but STATUS stays ContainerCreating. The docs:
"If a Pod is stuck in the Waiting state, then it has been scheduled to a
worker node, but it can't run on that machine"
(Kubernetes: Debug Pods).
Common Events point at volume attach/mount failures, CNI/network plugin
errors, or a long image pull. Full triage:
ContainerCreating stuck.
ImagePullBackOff / ErrImagePull — the image cannot be pulled
STATUS is ErrImagePull on the first failure and ImagePullBackOff once
the kubelet starts backing off retries. The Events line is a "Failed to pull
image" message. Causes cluster around a wrong image name or tag, a missing or
wrong imagePullSecret, a private registry the node cannot authenticate to,
or registry rate limiting. The official first checks are blunt: confirm the
image name is correct, confirm it was pushed, and try to pull it manually
(Kubernetes: Debug Pods).
Full triage: ImagePullBackOff.
CreateContainerConfigError — a referenced ConfigMap or Secret is missing
STATUS is CreateContainerConfigError, and describe names the object —
for example, "configmap … not found." This is a configuration reference
problem: the container spec points at a ConfigMap or Secret (as an env var or
volume) that does not exist or has the wrong key, not a code or image fault.
It surfaces before the container ever starts, so there are no application
logs to read — the evidence is entirely in describe. Full triage:
CreateContainerConfigError.
CrashLoopBackOff — the container starts, then dies, repeatedly
STATUS is CrashLoopBackOff and RESTARTS keeps climbing. This is not an
error in itself — it is the kubelet rate-limiting a container that keeps
exiting. When a container exits, the kubelet restarts it "with an exponential
back-off delay (10s, 20s, 40s, …), that is capped at 300 seconds (5
minutes)"; once a container has run successfully for long enough the back-off
timer resets (Kubernetes: Pod Lifecycle,
Google Cloud: Troubleshoot CrashLoopBackOff events).
To recognise why it loops, read Last State and logs --previous. A
non-zero application exit code points at a startup bug, a missing
dependency, or a bad config value; exit 137 with Reason: OOMKilled is a
memory kill (see below); a failing liveness probe restarts an otherwise
healthy process. Full triage:
CrashLoopBackOff.
OOMKilled — the kernel killed it for using too much memory
In triage you see Last State with Reason: OOMKilled and exit code
137 (128 + signal 9, SIGKILL). Two distinct mechanisms produce it: a
container exceeding its own limits.memory, or node memory pressure
where the kubelet evicts pods — and the victim is chosen by QoS class, not by
who leaked. Recognition is the same line; the fix depends on which mechanism
you are in. For the full breakdown — confirming the kill, the reactive kernel
enforcement, finding the real consumer, and stopping recurrence — use the
deep guide Kubernetes OOMKilled: a complete debugging guide,
or the quick-fix page OOMKilled.
Node NotReady — suspect the node, not the pod
If many pods on the same node go Unknown or stop reporting at once, the
problem is likely the node. A node's Ready condition is "True if the node
is healthy and ready to accept pods, False if the node is not healthy …
and Unknown if the node controller has not heard from the node in the last
node-monitor-grace-period (default is 50 seconds)." When Ready stays
Unknown or False past that grace period, the control plane adds a
node.kubernetes.io/unreachable or node.kubernetes.io/not-ready taint, and
"existing pods scheduled to the node may be evicted due to the application of
NoExecute taints" (Kubernetes: Node Status).
The tell is the blast radius: a whole node's workloads degrade together.
Full triage: Node NotReady.
The pod is healthy but traffic does not flow
Everything above is about a pod that will not start. A different class of
problem is a pod that is Running and Ready but still unreachable — the
failure is in networking, not the pod lifecycle. If kubectl get pods looks
clean and the symptom is a connection that hangs, is refused, or a name that
will not resolve, route to the connectivity pages instead:
- Connections refused or no backends — the Service has no ready endpoints
(selector mismatch, pods not
Ready, or atargetPortmismatch). → Service has no endpoints. - Connections hang and time out — traffic is being dropped, often by a
NetworkPolicythat isolates the pod without an allow rule. → Traffic blocked by NetworkPolicy. - Name resolution fails — DNS is broken before any connection is attempted
(CoreDNS,
ndots, or a blocked DNS egress policy). → DNS resolution failures.
The first triage question is the same as for the node-vs-pod split: is the pod itself failing, or is it fine and something between the client and the pod is in the way? Answer that before debugging either layer.
General prevention and operational principles
- Set requests and limits deliberately. Requests drive scheduling and
prevent
Pending; limits cap blast radius and shape OOM behavior. Setting requests too high starves the scheduler; setting limits too low triggers OOM kills. See the OOMKilled guide for the QoS tradeoffs. - Treat probes as load-bearing config. A liveness probe that is too
aggressive turns a slow-but-healthy startup into a
CrashLoopBackOff. TuneinitialDelaySecondsand timeouts to real startup behavior. - Pin image references and pre-stage
imagePullSecrets. Mutable tags and per-namespace secret drift are the two most commonImagePullBackOffsources; both are preventable at deploy time. - Validate manifests before they reach the cluster. Running
kubectl apply --validate -f mypod.yaml(Kubernetes: Debug Pods) and verifying referenced ConfigMaps/Secrets exist catchesCreateContainerConfigErrorin CI rather than in production. - Watch Events, not just dashboards. Most of the answers above appear in
the pod's Events stream first.
kubectl get events --namespace=<ns>surfaces scheduling and pull failures cluster-wide (Kubernetes: Debug Running Pods). - Separate node failures from pod failures early. Before deep-diving an app, check whether the blast radius is one pod or one node — it changes the entire investigation.
For the full catalogue of per-error fixes, see the Kubernetes troubleshooting index.
Sources
- Kubernetes: Pod Lifecycle — pod phases, container states, restart back-off (10s → 300s cap).
- Kubernetes: Debug Pods — Pending vs Waiting diagnosis, image-pull checks,
kubectl apply --validate. - Kubernetes: Debug Running Pods —
kubectl describe pod,-o yaml, Events, Last State. - Kubernetes: Node Status — Ready condition (True/False/Unknown), 50s
node-monitor-grace-period, NoExecute taints. - Google Cloud: Troubleshoot CrashLoopBackOff events — back-off behavior, recognizing via STATUS/RESTARTS,
kubectl logs --previous.
By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline; last verified 2026-06-02.