What Node NotReady means
A node is NotReady when its kubelet stops posting a healthy Ready status to
the control plane. The kubelet keeps a heartbeat alive two ways: it updates the
Node .status (default every 5 minutes) and renews a Lease object in the
kube-node-lease namespace (default every 10 seconds). If those stop or the
status turns unhealthy, the node controller stops scheduling to the node and,
after a timeout, taints it NoExecute so its pods are evicted and rescheduled —
turning one bad node into a cluster-wide capacity problem. See
node status and conditions.
First, classify it: False vs Unknown
This single distinction decides where you look. The Ready condition is either
False (kubelet is talking and says it is unhealthy) or Unknown (the control
plane has not heard from the node within node-monitor-grace-period, default
50s):
kubectl get node <node> -o jsonpath='{.status.conditions[?(@.type=="Ready")]}'
Ready=False→ the kubelet is reaching the API server. Look on the node: runtime, resource pressure, kubelet config. The control plane adds thenode.kubernetes.io/not-ready:NoExecutetaint.Ready=Unknown→ the node is unreachable (network, crashed kernel, dead kubelet). Look at connectivity and the node lease. The control plane addsnode.kubernetes.io/unreachable:NoExecute.
Both taint timings and the 50s default come from the kube-controller-manager NodeMonitorGracePeriod and the node-controller taints.
Diagnose it
kubectl get nodes
kubectl describe node <node>
# Conditions: Ready, plus MemoryPressure / DiskPressure / PIDPressure
# Look at LastHeartbeatTime and the Taints line
kubectl get lease <node> -n kube-node-lease # renewTime stale => unreachable
On the node itself (if reachable):
systemctl status kubelet
journalctl -u kubelet --no-pager | tail -80
systemctl status containerd # or crio / docker
df -h ; df -i # disk AND inode usage
Causes, each end to end
Kubelet or container runtime down (Ready=False)
The kubelet crashed or hung, or the CRI runtime (containerd / CRI-O) is down, so the kubelet cannot manage pods and reports unhealthy.
- Diagnose:
systemctl status kubeletandsystemctl status containerd. A missing runtime socket (ls -la /run/containerd/containerd.sock) or a crash-looping kubelet injournalctl -u kubeletconfirms it. - Fix: restart the failed service (
systemctl restart containerd && systemctl restart kubelet) and read the logs for the crash reason — a bad config flag, an OOM-killed kubelet, or a corrupt runtime state directory. The kubelet re-registers and the node returns toReady.
MemoryPressure (Ready=False)
Available memory dropped below the kubelet's memory.available eviction
threshold (hard default <100Mi), so the kubelet sets MemoryPressure=True and
starts evicting pods.
- Diagnose:
kubectl describe nodeshowsMemoryPressure True; events showEvictedpods. The kubelet evicts by QoS class —BestEffortfirst, thenBurstableover their requests,Guaranteedlast. - Fix: find and cap the offending workload (set memory
limits), or add node capacity. For critical pods useGuaranteedQoS (request == limit) so they are evicted last. Thresholds are set with--eviction-hard. See node-pressure eviction.
DiskPressure — bytes or inodes (Ready=False)
DiskPressure=True fires when nodefs.available (default under 10%),
nodefs.inodesFree (under 5%), or imagefs.available (under 15%) breach their hard
thresholds. Inode exhaustion is the trap: df -h can show free space while
df -i is at 100%.
- Diagnose:
df -handdf -ion the node;DiskPressure Trueinkubectl describe node. Check container logs, dead containers, and unused images fillingnodefs/imagefs. - Fix: free space or grow the disk. The kubelet first reclaims by pruning unused images and dead containers; if a runaway log or emptyDir is the cause, cap it. Per-threshold defaults are in node-pressure eviction.
PIDPressure (Ready=False)
pid.available fell below the threshold (hard default <4%) — a process or
fork-bomb workload exhausted the node's PIDs, so the kubelet sets
PIDPressure=True and cannot start new pods.
- Diagnose:
kubectl describe nodeshowsPIDPressure True; on the node,ps -eLf | wc -lagainstcat /proc/sys/kernel/pid_max. - Fix: kill or cap the offending workload and set pod/PID limits. See PIDPressure and node-pressure eviction.
Network partition — node unreachable (Ready=Unknown)
The node cannot reach the API server, so its lease in kube-node-lease goes
stale and the control plane flips Ready to Unknown after
node-monitor-grace-period. The CNI plugin itself can also report
NetworkUnavailable=True.
- Diagnose:
kubectl get lease <node> -n kube-node-leaseshows a stalerenewTime;NetworkUnavailablemay beTrue. From the node, test reachability to the API server endpoint. Suspect a recent security-group / firewall / route change or a CNI (Calico, Cilium, flannel) failure. - Fix: restore connectivity (security group, route, VPN, CNI pod). Once the
lease renews,
Readyreturns toTrue. See node heartbeats.
Expired kubelet certificate or clock skew (Ready=False or Unknown)
An expired kubelet client cert makes the API server reject the heartbeat; large clock skew breaks TLS validity windows. Both stop status updates.
- Diagnose:
journalctl -u kubeletshowsx509: certificate has expiredorUnauthorized. Check skew withtimedatectl/chronyc tracking. - Fix: rotate the kubelet cert (kubeadm renews
/var/lib/kubelet/pki, or re-run TLS bootstrap), restart the kubelet, and fix NTP so the clock stays in sync.
What happens to the pods
When Ready stays False/Unknown past node-monitor-grace-period, the node
controller adds the not-ready or unreachable NoExecute taint. Each pod is
evicted after its tolerationSeconds — Kubernetes injects a default of 300s
(5 minutes) for both taints unless the pod sets its own. Set a shorter value on
latency-sensitive workloads, or a longer one to ride out brief node restarts. See
taint-based eviction.
If a node will not recover, cordon and drain it, then replace it:
kubectl cordon <node>
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data
How Intellira diagnoses this
Intellira reads node conditions, the node lease, and recent events read-only,
then correlates the NotReady transition with what changed — a node-pool image
update, a CNI change, or a workload that exhausted disk or PIDs. It classifies
the node as pressure (Ready=False) versus unreachable (Ready=Unknown) up
front and points at the likely trigger with evidence, instead of leaving you to
SSH around.
Sources
- Node status and conditions
- Nodes — heartbeats and the node controller
- Node-pressure eviction
- Taints and tolerations — taint-based eviction
By Intellira Engineering. AI-assisted draft, reviewed by the Intellira engineering team; claims cited inline; last verified 2026-06-02.