When troubleshooting K8s issues, there are three core commands:
kubectl describe pod/node <name>
: To check resource Events and identify the root cause.kubectl logs <pod-name>
: To check application logs and resolve program issues.kubectl get <resource-type>
: To check the status of resources.
Layer 1: Pod Status Codes#
Status Code (Status) | Core Reason | Core Troubleshooting Steps |
---|---|---|
Pending | Cannot be scheduled: The scheduler cannot find a suitable node. | 1. kubectl describe pod <name> , check Events to find the specific reason:- Insufficient cpu/memory (Not enough resources).- Taints/Tolerations (Mismatch between taints and tolerations).- Affinity rules (Mismatch in affinity/anti-affinity rules).- PVC not bound (PersistentVolumeClaim is not ready). |
ImagePullBackOff / ErrImagePull | Image pull failed: The Kubelet cannot pull the container image from the registry. | 1. kubectl describe pod <name> , check Events to find the specific reason:- Incorrect image name or tag (Check the YAML). - Private registry authentication failed (Check imagePullSecrets ).- Network issue (Log in to the node and test with docker/crictl pull ). |
CrashLoopBackOff | Container is crashing repeatedly: The container exits immediately after starting, and the Kubelet keeps restarting it. | 1. kubectl logs <pod-name> --previous (Check the logs of the previous crash, extremely important).2. kubectl logs <pod-name> (Check the current logs).3. Investigate application bugs, configuration errors, or out-of-memory issues based on the logs. |
RunContainerError | Container runtime error: The configuration is correct, but the underlying container runtime (e.g., containerd) cannot start the container. | 1. kubectl describe pod <name> , Events will show RunContainerError .2. SSH into the node and use journalctl -u containerd (or docker ) to check the runtime logs for more low-level error messages. |
CreateContainerConfigError | Container configuration error: There is an issue with the configuration required to create the container (e.g., a ConfigMap or Secret). | 1. kubectl describe pod <name> , Events will clearly state which resource is missing or has a format error. |
Running (but Ready is 0/1) | Readiness Probe failed: The Pod is running, but it is not ready to receive traffic. | 1. kubectl describe pod <name> , Events will record Readiness probe failed .2. Check the ReadinessProbe configuration (initial delay, timeout) or see if a downstream service the application depends on is failing. |
Terminating (Stuck) | Pod cannot terminate properly: Usually due to a finalizer preventing its deletion, or a volume that cannot be unmounted. | 1. kubectl describe pod <name> , check Events for storage-related errors like FailedDetachVolume .2. kubectl edit pod <name> , check the metadata.finalizers field; a finalizer added by a controller may not have been cleaned up. |
Unknown | Status is unknown: Typically means the node controller cannot communicate with the Kubelet on the Pod’s node. | 1. This is almost equivalent to a node being NotReady . Immediately check the health of the Pod’s host node (see Layer 4). |
Job Failed: BackoffLimitExceeded | Job retry limit exceeded: The Pods created by the Job failed, and after reaching the retry limit, the Job is marked as failed. | 1. kubectl get pods -l job-name=<job-name> to find the failed Pods created by the Job.2. kubectl logs <failed-pod-name> to view the logs and identify the root cause of the task’s failure. |
Layer 2: Container Exit Codes#
Exit Code | Meaning | Core Troubleshooting Steps |
---|---|---|
1 | General Application Error | 1. Check application logs: kubectl logs <pod-name> --previous . |
126 / 127 | Command not executable / Command not found | 1. Check the Dockerfile (chmod +x ) and the command path in your YAML. |
137 | OOMKilled (Out of Memory) | 1. kubectl describe pod <name> to confirm Reason: OOMKilled .2. Increase resources.limits.memory . |
139 | Segmentation Fault (SIGSEGV): Code Bug. | 1. Notify the developers to debug the code. |
143 | Graceful Termination (SIGTERM): Normal behavior. | 1. Occurs during Pod deletion or updates; no action needed. |
Layer 3: Network Status Codes and Errors#
Error/Status | Core Reason | Core Troubleshooting Steps |
---|---|---|
Endpoints are empty | The Service Selector does not match any Pods. | 1. kubectl describe svc <name> to check the Selector .2. kubectl get pods --show-labels to compare with the Pod’s Labels . |
HTTP 502/503/504 | Ingress Gateway Error / Service Unavailable / Timeout. | 1. A comprehensive check of Endpoints and Pod health (CrashLoopBackOff , 0/1 Ready ).2. For 504: Check Pod logs and resource usage ( kubectl top pod ) to determine if the application is slow to respond. |
HTTP 499 | Client Closed Request. A non-standard Nginx status code. Simply put, the backend service took too long to respond. | 1. Check backend service response time: Use kubectl logs <ingress-controller-pod> to check logs and identify which endpoint (URL) frequently returns 499, and confirm if its request_time is too long.2. Check client timeout settings: Confirm if the client calling the service (browser, app, or another microservice) has set a very short request timeout. 3. Investigate application performance bottlenecks: Analyze the code of the corresponding service for issues like slow database queries or slow calls to third-party services. |
Connection refused | Connection was refused: The network path is clear, but no process is listening on the target Pod’s port. | 1. kubectl exec -it <pod-name> -- netstat -tulnp to confirm if the application is listening on the correct port.2. Check the application’s startup logs for any port binding errors. |
Connection timed out | Connection timed out: Packets are being lost in the network, usually due to a NetworkPolicy or firewall issue. | 1. Check NetworkPolicies: kubectl get networkpolicy -A to confirm if a policy is blocking this traffic.2. Check node security groups or the underlying network firewall. |
No route to host | No route to host: Typically an issue with the inter-node network (CNI). | 1. Check if the CNI plugin’s Pods (calico-node , flannel-ds , etc.) are running correctly on all nodes. |
Layer 4: Node Status Codes#
Status Code (Status) | Core Reason | Core Troubleshooting Steps |
---|---|---|
NotReady | Node lost contact: Communication between the Kubelet and the API Server is interrupted. | 1. SSH into the node, and check kubelet , containerd , df -h , and free -m in order. |
SchedulingDisabled | Scheduling is disabled: The node has been cordoned, and no new Pods will be scheduled on it. | 1. This is an administrative action, not a failure. Use kubectl uncordon <node-name> to resume scheduling. |
MemoryPressure | Memory Pressure: The available memory on the node is too low. | 1. The node may start evicting Pods. Log in to the node and use top to find the memory hogs. |
DiskPressure | Disk Pressure: The disk space on the node is insufficient. | 1. Log in to the node, use df -h to locate the partition, and clean up images, containers, and logs. |
PIDPressure | PID Pressure: The node has run out of Process IDs. | 1. Log in to the node and check for any process fork bombs or applications creating too many threads/processes. |
Layer 5: Storage Status Codes#
Status Code / Event | Core Reason | Core Troubleshooting Steps |
---|---|---|
PVC: Pending | The PVC cannot bind to a PV. | 1. kubectl describe pvc <name> , check Events to see if it’s a PV mismatch or a StorageClass issue. |
Pod Event: FailedMount | Volume mount failed. | 1. kubectl describe pod <name> , Events will provide detailed reasons, such as NFS permissions or cloud disk status. |
Pod Event: FailedDetachVolume | Volume detach failed: Usually, the underlying storage (e.g., a cloud disk) is busy or has an issue. | 1. This issue will cause a Pod to get stuck in the Terminating state.2. Check the CSI plugin logs or the cloud provider’s console to see the status of the volume. |
App Log: Read-only file system | The file system is read-only: The Pod encounters an error when writing to a PV. | 1. kubectl exec -it <pod-name> -- mount to view mount information and confirm if the mount option is ro (read-only).2. The storage backend itself may have encountered a problem and entered a read-only protective mode. |