KAI Road of Kubernetes 04 — What Probes are, and why Running is not Ready

The previous chapter made Service clear:

Pods change, so the entry point should not drift.

But a stable entry point does not mean every Pod behind it can actually serve traffic.

This is one of the first traps in Kubernetes:

The Service looks normal. DNS works. But requests still timeout, return 502, or fail randomly.

The sentence I would keep is this:

Running only means the process is up. Ready means it can safely receive traffic. Probes are the status signals Kubernetes acts on.

KAI notebook style diagram showing a Service sending traffic only to a Ready Pod while a Running but Not Ready Pod is blocked and a stuck Pod is restarted by liveness probe. — A Service gives the entry point, but probes decide which Pods can stand behind it. Running means started. Ready means traffic-safe.

Do not treat Running as Ready

When a Pod reaches Running, it means the container process has started.

Your application may still be doing work:

connecting to a database
loading configuration
warming caches
waiting for a dependency to become usable

So Running does not mean “stable enough to receive user traffic.”

If the Service starts routing traffic too early, users do not experience “fast startup.” They experience unstable responses.

Think of probes like opening a restaurant

I remember this chapter with a restaurant example.

The lights are on, but that does not mean the restaurant is ready for customers.

The kitchen may still be preparing ingredients. The POS terminal may not be connected. The staff may still be doing the morning setup.

If the door says “open” too early, customers walk in and cannot actually order anything.

Probes answer three different questions for Kubernetes:

startupProbe: am I still starting normally, so do not judge me too early?
readinessProbe: can I receive orders now, or should traffic stay away?
livenessProbe: am I stuck badly enough that I should be restarted?

The three probe roles should not be mixed

I compress the split like this:

readiness controls traffic, liveness controls restarts, and startup protects slow boot.

KAI notebook style diagram showing startup, readiness, and liveness probes as separate roles around one Pod, with readiness controlling Service traffic and liveness restarting stuck containers. — Keep the three probe roles separate: startup protects slow boot, readiness controls traffic admission, and liveness handles truly stuck containers.

Those probes can check the same endpoint, or they can check separate endpoints.

The important part is not whether the endpoint string is identical.

The important part is that the meaning does not fight itself.

If readiness and liveness are treated as the same signal, the system can turn “temporarily not ready” into “restart this container.” That is how a small delay becomes an incident.

What Kubernetes actually does

Here is the technically precise version without making it too heavy:

The kubelet runs probes periodically, using mechanisms such as HTTP, TCP, exec, or gRPC.
When readinessProbe fails, the Pod becomes not ready. EndpointSlices for matching Services remove that Pod from the ready backend set.
When livenessProbe fails at least failureThreshold times, the kubelet treats the container as unhealthy and restarts that container.
If startupProbe is configured, liveness and readiness probes do not run until the startup probe succeeds.

These are defined Kubernetes behaviors, not just community convention.

Reference: https://kubernetes.io/docs/concepts/workloads/pods/probes/

A simplified configuration example

apiVersion: v1
kind: Pod
metadata:
  name: web
spec:
  containers:
    - name: web
      image: ghcr.io/example/web:1.0.0
      ports:
        - containerPort: 8080
      startupProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 10
        failureThreshold: 30
      readinessProbe:
        httpGet:
          path: /ready
          port: 8080
        periodSeconds: 5
        timeoutSeconds: 2
        failureThreshold: 3
      livenessProbe:
        httpGet:
          path: /healthz
          port: 8080
        periodSeconds: 10
        timeoutSeconds: 2
        failureThreshold: 3

The exact numbers are not the main point.

The responsibility split is the point:

startupProbe gives slow startup a protected window
readinessProbe controls whether the Pod joins the Service traffic pool
livenessProbe handles truly stuck states

Common beginner mistakes

1. Treating Running as traffic-ready

This is the most common wrong read.

A Service having endpoints does not guarantee that those endpoints are stable enough to serve traffic.

2. Using liveness to solve a readiness problem

If a dependency has a short failure and liveness kills the container, the workload may become even less stable.

Not ready should often mean “stop sending traffic for now,” not “restart immediately.”

3. Making readiness too heavy

If the probe endpoint itself is expensive, the health check can become part of the load problem.

The probe should tell Kubernetes whether the app can serve. It should not become a mini production request.

4. Giving slow-starting services no startupProbe

Without a startup probe, a slow boot path can be killed by liveness before it has a fair chance to become healthy.

How I would inspect this

If someone says “the Service is not working,” I check probes before blaming the network.

kubectl get pod -n <ns>
kubectl describe pod <pod-name> -n <ns>
kubectl logs <pod-name> -n <ns> --previous
kubectl get endpointslice -n <ns> -l kubernetes.io/service-name=<svc-name>

I look for:

whether the READY column stays at 0/1
whether RESTARTS keeps increasing
whether Events show readiness failures or liveness failures
whether EndpointSlice still has ready backends
whether probe path, port, timeout, or threshold is too aggressive

KAI notebook style troubleshooting path for Kubernetes probes: check Probe, Ready state, Events, and EndpointSlice before blaming DNS. — For Service issues, walk through Probe, Ready state, Events, and EndpointSlice before blaming DNS.

The short version:

When a Service does not work, ask whether the Pod is Ready before asking whether the network is broken.

My mental model for probes

I do not like memorizing probes as just “health checks.”

That is too empty.

I treat probes as a contract:

A workload uses probes to tell Kubernetes when it can receive traffic, when it is only slow, and when it is truly stuck enough to restart.

When that contract is right, the platform helps the workload converge.

When that contract is wrong, the platform can reliably manufacture failure.

Three takeaways

Running is not Ready; Ready is what makes a Pod traffic-safe.
readiness controls traffic, liveness controls restarts, and startup protects slow boot.
Debug Service issues through Probe -> Ready state -> EndpointSlice; this is often more useful than blaming DNS first.

Next in the series: ConfigMap / Secret: why configuration and keys should not be baked into the container image.