Why Your Pods Are Restarting

Published

Misconfigured probes are one of the most common — and most silent — causes of production pod restarts. Here’s what actually goes wrong.

You deployed your app. It’s running. And then at 2am, alerts fire — pods are restarting. CPU looks fine. Memory looks fine. No OOM. No crash.

The culprit? Probes.

Liveness and readiness probes are supposed to make your app more resilient. But when misconfigured, they do the opposite — they restart healthy pods, drop traffic during load spikes, and create cascading failures that are infuriatingly hard to trace.

This post breaks down how probes actually work, where they go wrong, and the patterns that hold up in production.

The Difference — And Why It Matters

Kubernetes has three probe types. Most teams only configure two of them.

Liveness probe → “Is this pod alive or should it be restarted?” If this fails, Kubernetes kills and restarts the container. Use it to detect deadlocks and unrecoverable states.

Readiness probe → “Is this pod ready to receive traffic?” If this fails, Kubernetes removes the pod from the Service endpoints. The pod keeps running — it just gets no traffic. Use it to signal when the app is busy, warming up, or temporarily unable to serve.

Startup probe → “Has the app finished starting up?” While this is passing, liveness and readiness probes are paused. Essential for slow-starting apps. Hugely underused.

livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
  failureThreshold: 3
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 10
  periodSeconds: 5
  failureThreshold: 3
startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
  # Gives app up to 300s (30 * 10) to start before liveness kicks in

The Most Common Misconfiguration: Pointing Probes at the Wrong Endpoint

A very common pattern — especially in Spring Boot apps — is pointing both liveness and readiness probes at the same generic /health endpoint.

# ❌ This looks fine. It isn't.
livenessProbe:
  httpGet:
    path: /health
    port: 8080
readinessProbe:
  httpGet:
    path: /health
    port: 8080

The problem: /health typically checks everything — database connectivity, downstream services, message queue connections. If your DB is slow or an upstream service is degraded, /health returns unhealthy. Your liveness probe fails. Kubernetes restarts your pod.

But the pod itself was perfectly fine. You just killed a healthy container because of a dependency issue it couldn’t control.

The fix: Separate your health endpoints by concern.

# ✅ Correct — separate endpoints for separate concerns
livenessProbe:
  httpGet:
    path: /actuator/health/liveness   # Only: is the JVM alive? Is the app deadlocked?
    port: 8080
readinessProbe:
  httpGet:
    path: /actuator/health/readiness  # Checks: DB, queues, dependencies
    port: 8080
  • Liveness should only check internal app health — is the process responsive, is there a deadlock, is the thread pool exhausted? It should NOT check external dependencies.
  • Readiness can and should check dependencies — because if your DB is down, you shouldn’t be receiving traffic anyway.

In Spring Boot, management.endpoint.health.probes.enabled=true sets this up automatically. Other frameworks have equivalents.

The failureThreshold Trap

This is the one that catches teams out during traffic spikes.

# ❌ Aggressive — will restart pods during momentary slowdowns
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 10
  failureThreshold: 2   # 2 failures = restart. That's only 20 seconds.

With failureThreshold: 2 and periodSeconds: 10, your pod gets restarted after just 20 seconds of a slow health response. During an SQS burst or a GC pause, your app might be legitimately busy for 15–30 seconds. The probe fails. Pod restarts. Traffic spikes again. More restarts. You’re in a loop.

# ✅ Tolerant — survives transient slowdowns
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 10
  failureThreshold: 5       # 50 seconds before restart
  timeoutSeconds: 5         # Give the endpoint time to respond under load
  successThreshold: 1

General guidance:

  • failureThreshold: 3–5 for liveness (you want restarts for real problems, not blips)
  • failureThreshold: 2–3 for readiness (faster to pull from rotation is fine)
  • Always set timeoutSeconds — default is 1 second, which is almost always too low under load

The Startup Probe: Stop Using initialDelaySeconds as a Workaround

Before startup probes existed, the common pattern was:

# ❌ Old workaround — brittle
livenessProbe:
  initialDelaySeconds: 120   # Hope the app starts within 2 minutes

This is fragile. On a fast node it wastes 90 seconds. On a slow node (cold start, large image pull, JVM warmup) it kills the pod before it’s ready.

Startup probes solve this properly:

# ✅ Startup probe handles slow starts, then hands off to liveness
startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  failureThreshold: 30
  periodSeconds: 10
  # Max startup time: 30 * 10 = 300 seconds
  # Once this passes once, it stops running — liveness takes over
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  periodSeconds: 10
  failureThreshold: 5
  # No initialDelaySeconds needed — startup probe already handled the wait

The startup probe runs until it succeeds once, then steps aside. Liveness and readiness run from that point forward. This is the correct pattern for any app with a non-trivial startup time.

Thread Starvation: The Hidden Probe Failure Cause

Here’s one you won’t find in the official docs.

Your app uses an async thread pool for processing (SQS consumers, background jobs, etc.). Under load, the thread pool saturates. Your HTTP handler threads — the ones serving your health endpoint — are stuck waiting. The probe times out. Pod restarts.

This shows up as:

  • Probe failures correlated with traffic spikes
  • No CPU/memory pressure
  • App logs show work being processed right up until the restart

The diagnosis:

# Check probe failure events
kubectl describe pod <pod-name> -n <namespace> | grep -A5 "Liveness\|Readiness"
# Look for this pattern:
# Liveness probe failed: Get [http://...:8080/health:](http://...:8080/health:) context deadline exceeded
# alongside normal application log output

The fixes:

  1. Separate your health endpoint thread pool from your application thread pool
  2. Increase timeoutSeconds on the probe
  3. Increase failureThreshold to tolerate transient slowdowns
  4. Right-size your executor pools so they don’t saturate under normal load

Exec Probes: Use With Caution

HTTP probes are the right default. But some teams use exec probes — running a command inside the container:

# Works, but has hidden costs
livenessProbe:
  exec:
    command:
      - cat
      - /tmp/healthy
  periodSeconds: 5

The problem: every exec probe spawns a new process inside the container. At high periodSeconds frequency with many pods, this creates measurable CPU overhead and can interfere with PID limits. Stick to HTTP or TCP probes unless you have a specific reason not to.

A Production-Ready Probe Template

Here’s a template that holds up well for most HTTP services:

startupProbe:
  httpGet:
    path: /actuator/health
    port: 8080
  failureThreshold: 30      # Up to 5 min to start
  periodSeconds: 10
  timeoutSeconds: 5
livenessProbe:
  httpGet:
    path: /actuator/health/liveness
    port: 8080
  initialDelaySeconds: 0    # Startup probe handles this
  periodSeconds: 10
  failureThreshold: 5       # 50s before restart
  timeoutSeconds: 5
readinessProbe:
  httpGet:
    path: /actuator/health/readiness
    port: 8080
  initialDelaySeconds: 0
  periodSeconds: 5
  failureThreshold: 3       # 15s before removing from rotation
  timeoutSeconds: 3

Adjust failureThreshold and periodSeconds based on your app’s observed behaviour under load — not based on defaults.

Quick Debugging Reference

# See probe failure events for a pod
kubectl describe pod <pod-name> -n <namespace>
# Watch pod restarts in real time
kubectl get pods -n <namespace> -w
# Check restart count and last restart time
kubectl get pod <pod-name> -n <namespace> \
  -o jsonpath='{.status.containerStatuses[0].restartCount}'
# Manually test your health endpoint from inside the cluster
kubectl exec -it <pod-name> -n <namespace> -- \
  curl -s [http://localhost:8080/actuator/health/liveness](http://localhost:8080/actuator/health/liveness)

Summary

Mistake Fix Both probes on same /health endpoint Separate liveness and readiness endpoints failureThreshold: 2 Use 3–5 for liveness, 2–3 for readiness initialDelaySeconds as startup workaround Use a dedicated startup probe No timeoutSeconds set Always set it; default of 1s is too low under load Exec probes at high frequency Prefer HTTP probes; exec spawns a process each time Liveness checks external dependencies Liveness = internal health only

Probes are one of those Kubernetes features that work invisibly when configured right — and cause completely baffling incidents when they’re not. Getting them right is less about memorising the fields and more about understanding what each probe is actually supposed to answer.

Follow me for more (https://www.linkedin.com/in/fuzailn)Platform Engineering content — real problems, real fixes. Had a probe incident of your own? Share it in the comments.

Originally published on Medium.

Ready to apply? Browse open roles on FzlOps · get daily alerts on WhatsApp above.