Debugging Kubernetes with Events and Logs

Use `kubectl get events` with timestamp sorting and container logs to quickly diagnose Kubernetes issues.

JR

2 minute read

Use kubectl get events with timestamp sorting and container logs to quickly diagnose Kubernetes issues.

Why Events First?

Events are the first place to check when troubleshooting. They often contain direct clues about failures, scaling issues, or misconfigurations. Before reaching for describe or logs, run:

kubectl get events --sort-by=.metadata.creationTimestamp -o wide  

The -o wide flag shows the namespace, involved object, and reason for the event, which is critical for context.

Actionable Workflow

  1. Check Events:

    • Run the command above.
    • Filter by namespace if needed: kubectl get events -n <namespace> --sort-by=.metadata.creationTimestamp.
    • Look for Warning events or repeated failures.
  2. Inspect Logs with --previous:

    • For crashed pods, get logs from the previous container instance:
      kubectl logs <pod-name> --previous  
      
    • Combine with tail for recent entries:
      kubectl logs <pod-name> --previous | tail -n 100  
      
  3. Describe as Last Resort:

    • If events and logs don’t reveal the issue, use describe:
      kubectl describe pod <pod-name>  
      
    • Focus on Events and Conditions sections.

Tooling

  • kubectl: Master get events, logs --previous, and describe.
  • k9s: TUI for browsing events and logs in real time.
  • Lens: IDE with event filtering and log aggregation.

Tradeoffs and Caveats

  • Events Are Rate-Limited: Kubernetes drops events under heavy load. Critical issues might not appear if the control plane is overwhelmed.
  • Logs Are Transient: If a pod is deleted, --previous logs disappear. Always pair with centralized logging (e.g., Loki, ELK).
  • Timestamps Can Lie: Clock skew between nodes can misorder events. Use --sort-by but cross-check with other data.

Example Policy: Log Retention

Enforce a logging policy in your cluster:

# Example Fluentd config to forward logs to Loki  
apiVersion: v1  
kind: ConfigMap  
metadata:  
  name: fluentd-config  
data:  
  fluent.conf: |  
    <source>  
      @type systemd  
      tag kubernetes.*  
    </source>  
    <match kubernetes.**>  
      @type loki  
      url http://loki:3100/loki/api/v1/push  
    </match>  

Retain logs for 30 days and alert on persistent crash loops.

Troubleshooting Common Issues

  • No Events Showing:

    • Check if the kube-apiserver is overloaded.
    • Verify event TTL (--default-event-ttl in API server flags).
  • Pod Gone, Logs Missing:

    • Without centralized logging, logs are lost. Push for a logging stack in such cases.
    • Check node-level logs (e.g., journalctl) as a last resort.
  • Describe Overwhelm:

    • Pipe output to less for readability:
      kubectl describe pod <pod> | less  
      

Prevention

  • Monitor Event Streams: Use Prometheus to alert on event rates or specific warning types.
  • Review Terminations: Regularly check kubectl get events for patterns in pod failures.
  • Train Teams: Teach engineers to default to events and logs before describe. Reduces mean time to diagnose by ~50% in my experience.

Source thread: What K8s debugging trick would you have wished you knew on day one?

comments powered by Disqus