Start Chaos Experiments with Known Weaknesses, Not Random Targets

Begin chaos experiments by targeting services with known fragility or recent changes, not random workloads.

March 12, 2026 JR

2 minute read

Begin chaos experiments by targeting services with known fragility or recent changes, not random workloads.

Why This Matters

Kubernetes makes failure injection easy, but chaos without focus wastes time and risks overloading stable systems. Targeting known weaknesses first surfaces actionable issues, builds team confidence, and avoids “chaos for chaos' sake.”

Actionable Workflow

Identify candidates:
- Review monitoring/alerting for recurring errors (e.g., kubectl get pods --show-labels | grep -i error).
- List services with recent deployments (kubectl get deployments -o wide --sort-by='{.metadata.creationTimestamp}').
- Flag dependencies (databases, caches, external APIs) that lack redundancy.
Prioritize:
- Rank by blast radius (e.g., a payment service > a background job processor).
- Focus on services with incomplete observability or undocumented failover.
Validate:
- Check if the target has existing chaos tests (e.g., kubectl get chaosexperiments -A).
- Confirm exclusion policies (e.g., kubectl describe namespace kube-system | grep exclusion).

Policy Example

# Exclusion policy snippet (applied via namespace labels)  
metadata:  
  labels:  
    chaos.excluded: "true"  
    chaos.exclusion_reason: "third_party_app_no_control"

Rules:

Default allow chaos in all non-excluded namespaces.
Exclusions require SRE + application team sign-off, documented in a shared registry.
No exclusions for “fear of breaking things” without a remediation plan.

Tooling

Chaos Mesh: Use Action and Selector fields to target specific workloads.
Example:

spec:  
  podChaos:  
    action: restart  
    selector:  
      matchLabels:  
        app: payment-service

Litmus: Leverage ExperimentTemplates with affinity rules to constrain targets.
Monitoring integration: Pair with Prometheus/Grafana to auto-detect anomalies post-injection.

Tradeoffs

Over-isolation: Excluding too many components (e.g., all stateful services) reduces chaos effectiveness.
False confidence: Fixing only targeted issues might miss systemic problems (e.g., network partitions).
Operational overhead: Maintaining exclusion policies requires ongoing collaboration between SRE and app teams.

Troubleshooting

No targets found:
- Check if monitoring is underconfigured (e.g., no alerts for deployment failures).
- Ensure recent deployments are labeled correctly.
Experiments not triggering:
- Verify chaos operator permissions (kubectl auth can-i create chaosexperiments --all-namespaces).
- Check for conflicting exclusion labels.
Unexpected downtime:
- Review blast radius constraints (e.g., kubectl get chaosresultants -A).
- Audit exclusion requests for abuse (e.g., app teams blocking tests without cause).

Final Note

Chaos experiments should evolve from debugging known issues to validating systemic resilience. Start small, iterate, and formalize exclusions as part of platform governance—not ad-hoc exceptions.

Source thread: For teams running chaos experiments on Kubernetes, how do you pick the first target?

blog

Home

About

Blog

Projects

Posts

Categories

Contact

Recent Posts

Labeling Pods with Node Information in Production

Practical Kubernetes Projects for Platform Engineers

Secure Cluster Access Via Vpn and Oidc

Building and Maintaining Forked Bitnami Images in Production

Kubecon Cold: Practical Survival Guide for Engineers