Start Chaos Experiments with Known Weaknesses, Not Random Targets
Begin chaos experiments by targeting services with known fragility or recent changes, not random workloads.
Begin chaos experiments by targeting services with known fragility or recent changes, not random workloads.
Why This Matters
Kubernetes makes failure injection easy, but chaos without focus wastes time and risks overloading stable systems. Targeting known weaknesses first surfaces actionable issues, builds team confidence, and avoids “chaos for chaos' sake.”
Actionable Workflow
-
Identify candidates:
- Review monitoring/alerting for recurring errors (e.g.,
kubectl get pods --show-labels | grep -i error). - List services with recent deployments (
kubectl get deployments -o wide --sort-by='{.metadata.creationTimestamp}'). - Flag dependencies (databases, caches, external APIs) that lack redundancy.
- Review monitoring/alerting for recurring errors (e.g.,
-
Prioritize:
- Rank by blast radius (e.g., a payment service > a background job processor).
- Focus on services with incomplete observability or undocumented failover.
-
Validate:
- Check if the target has existing chaos tests (e.g.,
kubectl get chaosexperiments -A). - Confirm exclusion policies (e.g.,
kubectl describe namespace kube-system | grep exclusion).
- Check if the target has existing chaos tests (e.g.,
Policy Example
# Exclusion policy snippet (applied via namespace labels)
metadata:
labels:
chaos.excluded: "true"
chaos.exclusion_reason: "third_party_app_no_control"
Rules:
- Default allow chaos in all non-excluded namespaces.
- Exclusions require SRE + application team sign-off, documented in a shared registry.
- No exclusions for “fear of breaking things” without a remediation plan.
Tooling
- Chaos Mesh: Use
ActionandSelectorfields to target specific workloads.
Example:spec: podChaos: action: restart selector: matchLabels: app: payment-service - Litmus: Leverage
ExperimentTemplateswith affinity rules to constrain targets. - Monitoring integration: Pair with Prometheus/Grafana to auto-detect anomalies post-injection.
Tradeoffs
- Over-isolation: Excluding too many components (e.g., all stateful services) reduces chaos effectiveness.
- False confidence: Fixing only targeted issues might miss systemic problems (e.g., network partitions).
- Operational overhead: Maintaining exclusion policies requires ongoing collaboration between SRE and app teams.
Troubleshooting
- No targets found:
- Check if monitoring is underconfigured (e.g., no alerts for deployment failures).
- Ensure recent deployments are labeled correctly.
- Experiments not triggering:
- Verify chaos operator permissions (
kubectl auth can-i create chaosexperiments --all-namespaces). - Check for conflicting exclusion labels.
- Verify chaos operator permissions (
- Unexpected downtime:
- Review blast radius constraints (e.g.,
kubectl get chaosresultants -A). - Audit exclusion requests for abuse (e.g., app teams blocking tests without cause).
- Review blast radius constraints (e.g.,
Final Note
Chaos experiments should evolve from debugging known issues to validating systemic resilience. Start small, iterate, and formalize exclusions as part of platform governance—not ad-hoc exceptions.
Source thread: For teams running chaos experiments on Kubernetes, how do you pick the first target?

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email