What does "config hell" actually look like in the real world

What "Config Hell" Actually Looks Like (And How to Escape It) If you’ve spent any time managing production Kubernetes clusters, you’ve probably heard the term.

JR

4 minute read

What “Config Hell” Actually Looks Like (And How to Escape It)

If you’ve spent any time managing production Kubernetes clusters, you’ve probably heard the term “Config Hell.” It’s the chaotic state where configurations sprawl uncontrollably, drift between environments, and resist consistent management. But what does this look like in practice? And how do you fix it when it happens?

Let’s cut through the theory and walk through real-world symptoms, diagnosis steps, and repairs.


Symptoms of Config Hell

1. Inconsistent Configuration Artifacts

When multiple teams or developers write Helm charts, Kustomizations, or raw YAMLs without standards, you end up with a mess:

  • Helm charts with varying structures, no versioning, or hardcoded values.
  • Kustomizes that “work on my laptop” but fail in CI/CD.
  • Teams using different tools (ArgoCD vs. Helm vs. raw kubectl apply) for the same cluster.

Example: A team switches from Kustomize to Helm mid-project, leaving half the cluster in kustomization.yaml and the other half in Chart.yaml. No one knows which is authoritative.

2. YAML Drift

Configurations in Git don’t match what’s running in the cluster. This happens when:

  • Manual edits are made to live resources.
  • Tools generate YAML inconsistently (e.g., Helm upgrades with --values flags not tracked in version control).
  • Secrets or environment-specific values are hardcoded instead of parameterized.

Example: A developer runs helm upgrade -f dev-values.yaml locally, but the CI/CD pipeline uses a different values file. The cluster state becomes a Frankenstein of overlapping changes.

3. Toolchain Sprawl

Mixing tools without a clear strategy leads to:

  • ArgoCD managing some apps, Helm for others, and manual kubectl for “quick fixes.”
  • Policies enforced in one cluster (e.g., via OPA/Gatekeeper) but not others.
  • Configuration generation pipelines that chain multiple tools (e.g., Helm → Kustomize → Sealed Secrets), creating opaque dependencies.

Example: A team migrates from AWS EKS to GKE but leaves behind half-configured AWS-specific IAM roles and VPC settings, bloating the config repos.

4. Permission and Security Debt

Over time, IAM policies, RBAC roles, and network policies accumulate without review:

  • Service accounts with excessive permissions.
  • Legacy roles bound to users who left the company.
  • No clear ownership of config changes.

Example: A Helm chart creates a ClusterRoleBinding that grants cluster-admin to a service account, “just to get it working.” Months later, no one remembers why it’s there.


Diagnosing Config Hell

  1. Audit Configuration Sources

    • List all tools in use: Helm, Kustomize, ArgoCD, Terraform, etc.
    • Identify which configurations are versioned vs. ad-hoc.
    • Check for hardcoded values, secrets in repos, or environment-specific hacks.
  2. Check Version Control Hygiene

    • Are all configs in Git? If not, stop reading and fix that first.
    • Are Helm values or Kustomize overlays properly versioned?
    • Are there multiple branches with conflicting changes?
  3. Review Toolchain Dependencies

    • Map the pipeline from code to cluster: Which tools generate or mutate configs?
    • Are there “hidden” steps (e.g., manual sed/awk scripts in CI)?
  4. Scan for Access Drift

    • Use tools like kube-bench or rbac-lookup to audit permissions.
    • Look for roles bound to non-existent users or service accounts.

Repair Steps

1. Standardize Configuration Artifacts

Pick one config management approach and stick to it:

  • Helm: Enforce a chart structure (e.g., values.yaml, templates/, Chart.yaml).
  • Kustomize: Use bases and overlays consistently.
  • Policy Enforcement: Use OPA/Gatekeeper or Kyverno to validate all incoming configs.

Actionable Workflow:

  1. Inventory all existing configs.
  2. Choose a standard (e.g., Helm 3 with OCI images).
  3. Migrate non-compliant configs incrementally.
  4. Block non-standard deployments via CI/CD gates.

2. Enforce GitOps

  • All changes must come from Git.
  • Use ArgoCD, Flux, or similar to sync cluster state to Git.
  • Require pull requests for all changes, even “quick fixes.”

Example Policy (Rego for OPA/Gatekeeper):

package helm_chart_validations

violation[{"msg": msg}] {
  input.kind == "HelmRelease"
  not input.metadata.labels.app == "approved"
  msg := "HelmRelease must have label 'app=approved'"
}

3. Clean Up IAM and RBAC

  • Delete unused roles and bindings.
  • Replace cluster-admin with least-privilege roles.
  • Use tools like rbac-lookup or kubeaudit to find overprivileged accounts.

4. Document and Train

  • Create a “Config Playbook” with standards for Helm, Kustomize, and RBAC.
  • Train teams on why consistency matters (e.g., “Your hacky values.yaml will bite you in 6 months”).

Prevention

Policy Example: Enforce Label Consistency

package label_validations

violation[{"msg": msg}] {
  input.kind in ["Deployment", "StatefulSet", "Service"]
  not input.metadata.labels.team
  msg := "Resources must have 'team' label for ownership tracking"
}

Tooling to Avoid Hell

  • Policy as Code: OPA/Gatekeeper, Kyverno (enforce standards at admission).
  • GitOps: ArgoCD, Flux (sync configs from Git).
  • Config Management: Helm (with versioned charts), Kustomize (for overlays).
  • Audit: kube-bench, rbac-lookup, kubectl describe for manual checks.

Final Thoughts

Config Hell isn’t about tools—it’s about discipline. The goal isn’t perfection; it’s consistency and visibility. When you standardize artifacts, enforce GitOps, and audit regularly, you turn chaos into something manageable.

And remember: That Helm chart you wrote two years ago? It’s not your fault. But it’s your job to fix it.


Date: 2026-02-16

Source thread: What does “config hell” actually look like in the real world?

comments powered by Disqus