Run Only What You Must in Kubernetes

Self-hosting databases and stateful services in Kubernetes reduces costs but demands operational maturity for reliability and.

February 16, 2026 JR

2 minute read

Self-hosting databases and stateful services in Kubernetes reduces costs but demands operational maturity for reliability and recovery.

Running everything in your cluster is tempting for cost control, but it shifts undifferentiated operational burden to your team. Decisions should hinge on three factors: criticality, expertise, and total cost of ownership.

When to Run in Cluster

Self-host stateful workloads like PostgreSQL or RabbitMQ only if:

You’ve proven HA/DR workflows (e.g., backups, failover, point-in-time recovery).
Your team can debug storage, networking, and operator issues under pressure.
Long-term cost savings outweigh the operational tax.

Example: A fintech startup uses the Crunchy PostgreSQL Operator on Azure for HA databases. They automate backups with pgbackrest, monitor with Prometheus, and enforce role-based access. Cost? ~40% lower than managed services.

When to Avoid Cluster Hosting

Outsource if:

Your DBAs/RabbitMQ admins lack Kubernetes fluency (debugging pods isn’t their job).
Downtime or data loss would cripple business continuity.
Managed services (e.g., Azure SQL, AWS RDS) fit your compliance and latency needs.

A retail SaaS team runs PostgreSQL on VMs. DBAs manage backups via cron jobs and WAL archiving. Kubernetes hosts only stateless apps. Tradeoff: higher cloud costs, but zero risk of misconfigured persistent volumes.

Actionable Workflow

Audit: List all services in your cluster. Tag stateful vs. stateless.
Evaluate: For each stateful workload, answer:
- Can we recover from a zone failure in <15 mins?
- Do we have automated backups tested monthly?
- Is there a runbook for common failures (e.g., disk full, operator crashes)?
Decide:
- If answers are “no,” migrate to VMs or managed service.
- If “yes,” retain in cluster but enforce SLOs (e.g., 99.9% uptime, RTO <1hr).
Document: Create a decision matrix for future services.

Policy Example

**Stateful Workload Policy**  
1. Databases and message brokers default to VMs or managed services unless:  
   - Team demonstrates HA/DR validation in staging.  
   - Cost analysis shows >25% savings over 12 months.  
2. All cluster-hosted stateful workloads require:  
   - Automated backups with retention and test restores.  
   - Monitoring for storage latency, replication lag, and pod health.  
   - Annual game-day testing for failover and disaster recovery.

Tooling

Operators: Crunchy PostgreSQL, Zalando PostgreSQL, RabbitMQ Operator.
Monitoring: Prometheus + Grafana for metrics; Alertmanager for thresholds.
Backup: Velero for cluster-wide snapshots; pgbackrest for PostgreSQL.
Compliance: OPA/Gatekeeper to enforce policies (e.g., no privileged containers).

Conclusion

Running everything in Kubernetes is a technical and financial gamble. Prioritize services where your team can deliver better uptime and cost efficiency than managed alternatives. For the rest, pay the tax—your time is better spent on core product value.

Source thread: Do you run everything in your cluster?

blog

Home

About

Blog

Projects

Posts

Categories

Contact

Recent Posts

Cilium Network Policies: Granularity in Production

External Secrets Operator: Reconciliation and Auth in Production

Egress Control on Eks: Cilium Vs Istio Ambient Mesh in 2026

Diagnosing and Resolving GPU Node Failures in Kubernetes Clusters

Kubent's Current State and Alternatives for Policy Enforcement