Run Only What You Must in Kubernetes
Self-hosting databases and stateful services in Kubernetes reduces costs but demands operational maturity for reliability and.
Self-hosting databases and stateful services in Kubernetes reduces costs but demands operational maturity for reliability and recovery.
Running everything in your cluster is tempting for cost control, but it shifts undifferentiated operational burden to your team. Decisions should hinge on three factors: criticality, expertise, and total cost of ownership.
When to Run in Cluster
Self-host stateful workloads like PostgreSQL or RabbitMQ only if:
- You’ve proven HA/DR workflows (e.g., backups, failover, point-in-time recovery).
- Your team can debug storage, networking, and operator issues under pressure.
- Long-term cost savings outweigh the operational tax.
Example: A fintech startup uses the Crunchy PostgreSQL Operator on Azure for HA databases. They automate backups with pgbackrest, monitor with Prometheus, and enforce role-based access. Cost? ~40% lower than managed services.
When to Avoid Cluster Hosting
Outsource if:
- Your DBAs/RabbitMQ admins lack Kubernetes fluency (debugging pods isn’t their job).
- Downtime or data loss would cripple business continuity.
- Managed services (e.g., Azure SQL, AWS RDS) fit your compliance and latency needs.
A retail SaaS team runs PostgreSQL on VMs. DBAs manage backups via cron jobs and WAL archiving. Kubernetes hosts only stateless apps. Tradeoff: higher cloud costs, but zero risk of misconfigured persistent volumes.
Actionable Workflow
- Audit: List all services in your cluster. Tag stateful vs. stateless.
- Evaluate: For each stateful workload, answer:
- Can we recover from a zone failure in <15 mins?
- Do we have automated backups tested monthly?
- Is there a runbook for common failures (e.g., disk full, operator crashes)?
- Decide:
- If answers are “no,” migrate to VMs or managed service.
- If “yes,” retain in cluster but enforce SLOs (e.g., 99.9% uptime, RTO <1hr).
- Document: Create a decision matrix for future services.
Policy Example
**Stateful Workload Policy**
1. Databases and message brokers default to VMs or managed services unless:
- Team demonstrates HA/DR validation in staging.
- Cost analysis shows >25% savings over 12 months.
2. All cluster-hosted stateful workloads require:
- Automated backups with retention and test restores.
- Monitoring for storage latency, replication lag, and pod health.
- Annual game-day testing for failover and disaster recovery.
Tooling
- Operators: Crunchy PostgreSQL, Zalando PostgreSQL, RabbitMQ Operator.
- Monitoring: Prometheus + Grafana for metrics; Alertmanager for thresholds.
- Backup: Velero for cluster-wide snapshots;
pgbackrestfor PostgreSQL. - Compliance: OPA/Gatekeeper to enforce policies (e.g., no privileged containers).
Conclusion
Running everything in Kubernetes is a technical and financial gamble. Prioritize services where your team can deliver better uptime and cost efficiency than managed alternatives. For the rest, pay the tax—your time is better spent on core product value.
Source thread: Do you run everything in your cluster?

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email