Virtualkubelet in Production: When and Why It Fits
VirtualKubelet bridges Kubernetes with external systems.
VirtualKubelet bridges Kubernetes with external systems, enabling flexible pod scheduling without overcommitting cluster resources.
Problem Context
You’re likely running VirtualKubelet to decouple compute resource management from Kubernetes nodes. Common drivers:
- Bursty workloads requiring ephemeral capacity (e.g., batch jobs, ML inference)
- Integration with non-Kubernetes systems (Slurm, Lambda, ACI)
- Avoiding overprovisioning for sporadic or unpredictable demand
- Isolating risky or untrusted workloads (e.g., user-submitted models)
Workflow: Diagnose and Implement
-
Assess workload patterns
- Identify stateless, short-lived pods or those requiring external execution environments
- Example: ML inference jobs on Hugging Face models that spike during business hours
-
Select a VirtualKubelet provider
- Slurm, AWS Lambda, Azure ACI, or custom implementations
- Match provider to existing infrastructure (e.g., Slurm for HPC clusters)
-
Deploy VirtualKubelet and provider
# Example: Deploy VirtualKubelet with Slurm provider kubectl apply -f https://raw.githubusercontent.com/virtual-kubelet/virtual-kubelet/master/deploy/slurm/provider.yaml -
Configure node tuners or taints
- Use node selectors to route specific workloads to VirtualKubelet nodes
- Example policy:
kind: Pod metadata: annotations: node.kubernetes.io/instance-type: virtual-kubelet spec: nodeSelector: kubernetes.io/hostname: virtual-kubelet-node
-
Test with canary deployments
- Monitor scheduling latency and resource utilization
- Check node status:
kubectl get nodes -l node.kubernetes.io/instance-type=virtual-kubelet
Tooling
- Providers: Slurm, AWS Lambda, Azure ACI, Google Cloud Functions
- Monitoring: Prometheus + VirtualKubelet metrics endpoint (
/metrics) - Logging: Fluentd or Loki integration for provider-specific logs
- Debugging:
kubectl describe pod <virtualized-pod> kubectl logs <virtual-kubelet-pod> --container=slurm-provider
Tradeoffs and Caveats
- Complexity: Adds another layer to debug (provider health, network policies, RBAC)
- Latency: External provisioning (e.g., Lambda cold starts) can delay pod startup
- Dependency: Provider stability risks (e.g., AWS Lambda service limits or outages)
- Not for stateful workloads: VirtualKubelet nodes often lack persistent storage guarantees
Troubleshooting Common Issues
-
Pods stuck in Pending
- Check provider logs for quota limits or authentication errors
- Verify RBAC permissions for VirtualKubelet service account
-
Node not ready
- Describe the VirtualKubelet node:
kubectl describe node <virtual-node> - Ensure provider pods are running and connected
- Describe the VirtualKubelet node:
-
Unexpected evictions
- Monitor provider-specific resource limits (e.g., Lambda memory thresholds)
- Adjust QoS or resource requests in pod specs
Prevention and Maintenance
- Policy: Enforce node selectors for VirtualKubelet workloads to prevent accidental scheduling on physical nodes
- Monitoring: Alert on VirtualKubelet node health and provider-specific metrics
- Upgrades: Test provider updates in staging; version skew between VirtualKubelet and Kubernetes can cause scheduler conflicts
VirtualKubelet isn’t a silver bullet—it’s a tool for specific gaps. If your workload fits the pattern (ephemeral, external, or bursty), it reduces operational overhead. Otherwise, stick to standard nodes.
Source thread: Why are you running VirtualKubelets?

Share this post
Twitter
Google+
Facebook
Reddit
LinkedIn
StumbleUpon
Pinterest
Email