Rethinking the Kubernetes pod-and-node relationship for AI workloads
Extend Kubernetes pods to external environments without re-architecting your stack
AI workloads require high performance, scalability, and strong security, and Kubernetes provides a great platform for orchestrating AI workloads.
However, Kubernetes ties the pods to specific worker nodes, which introduces some limitations.
Let’s take an example of a simple AI workload deployed as a single pod in the Kubernetes cluster. If this pod requires a specialised GPU or any other accelerator, you must first ensure that the worker node has the specific GPU(s) and accelerator devices available for the pod to use. From an operations standpoint, this means creating a new worker node and adding it to the cluster.
Likewise, if an AI workload needs a compute infra with spec (cpu, memory) larger than any worker node in the cluster, you must follow the same procedure of creating a new worker node of required spec and adding it to the cluster.
What if you want to run the AI workload on a machine with different architecture (eg ARM) for cost reasons? You’ll need to follow the same procedure of creating and adding a new worker node to the cluster.
What if AI workloads could be deployed in an environment external to the worker node(s) while still being managed seamlessly within the Kubernetes cluster?
Can we relax the relationship between the pod and the Kubernetes worker node?
This is where the remote hypervisor feature of Kata containers runtime comes into play. Using the Kata remote hypervisor feature, you can run the pods external to the worker nodes. We also use the term peer-pods to refer to the pods created externally by the Kata remote hypervisor.
The following diagram shows the high-level architecture. The cloud-api-adaptor is an implementation of the Kata remote hypervisor.
How Kata Remote Hypervisor enables flexible AI workload deployment
The Kata remote hypervisor extends Kubernetes’ compute capabilities by enabling pods to run on external infrastructure (outside of worker nodes).
It ensures that:
- The pod remains a first-class Kubernetes object, fully manageable via the control plane.
- There is no change in the user experience w.r.to interacting with the pod.
- AI workloads can execute on specialised accelerators (GPUs, TPUs, AI chips) or even different architecture platforms (e.g. ARM) without modifying the primary Kubernetes cluster infrastructure.
- You can have policies (Kata agent policies and eBPF policies) to confine the workload and protect your infrastructure or protect the workload from the infrastructure, depending on your need.
The following diagram describes the high-level pod creation workflow using the Kata remote hypervisor. As shown, when a user creates a pod with kata-remote runtimeClass, the Kubernetes scheduler schedules the pod to the worker node with the Kata remote hypervisor configured. Upon receiving a pod creation request, the Kata remote hypervisor on the worker node creates an AWS EC2 instance, sets up the networking and runs the pod. From a user standpoint, you continue working with the pod like any other pod in the cluster.
Why does this matter for AI workloads?
✅ Elastic Compute Scaling — Deploy your AI workloads on need basis across different environments while still being managed via a single Kubernetes cluster deployed in one environment.
✅ Enhanced Security and Isolation — Keep your sensitive AI models, data, and inference pipelines isolated from the shared worker nodes, reducing attack surfaces.
✅ Confidential computing guarantees: Running a workload requiring confidential computing guarantees in the cloud with cluster deployed on-premise.
✅ Seamless Multi-Cloud Deployment — Run AI agents or inference workloads across different cloud providers while maintaining control via a single Kubernetes cluster deployed in one environment.
✅ Optimised Resource Utilisation — Avoid overloading Kubernetes worker nodes by offloading AI inference to external compute resources as needed.
✅ Extend your development environment from laptop to cloud — During experimentation and development on your computer, if you want access to specialised hardware for test, you can use this approach to spin up a pod in the cloud from your locally installed Kubernetes cluster.
Is this the right approach for all use cases?
While using the Kata remote hypervisor has apparent advantages, it may not be ideal for every need. The peer pod has a startup latency and is comparatively resource-heavy compared to regular pods. Each (peer) pod runs in a separate virtual machine.
Further, if your workload relies on Kubernetes CSI storage for persistent data, it won’t work as-is.
You’ll need to make the data available to the pod directly via network mount or using relevant SDKs. This is not an issue with most AI workloads, as those use SDKs to access the data directly.
For example, there is the Amazon S3 Connector for PyTorch and implementations for the Python Filesystem Interface(fsspec) from cloud providers.
Some references:
- https://github.com/awslabs/s3-connector-for-pytorch
- https://github.com/Azure/azure-sdk-for-python
- https://github.com/fsspec/s3fs
- https://github.com/fsspec/adlfs
- https://github.com/fsspec/gcsfs
If it’s not possible to use SDK in your application for accessing storage, you can use tools like “mountpoint-s3” to mount S3-compatible storage as a directory inside the pod.
You can use mountpoint-s3 packaged in a container image as a sidecar container to mount the S3 bucket as a directory and access it in your workload container.
Here is a sample pod manifest snippet based on the approach described above.
apiVersion: v1
kind: Pod
metadata:
name: test-ai
labels:
app: test-ai
spec:
runtimeClassName: kata-remote
initContainers:
- name: s3-mount
image: <some-registry>/<user>/mountpoint-s3-container
command: ["sh", "-c"]
args:
- mknod /dev/fuse -m 0666 c 10 229 && mkdir /mycontainer && mount-s3 "$S3_BUCKET" /mycontainer && sleep infinity
securityContext:
privileged: true
env:
- name: S3_BUCKET
valueFrom:
secretKeyRef:
name: s3-secret
key: S3_BUCKET
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: s3-secret
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: s3-secret
key: AWS_SECRET_ACCESS_KEY
restartPolicy: Always
containers
- name: ai-sample
...
Conclusion
For AI workloads, the peer-pods approach based on Kata remote hypervisor unlocks a powerful way to balance flexibility and security while keeping Kubernetes at the core of your AI infrastructure.
For trying out peer-pods, please head to the following Github project . If you are looking for an enterprise supported option, then take a look at the following link.
I hope this helps you when planning your AI infrastructure stack.