Getting Started with GPU Workloads in Managed Kubernetes
GPU node groups in UpCloud Managed Kubernetes (UKS) come with NVIDIA GPU drivers pre-installed and the hardware exposed directly to the container runtime.
To enable Kubernetes to schedule and track GPU resources (using nvidia.com/gpu resource limits), you must install an NVIDIA cluster component. You have two paths to choose from depending on your architectural needs:
- Option A: NVIDIA Device Plugin (Lightweight & Simple) – Ideal for standard workloads where containers consume whole, un-partitioned GPUs (e.g., on NVIDIA L4 or L40S nodes).
- Option B: NVIDIA GPU Operator (Advanced Features & Slicing) – Required if you want to use Multi-Instance GPU (MIG) on supported hardware (e.g., NVIDIA H100 or B200) or advanced scheduling features.
Note The NVIDIA GPU Operator already contains the device plugin internally. If you deploy both independently, they will conflict, causing double-allocation errors and scheduling failures. Choose only one option below.
Prerequisites
- An UpCloud Managed Kubernetes cluster with a GPU node group added.
kubectlandhelmconfigured in your local terminal.
Option A: Install the NVIDIA Device Plugin (Simple Path)
If you only need basic GPU scheduling and do not intend to slice your GPUs into smaller pieces, the standalone device plugin is the most light approach.
1. Label your GPU nodes
Before installing the plugin, you must ensure your GPU nodes are correctly targeted. The NVIDIA Device Plugin relies on Node Feature Discovery (NFD) style labels to identify NVIDIA hardware.
Apply the official NVIDIA vendor PCI presence label to your GPU node(s):
kubectl label node <GPU-NODE-NAME> feature.node.kubernetes.io/pci-10de.present=true2. Deploy the Helm chart
Run the following commands to install it:
helm repo add nvdp nvdp/nvidia-device-plugin && helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin -n nvidia-device-plugin --create-namespaceIf your cluster contains a mix of GPU and non-GPU node groups, pass a node selector so the daemon only targets your GPU-capable hardware:
helm install nvidia-device-plugin nvdp/nvidia-device-plugin -n nvidia-device-plugin --create-namespace \
--set nodeSelector.gpu='NVIDIA-L40S'The plugin will now only target nodes labeled with gpu: NVIDIA-L40S.
Option B: Install the NVIDIA GPU Operator
If you intend to implement hardware slicing or advanced policy management, use the GPU Operator. Because UpCloud already provides and manages the underlying NVIDIA drivers and container toolkits on the host, you should disable driver and toolkit installation flags in Helm to prevent them from overwriting host configurations.
Execute the following commands to install the GPU Operator:
helm repo add nvidia https://helm.github.io/gpu-operator && helm repo update
helm install gpu-operator nvidia/gpu-operator \
-n gpu-operator --create-namespace \
--set driver.enabled=false \
--set toolkit.enabled=falseVerify the installation
Regardless of whether you chose Option A or Option B, you need to ensure the scheduling resources are properly exposed to Kubernetes.
1. Check the DaemonSet status
Verify that your choice of plugin daemon is actively scheduled and running on your nodes:
For Device Plugin (Option A):
kubectl -n kube-system get ds nvidia-device-pluginFor GPU Operator (Option B):
kubectl -n gpu-operator get ds gpu-operator-nvidia-device-pluginEnsure that DESIRED and AVAILABLE counts match the total number of GPU nodes in your cluster. Operator can usually run on non-gpu nodes without issue.
2. Confirm the GPU Capacity
Verify that the nvidia.com/gpu resource capacity is actively being reported by your nodes:
# List your nodes to get names
kubectl get nodes
# Inspect your specific GPU node
kubectl describe node <GPU-NODE-NAME>Look for the Capacity and Allocatable blocks in the output. You should see the resource listed successfully:
Capacity:
nvidia.com/gpu: 1
Allocatable:
nvidia.com/gpu: 1Run a quick GPU smoke test
To ensure everything is working securely end-to-end, deploy a one-off Job that requests a single GPU and queries its operational status via nvidia-smi.
Create a file named gpu-smoke-test.yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: gpu-smoke-test
spec:
template:
spec:
restartPolicy: Never
containers:
- name: cuda
image: nvidia/cuda:12.4.1-base-ubuntu22.04
command: ["bash", "-lc", "nvidia-smi"]
resources:
limits:
nvidia.com/gpu: 1Apply the manifest, wait for it to execute, and inspect the logs:
kubectl apply -f gpu-smoke-test.yaml
kubectl wait --for=condition=complete job/gpu-smoke-test --timeout=60s
kubectl logs job/gpu-smoke-testIf everything is functioning correctly, you’ll see the GPU model and driver info.
Once verified, your UpCloud Managed Kubernetes cluster is fully prepared to schedule real GPU workloads
Next Steps for Slicing & MIG
If you want to split up your GPUs into smaller units, your approach depends on the hardware architecture:
- For MIG-supported hardware: If you are running hardware that supports Multi-Instance GPU (MIG) partitioning (such as NVIDIA H100 or B200), you can configure dedicated hardware-level slices. Please follow our comprehensive NVIDIA B200 and H100 Multi-Instance GPU (MIG) Configuration Guide to set up custom slicing layouts.
- For L40S and other architecture types: If you are running hardware that does not natively support MIG (such as the NVIDIA L40S), you can still safely oversubscribe and split up your devices using software-based sharing. Please follow our NVIDIA Time-Slicing on Kubernetes Guide to define logical GPU replicas.
