Resources

Documentation

CTRL K

Guides

Kubernetes

Getting Started with GPU Workloads in Managed Kubernetes

Last updated on: June 1, 2026

GPU node groups in UpCloud Managed Kubernetes (UKS) come with NVIDIA GPU drivers pre-installed and the hardware exposed directly to the container runtime.

To enable Kubernetes to schedule and track GPU resources (using nvidia.com/gpu resource limits), you must install an NVIDIA cluster component. You have two paths to choose from depending on your architectural needs:

Option A: NVIDIA Device Plugin (Lightweight & Simple) – Ideal for standard workloads where containers consume whole, un-partitioned GPUs (e.g., on NVIDIA L4 or L40S nodes).
Option B: NVIDIA GPU Operator (Advanced Features & Slicing) – Required if you want to use Multi-Instance GPU (MIG) on supported hardware (e.g., NVIDIA H100 or B200) or advanced scheduling features.

Warning

The NVIDIA GPU Operator already contains the device plugin internally. If you deploy both independently, they will conflict, causing double-allocation errors and scheduling failures. Choose only one option below.

Prerequisites

An UpCloud Managed Kubernetes cluster with a GPU node group added.
kubectl and helm configured in your local terminal.

Option A: Install the NVIDIA Device Plugin (Simple Path)

If you only need basic GPU scheduling and do not intend to slice your GPUs into smaller pieces, the standalone device plugin is the most light approach.

1. Label your GPU nodes

Before installing the plugin, you must ensure your GPU nodes are correctly targeted. The NVIDIA Device Plugin relies on Node Feature Discovery (NFD) style labels to identify NVIDIA hardware.

Apply the official NVIDIA vendor PCI presence label to your GPU node(s):

kubectl label node <GPU-NODE-NAME> feature.node.kubernetes.io/pci-10de.present=true

2. Deploy the Helm chart

Run the following commands to install it:

helm repo add nvdp nvdp/nvidia-device-plugin && helm repo update
helm install nvidia-device-plugin nvdp/nvidia-device-plugin -n nvidia-device-plugin --create-namespace

If your cluster contains a mix of GPU and non-GPU node groups, pass a node selector so the daemon only targets your GPU-capable hardware:

helm install nvidia-device-plugin nvdp/nvidia-device-plugin -n nvidia-device-plugin --create-namespace \
  --set nodeSelector.gpu='NVIDIA-L40S'

The plugin will now only target nodes labeled with gpu: NVIDIA-L40S.

Option B: Install the NVIDIA GPU Operator

If you intend to implement hardware slicing or advanced policy management, use the GPU Operator. Because UpCloud already provides and manages the underlying NVIDIA drivers and container toolkits on the host, you should disable driver and toolkit installation flags in Helm to prevent them from overwriting host configurations.

Execute the following commands to install the GPU Operator:

helm repo add nvidia https://helm.github.io/gpu-operator && helm repo update

helm install gpu-operator nvidia/gpu-operator \
  -n gpu-operator --create-namespace \
  --set driver.enabled=false \
  --set toolkit.enabled=false

Verify the installation

Regardless of whether you chose Option A or Option B, you need to ensure the scheduling resources are properly exposed to Kubernetes.

1. Check the DaemonSet status

Verify that your choice of plugin daemon is actively scheduled and running on your nodes:

For Device Plugin (Option A):

kubectl -n kube-system get ds nvidia-device-plugin

For GPU Operator (Option B):

kubectl -n gpu-operator get ds gpu-operator-nvidia-device-plugin

Ensure that DESIRED and AVAILABLE counts match the total number of GPU nodes in your cluster. Operator can usually run on non-gpu nodes without issue.

2. Confirm the GPU Capacity

Verify that the nvidia.com/gpu resource capacity is actively being reported by your nodes:

# List your nodes to get names
kubectl get nodes

# Inspect your specific GPU node
kubectl describe node <GPU-NODE-NAME>

Look for the Capacity and Allocatable blocks in the output. You should see the resource listed successfully:

Capacity:
  nvidia.com/gpu:  1
Allocatable:
  nvidia.com/gpu:  1

Run a quick GPU smoke test

To ensure everything is working securely end-to-end, deploy a one-off Job that requests a single GPU and queries its operational status via nvidia-smi.

Create a file named gpu-smoke-test.yaml:

apiVersion: batch/v1
kind: Job
metadata:
  name: gpu-smoke-test
spec:
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: cuda
          image: nvidia/cuda:12.4.1-base-ubuntu22.04
          command: ["bash", "-lc", "nvidia-smi"]
          resources:
            limits:
              nvidia.com/gpu: 1

Apply the manifest, wait for it to execute, and inspect the logs:

kubectl apply -f gpu-smoke-test.yaml
kubectl wait --for=condition=complete job/gpu-smoke-test --timeout=60s
kubectl logs job/gpu-smoke-test

If everything is functioning correctly, you’ll see the GPU model and driver info.

Once verified, your UpCloud Managed Kubernetes cluster is fully prepared to schedule real GPU workloads

Next Steps for Slicing & MIG

If you want to split up your GPUs into smaller units, your approach depends on the hardware architecture:

For MIG-supported hardware: If you are running hardware that supports Multi-Instance GPU (MIG) partitioning (such as NVIDIA H100 or B200), you can configure dedicated hardware-level slices. Please follow our comprehensive NVIDIA B200 and H100 Multi-Instance GPU (MIG) Configuration Guide to set up custom slicing layouts.
For L40S and other architecture types: If you are running hardware that does not natively support MIG (such as the NVIDIA L40S), you can still safely oversubscribe and split up your devices using software-based sharing. Please follow our NVIDIA Time-Slicing on Kubernetes Guide to define logical GPU replicas.

Contributed by: Ville Vesilehto

Can't find what you're looking for?

For more help you can contact our awesome 24/7 support team

Contact Support