Resources

Documentation

CTRL K

Guides

Kubernetes

Time-Slicing with NVIDIA GPU's

Last updated on: June 1, 2026

GPU Time-Slicing allows you to oversubscribe a single physical GPU by interleaving multiple workloads on the same compute engine. The system creates virtual resource replicas, allowing multiple containers to share the GPU. This is ideal for light AI/ML inference, development environments, and small parallel processing jobs that don't constantly saturate the hardware.

Supported GPU Architectures

NVIDIA L40S & L4: Support Time-Slicing for dividing accelerator to run multiple smaller workloads.
NVIDIA B200 & H100: Also support hardware-level MIG partitioning for strict isolation. See our dedicated on Configuring Multi-Instance GPU (MIG) on NVIDIA B200 and H100.

Prerequisites

An UpCloud Managed Kubernetes (UKS) cluster running.
A node group equipped with Time-Slicing supported NVIDIA GPU (referred to as <GPU-NODE>).
kubectl and helm installed and configured.

1. Choose a Deployment Method

You can configure time-slicing using either the lightweight NVIDIA Device Plugin or the advanced NVIDIA GPU Operator. For a complete breakdown of both choices, see our base Getting Started with GPU Workloads in Managed Kubernetes. This guide walks you through configuring Time-Slicing using the standalone NVIDIA Device Plugin deployment.

2. Create the Time-Slicing Configuration

Define how many slices you want. Create a local file named time-slicing-config.yaml.

This layout instructs the plugin to create 10 virtual replicas of the physical card:

version: v1
sharing:
  timeSlicing:
    renameByDefault: true
    resources:
      - name: nvidia.com/gpu
        replicas: 10

3. Upgrade the Device Plugin

Assuming you already have installed the Nvidia device plugin with our documentation. Now you can update the device plugin chart with your configuration.

$ helm upgrade nvidia-device-plugin nvdp/nvidia-device-plugin -n nvidia-device-plugin \
    --set-file config.map.config=time-slicing-config.yaml

--set-file injects your custom profile map directly into the plugin runtime.

4. Verify the Shared Resources

Once the plugin pods are running, check that the Kubernetes scheduler recognizes the multiplied capacity by inspecting your node's metrics:

$ kubectl describe node <GPU-NODE>

Look for the Capacity and Allocatable blocks. You should see 10 virtual replicas available for scheduling:

Capacity:
  nvidia.com/gpu.shared:  10
Allocatable:
  nvidia.com/gpu.shared:  10

Your applications can now target these shared slices normally by adding traditional limits to their deployment specs (e.g., limits: nvidia.com/gpu.shared: 1). Multiple pods will run concurrently on the same card until all 10 replica slots are consumed.

Looking ahead: Dynamic Resource Allocation (DRA)

🚀 Coming soon: Dynamic Resource Allocation

As of Kubernetes 1.35, Dynamic Resource Allocation (DRA) has graduated to GA as an advancement beyond traditional static device plugins. DRA allows workflows to request and reconfigure specific hardware resources, such as the slices configured here. A dedicated guide for setting up DRA is coming soon! You can read about it Here

Contributed by: Onni Pylvänen

Can't find what you're looking for?

For more help you can contact our awesome 24/7 support team

Contact Support