- NVIDIA L40S & L4: Support Time-Slicing for dividing accelerator to run multiple smaller workloads.
- NVIDIA B200 & H100: Also support hardware-level MIG partitioning for strict isolation. See our dedicated on Configuring Multi-Instance GPU (MIG) on NVIDIA B200 and H100.
Time-Slicing with NVIDIA GPU's
GPU Time-Slicing allows you to oversubscribe a single physical GPU by interleaving multiple workloads on the same compute engine. The system creates virtual resource replicas, allowing multiple containers to share the GPU. This is ideal for light AI/ML inference, development environments, and small parallel processing jobs that don't constantly saturate the hardware.
Prerequisites
- An UpCloud Managed Kubernetes (UKS) cluster running.
- A node group equipped with Time-Slicing supported NVIDIA GPU (referred to as
<GPU-NODE>). kubectlandhelminstalled and configured.
1. Choose a Deployment Method
You can configure time-slicing using either the lightweight NVIDIA Device Plugin or the advanced NVIDIA GPU Operator. For a complete breakdown of both choices, see our base Getting Started with GPU Workloads in Managed Kubernetes. This guide walks you through configuring Time-Slicing using the standalone NVIDIA Device Plugin deployment.
2. Create the Time-Slicing Configuration
Define how many slices you want. Create a local file named time-slicing-config.yaml.
This layout instructs the plugin to create 10 virtual replicas of the physical card:
version: v1
sharing:
timeSlicing:
renameByDefault: true
resources:
- name: nvidia.com/gpu
replicas: 103. Upgrade the Device Plugin
Assuming you already have installed the Nvidia device plugin with our documentation. Now you can update the device plugin chart with your configuration.
$ helm upgrade nvidia-device-plugin nvdp/nvidia-device-plugin -n nvidia-device-plugin \
--set-file config.map.config=time-slicing-config.yaml--set-fileinjects your custom profile map directly into the plugin runtime.
4. Verify the Shared Resources
Once the plugin pods are running, check that the Kubernetes scheduler recognizes the multiplied capacity by inspecting your node's metrics:
$ kubectl describe node <GPU-NODE>Look for the Capacity and Allocatable blocks. You should see 10 virtual replicas available for scheduling:
Capacity:
nvidia.com/gpu.shared: 10
Allocatable:
nvidia.com/gpu.shared: 10Your applications can now target these shared slices normally by adding traditional limits to their deployment specs (e.g., limits: nvidia.com/gpu.shared: 1). Multiple pods will run concurrently on the same card until all 10 replica slots are consumed.
Looking ahead: Dynamic Resource Allocation (DRA)
As of Kubernetes 1.35, Dynamic Resource Allocation (DRA) has graduated to GA as an advancement beyond traditional static device plugins. DRA allows workflows to request and reconfigure specific hardware resources, such as the slices configured here. A dedicated guide for setting up DRA is coming soon! You can read about it Here
