Fractional GPUs using Nvidia's KAI Scheduler¶

At KubeCon Europe, in April 2025, Nvidia announced and launched the Kubernetes AI (KAI) Scheduler. This is an Open Source project maintained by Nvidia.

The KAI Scheduler is an advanced Kubernetes scheduler that allows administrators of Kubernetes clusters to dynamically allocate GPU resources to workloads. Users of the Rafay Platform can immediately leverage the KAI scheduler via the integrated Catalog.

To help you understand the basics quickly, we have also created a brief video introducing the concepts and a live demonstration showcasing how you can allocate fractional GPU resources to workloads.

Fractional GPUs¶

In a previous blog, we discussed how in Kubernetes, although you can request fractional CPU units for workloads, you cannot request fractional GPU units (i.e. everything needs to be an integer). With the KAI scheduler, you are no longer limited to allocating integers for GPUs. Let's discuss two common strategies that you can employ for fractional GPUs using KAI Scheduler.

Strategy 1: Fractional GPUs¶

With this strategy, administrators can allocate a fraction of the underlying GPU to a workload. In the image below, we can see that 1 Nvidia GPU has been shared with three workloads, each allocated a fraction.

Workload 1 (0.5)
Workload 2 (0.3)
Workload 3 (0.2)

The workload's manifest without the KAI scheduler would look something like the following:

apiVersion: v1
kind: Pod
metadata:
  name: gpu
spec:
  containers:
    - name: ubuntu
      image: ubuntu
      args: ["sleep", "infinity"]
      resources:
        limits:
          nvidia.com/gpu: "1"
        requests:
          nvidia.com/gpu: "1"

With the KAI scheduler, the workload's manifest needs to be updated appropriately.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-fraction
  labels:
    runai/queue: test
  annotations:
    gpu-fraction: "0.5"
spec:
  schedulerName: kai-scheduler
  containers:
    - name: ubuntu
      image: ubuntu
      args: ["sleep", "infinity"]

Strategy 2: Fractional GPU Memory¶

With this strategy, administrators can allocate a fraction of the underlying GPU's memory to a workload. In the image below, we can see that 1 Nvidia GPU with 4GB memory has been shared with three workloads, each allocated a fraction of the memory.

Workload 1 (2 GB)
Workload 2 (1 GB)
Workload 3 (1 GB)

With the KAI scheduler, the workload's manifest needs to be updated appropriately.

apiVersion: v1
kind: Pod
metadata:
  name: gpu-mem
  labels:
    runai/queue: test
  annotations:
    gpu-memory: "2000" # in Mib
spec:
  schedulerName: kai-scheduler
  containers:
    - name: ubuntu
      image: ubuntu
      args: ["sleep", "infinity"]

Summary¶

In this blog, we looked at how organizations can use Nvidia's newly launched KAI scheduler to allocation fractions of a GPU to workloads.

Free Org

Sign up for a free Org if you want to try this yourself with our Get Started guides.

Free Org
Rafay's AI/ML Products

Learn about Rafay's offerings in AI/ML Infrastructure and Tooling

Learn More