Skip to content

Product Blog

Serving LLMs on ARM: Running Rafay Token Factory on NVIDIA DGX Spark

NVIDIA DGX Spark put a Grace Blackwell-class machine on the desk. It is roughly the size of a hardback book, draws a fraction of the power of a rack server, and ships with 128 GB of unified memory that lets you load models far larger than a typical workstation GPU can hold. For developers, researchers, and platform teams, it is one of the most interesting pieces of AI hardware to appear in a long time.

It also comes with a characteristic that trips up a lot of inference tooling: the DGX Spark is ARM-only. The NVIDIA GB10 Grace Blackwell Superchip pairs a Blackwell GPU with a 20-core Arm CPU (10 Cortex-X925 performance cores and 10 Cortex-A725 efficiency cores) over an NVLink-C2C link. By default, it runs DGX OS, an Ubuntu-based, aarch64 operating system.

Inference stacks that assume amd64 containers, x86 wheels, or x86-only base images do not run here without work.

This is exactly the kind of heterogeneity Rafay's Token Factory is designed to absorb. In this post I will walk through how Token Factory turns a single ARM-based DGX Spark into a managed, multi-tenant LLM serving endpoint, using a real deployment of Qwen2-0.5B-Instruct as the example.

DGX Spark in Hand


Why ARM is usually the hard part

When most teams say "deploy an LLM," they mean a workflow that has been quietly assuming x86 for years. The inference server image is built for amd64. The CUDA wheels are compiled for x86. The orchestration layer schedules onto x86 worker nodes. None of that is portable to a Grace Blackwell box by default.

On the DGX Spark, every layer has to be aarch64-native:

  • The container images for the inference engine and its dependencies
  • The CUDA and driver stack matched to the Blackwell GPU on the GB10 (compute capability sm_121, which is distinct from data-center Blackwell parts like the B200)
  • The Kubernetes node components and the GPU operator that exposes the device to pods
  • Any sidecars, gateways, or metering agents that ride alongside the model

The promise of Token Factory is that Rafay invisibly handles this substrate for you. You register the machine as a compute cluster, and from that point on the experience is the same whether the underlying silicon is an x86 H100 server, a GB200 NVL72 rack, or a single ARM DGX Spark on a desk.


The Architecture

Token Factory sits on top of a Kubernetes substrate and exposes LLM inference serving as a set of higher-level objects: a compute cluster (where models run), an endpoint (the network front door), a provider and model (what you are serving), and a model deployment (the running, scalable instance with its inference engine, rate limits, and pricing).

End users never see Kubernetes. They get an OpenAI-compatible API, an API key, and a usage dashboard. The operator/service provider gets multi-tenancy, metering, and governance. The DGX Spark just happens to be the place where the tokens are generated.


Step 1 — Provision Kubernetes on DGX Spark

Token Factory runs on a Kubernetes substrate, so before any model can be served, the DGX Spark systems needs a cluster on it. This is the first place an ARM-only machine usually causes friction: many Kubernetes distributions and installers still assume x86 worker nodes, ship amd64-only system images, or pull control-plane components that have no aarch64 build. On a Grace Blackwell box, all of that has to be native ARM or it simply does not come up.

We provision the cluster using Rafay MKS, Rafay's upstream, CNCF-conformant Kubernetes distribution for bare metal and VM environments. MKS is built to run directly on the hardware you bring, and it supports a fully aarch64-native, ARM-only deployment, which is what makes it a fit for the DGX Spark. There is no x86 control-plane node hiding in the topology; the entire cluster runs on the Grace Blackwell silicon.

On a single DGX Spark the result is a compact, single-node cluster where the control plane and the worker role co-reside on the same machine:

  • Control plane and kubelet run as aarch64 components on the Spark's Arm cores. The 20-core Grace CPU has more than enough headroom to host the control plane and still leave the bulk of its cores, and the unified memory, for inference.
  • The container runtime and CNI are ARM-native. (A common pattern here is to pair a CNI such as Calico or Cilium for pod networking)
  • The NVIDIA GPU Operator installs the aarch64 driver, CUDA runtime, and Kubernetes device plugin that expose the GB10's Blackwell GPU to pods. This is the layer that turns "a GPU exists in the box" into "the scheduler can place a model on it," and it is the piece most likely to break on a non-x86 platform if the components are not architecture-matched. MKS handles this as part of bringing up a GPU cluster.

Practically, the operator points Rafay at the DGX Spark, which bootstraps Kubernetes and the GPU software stack on the node, and a few minutes later there is a healthy, GPU-aware, ARM-native cluster ready to serve workloads. The DGX Spark is now a Kubernetes node like any other, except that every layer of that stack is aarch64.


Step 2 — Register the DGX Spark as a Compute Cluster

The next step is to bring the Kubernetes cluster on DGX Spark under management as a GPU compute cluster. Once the Kubernetes cluster is registered, Token Factory will automatically discover the hardware and surfaces it in the Rafay Console.

GB10 in Compute Cluster

A few things in this view are worth calling out:

  • The single node, spark-2ceb, reports its accelerator as NVIDIA-GB10, and the GPU Inventory by Type panel classifies it as Blackwell. The platform correctly identifies the GB10 silicon rather than treating it as a generic device.
  • The CPU panel reads 6,511 / 20,000 mCores. That 20,000 millicore ceiling is the 20 Arm cores of the Grace CPU, visible directly in the dashboard.
  • Memory reads 16.14 GB / 119.67 GB. That ~120 GB pool is the DGX Spark's 128 GB of coherent unified memory, which is the whole reason this box can hold sizable models in the first place.
  • GPUs show 1 allocated / 1 total at 100%, the single Blackwell GPU on the superchip.

From here, day-2 operations are managed: utilization telemetry, monitoring alerts etc. The DGX Spark is now a first-class citizen of the fleet, indistinguishable in workflow from any other GPU cluster.


Step 3 — Define the endpoint, provider, and model

With the cluster connected, the operator stitches together the serving objects. In Token Factory terms:

  • The endpoint is the addressable front door for inference traffic.
  • The provider and model describe what is being served. Here the model is Qwen2-0.5B-Instruct, a small instruct-tuned model that is a sensible first deployment on a single-GPU box: it loads comfortably into unified memory, serves quickly, and is a clean way to validate the full path before moving to larger models.
  • Model sharing controls which tenants and projects can see and consume the model, which is what makes a single DGX Spark useful to more than one team.

Info

Note that none of these objects care about the CPU architecture underneath. The endpoint and model abstractions are the same on ARM as on x86, which is the point.


Step 4 — Create the Model Deployment

The model deployment is where everything comes together: the model, the endpoint, the target GPU, the inference engine, rate limits, and pricing. This is the screen the operator fills in to actually bring the model online.

Model Deployment on GB10

Reading down the form:

  • Details name the deployment Qwen2-0.5B-Instruct-model
  • Model & Endpoint bind Qwen2-0.5B-Instruct to the token-factory-on-gb10-spark endpoint.
  • GPU Type is the important part. Token Factory presents NVIDIA-GB10 as a selectable target with 1 GPU available of 1 total, and the deployment is pinned to it. The platform has already done the hard scheduling work of matching the model to the Blackwell GPU on the ARM node. The operator simply selects it.
  • The remaining sections, Inference Engine, Rate Limiting, and Pricing, configure how the model is served, how aggressively clients can call it, and how usage is metered and charged back.
  • Click Create Model Deployment, and the model comes online on the DGX Spark.

Info

Behind that Inference Engine selection is where the ARM-native heavy lifting lives. Token Factory pulls aarch64 inference-engine images and the matching CUDA stack for the GB10, schedules the serving pod onto the Spark node via the GPU operator, and wires it to the endpoint. The operator never builds an ARM container, never recompiles a wheel, and never debugs an architecture mismatch. They pick an engine from a dropdown.


Step 5 — Operate & Use It

Once deployed, the DGX Spark behaves like a managed inference service, not a hobbyist's desktop experiment:

  • End users consume an OpenAI-compatible API. They generate API keys, send requests, and watch their own usage dashboards. They have no idea, and no need to know, that the tokens are coming off an ARM superchip under someone's desk.
  • Rate limiting protects the single GPU from being overwhelmed by any one tenant, which matters far more on a one-GPU box than on a large cluster.
  • Usage metering and pricing make the deployment chargeable, so even a desktop-class machine can participate in showback or a paid internal service.
  • Multi-tenancy and model sharing let several teams share the one DGX Spark with isolation and quotas, turning a single device into shared infrastructure.

Scaling is the natural next move. The same workflow that brought up one DGX Spark brings up a second, and NVIDIA's own design anticipates this: two Sparks can be linked over ConnectX networking into a 256 GB combined-memory pair for models in the 405B-parameter range. Token Factory treats those as additional capacity in the fleet, and the model-deployment abstraction is unchanged.


Why this Matters

The DGX Spark is a preview of where a lot of AI compute is heading: ARM-based, memory-rich, energy-efficient, and increasingly distributed out toward the edge rather than concentrated in a few data centers. The hardware is genuinely exciting. The operational reality, an aarch64 stack that breaks x86 assumptions at every layer, is where most teams lose weeks.

Rafay Token Factory collapses that gap. The same five-object workflow, compute cluster, endpoint, provider and model, model deployment, and end-user consumption, applies whether the silicon is x86 or ARM, a single desktop or a rack-scale GB200 system. You register the machine, pick the GPU, pick the engine, and serve tokens. The architecture under the hood becomes an implementation detail, which is exactly what platform software is supposed to do.

A Grace Blackwell machine on your desk, serving a production-style LLM endpoint with metering, rate limits, and multi-tenancy, and no one had to think about ARM. That is the whole idea!

Bring Rafay Into Your AI Workflows with the Rafay MCP Server

AI assistants are now part of everyday work for platform, DevOps, and SRE teams. We use them to debug code, make sense of configuration, and understand how systems behave. But when it comes to managing Kubernetes clusters and platform infrastructure, these assistants hit a wall: they have no secure, real-time view of your environment.

Without a secure window into your actual operational state AI tools are forced to guess rely on stale data or require engineers to manually copy paste massive YAML files and CLI outputs into chat windows.

To bridge this gap without compromising on security, we are thrilled to introduce the Rafay MCP Server.

Kubernetes v1.36 for Rafay MKS

As part of our continuous effort to bring the latest Kubernetes versions to our users, support for Kubernetes v1.36 (codename ハル / Haru) is now available on the Rafay Operations Platform for MKS cluster types.

Both new cluster provisioning and in-place upgrades of existing clusters are supported. As with most Kubernetes releases, this version deprecates and removes a number of features. To ensure zero impact to our customers, we have validated every feature in the Rafay Kubernetes Operations Platform on this Kubernetes version.

Recommended: Platform Version 1.3.0

Rafay Platform Version 1.3.0 is the default selection in the UI when provisioning new Kubernetes clusters. It includes containerd 2.3.0 (CRI 2.3.0). Rafay recommends using Platform Version 1.3.0 with Kubernetes v1.36 to take advantage of all new stable features in the 1.36 release. For upgrading existing clusters, you can upgrade to Platform Version 1.3.0 separately first, or together with the Kubernetes v1.36 upgrade.

Kubernetes v1.36 Release

Automated GPU Health Monitoring with NVIDIA NVSentinel on the Rafay Platform

GPU clusters are expensive and GPU failures are costly. In modern AI infrastructure, organizations operate large fleets of NVIDIA GPUs that can cost tens of thousands of dollars each. When a GPU develops a hardware fault (e.g. a double-bit ECC error, a thermal throttle, or a silent data corruption event), the consequences ripple outward: training jobs fail hours into a run, inference latency spikes, and expensive hardware sits idle while engineers scramble to diagnose the root cause.

Traditional monitoring catches these problems eventually, but rarely fixes them. Diagnosing and remediating GPU faults still requires deep expertise, and remediation timelines are measured in hours or days. For organizations running AI workloads at scale — and especially for GPU cloud providers who must deliver uptime SLAs to their tenants — this gap between detection and resolution translates directly into SLA breaches, lost revenue, and eroded customer trust.

NVIDIA's answer to this challenge is NVSentinel — an open-source, Kubernetes-native system that continuously monitors GPU health and automatically remediates issues before they disrupt workloads.

In this blog, we describe how Rafay integrates with NVSentinel enabling GPU cloud operators and enterprises to deploy intelligent GPU fault detection and self-healing across their entire fleet — consistently, repeatably, and at scale.

Rafay and NVSentinel

NVIDIA Dynamo: Turning Disaggregated Inference Into a Production System

In Part 1, we covered the core idea behind disaggregated inference. That architectural split is no longer just a research pattern. Disaggregated inference changes inference from a simple “deploy a container on GPUs” exercise into a distributed system problem.

Once prefill and decode are separated, the platform has to coordinate routing, GPU-to-GPU KV cache transfer, placement, autoscaling, service discovery, and fault handling across multiple worker pools. NVIDIA Dynamo provides the distributed inference framework for this, and Kubernetes provides the control plane foundation to operate it at scale. 

In this blog post, we will review NVIDIA's Dynamo project with a focus on what it does and when it it makes sense to use it.

NVIDIA Dynamo Logo

Introduction to Disaggregated Inference: Why It Matters

The explosive growth of generative AI has placed unprecedented demands on GPU infrastructure. Enterprises and GPU cloud providers are deploying large language models at scale, but the underlying inference serving architecture often can't keep up.

In this first blog post on disaggregated inference, we will discuss how it differs from traditional serving, why it matters for platform teams managing GPU infrastructure, and how the ecosystem—from NVIDIA Dynamo to open-source frameworks—is making it production-ready.

Disaggregated Inference

Fine Tuning as a Service using Rafay and Unsloth Studio

Fine-tuning large language models used to be an exercise reserved for teams with deep MLOps expertise and bespoke infrastructure. With Unsloth Studio — an open-source web UI for training and running LLMs — the barrier to entry has dropped considerably.

But packaging Unsloth Studio into a repeatable, self-service experience that neo clouds and enterprise can offer their end users? That still requires thoughtful orchestration.

In this post, we walk through how to deliver Unsloth Studio as a one-click, app-store-style experience using Rafay's App Marketplace. By the end, you'll understand how to create an Unsloth Studio App SKU, configure it for end users, test it, and share it across customer organizations — all without requiring your users to know anything about Kubernetes, Docker, or GPU scheduling.

Unsloth Studio in Rafay

Running GPU Infrastructure on Kubernetes: What Enterprise Platform Teams Must Get Right

KubeCon + CloudNativeCon Europe 2026, Amsterdam


If you are at KubeCon this week in Amsterdam, you are likely hearing the same question repeatedly: how do we actually operate GPU infrastructure on Kubernetes at enterprise scale? The announcements from NVIDIA — the DRA Driver donation, the KAI Scheduler entering CNCF Sandbox, GPU support for Kata Containers expand what is technically possible. But for enterprise platform teams, the harder problem is not capability. It is operating GPU infrastructure efficiently and responsibly once demand arrives.

This post is written for platform teams building internal GPU platforms — on-premises, in sovereign environments, or in hybrid models. You are not just provisioning infrastructure. You are governing access to some of the most expensive and constrained resources in the organization.

At scale, GPU inefficiency is not accidental. It is structural:

  • Idle GPUs that remain allocated but unused
  • Over-provisioned workloads consuming more than needed
  • Fragmented capacity that cannot satisfy real workloads
  • Lack of cost visibility and accountability

Solving this requires more than infrastructure. It requires a governed platform model.

Advancing GPU Scheduling and Isolation in Kubernetes

KubeCon + CloudNativeCon Europe 2026, Amsterdam


At KubeCon Europe 2026, NVIDIA made a set of significant open-source contributions that advance how GPUs are managed in Kubernetes. These developments span across: resource allocation (DRA), scheduling (KAI), and isolation (Kata Containers). Specifically, NVIDIA donated its DRA Driver for GPUs to the Cloud Native Computing Foundation, transferring governance from a single vendor to full community ownership under the Kubernetes project. The KAI Scheduler was formally accepted as a CNCF Sandbox project, marking its transition from an NVIDIA-governed tool to a community-developed standard. And NVIDIA collaborated with the CNCF Confidential Containers community to introduce GPU support for Kata Containers, extending hardware-level workload isolation to GPU-accelerated workloads. Together, these contributions move GPU infrastructure closer to a first-class, community-owned, scheduler-integrated model.

From Docker Image to 1-Click App: Enabling Self-Service for Custom Apps

In the Developer Pods series (part-1, part-2 and part-3), we made a simple point: most users do not want infrastructure. They want outcomes.

They do not want tickets. They do not want YAML. They do not want to think about pods, namespaces, ingress, or DNS. They want a working environment or application, available quickly, through a clean self-service experience. That was the core theme behind Developer Pods: Kubernetes is a powerful engine, but it should not be the user interface.

The next step is just as important: letting end users deploy applications packaged as Docker containers into shared, multi-tenant Kubernetes clusters with a true 1-click experience.

Rafay’s 3rd Party App Marketplace is designed for exactly this. It allows providers to curate and publish containerized apps from Docker Hub, third-party vendors, or open-source communities, package them with defaults, user overrides, and policies, and expose them as a secure, governed self-service experience for users across multiple tenants.

Docker App