Skip to content

LLM

Serving LLMs on ARM: Running Rafay Token Factory on NVIDIA DGX Spark

NVIDIA DGX Spark put a Grace Blackwell-class machine on the desk. It is roughly the size of a hardback book, draws a fraction of the power of a rack server, and ships with 128 GB of unified memory that lets you load models far larger than a typical workstation GPU can hold. For developers, researchers, and platform teams, it is one of the most interesting pieces of AI hardware to appear in a long time.

It also comes with a characteristic that trips up a lot of inference tooling: the DGX Spark is ARM-only. The NVIDIA GB10 Grace Blackwell Superchip pairs a Blackwell GPU with a 20-core Arm CPU (10 Cortex-X925 performance cores and 10 Cortex-A725 efficiency cores) over an NVLink-C2C link. By default, it runs DGX OS, an Ubuntu-based, aarch64 operating system.

Inference stacks that assume amd64 containers, x86 wheels, or x86-only base images do not run here without work.

This is exactly the kind of heterogeneity Rafay's Token Factory is designed to absorb. In this post I will walk through how Token Factory turns a single ARM-based DGX Spark into a managed, multi-tenant LLM serving endpoint, using a real deployment of Qwen2-0.5B-Instruct as the example.

DGX Spark in Hand


Why ARM is usually the hard part

When most teams say "deploy an LLM," they mean a workflow that has been quietly assuming x86 for years. The inference server image is built for amd64. The CUDA wheels are compiled for x86. The orchestration layer schedules onto x86 worker nodes. None of that is portable to a Grace Blackwell box by default.

On the DGX Spark, every layer has to be aarch64-native:

  • The container images for the inference engine and its dependencies
  • The CUDA and driver stack matched to the Blackwell GPU on the GB10 (compute capability sm_121, which is distinct from data-center Blackwell parts like the B200)
  • The Kubernetes node components and the GPU operator that exposes the device to pods
  • Any sidecars, gateways, or metering agents that ride alongside the model

The promise of Token Factory is that Rafay invisibly handles this substrate for you. You register the machine as a compute cluster, and from that point on the experience is the same whether the underlying silicon is an x86 H100 server, a GB200 NVL72 rack, or a single ARM DGX Spark on a desk.


The Architecture

Token Factory sits on top of a Kubernetes substrate and exposes LLM inference serving as a set of higher-level objects: a compute cluster (where models run), an endpoint (the network front door), a provider and model (what you are serving), and a model deployment (the running, scalable instance with its inference engine, rate limits, and pricing).

End users never see Kubernetes. They get an OpenAI-compatible API, an API key, and a usage dashboard. The operator/service provider gets multi-tenancy, metering, and governance. The DGX Spark just happens to be the place where the tokens are generated.


Step 1 — Provision Kubernetes on DGX Spark

Token Factory runs on a Kubernetes substrate, so before any model can be served, the DGX Spark systems needs a cluster on it. This is the first place an ARM-only machine usually causes friction: many Kubernetes distributions and installers still assume x86 worker nodes, ship amd64-only system images, or pull control-plane components that have no aarch64 build. On a Grace Blackwell box, all of that has to be native ARM or it simply does not come up.

We provision the cluster using Rafay MKS, Rafay's upstream, CNCF-conformant Kubernetes distribution for bare metal and VM environments. MKS is built to run directly on the hardware you bring, and it supports a fully aarch64-native, ARM-only deployment, which is what makes it a fit for the DGX Spark. There is no x86 control-plane node hiding in the topology; the entire cluster runs on the Grace Blackwell silicon.

On a single DGX Spark the result is a compact, single-node cluster where the control plane and the worker role co-reside on the same machine:

  • Control plane and kubelet run as aarch64 components on the Spark's Arm cores. The 20-core Grace CPU has more than enough headroom to host the control plane and still leave the bulk of its cores, and the unified memory, for inference.
  • The container runtime and CNI are ARM-native. (A common pattern here is to pair a CNI such as Calico or Cilium for pod networking)
  • The NVIDIA GPU Operator installs the aarch64 driver, CUDA runtime, and Kubernetes device plugin that expose the GB10's Blackwell GPU to pods. This is the layer that turns "a GPU exists in the box" into "the scheduler can place a model on it," and it is the piece most likely to break on a non-x86 platform if the components are not architecture-matched. MKS handles this as part of bringing up a GPU cluster.

Practically, the operator points Rafay at the DGX Spark, which bootstraps Kubernetes and the GPU software stack on the node, and a few minutes later there is a healthy, GPU-aware, ARM-native cluster ready to serve workloads. The DGX Spark is now a Kubernetes node like any other, except that every layer of that stack is aarch64.


Step 2 — Register the DGX Spark as a Compute Cluster

The next step is to bring the Kubernetes cluster on DGX Spark under management as a GPU compute cluster. Once the Kubernetes cluster is registered, Token Factory will automatically discover the hardware and surfaces it in the Rafay Console.

GB10 in Compute Cluster

A few things in this view are worth calling out:

  • The single node, spark-2ceb, reports its accelerator as NVIDIA-GB10, and the GPU Inventory by Type panel classifies it as Blackwell. The platform correctly identifies the GB10 silicon rather than treating it as a generic device.
  • The CPU panel reads 6,511 / 20,000 mCores. That 20,000 millicore ceiling is the 20 Arm cores of the Grace CPU, visible directly in the dashboard.
  • Memory reads 16.14 GB / 119.67 GB. That ~120 GB pool is the DGX Spark's 128 GB of coherent unified memory, which is the whole reason this box can hold sizable models in the first place.
  • GPUs show 1 allocated / 1 total at 100%, the single Blackwell GPU on the superchip.

From here, day-2 operations are managed: utilization telemetry, monitoring alerts etc. The DGX Spark is now a first-class citizen of the fleet, indistinguishable in workflow from any other GPU cluster.


Step 3 — Define the endpoint, provider, and model

With the cluster connected, the operator stitches together the serving objects. In Token Factory terms:

  • The endpoint is the addressable front door for inference traffic.
  • The provider and model describe what is being served. Here the model is Qwen2-0.5B-Instruct, a small instruct-tuned model that is a sensible first deployment on a single-GPU box: it loads comfortably into unified memory, serves quickly, and is a clean way to validate the full path before moving to larger models.
  • Model sharing controls which tenants and projects can see and consume the model, which is what makes a single DGX Spark useful to more than one team.

Info

Note that none of these objects care about the CPU architecture underneath. The endpoint and model abstractions are the same on ARM as on x86, which is the point.


Step 4 — Create the Model Deployment

The model deployment is where everything comes together: the model, the endpoint, the target GPU, the inference engine, rate limits, and pricing. This is the screen the operator fills in to actually bring the model online.

Model Deployment on GB10

Reading down the form:

  • Details name the deployment Qwen2-0.5B-Instruct-model
  • Model & Endpoint bind Qwen2-0.5B-Instruct to the token-factory-on-gb10-spark endpoint.
  • GPU Type is the important part. Token Factory presents NVIDIA-GB10 as a selectable target with 1 GPU available of 1 total, and the deployment is pinned to it. The platform has already done the hard scheduling work of matching the model to the Blackwell GPU on the ARM node. The operator simply selects it.
  • The remaining sections, Inference Engine, Rate Limiting, and Pricing, configure how the model is served, how aggressively clients can call it, and how usage is metered and charged back.
  • Click Create Model Deployment, and the model comes online on the DGX Spark.

Info

Behind that Inference Engine selection is where the ARM-native heavy lifting lives. Token Factory pulls aarch64 inference-engine images and the matching CUDA stack for the GB10, schedules the serving pod onto the Spark node via the GPU operator, and wires it to the endpoint. The operator never builds an ARM container, never recompiles a wheel, and never debugs an architecture mismatch. They pick an engine from a dropdown.


Step 5 — Operate & Use It

Once deployed, the DGX Spark behaves like a managed inference service, not a hobbyist's desktop experiment:

  • End users consume an OpenAI-compatible API. They generate API keys, send requests, and watch their own usage dashboards. They have no idea, and no need to know, that the tokens are coming off an ARM superchip under someone's desk.
  • Rate limiting protects the single GPU from being overwhelmed by any one tenant, which matters far more on a one-GPU box than on a large cluster.
  • Usage metering and pricing make the deployment chargeable, so even a desktop-class machine can participate in showback or a paid internal service.
  • Multi-tenancy and model sharing let several teams share the one DGX Spark with isolation and quotas, turning a single device into shared infrastructure.

Scaling is the natural next move. The same workflow that brought up one DGX Spark brings up a second, and NVIDIA's own design anticipates this: two Sparks can be linked over ConnectX networking into a 256 GB combined-memory pair for models in the 405B-parameter range. Token Factory treats those as additional capacity in the fleet, and the model-deployment abstraction is unchanged.


Why this Matters

The DGX Spark is a preview of where a lot of AI compute is heading: ARM-based, memory-rich, energy-efficient, and increasingly distributed out toward the edge rather than concentrated in a few data centers. The hardware is genuinely exciting. The operational reality, an aarch64 stack that breaks x86 assumptions at every layer, is where most teams lose weeks.

Rafay Token Factory collapses that gap. The same five-object workflow, compute cluster, endpoint, provider and model, model deployment, and end-user consumption, applies whether the silicon is x86 or ARM, a single desktop or a rack-scale GB200 system. You register the machine, pick the GPU, pick the engine, and serve tokens. The architecture under the hood becomes an implementation detail, which is exactly what platform software is supposed to do.

A Grace Blackwell machine on your desk, serving a production-style LLM endpoint with metering, rate limits, and multi-tenancy, and no one had to think about ARM. That is the whole idea!

NVIDIA NIM Operator: Bringing AI Model Deployment to the Kubernetes Era

In the previous blog, we learnt the basics about NIM (NVIDIA Inference Microservices). In this follow-on blog, we will do a deep dive into the NIM Kubernetes Operator, a Kubernetes-native extension that automates the deployment and management of NVIDIA’s NIM containers. By combining the strengths of Kubernetes orchestration with NVIDIA’s optimized inference stack, the NIM Operator makes it dramatically easier to deliver production-grade generative AI at scale.

NIM Operator

NVIDIA NIM: Why It Matters—and How It Stacks Up

Generative AI is moving from experiments to production, and the bottleneck is no longer training—it’s serving: getting high-quality model inference running reliably, efficiently, and securely across clouds, data centers, and the edge.

NVIDIA’s answer is NIM (NVIDIA Inference Microservices). NIM a set of prebuilt, performance-tuned containers that expose industry-standard APIs for popular model families (LLMs, vision, speech) and run anywhere there’s an NVIDIA GPU. Think of NIM as a “batteries-included” model-serving layer that blends TensorRT-LLM optimizations, Triton runtimes, security hardening, and OpenAI-compatible APIs into one deployable unit.

NIM Logo

Family vs. Lineage: Unpacking Two Often-Confused Ideas in the LLM World

LLMs have begun to resemble sprawling family trees. Folks that are relatively new to LLMs will notice two words appear constantly in technical blogs: "family" and "lineage".

They sound interchangeable and users frequently conflate them. But, they describe different slices of an LLM’s life story.

Important

Understanding the differences is more than trivia. This determines how you pick models, tune them, and keep inference predictable at scale.

LLM Family vs Lineage

Why “Family” Matters in the World of LLMs

When GPU bills run into six digits and every millisecond of latency counts, platform teams learn that vocabulary choices and hidden-unit counts aren’t the only things that separate one model checkpoint from another.

LLMs travel in families—lineages of models that share a common architecture, tokenizer, and training recipe. Think of them the way you might think of Apple’s M-series chips or Toyota’s Prius line: the tuning changes, the size varies, but the underlying design stays stable enough that tools, drivers, and workflows remain interchangeable.

In this blog, we will learn about what we mean by a family for LLMs and why this matters for Inference.

LLM Family

Choosing Your Engine for LLM Inference: The Ultimate vLLM vs. TensorRT LLM Guide

This is the next blog in the series of blogs on LLMs and Generative AI. When deploying large language models (LLMs) for inference, it is critical to consider: efficiency, scalability, and performance. Users will likely be very familiar with two market leading options: vLLM and Nvidia's TensorRT LLM.

In this blog, we dive into their pros and cons, helping users select the most appropriate option for their use case.

vLLM vs TensorRT LLM

Demystifying Quantization: Why It Matters for LLMs and Inference Efficiency

As Large Language Models (LLMs) like GPT, LLaMA, and DeepSeek reach hundreds of billions of parameters, the demand for high-speed, low-cost inference has skyrocketed. Quantization is a technique that helps drastically reduces model size and computational requirements by using lower-precision numbers. In this blog, we will discuss quantization and why it is essential.

Quantization

Compiling a LLM for High Performance Inference

This is the next blog in the blog series on LLMs and Inference. In the previous blog on LLMs and Inference, we discussed about the safetensors format for LLMs. In this blog, we will walk through a critical step for LLM Inference.

Compiling a Large Language Model (LLM) generally refers to optimizing the model’s computational graph and kernel execution to improve inference or training performance on specific hardware (like GPUs or TPUs). Think of this as the next logical step that is performed after loading a model.

LLM Compilation

How to Select the Right GPU for Open Source LLMs?

Deploying and operating an open-source Large Language Model (LLM) requires careful planning when selecting the right GPU model and memory capacity. Choosing the optimal configuration is crucial for performance, cost efficiency, and scalability. However, this process comes with several challenges.

In this blog, we will describe the factors that you need to consider to select the optimal GPU model for your LLM. We have also published a table capturing optimal GPU models to deploy and use Top-10 open source LLMs.

How many GPUs