Skip to content

Overview

The vLLM template enables streamlined deployment of vLLM-based large language model (LLM) inference services on GPU-enabled Kubernetes clusters. In this guide, we will be using Rafay's curated template for Inference available from the Template Catalog. The Org Admin for the user's Rafay Org has the privileges to share the system templates from Rafay's Catalog to specific/all projects.

  • Service profiles are based on environment templates powered by Rafay's Environment Manager.
  • Users can also create and configure Custom environment templates for use cases outside the list supported out of the box in Rafay's Template Catalog.

Info

Please check Rafay's Public Roadmap or contact support for details on additional templates for the Template Catalog.

Please ensure that you have a properly configured the cluster with GPUs and an Ingress Controller by following the infrastructure related instructions. The endpoint URL for the vLLM will be exposed via a https based Ingress on a domain.

As an Org Admin, navigate to "System->Template Catalog".

  • On the "vLLM Inference on K8s" template card, click on "Get Started"
  • Follow the wizard by providing a name, version and project
  • Click "Continue"

Share Template

Info

To aid with testing and evaluation, the Rafay platform provides the option to automatically publish the DNS and inject a certificate for the https URL on a Rafay managed domain. We will use this option for our exercise.


Configure Template

When the template is executed by the Rafay agent operating behind the firewall, it will receive the configured template, the associated Infra as Code (IaC), credentials and policies from the Rafay Platform. It will then execute this code on the behalf of the user.

1. Specify Agent

Let's configure the template to be received and executed by the Rafay Agent we created in a prior step.

  • Under the "Agents" tab, click on "Add Agent"
  • Select the name of the agent you configured in the prior step
  • Ensure that the override configuration is specified to "Not Allowed" because we do not want the downstream users to be able to change this
  • Save your changes

Select Agent


2. Config Context

The config context will typically encapsulate credentials and environment variables required for the agent to perform its job. In this case, we will configure the Rafay Agent with credentials so that it can make programmatic (API) calls to the specified Rafay Org and Hugging Face.

To get the Rafay API Key + Secret for the administrator user,

  • Navigate to "My Tools -> Manage Keys" and click on "New API Key".
  • Copy the API Key + Secret combination.

Info

Click here to learn more about API Key & Secret for programmatic access.

Now, we are ready to configure our agent's config context.

  • Under the "Config Contexts" tab in the environment template, edit "kubeconfig-mounter"
  • Expand "Environment Variables" and you should see three entries: HF Token, API Key & Controller Endpoint
  • Click on Edit for API Key
  • Paste the API Key/Secret string from the above step into the value section
  • Select Override to "Not Allowed" to ensure none of the downstream users have visibility or access to the config context
  • Save & Continue

Config Context Step 1

  • Click on Edit for HF Token
  • Paste the Hugging Face Token from your Hugging Face account into the value section
  • Select Override to "Not Allowed" to ensure none of the downstream users have visibility or access to the config context
  • Save & Continue

Config Context Step 2

Controller API Endpoint

For self hosted Rafay Controller deployments, the agent will need to be configured to point to its URL. In our Get Started Guide, we will be using the URL for Rafay's SaaS option.

  • Click on Edit for "Controller Endpoint".
  • Note that Rafay's SaaS Endpoint URL is already configured and can be updated if required
  • Select Override to "Not Allowed" to ensure none of the downstream users have visibility or access to the config context
  • Save & Continue

Controller Endpoint

  • Click Save

Config Context Step 3


3. Input Variables

We will update the input variables by adding a small LLM model to list of available models.

  • Under the "Input Variables" tab in the environment template, edit "Model"
  • Click "Add Restricted Value"
  • Enter "facebook/opt-125m" into the new restricted value section
  • Click Save

Input Variables

  • Click Save as Draft

Configure PaaS

Next, we will configure a custom PaaS service profile to allow self service users to deploy the template.

  • Navigate to PaaS Studio
  • Select "Service Profiles"
  • Select the project where the template was previously created
  • Click "New Service Profile"
  • Enter a name for the profile
  • Select the previously created template and version
  • Select "Inference Endpoints" for the service type
  • Select "Yes" for "Will compute be auto-created"
  • Click Save & Continue

Create Profile

On the following screen, navigate to "Input Settings" - Deselect "Override" for all variables except for "Model". This will allow the user to choose the model they want to use. - Update the following input variable values

Name Value
cluster_name Name of Cluster
Extra Args --dtype=half --max-model-len 2046
Ingress Controller IP Ingress IP
Ingress Namespace Ingress Namespace
Model facebook/opt-125m
  • Navigate to "Output Settings"
  • Click "Add Output"
  • Enter the name "api_key"
  • Click "Add Output"
  • Enter the name "url"
  • Click Save Changes

Create Profile


Deploy

Next, we will use developer hub to deploy an instance of the service profile.

  • Navigate to Developer Hub
  • Select the project where the template was previously created
  • Click "Workspaces"
  • Click "New Workspace"
  • Enter a name for the workspace
  • Click Save

Create Workspace

  • Click "Custom Services"
  • Click "New Custom Service"
  • Click "Select" on the LLM service card
  • Enter a name for the instance
  • Click Deploy

Create Instance

After a short period of time, the instance will be deployed.

Create Instance


Utilize

Next, we will use use the newly deployed model.

  • Copy the URL and the API KEY and substitute those values in the command below.
curl YOUR_URL/v1/chat/completions \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "facebook/opt-125m",
    "messages": [
      {
        "role": "user",
        "content": "Explain the main steps involved in training a large language model."
      }
    ]
  }'
  • Execute the command to see the response of the model