Overview

The vLLM template enables streamlined deployment of vLLM-based large language model (LLM) inference services on GPU-enabled Kubernetes clusters. In this guide, we will be using Rafay's curated template for Inference available from the Template Catalog. The Org Admin for the user's Rafay Org has the privileges to share the system templates from Rafay's Catalog to specific/all projects.

Service profiles are based on environment templates powered by Rafay's Environment Manager.
Users can also create and configure Custom environment templates for use cases outside the list supported out of the box in Rafay's Template Catalog.

Info

Please check Rafay's Public Roadmap or contact support for details on additional templates for the Template Catalog.

Please ensure that you have a properly configured the cluster with GPUs and an Ingress Controller by following the infrastructure related instructions. The endpoint URL for the vLLM will be exposed via a https based Ingress on a domain.

As an Org Admin, navigate to "System->Template Catalog".

On the "vLLM Inference on K8s" template card, click on "Get Started"
Follow the wizard by providing a name, version and project
Click "Continue"

Share Template

Info

To aid with testing and evaluation, the Rafay platform provides the option to automatically publish the DNS and inject a certificate for the https URL on a Rafay managed domain. We will use this option for our exercise.

Configure Template¶

When the template is executed by the Rafay agent operating behind the firewall, it will receive the configured template, the associated Infra as Code (IaC), credentials and policies from the Rafay Platform. It will then execute this code on the behalf of the user.

1. Specify Agent¶

Let's configure the template to be received and executed by the Rafay Agent we created in a prior step.

Under the "Agents" tab, click on "Add Agent"
Select the name of the agent you configured in the prior step
Ensure that the override configuration is specified to "Not Allowed" because we do not want the downstream users to be able to change this
Save your changes

Select Agent

2. Config Context¶

The config context will typically encapsulate credentials and environment variables required for the agent to perform its job. In this case, we will configure the Rafay Agent with credentials so that it can make programmatic (API) calls to the specified Rafay Org and Hugging Face.

To get the Rafay API Key + Secret for the administrator user,

Navigate to "My Tools -> Manage Keys" and click on "New API Key".
Copy the API Key + Secret combination.

Info

Click here to learn more about API Key & Secret for programmatic access.

Now, we are ready to configure our agent's config context.

Under the "Config Contexts" tab in the environment template, edit "kubeconfig-mounter"
Expand "Environment Variables" and you should see three entries: HF Token, API Key & Controller Endpoint
Click on Edit for API Key
Paste the API Key/Secret string from the above step into the value section
Select Override to "Not Allowed" to ensure none of the downstream users have visibility or access to the config context
Save & Continue

Config Context Step 1

Click on Edit for HF Token
Paste the Hugging Face Token from your Hugging Face account into the value section
Select Override to "Not Allowed" to ensure none of the downstream users have visibility or access to the config context
Save & Continue

Config Context Step 2

Controller API Endpoint

For self hosted Rafay Controller deployments, the agent will need to be configured to point to its URL. In our Get Started Guide, we will be using the URL for Rafay's SaaS option.

Click on Edit for "Controller Endpoint".
Note that Rafay's SaaS Endpoint URL is already configured and can be updated if required
Select Override to "Not Allowed" to ensure none of the downstream users have visibility or access to the config context
Save & Continue

Controller Endpoint

Click Save

Config Context Step 3

3. Input Variables¶

We will update the input variables by adding a small LLM model to list of available models.

Under the "Input Variables" tab in the environment template, edit "Model"
Click "Add Restricted Value"
Enter "facebook/opt-125m" into the new restricted value section
Click Save

Input Variables

Click Save as Draft

Configure PaaS¶

Next, we will configure a custom PaaS service profile to allow self service users to deploy the template.

Navigate to PaaS Studio
Select "Service Profiles"
Select the project where the template was previously created
Click "New Service Profile"
Enter a name for the profile
Select the previously created template and version
Select "Inference Endpoints" for the service type
Select "Yes" for "Will compute be auto-created"
Click Save & Continue

Create Profile

On the following screen, navigate to "Input Settings" - Deselect "Override" for all variables except for "Model". This will allow the user to choose the model they want to use. - Update the following input variable values

Name	Value
cluster_name	Name of Cluster
Extra Args	--dtype=half --max-model-len 2046
Ingress Controller IP	Ingress IP
Ingress Namespace	Ingress Namespace
Model	facebook/opt-125m

Navigate to "Output Settings"
Click "Add Output"
Enter the name "api_key"
Click "Add Output"
Enter the name "url"
Click Save Changes

Create Profile

Deploy¶

Next, we will use developer hub to deploy an instance of the service profile.

Navigate to Developer Hub
Select the project where the template was previously created
Click "Workspaces"
Click "New Workspace"
Enter a name for the workspace
Click Save

Create Workspace

Click "Custom Services"
Click "New Custom Service"
Click "Select" on the LLM service card
Enter a name for the instance
Click Deploy

Create Instance

After a short period of time, the instance will be deployed.

Create Instance

Utilize¶

Next, we will use use the newly deployed model.

Copy the URL and the API KEY and substitute those values in the command below.

curl YOUR_URL/v1/chat/completions \
  -X POST \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "facebook/opt-125m",
    "messages": [
      {
        "role": "user",
        "content": "Explain the main steps involved in training a large language model."
      }
    ]
  }'

Execute the command to see the response of the model