Nvidia GPU Operator on Azure AKS
What Will You Do¶
In this part of the self-paced exercise, you will provision an Azure AKS cluster with a GPU node pool based on a declarative cluster specification.
Step 1: Cluster Spec¶
- Open Terminal (on macOS/Linux) or Command Prompt (Windows) and navigate to the folder where you forked the Git repository
- Navigate to the folder "
/getstarted/gpuaks/cluster"
The "aks-gpu.yaml" file contains the declarative specification for our Azure AKS Cluster.
Cluster Details¶
Update the following values in the spec file to match the correct values in your environment.
- project: defaultproject
- cloudprovider: azure-cc
- location: northcentralus
- resourceGroupName: Tim-RG
apiVersion: rafay.io/v1alpha1
kind: Cluster
metadata:
name: demo-gpu-aks
project: defaultproject
spec:
blueprint: default-aks
cloudprovider: azure-cc
clusterConfig:
apiVersion: rafay.io/v1alpha1
kind: aksClusterConfig
metadata:
name: demo-gpu-aks
spec:
managedCluster:
apiVersion: "2022-07-01"
identity:
type: SystemAssigned
location: northcentralus
properties:
apiServerAccessProfile:
enablePrivateCluster: true
dnsPrefix: demo-gpu-aks-dns
kubernetesVersion: 1.25.6
networkProfile:
loadBalancerSku: standard
networkPlugin: kubenet
sku:
name: Basic
tier: Free
type: Microsoft.ContainerService/managedClusters
nodePools:
- apiVersion: "2022-07-01"
location: northcentralus
name: primary
properties:
count: 1
enableAutoScaling: true
maxCount: 1
maxPods: 110
minCount: 1
mode: System
orchestratorVersion: 1.25.6
osType: Linux
type: VirtualMachineScaleSets
vmSize: Standard_NC4as_T4_v3
type: Microsoft.ContainerService/managedClusters/agentPools
resourceGroupName: Tim-RG
proxyconfig: {}
type: aks
Step 2: Provision Cluster¶
- On your command line, navigate to the cluster sub folder
- Type the command
rctl apply -f aks-gpu.yaml
If there are no errors, you will be presented with a "Task ID" that you can use to check progress/status. Note that this step requires creation of infrastructure in your Azure account and can take ~20-30 minutes to complete.
{
"taskset_id": "x28y6ek",
"operations": [
{
"operation": "ClusterCreation",
"resource_name": "demo-gpu-aks",
"status": "PROVISION_TASK_STATUS_PENDING"
},
{
"operation": "NodegroupCreation",
"resource_name": "primary",
"status": "PROVISION_TASK_STATUS_PENDING"
},
{
"operation": "BlueprintSync",
"resource_name": "demo-gpu-aks",
"status": "PROVISION_TASK_STATUS_PENDING"
}
],
"comments": "The status of the operations can be fetched using taskset_id",
"status": "PROVISION_TASKSET_STATUS_PENDING"
}
- Navigate to the project in your Org
- Click on Infrastructure -> Clusters. You should see something like the following
- Click on the cluster name to monitor progress
Step 3: Verify Cluster¶
Once provisioning is complete, you should see a healthy cluster in the web console
- Click on the kubectl link and type the following command
kubectl get nodes -o wide
You should see something like the following
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
aks-primary-14718340-vmss000002 Ready agent 8m38s v1.25.6 10.224.0.4 <none> Ubuntu 22.04.2 LTS 5.15.0-1041-azure containerd://1.7.1+a
Recap¶
Congratulations! At this point, you have successfully configured and provisioned an Azure AKS cluster with a GPU node pool in your account using the RCTL CLI. You are now ready to move on to the next step where you will create a deploy a custom cluster blueprint that contains the GPU Operator as an addon.