MIG Mixed
The Multi-Instance GPU (MIG) feature allows GPUs (starting with NVIDIA Ampere architecture) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.
For Cloud Service Providers (CSPs), who have multi-tenant use cases, MIG ensures one client cannot impact the work or scheduling of other clients, in addition to providing enhanced isolation for customers.
In this guide, you will configure MIG using a Mixed strategy. This means that a node will expose multiple types of MIG configurations.
Assumptions¶
- You have provisioned or imported one or more Kubernetes clusters into a Project in your Rafay Org that contain one or more MIG supported GPUs.
- Ensure that you have NOT already deployed the Nvidia GPU Operator on the cluster.
- You have setup the RCTL CLI
Deploy Mixed Strategy Multi-Instance GPU (MIG)¶
To deploy a mixed strategy MIG on the managed Kubernetes cluster, perform the following steps:
Step 1: Create GPU Operator Namespace¶
In this step, you will create a namespace for the GPU Operator which will be installed in a later step.
- Download the namespace specification file
- Update the Project name in the YAML file with the name of the project to create the resource in
- Execute the following RCTL command to create the namespace
rctl --v3 apply -f 01-gpu-operator-namespace.yaml
Step 2: Add the NVIDIA Helm Repository¶
In this step, you will add the NVIDIA Helm repository into the controller. This repository will be used to pull the Helm chart of the GPU Operator.
- Download the repository specification file
- Update the Project name in the YAML file with the name of the project to create the resource in
- Execute the following RCTL command to create the repository
rctl --v3 apply -f 02-nvidia-helm-repository.yaml
Step 3: Create GPU Operator Resource Quota Custom Addon¶
In this step, you will create a custom addon for the GPU Operator Resource Quota.
- Download the resource quota addon specification file
- Download the GPU Operator Resource Quota specification file
- Update the Project name in the 03-addon-gpu-operator-quota.yaml file with the name of the project to create the resource in
- Execute the following RCTL command to create the resource quota addon
rctl --v3 apply -f 03-addon-gpu-operator-quota.yaml
Step 4: Create GPU Operator Custom Addon¶
In this step, you will create a custom addon for the GPU Operator.
- Download the GPU Operator addon specification file
- Download the GPU Operator values file
- Update the Project name in the nvidia-gpu-operator-values.yaml file with the name of the project to create the resource in
- Execute the following RCTL command to create the GPU Operator addon
rctl --v3 apply -f 04-addon-nvidia-gpu-operator.yaml
Step 5: Create Blueprint¶
In this step, you will create a cluster blueprint which contains the previously created addons for the GPU Operator.
- Download the blueprint specification file
- Update the Project name in the YAML file with the name of the project to create the resource in
- Execute the following RCTL command to create the repository
rctl --v3 apply -f 05-blueprint-gpu.yaml
Step 6: Label GPU Nodes¶
In this step, you will label the GPU nodes with the MIG configuration. MIG configurations can be found here.
- Execute the following command for each GPU node being sure to update the node name in the command
kubectl label nodes <node-name> nvidia.com/mig.config: all-balanced
The following configuration is exposed on the node by using this label.
mig-devices:
"1g.18gb": 2
"2g.36gb": 1
"3g.72gb": 1
Step 7: Apply Node Taints¶
In this step, you will apply a taint to the GPU nodes to force our test application to run on the GPU nodes.
- Execute the following command for each GPU node being sure to update the node name in the command
kubectl taint nodes <node-name> nvidia.com/gpu=Present:NoSchedule
Step 8: Deploy Test Application¶
- Download the test application specification file
- Under Applications, select Workloads, then create a New Workload with the name
gpu-mixed-mig-testapp
. - Set Package Type to
k8s YAML
- Select a namespace
- Click CONTINUE.
- Upload the
06-test-workload.yaml
file that was downloaded and then go to the placement of workload. - Select the target cluster from the list of available clusters and click Save and go to publish.
- Publish the workload and make sure that it gets published successfully in the target cluster before moving to the next step.
Step 9: Verify MIG¶
After deploying application in the cluster, let us verify that the test application is using different GPU IDs, as MIG is providing
- Execute the following command being sure to update the namespace where the test application was deployed
kubectl logs --all-containers -l app-group=gpu-workload -n <Namespace of Test Application>
In the output, you will see that the GPU UUID is the same for each pod and the unique MIG ID of each MIG GPU along with the MIG configuration.
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-936508b7-d2e5-3b4d-07ab-c2062d6a1a5c)
MIG 2g.10gb Device 0: (UUID: MIG-d7d7b514-3ae1-5bda-a834-e4d556b11131)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-936508b7-d2e5-3b4d-07ab-c2062d6a1a5c)
MIG 3g.20gb Device 0: (UUID: MIG-28eab685-093e-510d-835a-03802a89fba2)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-936508b7-d2e5-3b4d-07ab-c2062d6a1a5c)
MIG 1g.5gb Device 0: (UUID: MIG-46156cd5-50ac-5c9a-98c2-f3bcfde699ad)
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-936508b7-d2e5-3b4d-07ab-c2062d6a1a5c)
MIG 1g.5gb Device 0: (UUID: MIG-9206a249-a7db-5533-8d56-caa6a8ad0cf5)
Recap¶
Congratulations! Now, you have successfully deployed MIG with a mixed strategy configuration within your cluster.