Skip to content

Troubleshooting

The issues span different areas such as deployment failures, cluster-level misconfigurations, UI access errors, and authentication challenges. The goal is to help users quickly identify and resolve problems efficiently.

A common theme observed across various templates is capacity-related issues, often involving resource allocation, node scaling, and workload identity settings. Addressing these correctly ensures stable cluster operations and smooth Kubeflow deployment.

Each issue is documented with its error message, possible cause, recommended workaround, and additional comments where necessary.


Kubeflow - Errors and Troubleshooting Guide

Here are some scenarios that may arise when using the Kubeflow template.

a. YAML Parse Error Due to Multi-Arch Encoding

Error Message

Reason:
2 problems:

- activity in progress: group.res-gke-feast-gcp.output
- activity failed: group.res-gke-kubeflow-gcp.output: activity failed: group.res-gke-kubeflow-gcp.output: exit status 1

Error: YAML parse error on istio/templates/Secret/._kubeflow-gateway-tls-secret.yaml: error converting YAML to JSON: yaml: control characters are not allowed

  with helm_release.istio,
  on main.tf line 108, in resource "helm_release" "istio":
  108: resource "helm_release" "istio" {

Possible Causes

  • The Docker image was built and pushed to ArtifactDriver from a Mac machine.
  • Mac uses a different binary encoding process (multi-arch), causing issues when a GitOps Agent attempts to pull the correct Helm charts from the Docker image.

Resolution Steps

  • Rebuild the Docker image from an agent running on an Ubuntu machine.
  • Push the newly built image to ArtifactDriver.
  • Retry deploying the Helm chart.

Note: This issue typically does not occur in production but may happen in development environments.

b. Deployment Failure - Missing Required Variable (rafay_project)

Error Message

Reason:
activity failed: group.res-gke-infra-gcp.destroy: activity failed: group.res-gke-infra-gcp.destroy: exit status 1

Error: No value for required variable

  on variables.tf line 71:
  71: variable "rafay_project" {

The root module input variable "rafay_project" is not set and has no default
value. Use a -var or -var-file command line argument to provide a value for
this variable.

Possible Causes

  • The error indicates that the rafay_project variable is not set and has no default value in the Terraform configuration.
  • This may occur if the GitOps Agent is out of date and missing the necessary updates.

Resolution Steps - Update the GitOps Agent by pulling the latest Docker image:

docker pull <latest-agent-image>
  • Redeploy after pulling the updated image.
  • If the issue persists, manually verify that the rafay_project variable is correctly set in the Terraform configuration.
  • Pass the required variable explicitly using:
terraform apply -var="rafay_project=<project_name>"

Best Practices

If pulling and deploying again does not fix the issue, there may be an underlying problem in the agent or Terraform configuration.

c. Invalid TLS and Domain Selectors

Error Message

handle failed: unable to build run config for trigger 01JHNBMC8S7FCZYKRJ1JHKB6GY:  
environment template kubeflow-gcp-template variable TLS Certificate selector is invalid;  
environment template kubeflow-gcp-template variable TLS Key selector is invalid;  
environment template kubeflow-gcp-template variable Rafay Domain selector is invalid.

Possible Causes

  • The Config Context for system-dns-config is out of date.
  • The GitOps Agent is not running the latest image, leading to invalid selector references.

Resolution Steps

  • Update the GitOps Agent by pulling the latest Docker image:
docker pull <image_name>
  • Redeploy the environment template after updating the agent.

Note: This issue should not occur in production but may appear in development environments.

d. Deployment Failure - Invalid API Key in Function Call

Error Message

Error: Error in function call

  on outputs.tf line 9, in output "host":
   9:   value = yamldecode(data.rafay_download_kubeconfig.kubeconfig_cluster.kubeconfig).clusters[0].cluster.server
    ├────────────────
    │ while calling yamldecode(src)
    │ data.rafay_download_kubeconfig.kubeconfig_cluster.kubeconfig is """"

Call to function "yamldecode" failed: on line 1, column 1: missing start of
document.

Possible Causes

  • This error occurs when an incorrect API Key is used for the deployment.
  • The API Key provided does not match the organization the deployment is running in, leading to a failure when retrieving the Kubernetes configuration.

Resolution Steps

  • Verify that the API Key being used is valid and active.
  • Ensure the API Key corresponds to the correct organization where the deployment is running.
  • If necessary, generate a new API Key from the Rafay Controller UI and update the deployment configuration.

Note: - This issue is commonly caused by incorrect API Key usage. - Ensuring the API Key matches the deployment organization will prevent this error.

e. CSRF Check Failed

Error Message

CSRF check failed. This may happen if you opened the login form in more than 1 tab. Please try to log in again.

Possible Causes

  • This error occurs when attempting to access the Kubeflow UI and logging in via Okta immediately after deployment.
  • The DNS entry may not have fully propagated, causing temporary login failures.

Resolution Steps

  • Wait for 1-5 minutes and try logging in again.
  • If the issue persists, wait for 2-7 minutes to allow DNS propagation to complete.
  • Clear browser cache and cookies before retrying.

Note: This issue is temporary and usually resolves once DNS propagation is complete.

f. Access Denied After Successful Environment Deployment

Error Message

Kubeflow UI leads to the following error after successful environment deployment.

Access Denied

Possible Causes

  • The oidc-authservice-0 Pod in the cluster has not initialized properly.
  • This prevents proper authentication, leading to an access denial.

Resolution Steps

  • Navigate to Infrastructure → Clusters → <underlying_cluster_name> → Resources → Pods.
  • Locate oidc-authservice-0 in the istio-system namespace.
  • Delete the Pod to force a restart.

Best Practices

  • The pod will automatically restart in a Container Initializing sequence.
  • If it comes back up with a Running status (1/1), the Kubeflow UI should be accessible.

Capacity Issues

Below are some capacity-related issues that can occur across templates.

a. Helm Release Name Already in Use

Error Message

Error: cannot re-use a name that is still in use

  with helm_release.feast,
  on main.tf line 73, in resource "helm_release" "feast":
  73: resource "helm_release" "feast" {

time=2025-01-07T01:15:00.755Z level=ERROR msg="failed to run open tofu job" error-source=provider error="exit status 1

Error: cannot re-use a name that is still in use

  with helm_release.feast,
  on main.tf line 73, in resource "helm_release" "feast":
  73: resource "helm_release" "feast" {

Possible Causes

  • The Helm release name is already in use, preventing reinstallation.
  • This issue often occurs when the same environment deployment has been redeployed multiple times (2-4 times or more) without proper cleanup.

Resolution Steps

  • Navigate to Infrastructure → Clusters → <underlying_cluster_name> → Kubectl. Run the following command to list Helm releases:

helm ls -A
- Identify the conflicting resource name (feast). - Uninstall the existing Helm release by running:

helm uninstall feast -n feast
  • feast is both the resource name and the namespace in this case.
  • If unsure about the namespace, locate it under: - Infrastructure → Clusters → <underlying_cluster_name> → Resources → Pods/Deployments

Note: This issue commonly occurs when an environment deployment is redeployed multiple times without cleaning up previous Helm releases.

b. EOF Error Preventing Request Execution

Error Message

Error: 1 error occurred:
    * an error on the server ("EOF") has prevented the request from succeeding (post serviceaccounts)

Possible Cause

  • The Google account or service account used for deployment has been signed out
  • A connectivity issue interrupted the deployment process

Resolution Steps

  • Redeploy the Environment run to resume the process from where it left off
  • Ensure that the Google account/service account is active and signed in
  • Check for any network connectivity issues before redeploying

Note: This is a temporary issue that can be resolved by redeploying.

c. Cluster Configuration Issue - Workload Identity Not Enabled

Error Message

Option: Enable Workload Identity

Must be configured to True if the underlying cluster is deployed via the `system-gke-cluster` template, or the `Enable Workload Identity` checkbox must be checked if deployed via Rafay Controller UI's `New Cluster` Provisioning.

Possible Cause

  • Workload Identity was not enabled during cluster creation.
  • If deployed using the system-gke-cluster template, the Enable Workload Identity option must be set to True.
  • If deployed via the Rafay Controller UI, the Enable Workload Identity checkbox must be checked.

Resolution Steps

  • Verify Cluster Settings: - Check if Workload Identity is enabled in GCP Kubernetes cluster settings.
  • If Workload Identity is not enabled: - Delete the existing cluster. - Create a new cluster with Workload Identity enabled.
  • Ensure the required IAM roles are assigned for authentication.

Best Practices

If Workload Identity is not enabled, Kubeflow deployment will fail to bring up MLOps services properly.


vCluster Environment Template - Errors and Troubleshooting Guide

a. Failure at Environment Deployment, No Activity Starts

Error Message

Audit Logs Console

If the environment deployment fails and no activity is initiated, it is often caused by an issue with the Agent.

Possible Causes

  • The agent is not assigned to the environment
  • The agent is inactive or disconnected

Resolution Steps

  • If no agent is present, deploy a new agent or share an existing agent from another project
  • Check if the agent is healthy; if not, restart the agent
  • Edit the Environment Deployment, associate it with a healthy agent, and click Deploy again

b. Failure in group.*.artifact

Error Message

invalid driver config: failed to evaluate "$ctx.activities[\"group.res-gen-vcluster.artifact\"].output.files[\"job.tar.zst\"].token)$": invalid expression: output: undefined field: "job.tar.zst":

This error occurs when the Git repository or repodriver associated with the resource template is inaccessible, or there are storage issues at the backend.

Possible Causes

  • The Git repository associated with the resource template is not accessible, or the repodriver is unavailable
  • There is a storage issue at the backend
  • The repodriver architecture is not compatible with the node where the agent is running

Resolution Steps

  • Check if the Git repository is accessible
  • Ensure the resource template has the correct repository, path, and branch defined
  • Verify the agent’s health and restart if necessary

c. Namespace Already Exists

Error Message

Audit Logs Console

If the namespace is in a terminating state and the vCluster template attempts to create the same namespace, the environment creation will fail.

Resolution Step

Change the namespace name and retry the deployment.

d. Cluster Name Already Exists in the Cluster Infrastructure Console

Error Message

Audit Logs Console

By default, the vCluster template uses the environment name as the cluster name. This error occurs if a cluster with the same name already exists.

Resolution Step

Re-deploy the vCluster template with a different name.

e. Not Enough Resources

Error Message

Audit Logs Console

vCluster runs on the host cluster. If the host cluster does not have sufficient resources (minimum 4 CPUs) or if other clusters and workloads have already consumed available resources, the vCluster deployment may fail.

Resolution Step

Select a host cluster with enough free resources to support vCluster deployment.

f. Host Cluster Unreachable

Audit Logs Console

When the host cluster is unreachable, the deployment may fail with an error.

Possible Causes

  • The host cluster is not reachable from the agent
  • The cluster name is incorrectly typed
  • The host cluster is unhealthy

Resolution Steps

  • Verify connectivity between the agent and the host cluster to ensure communication is possible
  • Check the cluster name for any typos or mismatches in the configuration
  • Inspect the cluster’s health status and ensure it is running properly. Restart or troubleshoot if necessary
  • Review network configurations to confirm that no firewall rules or security policies are blocking access

g. Blueprint Sync Fails

There might be an error when syncing a blueprint due to the following reasons:

  • Connectivity issues with the controller
  • Failure to deploy certain blueprint components
  • Insufficient resources in the host cluster

Resolution Step

May need to destroy and recreate the vCluster in the right host cluster.