Skip to content

Troubleshooting

Background

There are scenarios where the cluster provisioning is successful but the blueprint sync fails on day 0. Or the blueprint sync fails at a later point in time when you want to update certain software add-ons, policies, and services. This can occur for example when:

  • software add-ons in the blueprint are not correctly defined/written
  • there is an incompatibility between a version of an software-add on or service version (for ex. OPA Gatekeeper) and the cluster version
  • dependencies between software add-ons are not defined resulting in a failure

Viewing Details Of a Blueprint Sync Failure

When a blueprint sync failure occurs, the user can view the details on the Web Console by clicking the Blueprint Sync: Failed expand icon and take necessary action(s)

Provisioning Failure

  • Hover over the red notification to view the reason for the failed status

Red Notification

  • Check out the failed add-on(s) details and reason for the failure

Provisioning Failure

Users are allowed to make the required changes to the Blueprint configuration and retry for Blueprint sync


What happens when a blueprint sync fails

Generally, when a blueprint sync fails workload deployments on the cluster are blocked from being deployed. This is because the blueprint may carry very important policies, for ex. OPA Gatekeeper policies that have rules for application deployments. If those policies are not installed on the cluster, it can lead to security and compliance issues.

Therefore, it is recommended that an immediate step is taken to make the cluster usable again. Let's look at the steps that can be performed next.


Steps To Take Upon a Blueprint Sync Failure

  • Diagnose the general error: The error sometimes may be obvious and require just a simple tweak to the blueprint and/or underlying software add-ons to get it working again. For example, in the picture below, notice that the error represents that there is an incompatibility between the OPA Gatekeeper version defined in the blueprint and the cluster version which is v1.25 which is incompatible with OPA 3.7.1.

Error 1

  • Triage the specific add-on deployments that failed: Sometimes it may be specific add-on deployments that failed. Clicking on the status of the add-on can reveal more details

Provisioning Failure

  • Rollback to a previous version: If you are immediately blocked, you can roll back to a previous version of a blueprint by simply initiating the same process but just selecting the previous version of a blueprint.

Scenarios

Let's cover some different scenarios when blueprint sync failures can occur.

Scenario 1: Incompatibility between managed add-on/service and the cluster version

The below error is an example that occurs when provisioning a cluster of Kubernetes version 1.25 along with custom Blueprint which contains OPA Gatekeeper 3.7.1, a version of OPA Gatekeeper that is incompatible with Kubernetes version 1.25.

Error 1

In this case, a validation error is being thrown by the platform to indicate that you must update to a version of OPA Gatekeeper in your installation profile in the blueprint that is 3.11.0 or higher.

Scenario 2: Cluster is unreachable

There may be a situation where your cluster is down or unreachable. In that case, when a blueprint sync fails, it will fail with a timeout error (specifically kubeapi-proxy giving a timeout error).

Timeout error

Scenario 3: Using the wrong default blueprint based on the cluster type

The system comes with blueprints for specific cluster types that can be used by default, or used as the base in a custom or golden blueprint. If the wrong default is used however it can lead to errors.

For example, for MKS clusters, the blueprint that should be used is default, minimal or default-upstream. Using any other default blueprint leads to an error.

Error 1

Scenario 4: Attempting To Override Blueprint Fleet Configuration

Blueprint fleet can be used to update multiple clusters at once with a given blueprint. In this case, a cluster is assigned with a fleet label. However, if the user tries to manually do a one time update of a single cluster, because the cluster is part of the fleet, it will fail.

Error 1

To overcome this, one must remove the fleet label from the single cluster and try again. See the blueprint fleet documentation for more details.

Scenario 5: The cluster runs out of space/memory for new add-on deployments.

In some cases, your cluster may run out of the space necessary for a given add-on deployment that comes with your new version of the blueprint.