Skip to content

Troubleshooting

This section explains the frequently occurred errors during cluster provision


Resource Provisioning Failures

Scenario 1: Instance Type Not supported

The below error is an example that might occur at the time of cluster provision or adding a new nodegroup to the existing cluster

Error 1

Validation

To overcome this issue, perform the below validations for instance types in a region:

  • Check your Cloud Credentials (roles based or access id or secret) has the required permission to call ec2 AWS APIs. If the Cloud Credentials are role based, ensure all the appropriate IAM Policies are met
  • Check whether the configuration has an instance type that is not available in the selected region

Scenario 2: Availability Zones

The below error is an example that might occur when the Cloud credentials does not have permission to create resources in the selected region during EKS cluster provision

Error 2

Validation

Validate the permissions of the cloud credentials used for cluster provisioning to create the resources in that configured region


Scenario 3: Instance Type Permission

The below error is an example that might occur when the cloud credentials do not have permission to use a particular instance type, used in the EKS cluster configuration

Error 2

Validation

  • Check for permission and use the right instance type for the cloud credentials
  • Rectify the permission on AWS to use the required configured instance type

Scenario 4: K8s version upgrade

During the k8s version upgrade to 1.25, the below error occurs if the aws-load-balancer-controller version is 2.4.6. The upgrade gets halted and the preflight check fails

Error 2

Validation

Update the aws-load-balancer-controller to version v2.4.7 and then upgrade the k8s version to 1.25


Scenario 5: Removal of PSPs

The below error is an example that might occur when PSPs are found during the k8s version upgrade to 1.25.

Error 2

Validation

PSPs are no longer supported in k8s v1.25, hence remove the PSPs and upgrade again


AWS Cloud Errors

When provisioning an EKS cluster, it might fail due to various AWS Cloud errors. These errors can stem from resource limitations, network connectivity issues, misconfigurations in the provisioning process, insufficient permissions, service outages impacting required AWS services, software bugs, and region-specific constraints. These factors can disrupt the EKS cluster provisioning process and necessitate troubleshooting to identify and resolve the underlying issues for successful deployment.

To gain insight into the failure and its underlying cause, click on Provision Status of the failed cluster

Error 2

Click on Errors tab and expand the Cloud Error(s) section to access detailed information about AWS CloudFormation errors. This action will provide specific details regarding the encountered issues during the cluster provisioning process, enabling you to identify the root cause and take appropriate remedial actions for successful deployment.

Error 2

Logs & Events

AWS Cloud Debug Logs & Events - Coming Soon

In the event of Cluster Provisioning failure on Day 0 due to any underlying issues, it's essential to diagnose and resolve them promptly. Along with the Errors, users can gain deeper insights into the errors and facilitate debugging by clicking on Logs & Events tab to view the CloudFormation stack events from AWS. This action provides access to comprehensive logs dating back to the creation of the stacks, enabling a thorough examination of events leading up to the provisioning failure.

This Logs & Events are available for the failure scenarios during Infra Creation and deletion, Bootstrap Node creation, and deletion and Bootstrap Creation In progress

Error 2

Logs & Events are available when the cluster provisioning is complete, but the operation status is not ready, even though nodes are being created. This discrepancy might occur if the blueprint status is not initiated when incorrect images are applied.

Error 2

If nodegroup creation fails on Day 2, users are allowed to pull the Logs and Events of a specific nodegroup. To access the logs and events pertaining to a specific nodegroup creation failure, click on the corresponding 'nodegroup creation failed' link and review the details.

Error 2

In addition to retrieve logs and events through the user interface, users can also pull cloud logs and events using the API and CLI.