Skip to content

Troubleshooting

Cluster provisioning will fail if issues are detected and cannot be automatically overcome. When this occurs, the user is presented with an error message on the Web Console with a link to download the "error logs".

Provisioning Failure


Environmental

Environmental or misconfiguration issues can result in provisioning failures. Some of them are documented below.

The platform provides tooling for "pre-flight" checks on the nodes before initiating provisioning. These pre-flight checks are designed to quickly detect environmental or misconfiguration issues that will result in cluster provisioning issues.

Important

Please initiate provisioning ONLY after the pre-flight checks have passed successfully.


RCTL CLI

Users that are using the RCTL CLI with declarative cluster specifications for cluster provisioning and lifecycle management may encounter errors that will be presented by RCTL. Some of the commonly encountered issues are listed below

Credentials Upload

This occurs when the RCTL CLI is unable to secure copy the credentials and the conjurer binary to the remote node(s).

./rctl apply -f singlenode-test.yaml
{
  "taskset_id": "lk5owy2",
  "operations": [
    {
      "operation": "ClusterCreation",
      "resource_name": "singlenode-test1",
      "status": "PROVISION_TASK_STATUS_PENDING"
    },
    {
      "operation": "BlueprintSync",
      "resource_name": "singlenode-test1",
      "status": "PROVISION_TASK_STATUS_INPROGRESS"
    }
  ],
  "comments": "The status of the operations can be fetched using taskset_id",
  "status": "PROVISION_TASKSET_STATUS_PENDING"
}

Downloading Installer And Credentials
Copying Installer and credentials to node:  35.84.184.226
scpFileToRemote() failed, err: failed to dial: dial tcp 35.84.184.226:22: connect: operation timed out%

Incorrect SSH Details

Provisioning will fail if incorrect SSH details are specified in the cluster specification file.

Error: Error performing apply on cluster "vyshak-mks-ui-test1": server error [return code: 400]: {"operations":null,"error":{"type":"Processing Error","status":400,"title":"Processing request failed","detail":{"Message":"Active Provisioning is in progress, cannot initiate another\n"}}}

Incorrect Node Details

Provisioning will fail if incorrect node details are specified in the cluster specification file.

Error: Error performing apply on cluster "vyshak-singlenode-test1": server error [return code: 400]: {"operations":null,"error":{"type":"Processing Error","status":400,"title":"Processing request failed","detail":{"Message":"Cluster with name \"vyshak-singlenode-test1\" is partially provisioned\n"}}}

Managed Storage

The customers using the rook-ceph storage node must deploy the default-upstream blueprint to the cluster.

Step 1: Verify Blueprint Sync

The rook-ceph storage is provided as an add-on with default-upstream blueprint, thus users can verify the rook-ceph managed storage deployment using the blueprint sync icon. Refer Update Blueprint to know more about update blueprint sync status

Step 2: Verify Health of Pods

On successful blueprint sync, users can view the rook-ceph pods running as shown in the below example:

kubectl -n rook-ceph get pod
11NAME                                             READY   STATUS      RESTARTS   AGE
12csi-cephfsplugin-4r8c5                           3/3     Running     0          4m1s
13csi-cephfsplugin-provisioner-b54db7d9b-mh7mb     6/6     Running     0          4m
14csi-rbdplugin-6684r                              3/3     Running     0          4m1s
15csi-rbdplugin-provisioner-5845579d68-sq7f2       6/6     Running     0          4m1s
16rook-ceph-mgr-a-f576c8dc4-76z96                  1/1     Running     0          3m8s
17rook-ceph-mon-a-6f6684764f-sljtr                 1/1     Running     0          3m29s
18rook-ceph-operator-75fbfb7756-56hq8              1/1     Running     1          17h
19rook-ceph-osd-0-5c466fd66f-8lsq2                 1/1     Running     0          2m55s
20rook-ceph-osd-prepare-oci-robbie-tb-mks1-b9t7s   0/1     Completed   0          3m5s
21rook-discover-5q2cq                              1/1     Running     1          17h

Node Provision/Upgrade

When a node provision/upgrade/addition fails, perform the below steps:

  1. ssh to faculty node where provision/upgrade/addition failed
  2. Run systemctl status salt-minion-rafay to check the minion status (active/inactive)
  3. If the Salt minion is not active, restart it using the command systemctl restart salt-minion-rafay
  4. To check for any errors, verify the logs of the active minion from the file cat/opt/rafay/salt/var/log/salt/minion

Support

If you are unable to resolve the issue yourself, please contact Support or via the provided private Slack channel for your organization. The support organization is available 24x7 and will be able to assist you immediately.

Please make sure that you have downloaded the "error log file" that was shown during failure. Provide this to the support team for troubleshooting.


Remote Diagnosis and Resolution

For customers using the SaaS Controller, with your permission, as long as the nodes are operational (i.e. running and reachable), the support team can remotely debug, diagnose and resolve issues for you. Support will inform the customer if the underlying issue is due to misconfiguration (e.g. network connectivity) or environmental issues (e.g. bad storage etc).

Important

Support DOES NOT require any form of inbound connectivity to perform remote diagnosis and fixes.