Troubleshooting
Cluster provisioning will fail if issues are detected and cannot be automatically overcome. When this occurs, the user is presented with an error message on the Web Console with a link to download the "error logs".
Environmental¶
Environmental or misconfiguration issues can result in provisioning failures. Some of them are documented below.
The platform provides tooling for "pre-flight" checks on the nodes before initiating provisioning. These pre-flight checks are designed to quickly detect environmental or misconfiguration issues that will result in cluster provisioning issues.
Important
Please initiate provisioning ONLY after the pre-flight checks have passed successfully.
RCTL CLI¶
Users that are using the RCTL CLI with declarative cluster specifications for cluster provisioning and lifecycle management may encounter errors that will be presented by RCTL. Some of the commonly encountered issues are listed below
Credentials Upload¶
This occurs when the RCTL CLI is unable to secure copy the credentials and the conjurer binary to the remote node(s).
./rctl apply -f singlenode-test.yaml
{
"taskset_id": "lk5owy2",
"operations": [
{
"operation": "ClusterCreation",
"resource_name": "singlenode-test1",
"status": "PROVISION_TASK_STATUS_PENDING"
},
{
"operation": "BlueprintSync",
"resource_name": "singlenode-test1",
"status": "PROVISION_TASK_STATUS_INPROGRESS"
}
],
"comments": "The status of the operations can be fetched using taskset_id",
"status": "PROVISION_TASKSET_STATUS_PENDING"
}
Downloading Installer And Credentials
Copying Installer and credentials to node: 35.84.184.226
scpFileToRemote() failed, err: failed to dial: dial tcp 35.84.184.226:22: connect: operation timed out%
Incorrect SSH Details¶
Provisioning will fail if incorrect SSH details are specified in the cluster specification file.
Error: Error performing apply on cluster "vyshak-mks-ui-test1": server error [return code: 400]: {"operations":null,"error":{"type":"Processing Error","status":400,"title":"Processing request failed","detail":{"Message":"Active Provisioning is in progress, cannot initiate another\n"}}}
Incorrect Node Details¶
Provisioning will fail if incorrect node details are specified in the cluster specification file.
Error: Error performing apply on cluster "vyshak-singlenode-test1": server error [return code: 400]: {"operations":null,"error":{"type":"Processing Error","status":400,"title":"Processing request failed","detail":{"Message":"Cluster with name \"vyshak-singlenode-test1\" is partially provisioned\n"}}}
Managed Storage¶
The customers using the rook-ceph storage node must deploy the default-upstream blueprint to the cluster.
Step 1: Verify Blueprint Sync¶
The rook-ceph storage is provided as an add-on with default-upstream blueprint, thus users can verify the rook-ceph managed storage deployment using the blueprint sync icon. Refer Update Blueprint to know more about update blueprint sync status
Step 2: Verify Health of Pods¶
On successful blueprint sync, users can view the rook-ceph pods running as shown in the below example:
kubectl -n rook-ceph get pod
11NAME READY STATUS RESTARTS AGE
12csi-cephfsplugin-4r8c5 3/3 Running 0 4m1s
13csi-cephfsplugin-provisioner-b54db7d9b-mh7mb 6/6 Running 0 4m
14csi-rbdplugin-6684r 3/3 Running 0 4m1s
15csi-rbdplugin-provisioner-5845579d68-sq7f2 6/6 Running 0 4m1s
16rook-ceph-mgr-a-f576c8dc4-76z96 1/1 Running 0 3m8s
17rook-ceph-mon-a-6f6684764f-sljtr 1/1 Running 0 3m29s
18rook-ceph-operator-75fbfb7756-56hq8 1/1 Running 1 17h
19rook-ceph-osd-0-5c466fd66f-8lsq2 1/1 Running 0 2m55s
20rook-ceph-osd-prepare-oci-robbie-tb-mks1-b9t7s 0/1 Completed 0 3m5s
21rook-discover-5q2cq 1/1 Running 1 17h
Node Provision/Upgrade¶
When a node provision/upgrade/addition fails, perform the below steps:
ssh
to faculty node where provision/upgrade/addition failed- Run
systemctl status salt-minion-rafay
to check the minion status (active/inactive) - If the Salt minion is not active, restart it using the command
systemctl restart salt-minion-rafay
- To check for any errors, verify the logs of the active minion from the file cat/opt/rafay/salt/var/log/salt/minion
Support¶
If you are unable to resolve the issue yourself, please contact Support or via the provided private Slack channel for your organization. The support organization is available 24x7 and will be able to assist you immediately.
Please make sure that you have downloaded the "error log file" that was shown during failure. Provide this to the support team for troubleshooting.
Remote Diagnosis and Resolution¶
For customers using the SaaS Controller, with your permission, as long as the nodes are operational (i.e. running and reachable), the support team can remotely debug, diagnose and resolve issues for you. Support will inform the customer if the underlying issue is due to misconfiguration (e.g. network connectivity) or environmental issues (e.g. bad storage etc).
Important
Support DOES NOT require any form of inbound connectivity to perform remote diagnosis and fixes.