Diagnose cluster health¶

Arguments: $ARGUMENTS

Goal: decide overall health, flag nodes not Ready, surface pods in problematic states, and call out events that indicate real problems. The rafay_get cluster response is the control-plane source of truth for Rafay-reported status and health (field names vary by API version); use that before kubectl, then correlate with the data plane.

Inputs¶

Key	Required	Description
`cluster_name`	Yes	Rafay cluster resource name.
`project_name`	Yes	Rafay project that owns the cluster. Always supply explicitly—do not rely on `RAFAY_PROJECT` or other defaults.

If cluster_name or project_name is missing, stop and ask the user to supply both:

cluster_name: <cluster>
project_name: <project>

MCP mapping: cluster_name → name on rafay_get / rafay_execute with resource_type=cluster; project_name → project-name on every Rafay call (required).

Prerequisites¶

rafay MCP authenticated; pass project-name on every call using the supplied project_name (do not skip it because RAFAY_PROJECT is set).
Read the host’s rafay MCP tool descriptors before calling tools; if unsure of resource types, call rafay_describe first.

Managing output volume¶

Full cluster dumps (get pods -A, unbounded events -A) are often too large for tools and for the user. Default to a narrow → widen pattern:

Summarize in prose from smaller queries; paste only lines that support the verdict (or a short capped excerpt).
describe sparingly—only for nodes/pods that are already flagged NotReady, Pending a long time, CrashLoopBackOff, etc. (cap at a small handful unless the user asks for more).
Widen (broader get, extra namespaces, full event stream) only if narrow checks are clean but Rafay still unhealthy, or the user requests a full inventory.

If kubectl output is truncated or the command times out, say so and tighten the query (namespace, field selector, Warning-only events).

When kubectl via MCP fails¶

rafay_execute with action=kubectl can fail even when rafay_get cluster works—e.g. network/connectivity to the cluster through Rafay, agent or exec/tunnel path misconfiguration or outage, timeouts, RBAC denied, or the Kubernetes API unreachable from the agent.

When that happens:

Tell the user plainly that kubectl through Rafay MCP failed and that the data-plane connection (or the path from Rafay to the cluster API) appears broken or unavailable—use wording that matches the error (timeout vs permission vs connection refused), and quote the tool error if it is safe to share.
Do not pretend nodes/pods/events were checked; base the verdict on rafay_get cluster only and label the assessment incomplete regarding the workload layer.
Optionally retry one minimal probe (e.g. kubectl get nodes) if the failure might be transient; if it fails again, stop retrying and treat the path as down.

Workflow¶

1. Control plane: cluster object¶

rafay_get with resource_type=cluster, name=cluster_name, and project-name=project_name.

The cluster payload usually includes status and health-related fields (e.g. high-level state/phase, conditions, readiness, connectivity to the control plane, agent or infrastructure signals, and error or message text—exact keys depend on the API). Parse the full status (and top-level health indicators if present); do not skip this in favor of kubectl.

Quote or summarize using API field names and values verbatim; tie the narrative to those fields (e.g. “status.… shows …”).
If Rafay reports the cluster as not ready, degraded, disconnected, or similar, treat that as the primary signal and still run kubectl to explain or confirm (nodes/pods/events).
If Rafay reports healthy but kubectl shows widespread node or pod failure, call that mismatch explicitly in the synthesis.

2. Data plane: nodes¶

rafay_execute with resource_type=cluster, action=kubectl, same cluster name, project-name=project_name, and a command that includes the kubectl prefix (per server schema).

Node lists are usually small (dozens, not thousands). Start compact; add -o wide only if zone / IP / version matter for the question:

kubectl get nodes

Then for any node not Ready or with unknown state:

kubectl describe node <node-name>

Interpret Ready, MemoryPressure, DiskPressure, PIDPressure, NetworkUnavailable, and SchedulingDisabled (cordon). List nodes that are NotReady, SchedulingDisabled, or missing expected labels/roles if the user cares about topology.

3. Data plane: pods in bad states¶

Do not default to kubectl get pods -A -o wide—that scales poorly on busy clusters.

Default (cluster-wide, high-signal only)—pods that are not steady-state Running or completed Jobs:

kubectl get pods -A --field-selector=status.phase!=Running,status.phase!=Succeeded -o wide

If the user gives a namespace, scope there first:

kubectl get pods -n <namespace> -o wide

If that filtered list is empty, either the workload layer looks fine or issues are Running but unhealthy (e.g. CrashLoopBackOff, ImagePullBackOff, 0/1 Ready in the STATUS column—those often still have phase Running). Then either:

Use events (Warning, namespaces tied to symptoms) and describe on implicated objects, or
Spot-check known app namespaces with kubectl get pods -n <namespace> -o wide, or
One namespace at a time—avoid kubectl get pods -A unless the user asks for a full inventory or narrow checks are exhausted.

Flag pods that are worth reporting, including (non-exhaustive):

Phase: Pending (especially long-running), Failed, Unknown
Restart-heavy: high RESTARTS or CrashLoopBackOff
Image / pull: ImagePullBackOff, ErrImagePull
Scheduling: Pending with 0/N containers ready—check reason in describe
Stuck terminating or unusual Init failures

For top offenders (limit to a handful unless the user wants full detail):

kubectl describe pod <pod> -n <namespace>

Note: Succeeded pods from Jobs/CronJobs may be normal; distinguish one-off completion from stuck or failing workloads.

4. Data plane: events to care about¶

Event streams are often noisier and larger than node lists. Default to Warnings cluster-wide, then sort by time:

kubectl get events -A --field-selector type=Warning --sort-by=.lastTimestamp

In the answer, prioritize the most recent slice (e.g. last ~30–80 lines worth of content)—do not paste thousands of lines. Group repeated reason + involved object into one line with a count.

If Warning-only is empty or the story needs Normal events, narrow by namespace (same as failing pods) before going fully unfiltered:

kubectl get events -n <namespace> --sort-by=.lastTimestamp

Only if still inconclusive, use a broader event query—and still cap what you quote to recent, relevant rows.

Focus on:

Reasons such as Failed, FailedScheduling, BackOff, Unhealthy, Killing, FailedMount, FailedAttachVolume, Evicted, OOMKilled (often visible via pod describe too)
Bursting repeats on the same object (same involved object + reason)
Very old events alone with no matching current pod/node issues—treat as historical unless the user asks for a timeline

5. Optional: addons / blueprint sync¶

If symptoms point at platform components or addon drift, rafay_list with resource_type=cluster_addon, name=cluster_name, and project-name=project_name can show sync failures. For a full blueprint/addon narrative, prefer the diagnose-blueprint-sync skill.

Synthesis¶

Produce a short verdict:

Healthy / degraded / unhealthy: lead with what rafay_get cluster reported for status/health, then nodes and critical pods—unless kubectl never succeeded, in which case say data-plane checks could not run and that connectivity/exec via Rafay appears broken, then summarize Rafay-only signals. When both exist, note any disagreement between Rafay and the data plane.
Nodes: list NotReady or cordoned nodes and the main condition/message.
Pods: grouped by namespace (or cluster-wide top N): name, state/reason, restarts, what to check next.
Events: only actionable or recurring lines; separate current vs stale noise.

Call out mismatches (e.g. Rafay says healthy but many nodes NotReady) and the next concrete check or fix.

Escalation¶

On auth errors, missing cluster, rafay_execute / kubectl failures, timeouts, or empty kubectl output where output is expected, state the blocker explicitly. For kubectl failures, include that the connection or exec path to the cluster may be broken and suggest the user verify cluster/agent connectivity, Rafay agent health, and permissions for the executing identity. Ask the user to confirm cluster_name, project_name, MCP credentials, and that the cluster can run kubectl through Rafay. If either input is missing or wrong, ask for the structured YAML block again with both keys.