Alerts

The Controller continuously monitors both clusters and workloads deployed on the managed clusters. When a critical issue with the cluster or the workload is detected, the Controller generates an "Alert".

Alerts are generated when observed events "persist" and are unable to resolve automatically after a number of retries. The entire history of "Alerts" is persisted on the Controller and a reverse chronological history is available to Org Admins on the Console.

Alert Lifecycle¶

All Alerts start life as "Open Alerts". When the underlying issue is resolved (automatically or manually) and the issue does not manifest anymore, the alert is automatically "Closed".

Filters are provided to help sort and manage the alerts appropriately:

Alerts Status (Open/Closed)
Type
Cluster
Severity
Timeframe

For every alert, the following data is presented to the user: - Date: When the issue was first observed and therefore the alert was generated automatically - Duration: How long the issue has persisted - Type: See details below - Cluster: The cluster in which the issue was observed - Severity: How severe is this alert (Critical/Warning/Info) - Summary: Brief description of the issue - Description: Detailed description of the issue behind the alert

Closed Alerts

Alert Severity¶

All alerts have an associated Severity. A CRITICAL alert means the administrator needs to pay attention immediately to help address the underlying issue. A WARNING severity means there is an underlying issue that is trending poorly and will need attention quickly. An Info severity is mostly for Informational purposes only.

SLA¶

For application and ops teams, SLA can be a critical measure of their effectiveness. The "duration" of the alert provides an excellent indication of SLA. Issues should ideally be triaged and resolved ASAP in minutes.

Alerts Quick View¶

Cluster administrators are provided with a quick view of all open alerts associated with a cluster. In the Console, navigate to the cluster card to get a bird's eye view of open alerts.

Quick View of Alerts

Alert Scenarios¶

The table below captures the list of scenarios that are actively monitored. Alerts are automatically generated when these scenarios occur.

Managed Clusters¶

Monitored Object	Description	Severity
Cluster	Health of pods in Critical Monitored Namespaces	Critical
Cluster	Loss of Operator Connectivity to Controller	Critical
Cluster	Low Capacity	Warning
Cluster	Very Low Capacity	Critical

Pods in Critical Namespaces

Are pods in critical, monitored namespaces healthy? i.e. “kube-system”, “rafay-system” and “rafay-infra” namespaces

Network Connectivity

The k8s Operator is unable to reach the Controller over the network

Low Capacity

Less than 20% of overall cluster capacity (CPU and Memory) available for >5 minutes

Very Low Capacity

Less than 10% of overall cluster capacity (CPU and Memory) available for >5 minutes

Cluster Nodes¶

Monitored Object	Description	Severity
Node	Node in Not Ready state	Critical
Node	Node powered down	Critical
Node	High CPU load	Critical
Node	High Memory Load	Critical
Node	Disk Usage Prediction	Warning

Node Not Ready

Cluster Node in “Not Ready” state for >5 minutes (i.e. Disk, CPU or PID Pressure)

Node Powered Down

Node powered down for >5 minutes

High CPU Load

Greater than 90% sustained CPU utilization over 5 minutes. This can result in CPU throttling of pods

High Memory Load

Greater than 80% sustained Memory utilization over 5 minutes. This can result in pods experiencing OOM Killed issues

Disk Usage Prediction

Prediction based on growth and usage

Workloads¶

Monitored Object	Description	Severity
Workload	Unhealthy	Critical
Workload	Degradation	Critical

Workload Unhealthy

A k8s resource (e.g. replicaset, daemonset etc.) required by the workload is unavailable for > 2 minutes

Workload Degradation

One or more of the Pod’s P95 cpu utilization is 90% of the limit for > 15 minutes

Pod Health¶

Monitored Object	Description	Severity
Pod	OOM Killed	Critical
Pod	Pod Pending	Critical
Pod	Frequent Pod Restart	Critical

Pod OOMKilled

Processes in the Pod have used more than the memory limit

Pod Pending

Pod pending for >5minutes

Frequent Pod Restart

Pod restarted >3 times in 60 minutes

PVC Health¶

Monitored Object	Description	Severity
PVC	PVC Unbound	Critical
PVC	Usage Prediction	Warning

PVC Unbound

PVC unbound for >5minutes

PVC Usage Prediction

PVC projected to run out of capacity within 24hrs.