The Rafay platform is primarily delivered as a cloud based SaaS Offering (Self Hosted deployment options are available as well). The Rafay Controller provides seamless workflows to help customers manage the lifecycle of Kubernetes clusters including workflows to keep the Kubernetes version current and upto date.
Kubernetes versions are expressed as vMajor.vMinor.vPatch. The Kubernetes project typically releases new vMinor versions every 3-months.
New vPatch updates are made available to address security issues and/or bug fixes.
The Kubernetes project maintains release branches for the most recent three minor releases. Applicable fixes, including security fixes are typically made available ONLY to these three release branches.
Rafay actively tracks the Kubernetes Project for availability of patches and minor/major versions. These are then immediately put through a round of testing and qualification before being made available to customers. By default, the latest version of the Kubernetes is used for cluster provisioning. Customers can also optionally specify a version from an older minor version from the support matrix of Kubernetes during cluster provisioning. For example, the screenshot below shows the K8s version selection dropdown for provisioning on bare metal/VMs.
In Place k8s Upgrades¶
- Customer’s application should not suffer from lack of orchestration capabilities (e.g. autoscaling) during k8s upgrade process.
- Kubernetes upgrades can be scheduled and performed in customer's preferred maintenance windows.
- Kubernetes upgrades can be performed with a canary approach i.e. one canary cluster first, then followed by remaining clusters.
- Customer applications should be able to operate in a heterogeneous k8s environment for extended periods of time i.e. some clusters on latest versions and remaining on prior version.
HA clusters will have "multiple Kubernetes masters" deployed on three separate nodes. The master nodes are upgraded one at a time ensuring there is no disruption to both customer containers as well as core control/management functions.
Non HA, single node systems have one Kubernetes master. When Kubernetes is upgraded on these systems, the control functions are paused briefly until upgrade is complete. It is worth emphasizing that there is no impact to the customer's containers on this system while Kubernetes is being upgraded.
The worker nodes are upgraded one at a time ensuring there is no disruption to both customer containers. Before they are upgraded, the worker nodes are tainted with "No Schedule" to ensure that new pods are not scheduled on it. Once upgrade is complete, the taint is removed.
Note that during the worker node upgrade process, there is no disruption to the data path to the customer applications.
When new Kubernetes versions (vMinor or vPatch) are made available by Rafay, cluster administrators are provided a notification. For example, the cluster shown below is running an older version of Kubernetes (v1.16.12) and is shown a red "upgrade available" notification badge.
Clicking on the notification badge will present available upgrade options. For example, in the screenshot below, the user can upgrade to
- The latest vPatch (v1.14.1 -to- v1.14.10)
- The next possible vMinor (v1.14.1 -to- v1.15.12)
Only authorized users with appropriate RBAC are allowed to perform Kubernetes upgrades.
Rafay performs an "in-place" upgrade of Kubernetes. During the Kubernetes upgrade process, the nodes are cordoned (not drained) before they are upgraded. As a result, pods already resident on the node can remain where they are running with no loss of transient data in local volumes.
Cordon will not schedule new pods on the node. Draining will remove the current pods on the node and they get rescheduled to a different node. This can result in evictions of pods etc.
Preflight checks are automatically performed before the cluster is upgraded/downgraded to the new Kubernetes version. The process is terminated if preflight checks do not pass.
The following preflight checks are performed and have to pass before the upgrade process is allowed to proceed.
- Cluster Readiness (i.e. Is cluster actually provisioned and in a READY state?)
- Control Channel Health (i.e. Is the OS level control channel to Controller active?) and
- Kubeadm Internal Preflight Checks (i.e. verifies the cluster’s health, node health etc)
The software binaries for the target Kubernetes version are downloaded from the Rafay Controller as a single TAR file. The typical size of the downloaded tar file is ~40 MB.
Post upgrade validation checks are automatically performed after the cluster is upgraded/downgraded to the new Kubernetes version. The following checks are performed and have to pass before the upgrade is deemed successful.
- Node Ready Check (i.e. did all the cluster nodes report back as READY after upgrade?)
- Pods Running Check (i.e. are all the pods in the critical namespaces running after upgrade?)
Performing k8s upgrades on unhealthy clusters can result in long validation time windows because the process will continuously retry for 10 minutes to ensure that the cluster and pods settle down.
The upgrade process can take a few minutes (3-5 mins) and is dependent on the number of nodes on the cluster and network bandwidth for software downloads. Note that the time taken for every step is measured and displayed to the user.
If the upgrade process was unsuccessful, users will be presented the option to
(a) Retry OR (b) Rollback
Retries can be useful when transient errors occur such as downloading of binaries fail with remote clusters with poor network connectivity. A rollback will take the cluster back to its original state before the upgrade was performed.
The Rafay Controller maintains a history of all successful and unsuccessful upgrades.
- Navigate to the Cluster
- Click on Activity
- Clicking on the "eye" icon will display deep details associated with the particular upgrade job
An audit entry is generated when a Kubernetes upgrade is performed. It is possible to retroactively determine "Who performed the upgrade when and from which version to version.