Skip to content

AI/ML Superpowers for Kubernetes Troubleshooting

In the last two blogs (part 1 and part 2), we discussed the challenges customers face with running AI/ML on Kubernetes and innovative solutions to address these challenges. In this blog, we will flip this on its head and look at how AI/ML can make Kubernetes easier to use and operate.

Inflection Point for AI

Technology enabling AI has been improving over the decades, building on it one step at a time.

The 2000's

The primary advances in this phase were using machines for analyzing data, finding patterns, generating insights, making predictions and automating tasks at a pace and on a scale that was previously impossible.

The 2010's

The primary advances in this phase were with perception capabilities that enabled computer vision, detection and classification, voice recognition etc.

The 2020's

This decade so far has been the phase of "Generative AI" based on the GPT based large language models (LLMs). LLM's ability to process massive datasets has enabled them to learn the entire history, context and intent. In a nutshell, the working theory is that anything that can be conveyed through language can be addressed by Generative AI based LLMs.

Looking Ahead

The inherent capabilities of GPT have been increasing dramatically every year. There has literally been an exponential increase in the number of parameters supported by each generation.

  • 2019 - GPT-2 (1.5B parameters)
  • 2020 - GPT-3 (175B parameters)
  • 2023 - GPT-4 (Not disclosed yet! Trillions of parameters expected)

A few days back, OpenAI published a technical report on GPT-4.

This technology is already outscoring the vast majority of humans. If you project 5 years out, it would be logical to assume that this is going to be very disruptive and bring about unprededented productivity gains.

Kubernetes and AI

Ever since OpenAI opened up access an API to access their LLM, the Rafay team has been actively experimenting various use cases where AI can be applied to address significant challenges with Kubernetes. One of the first areas we have been experimenting with is wrt. Kubernetes troubleshooting since we see users struggle with this if they are new to Kubernetes

AI to Troubleshoot Kubernetes

Kubernetes is a fairly complex orchestration platform that has a relatively steep learning curve to get to an expert level proficiency. Troubleshooting issues in Kubernetes can be complex and time consuming. As a result, it is an ideal candidate for GPT based AI.

Introducing K8sGPT

k8sgpt is a relatively new project. It scans Kubernetes clusters allowing developers and SREs to diagnose and triage issues quickly. It leverages LLM based AI to process complex signals quickly and provide prescriptive guidance to the users.

Although k8sgpt can be used as a standalone CLI utility that can be used to analyze clusters on demand, it is more practical to deploy it as a Kubernetes Operator. When used in this mode, users can monitor their clusters continuously and can optionally integrate this with their monitoring tools such as Prometheus etc.


Developers and SREs can use the K8sGPT CLI utility with Rafay's Zero Trust Kubectl Access to analyze their cloaked, remote clusters operating behind firewalls.

This approach can be practical when a developer or SRE wishes to analyze a cluster using K8sGPT on demand.

flowchart LR
    subgraph rafay[Rafay]
        ztka[Zero Trust Kubectl Proxy -ZTKA]

    subgraph syd[Sydney]
        k1[Kubernetes Cluster]
    subgraph sfo[San Francisco]
        k2[Kubernetes Cluster]
    subgraph nyc[New York City]
        k5[Kubernetes Cluster]

    rafay-.->nyc[New York City]

    subgraph bay[Developer/SRE]
        direction TB
        k8sgpt[K8sGPT CLI] --> kubectl[ZTKA kubeconfig]
    k8sgpt -->openai[OpenAI API]

K8sGPT Operator

For continuous monitoring, it is ideal to deploy the K8sGPT operator on every cluster in the organization. With Rafay, users have a 1-time burden to create a cluster blueprint with a K8sGPT add-on and deploy the blueprint on all their clusters in the organization.

This approach can be practical at scale for ongoing, continuous monitoring/analysis of Kubernetes clusters in the organization. Issues can be proactively highlighted to users allowing them to be proactive.

flowchart LR
    subgraph rafay[Rafay]
        bp[Cluster Blueprint]

    subgraph syd[Kubernetes Cluster]
        k1[K8sGPT Operator] 
    k1-.->open[OpenAI API]
    subgraph sfo[Kubernetes Cluster]
        k2[K8sGPT Operator]
    k2-.->open[OpenAI API]
    subgraph nyc[Kubernetes Cluster]
        k3[K8sGPT Operator]
    k3-.->open[OpenAI API]


Try K8sGPT with Rafay

Follow our step-by-step documentation if you are interested in trying the K8sGPT Operator using Rafay. Users can bring in K8sGPT operator as an add-on in a cluster blueprint and deploy it to all clusters in their Rafay Orgs. You can also watch a short video of this in action.


New developments with Generative AI are appearing on an almost daily basis. The Rafay team continues to keep a close eye on these. Our goal is to leverage this transformative technology in an unobtrusive manner like we have done with pretty much all our integrations.

Blog Ideas

Sincere thanks to readers of our blog who spend time reading our product blogs. This blog was authored because we had many readers ask us about AI/ML and Kubernetes. Please Contact the Rafay Product Team if you would like us to write about other topics.