Index¶

October 8, 2024
in Product Blog, Workload Identity, Azure AKS, GitOps
4 min read

Leveraging Workload Identity with Rafay's GitOps Approach - Part 2

In continuation of our Part 1 of our blog introducing Workload Identity for Azure AKS,this is Part 2 where will explore how to use Workload Identity with the Rafay's GitOps approach, enabling your Kubernetes pods to securely access Azure resources.

Accessing Azure Resources

October 7, 2024
in Product Blog, Pod Identity, EKS Pod Identity Associations
4 min read

Using Amazon EKS Pod Identity and Associations with Rafay - Part 2

In continuation of our Part 1 of our blog introducing Pod Identity vs. IRSA for Amazon EKS, this is Part 2, where we will explore how to use Amazon EKS Pod Identity with the Rafay platform. This blog post will guide you through deploying the Amazon EKS Pod Identity Agent and configuring role associations, enabling your Kubernetes pods to securely access AWS services.

Pod Accessing AWS service

October 4, 2024
in Product Blog, Break Glass, Kubernetes
3 min read

Break Glass Workflows for Developer Access to Kubernetes Clusters - Introduction

In any large-scale, production-grade Kubernetes setup, maintaining the security and integrity of the clusters is critical. However, there are exceptional circumstances—such as production outages or critical bugs—where developers need emergency access to a Kubernetes cluster to resolve issues.

This is where a "Break Glass" process comes into play. It is a controlled procedure that grants temporary, elevated access to developers in critical situations, with the appropriate safeguards in place to minimize risks.

Break Glass

October 3, 2024
in Product Blog, Pod Identity, IRSA
6 min read

Pod Identity versus IRSA for Amazon EKS - Part 1

When managing containerized applications on Amazon Elastic Kubernetes Service (EKS), a critical concern is securely granting permissions to your applications so that they can securely access AWS resources. Traditionally, AWS has provided mechanisms like IAM Roles for Service Accounts (IRSA) to enable fine-grained permissions management within EKS clusters. However, EKS Pod Identity, a newer feature, offers a more refined and efficient solution.

In this blog, we’ll explore how EKS Pod Identity differs from IRSA, and why it represents a significant improvement for identity management in Amazon EKS based environments. Let's assume our EKS cluster resident application needs to securely access data in an AWS s3 bucket.

App Accessing AWS S3

October 2, 2024
in Product Blog, MLOps
4 min read

Bringing DevOps and Automation to Machine Learning via MLOps

The vast majority of organizations are new to AI/ML. As a result, most in-house systems and processes supporting this is likely ad-hoc. Industry analysts like Gartner forecast that organizations will need to quickly transition from Pilots to Production with AI/ML in order to make it across the chasm.

Most organizations already have reasonably mature DevOps processes and systems in place. So, going mainstream with AI should be a walk in the park. Correct? Turns out that this is not really true “IT leaders responsible for AI are discovering the AI pilot paradox, where launching pilots is deceptively easy but deploying them into production is notoriously challenging.” by Chirag Dekate, Gartner

In this blog, we will try and answer the following question:

Why do we need a new process called MLOps when most organizations already have reasonably mature DevOps practices? How is MLOps different from DevOps?

DevOps vs MLOps

September 29, 2024
in Product Blog, Nvidia, GPU Metrics, GPU, Framebuffer
5 min read

GPU Metrics - Framebuffer

In the previous blog, we discussed why tracking and reporting GPU power usage matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU Framebuffer usage.

GPU Framebuffer

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

September 28, 2024
in Product Blog, Nvidia, GPU, GPU Metrics, Power
4 min read

GPU Metrics - Power

In the previous blog, we discussed why tracking and reporting GPU SM Clock metrics matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU Power.

GPU Power

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

September 27, 2024
in Product Blog, Nvidia, GPU, GPU Metrics, SM Clock
4 min read

GPU Metrics - SM Clock

In the previous blog, we discussed why tracking and reporting GPU Memory Utilization metrics matters. In this blog, we will dive deeper into another critical GPU metric i.e. GPU SM Clock. The GPU SM clock (Streaming Multiprocessor clock) metric refers to the clock speed at which the GPU's cores (SMs) are running.

The SM is the main processing unit of the GPU, responsible for executing compute tasks such as deep learning operations, simulations, and graphics rendering. Monitoring the SM clock speed can help users assess the performance and health of your GPU during workloads and detect potential bottlenecks related to clock speed throttling.

GPU SM Clock

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

September 26, 2024
in Product Blog, Nvidia, GPU, GPU Metrics, Memory Utilization
4 min read

GPU Metrics - Memory Utilization

In the introductory blog on GPU metrics, we discussed about the GPU metrics that matter and why they matter. In this blog, we will dive deeper into one of the critical GPU metrics i.e. GPU Memory Utilization.

GPU memory utilization refers to the percentage of the GPU’s dedicated memory (i.e. framebuffer) that is currently in use. It measures how much of the available GPU memory is occupied by data such as models, textures, tensors, or intermediate results during computation.

GPU Memory Utilization

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.

September 25, 2024
in Product Blog, Nvidia, GPU, GPU Metrics
5 min read

What GPU Metrics to Monitor and Why?

With the increasing reliance on GPUs for compute-intensive tasks such as machine learning, deep learning, data processing, and rendering, both infrastructure administrators and users of GPUs (i.e. data scientists, ML engineers and GenAI app developers) require timely access and insights into performance, efficiency, and overall health of their GPU resources.

In order to make data driven, logical decisions, it is critical for these users to have access to critical metrics for their GPUs. This is the first blog in a series where we will describe the GPU metrics that you should track and monitor. In subsequent blogs, we will do a deep dive into each metric, why it matters and how to use it effectively.

Intro to GPU Metrics

Important

Navigate to documentation for Rafay's integrated capabilities for Multi Cluster GPU Metrics Aggregation & Visualization.