Key Kubernetes Challenges for AI/ML in the Enterprise - Part 1¶

This blog is based on our learnings over the last two years as we worked very closely with our customers that make extensive use of Kubernetes for AI/ML.

This is part-1 of a two part series. In part-1, we will

Start by looking at why Kubernetes is particularly compelling for AI/ML.
Describe some of the key challenges that organizations will encounter with AI/ML and Kubernetes

In part-2, we will look at ways by which organizations can address these challenges.

Why Kubernetes for Machine Learning¶

Machine learning is about being able to process large data sets quickly and efficiently. Capabilities such as parallelization of jobs, data segmentation, and batch processing are critical for machine learning. These capabilities are supported natively in Kubernetes and as a result, it is extremely well suited for machine learning.

Now, let's explore the key challenges that organizations experience supporting these initiatives.

Infra Setup and Maintenance Complexity¶

One of the biggest challenges organizations encounter is with the complexity of infrastructure setup and maintenance. It was recently reported that data scientists are forced to invest 60-80% of their time on infrastructure related tasks and only 4% on actual testing with data.

This is unacceptable from a user productivity perspective. Organizations want infrastructure to be abstracted away from data scientists and deliver this to them "on demand" via a "self service" experience.

Steep Learning Curve¶

Data scientists have a difficult enough job learning and keeping up with constant advances in the AI/ML space. It is not practical to expect them to become experts in Kubernetes and the associated ecosystem as well.

There are excellent projects such as Kubeflow, mlflow that attempt to streamline this experience for data scientists. But, these still require users to be intimately familiar with Kubernetes.

Security & Governance¶

As AI/ML goes mainstream supporting the primary revenue stream for organizations, these teams find themselves having to demonstrate that they are operating with world class security and governance. Ignoring this can be very problematic and result in unnecessary audits etc.

User Access¶

This is an acute and daily problem where we see organizations struggle to provide data scientists and other associated users "secure, remote access" to both infrastructure and the ML system/platform.

To ensure uptime, these users need visibility into the health metrics for the underlying compute, storage infrastructure, GPUs and their applications. The lack of an integrated, intuitive access experience can result in dramatic loss of user productivity and potentially system downtime.

Typically, there are at least three classes of users that need access to support the organization's AI/ML systems and operations.

Employees¶

These are typically Data Scientists, Operations, FinOps and Security personnel that need "seamless, role based access" to do their jobs.

3^rd Party/ISVs¶

Organizations frequently work with specialized 3^rd party ISV software for AI/ML that needs to be deployed and operated in their infrastructure. Authorized employees from this ISV will need to be provided access so that they can "remotely" deploy, operate and manage the AI/ML application.

A good example for this is a specialized AI based diagnostic application used at a hospital. The hospital will likely not be in the business of developing bespoke AI/ML applications since it is not core to their business.

Contractors¶

It is extremely common for organizations to leverage contractors extensively to support their AI/ML initiatives.

Summary¶

We live in very interesting times. Every organization we work with is exploring how they can leverage AI/ML to transform their core business and is being treated as a "sink or swim" decision.

In part-2 of this series, we will look at how organizations can address these challenges at scale and dramatically accelerate the adoption/use of AI/ML internally.