Microsoft Adopts Kubernetes to Scale Machine Learning Cloud Workloads

Microsoft and AI specialist Litbit collaborate on an AI system that uses Kubernetes to automatically scale unpredictable machine learning workloads.

Microsoft Kubernetes Components

Add Kubernetes to the growing list of technologies that Microsoft is using to push the boundaries of cloud-based artificial intelligence.

Kubernetes, the popular open-source container orchestration platform, had a breakout year in 2017. Now, in addition to helping enterprises manage their application container deployments in the cloud or on-premises, Kubernetes is being recruited to give cloud-based AI workloads the room they need to get the job done when demand picks up.

Microsoft unveiled a new auto-scaling system that uses Kubernetes to expand or shrink the amount of cloud-computing resources required for learning training workloads. The system was developed in partnership with Litbit, a San Jose, Calif. technology startup that uses Internet of Things data to create "AI Personas" that workplaces can use to augment the capabilities of their employees based on their collective experiences and know-how.

For example, an organization can create and train a persona that helps its field technicians detect and diagnose equipment problems before jumping in a work truck and physically visit machinery that is acting up to save time and expense.

It's a tall order and an unpredictable one, it turns out. Litbit discovered that AI training workloads loads varied wildly, since customers training their personas at different times.

"Some of these training jobs (e.g., Spark ML) make heavy use of CPUs, while others (e.g., TensorFlow) make heavy use of GPUs. In the latter case, some jobs retrain a single layer of the neural net and finish very quickly, while others need to train an entire new neural net and can take several hours or even days," explained Microsoft representatives in a blog post.

Microsoft and Litbit settled on Kubernetes, partly because of its proven cluster management technology, but also because of the strong community support the project had attracted in a few short years. Although it started off at Google, the project is considered a crown jewel of the Linux Foundation's Cloud Native Computing Foundation (CNCF).

The companies set out to solve the problem of highly variable machine learning workloads by configuring a Kubernetes cluster on Azure with GPU support using the Azure CNI Networking plugin for Kubernetes. They then applied a node-level auto-scaler using the Helm package manager for Kubernetes, followed by some configuration changes to get the system up and running.

The project was a success. The system has been running for four months and has enabled Litbit to scale to up to 40 nodes at a time and seamlessly downsize when demand abated. Microsoft has posted a complete walk-through of the Kubernetes auto-scaler on its developer blog.

Reflecting the container craze that has gripped enterprise DevOps teams, Microsoft has doubled down on its support for Kubernetes.

During the KubeCon conference earlier this December, Microsoft announced that its Azure Container Service is now abbreviated AKS, signifying the company's customer-focused, Kubernetes-centric approach to cloud-native application development. The company also unveiled a new connector called Virtual Kubelet, which allows users to target Azure Container Instances (ACI), the company's rapid container creation and deployment service.

Pedro Hernandez

Pedro Hernandez

Pedro Hernandez is a contributor to eWEEK and the IT Business Edge Network, the network for technology professionals. Previously, he served as a managing editor for the network of...