VMware Private AI Foundation: Deep Learning VM Images

Enhancing efficiency in setting up deep learning environments for data scientists

VMware Private AI Foundation: Deep Learning VM Images

Enhancing efficiency in setting up deep learning environments for data scientists

VMware Private AI Foundation: Deep Learning VM Images

Enhancing efficiency in setting up deep learning environments for data scientists

My Role

UX Designer — Interaction Design, User Flows, Prototyping

Team

Grace, PM

Qi, SWE

Fanny, SWE

Timeline

November 2023 - January 2024

Overview

Deep Learning VM Images are VMware's tailored solution for deploying high-performance virtual environments optimized for deep learning tasks. These VM images are designed to support popular machine learning frameworks and tools such as TensorFlow, PyTorch, and JupyterLab, making it easier for data scientists to start working with their models quickly. By streamlining the deployment process and enabling easy GPU resource allocation, Deep Learning VM Images eliminate the complexities of setting up and configuring deep learning environments.

I spearheaded the design of these VM images, focusing on simplifying the setup and providing pre-configured, optimized environments. This approach enhances deployment efficiency, allowing data scientists to rapidly access and utilize powerful deep learning tools.

Highlights

Streamlined hardware customization

Integrating the "Customize Hardware" step into the "Deploy from VM Template" workflow simplifies VM setup by allowing direct GPU resource configuration. This improvement eliminates the need for separate workflows, enhancing deployment efficiency and addressing a key user pain point.

Highlights

Streamlined hardware customization

Integrating the "Customize Hardware" step into the "Deploy from VM Template" workflow simplifies VM setup by allowing direct GPU resource configuration. This improvement eliminates the need for separate workflows, enhancing deployment efficiency and addressing a key user pain point.

Highlights

Streamlined hardware customization

Integrating the "Customize Hardware" step into the "Deploy from VM Template" workflow simplifies VM setup by allowing direct GPU resource configuration. This improvement eliminates the need for separate workflows, enhancing deployment efficiency and addressing a key user pain point.

Automated toolkit setup

The addition of the "Customize Template" step simplifies VM preparation by enabling the selection and installation of frameworks and dependencies in advance. This enhancement ensures data scientists can immediately begin their work without the hassle of manual setup.

Automated toolkit setup

The addition of the "Customize Template" step simplifies VM preparation by enabling the selection and installation of frameworks and dependencies in advance. This enhancement ensures data scientists can immediately begin their work without the hassle of manual setup.

Automated toolkit setup

The addition of the "Customize Template" step simplifies VM preparation by enabling the selection and installation of frameworks and dependencies in advance. This enhancement ensures data scientists can immediately begin their work without the hassle of manual setup.

CONTEXT

The challenges of setting up a deep learning environment

CONTEXT

The challenges of setting up a deep learning environment

CONTEXT

The challenges of setting up a deep learning environment

Managing dependencies

Data scientists frequently invest substantial time managing versions and dependencies, particularly in complex machine learning (ML) environments with GPU utilization. The intricate network of libraries, frameworks, toolkits, and drivers connecting ML applications to GPUs can be fragile and interdependent.

When everything aligns perfectly, it operates smoothly, but any misalignment can lead to significant disruptions and challenges.

Managing dependencies

Data scientists frequently invest substantial time managing versions and dependencies, particularly in complex machine learning (ML) environments with GPU utilization. The intricate network of libraries, frameworks, toolkits, and drivers connecting ML applications to GPUs can be fragile and interdependent.

When everything aligns perfectly, it operates smoothly, but any misalignment can lead to significant disruptions and challenges.

Managing dependencies

Data scientists frequently invest substantial time managing versions and dependencies, particularly in complex machine learning (ML) environments with GPU utilization. The intricate network of libraries, frameworks, toolkits, and drivers connecting ML applications to GPUs can be fragile and interdependent.

When everything aligns perfectly, it operates smoothly, but any misalignment can lead to significant disruptions and challenges.

Competing with cloud providers

To streamline the setup process and help customers start quickly, major public cloud providers like Google, AWS, and Azure offer specialized Deep Learning images. These images are optimized for deep learning tasks and come pre-configured with popular frameworks such as TensorFlow and PyTorch, along with various NVIDIA libraries and integrated development environments (IDEs).

By delivering these ready-to-use environments, these providers save data scientists significant time and effort, allowing them to focus on developing their machine learning models.

Competing with cloud providers

To streamline the setup process and help customers start quickly, major public cloud providers like Google, AWS, and Azure offer specialized Deep Learning images. These images are optimized for deep learning tasks and come pre-configured with popular frameworks such as TensorFlow and PyTorch, along with various NVIDIA libraries and integrated development environments (IDEs).

By delivering these ready-to-use environments, these providers save data scientists significant time and effort, allowing them to focus on developing their machine learning models.

Competing with cloud providers

To streamline the setup process and help customers start quickly, major public cloud providers like Google, AWS, and Azure offer specialized Deep Learning images. These images are optimized for deep learning tasks and come pre-configured with popular frameworks such as TensorFlow and PyTorch, along with various NVIDIA libraries and integrated development environments (IDEs).

By delivering these ready-to-use environments, these providers save data scientists significant time and effort, allowing them to focus on developing their machine learning models.

TARGET USERS

Meet the virtual infrastructure (VI) admins and data scientists

TARGET USERS

Meet the virtual infrastructure (VI) admins and data scientists

TARGET USERS

Meet the virtual infrastructure (VI) admins and data scientists

Understanding their roles and needs

Anita,

Virtual Infrastructure (VI) Admin

JTBD

I want to efficiently allocate hardware resources, so that I can ensure optimal VM performance without delays

Goals

Streamline VM deployment while ensuring available resources like GPUs and NICs are allocated efficiently

Pain points

Manually assigning hardware after deployment wastes time and adds complexity

Lack of real-time resource visibility makes configuration difficult

Juggling multiple workflows for hardware adjustments slows down operations

Bob,

Data Scientist

JTBD

I want to quickly start my deep learning / machine learning work, so that I can focus on training models rather than configuring environments.

Goals

Start work immediately with pre-configured environments that include all necessary frameworks.

Pain points

Installing frameworks manually wastes time and risks version conflicts.

Inconsistent environments across VMs hinder collaboration and reproducibility.

PROBLEM

The deployment process is rather roundabout

PROBLEM

The deployment process is rather roundabout

PROBLEM

The deployment process is rather roundabout

Missing GPU resource configuration in “Deploy VM from Template” workflow

Currently, when virtual infrastructure (VI) admins deploy a VM from a deep learning template for data scientists, they have no option to allocate GPU resources during the creation process. This leads to an inefficient workflow where admins have to deploy the VM, navigate to the VM's detail page, and manually configure the virtual GPU profiles. This process hinders quick and easy deployment, requiring multiple steps for a single task.

Missing GPU resource configuration in “Deploy VM from Template” workflow

Currently, when virtual infrastructure (VI) admins deploy a VM from a deep learning template for data scientists, they have no option to allocate GPU resources during the creation process. This leads to an inefficient workflow where admins have to deploy the VM, navigate to the VM's detail page, and manually configure the virtual GPU profiles. This process hinders quick and easy deployment, requiring multiple steps for a single task.

Missing GPU resource configuration in “Deploy VM from Template” workflow

Currently, when virtual infrastructure (VI) admins deploy a VM from a deep learning template for data scientists, they have no option to allocate GPU resources during the creation process. This leads to an inefficient workflow where admins have to deploy the VM, navigate to the VM's detail page, and manually configure the virtual GPU profiles. This process hinders quick and easy deployment, requiring multiple steps for a single task.

Workflow 1: Deploy VM from template

Anita follows the 9-step process to deploy a VM using a deep learning template from her content library.


Workflow 2: Add GPU resources

Step 1. Edit VM settings

After completing the deployment workflow, Anita must go to the VM details page to modify the settings and add GPU resources.

Workflow 2a: Add GPU resources

Step 2. Add Peripheral Component Interconnect (PCI) device

Anita selects “PCI Device” from the “Add New Device” dropdown.

Workflow 2b: Add GPU resources

Step 3. Select vGPU profile

Anita chooses the appropriate vGPU profile that provides the necessary resources for the VM.

Workflow 2c: Add GPU resources

Step 4. Confirm GPU resources have been added

Anita confirms that the selected vGPU profile has been successfully applied to her VM.

Inadequate Visibility of GPU Resource Availability

The process was further complicated by the inability to check GPU resource availability during deployment. Admins had to power on the VM to determine if the chosen GPU profile was available, potentially causing delays if the necessary resources were already in use.

Inadequate Visibility of GPU Resource Availability

The process was further complicated by the inability to check GPU resource availability during deployment. Admins had to power on the VM to determine if the chosen GPU profile was available, potentially causing delays if the necessary resources were already in use.

Inadequate Visibility of GPU Resource Availability

The process was further complicated by the inability to check GPU resource availability during deployment. Admins had to power on the VM to determine if the chosen GPU profile was available, potentially causing delays if the necessary resources were already in use.

How can I be sure that this vGPU profile isn't being used by another VM and won't cause issues when I attempt to power on this VM?

How can I be sure that this vGPU profile isn't being used by another VM and won't cause issues when I attempt to power on this VM?

IDEATION

Enhancing GPU resource and data science tools selection

IDEATION

Enhancing GPU resource and data science tools selection

IDEATION

Enhancing GPU resource and data science tools selection

Unified GPU resource selection

Allocating GPU resources required a multi-step process after VM deployment from template. I proposed to integrate a “Customize Hardware” step into this workflow, allowing admins to choose GPU profiles at the time of VM creation and avoid the need for additional configuration steps afterwards.

The “Customize Hardware” step already exists in the “Create New VM” from scratch workflow and would be brought over to the “Deploy from Template” workflow.

Unified GPU resource selection

Allocating GPU resources required a multi-step process after VM deployment from template. I proposed to integrate a “Customize Hardware” step into this workflow, allowing admins to choose GPU profiles at the time of VM creation and avoid the need for additional configuration steps afterwards.

The “Customize Hardware” step already exists in the “Create New VM” from scratch workflow and would be brought over to the “Deploy from Template” workflow.

Unified GPU resource selection

Allocating GPU resources required a multi-step process after VM deployment from template. I proposed to integrate a “Customize Hardware” step into this workflow, allowing admins to choose GPU profiles at the time of VM creation and avoid the need for additional configuration steps afterwards.

The “Customize Hardware” step already exists in the “Create New VM” from scratch workflow and would be brought over to the “Deploy from Template” workflow.

GPU resource availability visibility

I proposed adding a separate add “GPU Device” action in the dropdown and a column to indicate GPU resource availability during selection, showing whether resources are in use by other VMs. However, this approach oversimplifies the issue. Resource availability can change rapidly, and this static display won't account for real-time shifts. Even if a resource appears available when selected, it might be occupied by the time the VM is created, leading to potential power-on failures due to new resource reservations by other VMs.

GPU resource availability visibility

I proposed adding a separate add “GPU Device” action in the dropdown and a column to indicate GPU resource availability during selection, showing whether resources are in use by other VMs. However, this approach oversimplifies the issue. Resource availability can change rapidly, and this static display won't account for real-time shifts. Even if a resource appears available when selected, it might be occupied by the time the VM is created, leading to potential power-on failures due to new resource reservations by other VMs.

GPU resource availability visibility

I proposed adding a separate add “GPU Device” action in the dropdown and a column to indicate GPU resource availability during selection, showing whether resources are in use by other VMs. However, this approach oversimplifies the issue. Resource availability can change rapidly, and this static display won't account for real-time shifts. Even if a resource appears available when selected, it might be occupied by the time the VM is created, leading to potential power-on failures due to new resource reservations by other VMs.

Simplified toolkit setup

The “Customize Template” step a the end of the workflow lets users pre-select the required toolkits and frameworks during VM creation. This feature ensures essential tools like TensorFlow and PyTorch are pre-installed, saving time and avoiding post-deployment setup.

Simplified toolkit setup

The “Customize Template” step a the end of the workflow lets users pre-select the required toolkits and frameworks during VM creation. This feature ensures essential tools like TensorFlow and PyTorch are pre-installed, saving time and avoiding post-deployment setup.

Simplified toolkit setup

The “Customize Template” step a the end of the workflow lets users pre-select the required toolkits and frameworks during VM creation. This feature ensures essential tools like TensorFlow and PyTorch are pre-installed, saving time and avoiding post-deployment setup.

SCOPE CHANGE

Prioritizing features for MVP delivery

SCOPE CHANGE

Prioritizing features for MVP delivery

SCOPE CHANGE

Prioritizing features for MVP delivery

MVP Focus on GPU resource customization

Given the tight timeline for delivering the MVP, the highest priority was exposing hardware customization during the "Deploy from Template" workflow. This improvement addressed a long-standing issue for users who had been frustrated that while the "Create VM" workflow allowed for GPU resource selection, the "Deploy from Template" workflow did not. Streamlining the "Create from Template" workflow by integrating GPU resources and PCI devices into a single process would significantly enhance VM deployment efficiency. This unified approach resolved a major pain point by eliminating the need for users to modify hardware settings after the VM was already deployed.

MVP Focus on GPU resource customization

Given the tight timeline for delivering the MVP, the highest priority was exposing hardware customization during the "Deploy from Template" workflow. This improvement addressed a long-standing issue for users who had been frustrated that while the "Create VM" workflow allowed for GPU resource selection, the "Deploy from Template" workflow did not. Streamlining the "Create from Template" workflow by integrating GPU resources and PCI devices into a single process would significantly enhance VM deployment efficiency. This unified approach resolved a major pain point by eliminating the need for users to modify hardware settings after the VM was already deployed.

MVP Focus on GPU resource customization

Given the tight timeline for delivering the MVP, the highest priority was exposing hardware customization during the "Deploy from Template" workflow. This improvement addressed a long-standing issue for users who had been frustrated that while the "Create VM" workflow allowed for GPU resource selection, the "Deploy from Template" workflow did not. Streamlining the "Create from Template" workflow by integrating GPU resources and PCI devices into a single process would significantly enhance VM deployment efficiency. This unified approach resolved a major pain point by eliminating the need for users to modify hardware settings after the VM was already deployed.

Future iterations post-MVP

Initial plans to display GPU availability during the MVP phase were deferred. The complexity of dynamically managing GPU resources, along with its dependencies on other teams, was moved to the larger Private AI Foundation Accelerator Resource Management project, which I led. This shift was necessary due to the extensive effort required, which exceeded the scope of the current project.

Future iterations post-MVP

Initial plans to display GPU availability during the MVP phase were deferred. The complexity of dynamically managing GPU resources, along with its dependencies on other teams, was moved to the larger Private AI Foundation Accelerator Resource Management project, which I led. This shift was necessary due to the extensive effort required, which exceeded the scope of the current project.

Future iterations post-MVP

Initial plans to display GPU availability during the MVP phase were deferred. The complexity of dynamically managing GPU resources, along with its dependencies on other teams, was moved to the larger Private AI Foundation Accelerator Resource Management project, which I led. This shift was necessary due to the extensive effort required, which exceeded the scope of the current project.

FINAL DESIGNS

Small but mighty improvements

FINAL DESIGNS

Small but mighty improvements

FINAL DESIGNS

Small but mighty improvements

Improvements in action

Here’s how these improvements benefit administrators and data scientists:

Improvements in action

Here’s how these improvements benefit administrators and data scientists:

Improvements in action

Here’s how these improvements benefit administrators and data scientists:

1. Deploy Private AI Foundation Workload Domain

Anita follows the 9-step process to deploy a VM using a deep learning template from her content library.


2. Subscribe to content library

She subscribes to the content library in vCenter, which includes VMware's Deep Learning VM image, and is now ready to create her own VM.

3. Deploy VM from template

She launches the “Deploy from Template” workflow and completes steps 1-8.

4. Customize hardware

She no longer needs to deploy the VM first before adding hardware resources like GPUs; instead, she can allocate them directly during the deployment process.

5. Select GPU device

After choosing "PCI Device" from the dropdown menu, she selects the necessary GPU resources for her VM.

6. Confirm desired GPU device is selected

Once she selects the desired GPU resources for her VM, she verifies that the correct profile is applied and proceeds to the next step.

7. Select data science toolkit

She chooses the software bundle that defines the data science toolkit for the VM, then inputs the required details, such as container versions and configuration tokens.

8. Deep Learning VM is deployed!

She deploys the deep learning VM and provides the login details to the data scientist, allowing them to begin training models and working within the environment.