Focus on Managing Your Data Science Workspaces, Not Your Data Volumes

Mike Oglesby

April 7, 2021

We’ve all been there. The project works great as a proof of concept, but when it comes time to move to production, progress stalls. Challenges around data management and traceability become painful roadblocks. Unfortunately, this is all-too-common a problem in the world of enterprise AI. Although the emerging machine learning operations (MLOps) ecosystem offers many tools for iterative AI model training and deployment, most of those tools don’t streamline data management. And those that do handle data management are often complex and force data scientists to manage storage resources separately from their data science workspaces.

To address this gap, we’ve developed the NetApp® Data Science Toolkit for Kubernetes, which is included in the newly released version 1.2 of the NetApp Data Science Toolkit. This toolkit abstracts storage resources and Kubernetes workloads up to the data science workspace level. Best of all, these capabilities are packaged in a simple, easy-to-use interface that’s designed for data scientists and data engineers. Using the familiar form of a Python program, the toolkit enables data scientists and engineers to provision and destroy JupyterLab workspaces in just seconds. These workspaces can contain terabytes, or even petabytes, of storage capacity, allowing data scientists to store all of their training datasets directly in their project workspaces. Gone are the days of separately managing workspaces and data volumes.

All of the under-the-hood storage and Kubernetes operations, which would otherwise require help from both a DevOps engineer and a storage administrator, are executed automatically. These self-service capabilities can significantly speed up AI projects, removing time-consuming IT request-response cycles.

Clone workspaces in seconds

With the NetApp Data Science Toolkit for Kubernetes, a data scientist can almost instantaneously create a JupyterLab workspace that’s an exact copy of an existing workspace, even if the workspace contains terabytes or even petabytes of data and notebooks. Data scientists can quickly create clones of JupyterLab workspaces that they can modify as needed, while preserving the original “gold-source” workspace. These operations are built on top of NetApp Trident, NetApp’s enterprise-class dynamic storage orchestrator for Kubernetes, and NetApp’s highly efficient and battle-tested cloning technology. And they can be performed directly by data scientists who don’t have storage or Kubernetes expertise. Operations that used to take days or weeks, and the assistance of both a DevOps engineer and a storage administrator, now take a data scientist just seconds.

Traceability made easy

Data scientists can also save space-efficient, read-only copies of existing JupyterLab workspaces. Based on Trident and NetApp Snapshot™ technology, this functionality can be used to version workspaces and implement workspace-to-model traceability. Best of all, since datasets can now be stored directly within workspaces, there is no need to separately implement dataset traceability. Dataset-to-model traceability is literally built in to the workspace. In regulated industries, traceability is a baseline requirement, and implementing it is often extremely cumbersome. Now, with the Data Science Toolkit for Kubernetes, it’s amazingly easy.

Automate your workflows

You can also use the Data Science Toolkit for Kubernetes in conjunction with a workflow management platform, such as Apache Airflow or Kubeflow Pipelines, to automate various AI workflows. Do you have a workflow that involves provisioning or cloning a data scientist workspace? You can use the toolkit to automate the workspace provisioning or cloning step. Do you have a complicated compliance workflow that involves implementing traceability? No problem, you can automate that too.

With the NetApp Data Science Toolkit, data scientist self-service really is possible. To learn more, visit the toolkit’s GitHub repository.

Mike Oglesby

Mike is a Sr. Software Engineer at NetApp focused on AI ecosystem solutions and integrations. He architects and develops solutions and integrations that incorporate NetApp's hybrid multicloud data management capabilities with AI ecosystem tools and platforms. Mike had a diverse background spanning containers, DevOps, and business applications. Prior to joining NetApp, Mike worked on a line-of-business application development team at a large global financial services company. Mike resides in Cary, NC, with his wife and young son.

View all Posts by Mike Oglesby