NetApp Data Science Toolkit 1.1: Bringing the data fabric to data scientists and data engineers

Mike Oglesby

March 3, 2021

As the adoption of AI in the enterprise continues to expand at a rapid pace, AI training workflows are becoming more complex. Data scientists and data engineers often need to pull data from multiple different data sources, and these data sources aren’t always compatible with each other. This presents a major challenge and causes many AI projects to underdeliver or even fail completely. It is now imperative that data scientists and data engineers have the tools necessary to construct unified data pipelines that incorporate different data sources, environments, platforms, and protocols. The latest version of the NetApp® Data Science Toolkit, version 1.1, enables data scientists and data engineers to directly trigger the movement of datasets—on demand or as a step in an automated workflow. Here's a rundown of what's new in version 1.1.

Triggering a sync operation for a Cloud Sync relationship

You can now use the Data Science Toolkit to sychronize a NetApp Cloud Sync relationship that you previously created in your NetApp Cloud Central account. The Cloud Sync service can replicate data to and from various file and object storage systems. Use cases include the following:

Replicating new sensor data from the edge back to the core data center or to the cloud. You can use this data for artificial intelligence or machine learning (AI/ML) model training or retraining.
Replicating a newly trained or updated model from the core data center to the edge or to the cloud to be deployed as part of an inferencing application.
Replicating data from a Simple Storage Service (S3) data lake to a high-performance environment to use in training an AI/ML model.
Replicating data from a Hadoop data lake (through Hadoop NFS Gateway) to a high-performance environment to use in training an AI/ML model.
Saving a new version of a trained model to an S3 or Hadoop data lake for permanent storage.
Replicating NFS-accessible data from a legacy or third-party system of record to a high-performance environment for use in training an AI/ML model.

Triggering a sync operation for an asynchronous Mirror or Vault relationship

Using the Data Science Toolkit, you can now synchronize an existing NetApp SnapMirror^® relationship whose destination volume is on your storage system. SnapMirror volume replication technology quickly and efficiently replicates data between NetApp storage systems. For example, you can gather data from another storage system and replicate it to your own storage system for AI/ML model training or retraining.

Pulling data from S3

The Data Science Toolkit now lets you pull one or more objects from an S3 bucket. When you pull multiple objects, the operation is multithreaded, so you’ll get better performance than if you loop through objects and pull them one at a time. This S3 pull functionality is particularly useful when a NetApp ONTAP® system is your AI training environment, but you need to collect training data from an S3 object storage data lake, such as NetApp StorageGRID® or an S3-compliant object store in the cloud. One caveat: The multithreaded S3 pull functionality hasn’t been tested at scale, so it might not be appropriate for extremely large datasets.

Pushing data to S3

The toolkit also lets you push data to an S3 bucket. As with the S3 pull functionality, the S3 push functionality lets you push one file or multiple files; for multiple files, the operation is multithreaded. This S3 push functionality is useful for saving trained models and updated datasets in an S3 object storage data lake. The same caveat applies here: Because the multithreaded S3 push functionality hasn’t been tested at scale, it might not be appropriate for extremely large datasets.

New data volumes are now thin-provisioned by default

The toolkit also includes a minor enhancement to the “create volume” operation. Now, new volumes that you create will be thin-provisioned by default. A thin-provisioned volume is one for which storage space isn’t reserved in advance. Instead, it’s allocated dynamically, as needed, and free space is released back to the storage system when data in the volume is deleted. This approach helps you use your storage space more efficiently, because you won’t end up underutilizing it anytime there’s space available. If you want to fully allocate storage space for a volume that you’re creating, you can specify an optional parameter, which makes the system guarantee sufficient space for the full capacity of the volume.

Learn more

For a full list of enhancements, fixes, and changes in version 1.1, refer to the release notes. To download the latest version of the NetApp Data Science Toolkit, visit the toolkit’s GitHub repository.

Mike Oglesby

Mike is a Sr. Software Engineer at NetApp focused on AI ecosystem solutions and integrations. He architects and develops solutions and integrations that incorporate NetApp's hybrid multicloud data management capabilities with AI ecosystem tools and platforms. Mike had a diverse background spanning containers, DevOps, and business applications. Prior to joining NetApp, Mike worked on a line-of-business application development team at a large global financial services company. Mike resides in Cary, NC, with his wife and young son.

View all Posts by Mike Oglesby

Next Steps

Blogs

Brush up on the latest trends and developments in cloud, on premises, and everywhere in between. This is where it all gets real, with a cherry on top.

Get to reading

Community

Explore a wide range of open forums where you can post questions, share answers and just generally get smart on all the NetApp technologies that matter most to you.

Join the discussion