Sign in to my dashboard Create an account
Menu

Catalog ONTAP data to unleash its power

view of mountains in middle of clouds
Table Of Contents

Share this page

Win Vahlkamp
Win Vahlkamp
107 views

With more than half of the world’s data stored in NetApp® ONTAP®  unified storage, it’s no wonder that so many customers tell us, “I need a catalog for my data.” However, different types of catalogs are used to achieve different ends. Data engineers, scientists, and analysts need to identify schemas, tables, columns, and data types to accelerate exploratory data analysis. Site reliability engineers may need to understand how many files are stored in which volumes and their size and age to decide what can be tiered to cheaper storage. Geologists may need to understand their domain-specific files to import into their interpretation project for a multi-billion-dollar exploration project. The common goal with any catalog is to centrally store and find information about stored data to build a semantic (knowledge) layer across the entire enterprise.

Business and technical catalogs

There are two types of catalogs: business catalogs and technical catalogs. A business catalog is not in the query path. It stores an inventory of all the data assets of the entire organization such as schemas, tables, columns, data lineage, data quality results, data types, sample data, and a host of other metadata to speed up understanding and exploration of the data.   

A business catalog facilitates the creation of a standard business glossary, so that all parts of the business have a common understanding and vocabulary of their data. It contains data descriptions, tracks ownership by people and systems, stores security parameters, and gives quick access to the schema and format of the data. It can also inventory all of the available models and all of the pipelines connecting the data. Modern catalogs enable a data mesh by additionally identifying data sources, targets, and the domains that own them, along with notification and contact information.  

Accelerating data science and engineering workflows

Identifying and understanding data is an early activity in data science and data engineering workflows that is accelerated by a data catalog. Without a catalog, a data engineer would have to manually examine the database schema, identify the tables, and sample each one with SQL queries to understand the data formats. The data catalog cuts out hours of exploratory time, providing all of that metadata in one location. Numerous commercial and open-source business catalogs exist, such as Open Metadata, Amundsen (born at Lyft), Apache Atlas, DataHub (born at LinkedIn), Informatica, AWS Glue, and Azure Data Factory, among many others.  

A technical catalog, which is in the query path, tracks the location of tables such as Iceberg and Delta Lake tables in the data lakehouse and enables SQL queries to them. The original data lake, Hadoop, very quickly developed the Hive Metastore to track the location of tables and implement a query layer to reach them. Modern data lakehouses persist table storage in Iceberg and Delta Lake table formats to enable ACID queries on columnar files stored in Parquet or other formats. A catalog is needed to track changes to the tables and map the metadata of the tables for the SQL queries to find them. Technical catalogs supported by Iceberg include Nessie (developed and open sourced by Dremio), the REST catalog, the JDBC catalog, and AWS Glue, among others. Apache Spark can be configured to use all these catalogs. 

Domain-specific catalogs

Another special type of catalog is domain specific to an industrial vertical. Oil and gas upstream exploration workflows have domain-specific file types such as SEG-Y and ZGY that only those tools understand. Therefore domain-specific catalogs are embedded in the software so that geologists, geophysicists, and engineers can discover, qualify, and explore their data assets. Media and entertainment also have specific file types that rendering and animation tools understand and therefore have specific catalogs for their uses. Health care, engineering, and other industrial verticals have their own embedded catalogs for discovering and inventorying their own data formats. So there is not one catalog to rule them all, because specific data formats exist that not all catalogs can ingest. 

Enabling S3 with ONTAP for object storage

Object storage is the de facto storage standard for the modern data infrastructure, and all data catalogs support S3. NetApp ONTAP is the world’s premiere unified storage, serving NFS, SMB, and block protocols, but the tools of data science, engineering, and analytics favor object storage. How can data engineers, scientists, and analysts discover and explore the enormous amount of data stored in ONTAP?   

Enabling the S3 protocol on all your ONTAP NAS volumes means that your data can be cataloged by any of the data catalog tools on the market today. You don’t need to invest expensive data engineering time to pipeline the data out to an S3 bucket to be inventoried and cataloged. Instead, the value of the data in ONTAP can be unleashed to become the most significant source of data value in your organization. 

Read more

Learn how to catalog your data in ONTAP and unlock the value for your data science and analytics workloads. 

Win Vahlkamp

Win is a Data Solutions Architect with over 25+ years of experience in systems architecture and engineering. He is focused on developing open source data solutions across the NetApp’s cloud and on-prem portfolio of products and solutions. Previously, he was an Azure Cloud Solutions Architect, Global Technology Strategist, and Senior Solutions Engineer for some of the world’s largest oil companies. In his spare time, Win reads a lot, especially about Systems of Profound Knowledge (Deming) and the Theory of Constraints (Goldratt), practices Iaido (Japanese sword), and likes to travel.

View all Posts by Win Vahlkamp

Next Steps

Drift chat loading