As a data driven company, NetApp relies on data science, business intelligence (BI) and analytics to learn, improve, and predict from a myriad of possibilities throughout the enterprise. As I discussed in my earlier blog on organizational structures for data scientists, data analytics is essential to planning and driving business. Data scientists, and their portfolio of methods and tools, are the key to unlocking the most value from our data and improving our odds of success.
Before I jump into describing the symbiotic relationship between data science and data lakes, I want to describe how data can be separated by IT into a data warehouse, a data mart, or a data lake. As an enterprise we need all three. Broadly, a data warehouse is for storing large amounts of data from different sources to feed standard reporting, a data mart is subset of the data warehouse dedicated to a specific organization, and a data lake is for predictive analytics and what-if scenarios.
Our team has spent a lot of time and effort building our data warehouse as a single source of truth. The data is often used for producing standard reports, and if an employee knows the data sources and what answers he or she is seeking, the data warehouse can produce bookings or financial reports. Gartner refers to this as a System of Record, and it's the traditional way of doing business.
A data mart begins to answer the "How?" and goes beyond the "What?" of business intelligence. It is meant to address departmental needs and provide information regarding how the department is operating based on a subset of data. For example, we have a data mart for sales operations to provide detailed insights from all sales data. It has data that answers questions not available from a simple data warehouse.
Fast forward about four years from the introduction of the data mart concept to today and our move towards a data lake. A data lake will provide an opportunity to build analytics derived from massive amounts of data and insights beyond straightforward reports or department-specific operational questions. Data lake analytics are predominantly meant for people who are specialized in understanding data models and proficient at working closely with business users to make projections and explore “what-if” scenarios.
The benefits of analyzing a massive data lake cannot be realized by looking at data in a siloed fashion; analysts must understand the process implications in order to come to useful conclusions. For example, data lake analysis cannot be used to make renewal recommendations to the sales team without understanding the sales cycle. A data scientist will spearhead these conversations and requires familiarity with relevant business processes.
In order to prevent misuse of data lake interpretations, we work very closely with each business group on a case by case basis. We must ensure each team has the knowledge and maturity to make decisions based on accurate assessments. It requires time and organizational buy-in to obtain the best return on investment.
Our team has been busy building out the data lake technology stack, based on previous experience with a Hadoop-based infrastructure for Active IQ, a system that collects information on customers’ NetApp solutions. Every week the system generates about 100TB of data and 225 million files―and it is growing!
We have taken those lessons learned, and many of the design patterns, to build our data lake architecture. At the same time, we have identified an approach to integrate the data lake into our next generation data center platform as part of efforts to build a cloud aware enterprise.
While the team continues to build out the data lake infrastructure, myself and other team leaders have begun discussions with different business stakeholders to identify data lake use cases. Like a laptop without software, a data lake by itself is of no value unless it can be used to derive predictive analytics with significant business benefit and positive ROI.
To address the complexities of a massive data lake and build our maturity, we felt it was important to establish a Data Science Center of Excellence. Data scientists can answer questions such as when to connect to the data lake, which models to build out, how to establish visibility across models to prevent duplicate efforts, how to ensure alignment with business processes, and more. These resources can provide guidance on when to go to data lake—or even maybe go old school with the data warehouse or data mart!
Rajesh Shriyan is the Director of IT Enterprise Architecture at NetApp. Rajesh and his team work with NetApp’s business teams to ensure application capability in the enterprise and map applications to these capabilities. They help determine how an organization can most effectively achieve its current and future objectives.