Build StaaS with microservices: Storage automation (Part 2)

Contents

Share this page

Guest Author: Tony Johnson, SRE Automation Lead, IBM

February 17, 2022

In part 1 of this blog, I shared how my team at Red Hat began our storage automation journey to address some pain points. I also outlined the objectives and metrics we wanted to achieve as a result of this experiment. In this post, I’ll review what we tackled first: automation provisioning.

The problem of provisioning

One of our biggest issues in implementing storage as a service (StaaS) was request resolution time. Our median time to complete a request was 5 business days, which was way too long. The other issue was process quality. Although we had relatively few errors in our provisioning process, we were not able to triage them effectively. We had to go back to find the original ticket, talk to the engineer who handled it, recover any data needed, and recreate the provision request correctly. This process took time and damaged our team’s reputation within the organization.

Data protection required an additional ad hoc request made to a data protection specialist. Because we didn’t have a standard process, we had to ask the requester what they needed in terms of data protection. Our lack of standards led to governance complications, a nonstandard disaster recovery process, and a lack of clarity around meeting the RTO needs of the organization.

A provisioning automation project was needed to reduce time to market on provisioning requests, eliminate provisioning errors, and standardize our data protection process and capabilities.

But first, we had to decide what data was required as input to the provisioning workflow. We needed some main data points about the application, including:

RPO and RTO requirements, sorted by application criticality
Application criticality for data protection standards
Application data classification for data placement and encryption needs
Service-level definition (Extreme, Performant, or Value)—we’ve since added two additional high-performance levels beyond Extreme
Application performance profile for service-level mapping (IOPS and throughput)

We worked with our security and service mapping teams to define RPOs and RTOs, application criticality, and data classification by application, using ServiceNow. We then characterized service levels for volume types like databases and logs.

We also created user acceptance performance profiles for application volumes and determined the default performance for requests when we couldn’t determine a service level. After initial provisioning, we used our monitoring platform to determine service level changes and migration needs.

Meet our new pipeline

Our pipeline starts with a ServiceNow form asking the requester what application the storage is for, whether they need block or file, and what hosts need to connect, as well as what environment and capacity are required. ServiceNow sends the applications’ criticality and data classifications to the pipeline. The pipeline then configures the storage on the appropriate hardware based on the environment.

If application performance is specified, the request is assigned an appropriate service level. If performance can’t be ascertained, the request is placed in our default service level. Data protection frequency and location needs are configured based on criticality and RPO and RTO requirements. All storage is encrypted based on application data classification.

Next, we add the storage into our monitoring platforms via API calls. We developed triage storage monitoring designed for storage administrators, and we designed a simple monitoring tool for application owners, which provides an overview of application health. We then add the storage to our logging platform, which logs macro information about the storage to help us plan annual purchasing and budget requirements. Pipeline outputs are added to the logging platform for audit and success rate metrics.

Finally, the pipeline performs nightly feedback to ServiceNow to tie the application to all its infrastructure resources.

As a result of our provisioning automation effort, we have achieved our objectives:

Provisioning requests no longer have to sit in a queue waiting for human hands to complete them.
Quality has improved by minimizing human error.
Any errors in the workflow are much easier to triage and fix permanently using our new logging capabilities.
Our storage engineers now have more time to spend enhancing our service, including working toward our strategic initiative: container persistent storage.

In the next part of this blog series, I’ll dive into automation for deployment of new equipment, the efficiency gained, and some surprising quality benefits.

To learn more, check out this video we did at NetApp INSIGHT^® 2021, “Build Storage as a Service (StaaS) with Micro-Services”.

Guest Author: Tony Johnson, SRE Automation Lead, IBM

Tony Johnson is the SRE Automation Lead for IBM. He is responsible for the automation of hybrid cloud services and platforms for the IBM CIO team. Previously he was Storage Manager at Red Hat, from 2014 to 2020, leading a group of 6 engineers responsible for IT storage.

View all Posts by Guest Author: Tony Johnson, SRE Automation Lead, IBM

Next Steps

Blogs

Brush up on the latest trends and developments in cloud, on premises, and everywhere in between. This is where it all gets real, with a cherry on top.

Get to reading

Community

Explore a wide range of open forums where you can post questions, share answers and just generally get smart on all the NetApp technologies that matter most to you.

Join the discussion