The crux of IT tech support is simple: How fast can you identify and successfully resolve an incident? The complexity lies in the details, especially when handling hundreds of tasks per day.
Recently, NetApp IT began automating the process to resolve incidents without human intervention, particularly those first-level incidents that experience a high-volume of service tickets. The goal of response automation is to restart services, trigger a workflow, or gather log information—all before an engineer receives the ticket. We wanted to leverage automation to drive faster, more accurate responses to incidents, improve our service delivery—and take the first step toward self-healing in IT operations.
Our challenge was to create a platform that could be easily integrated into our existing ecosystem of monitoring and service management tools including the ServiceNow CMDB and various monitoring tools from Zenoss, Splunk, and NetApp OnCommand® Unified Manager. We knew what we wanted to achieve, but we first had to understand the process.
The bulk of any automation project and its success lies in first understanding the process and workflows. Our Command Center engineers were challenged to define a standard response process for a handful of first-level incidents with a high-volume of service tickets. They started with relatively simple-to-resolve incidents, such as rebalancing storage capacity or restarting an offline application. The team used scripts as building blocks and applied the relevant responses as needed. Today they are building our library of automated responses while continuing to provide day-to-day IT support.
The automated response process is designed to work within our existing ecosystem. When an incident is received by our Zenoss monitoring system, it creates a ticket in our ServiceNow service management platform. If the ticket is flagged with auto response enabled, a script is executed using Ansible. The script directs the affected application to run certain commands, collect the results, and place the information into the ticket for a tech support person to access.
With auto response, it is important to ensure the problem is really resolved. Until the Command Center becomes fully confident that the script is doing its job, team members will verify the resolution. Albeit a slow process, we’ve gained a huge head start in incident resolution. When the tech support person opens the ticket, s/he can review the results of the basic information gathered by the auto response and begin troubleshooting immediately. This eliminates the time delay that comes from running tests to diagnosing the issue.
For those incidents with auto response enabled, it takes 3 to 4 minutes (on average) to execute an automation script and approximately one day from when the ticket is opened to it being resolved (known as ticket duration). Without auto response, the average ticket duration was three days and it took an engineer approximately one hour to assess the situation. Other benefits of automating incident resolution include:
Below are some auto response (AR) examples the NetApp IT team has implemented. These examples provide the benefit of ensuring the service is always up, prevents interruption, and/or eliminates manual intervention by the IT operations teams to check and restore service.
The NetApp-on-NetApp blogs feature advice from subject matter experts from NetApp IT who share their real experiences using NetApp’s industry-leading data management solutions to support business goals. Visit www.NetAppIT.comto learn more.
Andy Kranjec is the Senior IT Manager of Infrastructure Operations at NetApp. Andy and his team provide process, technical, and administrative leadership to the organization responsible for day-to-day operations and execution of continuous service improvement, change management, and problem management processes.