BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
Welcome to today's deep dive session. My name is Angela Chang. I am storage grid tme. I'm going to walk you through how you can perform storage grid log analysis using elk stack. Here is the agenda and today's session has two major parts. The first part is slide presentation to cover log analysis steps and ELK server configuration.The second part is a 25minut demo to show you how to perform log analysis using Elk. I may go through some slides very quickly as the details will be demonstrated in the demo session. I am the only one presenting so will not able response to chat. If you have question during the presentation or demo, I suggest you hold it to the end. Each grid node can generate lots of message within a minute. If you are familiar with storage grid and have been looking into storage grid logs, you all find that there are so many messages, it's very difficult to navigate or identify the part that you want to look at.The message format is different based on the module which generates it.is not easy to parse the message manually. Some of the messages are not easy to read. For example, the audit messages. There is no central place to gather the logs from entire system to view related events.This makes log analysis very time consuming.With storage grid 11.6 SIS log forward feature, it allows us to collect all messages in the external centralized system. This enhances the log analysis and troubleshooting process. This feature also helps to prevent logs filling up the storage grid local file system. When local storage fills up, the older logs being rotated out. This feature allows customer to archive all log messages to comply with company policy. Log entries are like footprints. Log analysis is important because it allows us to trace back in time and see what happened or who did it. High-level steps of log analysis. Data collected to a centralized database where data are parsed and indexed to speed up search. Log analysis tools to correlated events and identify pattern or problem. Monitor and reports using visualization tool to give the administrator a quick overview of ongoing activities. Elk is a collection of three open-source products which are Elastic Search, Log Stash and Cabana. Elk has become one of the most popular log analytics solutions. Log Stash is a data processing pipeline. It's used to collect data from many different sources such as databases, CSV files, and logs. It transforms the data and ingest to elastic search. Elastic search is a search and analytic tool. It takes data from log stash, store and index it. Cabana is known as the charting tool for the ELK stack. It provides search and data visualization functions for data indexed on elastic search. You can see the flow of Elk stack here and how they related to log analysis process. For testing purpose, I have a single VM with three software installed. From storage grid manager, configure external SIS log server using TCP or TLS protocol.For TLS, there are two options. You can have one-way authentication or two ways mutual authentication. I'll go through it in detail when we walk through the config file. On log stash side, you customize a config file to receive storage grid log message, filter and parse it to extract key information before sending to elastic search for indexing. Lastly, you create some cabana query or create dashboard to analysis the log messages. Logconfig file is located in this directory.You can have multiple cf files. Each represents a pipeline. It has three main sections input, filter, and output, which I will cover them in detail in later slides and demo. For the first time, I suggest you use this sample config file to start with and collect the storage grid log messages to a file. This allows you to see different message format and learn how to parse each type of message. For input section, you specify TCP and a port. Any non-reserved port can be used. On storage grid side, you specify the ELK server host and port and select TCP protocol.If you want to enable server authentication, specify these three lines here. On storage grid side, you select TLS protocol instead of TCP. You need to upload the CAERT that issue the log stash server. If you want mutual authentication, add these two lines. Specify the CAert that issue the client. You can skip the line SSL certificate authorities. If the client C is issued by a known public CA as it is already in this server CA trust store on storage grid side, upload the client and client private key. Let's take a look of storage grid log messages when they are sent to log stash.Most of the messages are unstructured data and there are many different formats.They need to be parsed to extract key information from the message. There are few major components of the message.Each message starts with a numeric value which refers as syslog priority. The number equals facility time 8 plus severity.We will look at the details in next slide.Each message is generated by a program. Majority comes from AD that is storage grid core services such as LDR, SSM or DDS.Log level could be info, notice, warning, etc. Some messages have a module name like S3RQ or SVQY which represent the module within a program. Trace ID correlates messages of a particular event. Mostly S3 requests related activities. Each SIS log message includes a priority value at the beginning of the message. The priority value ranges from 0 to 191. The priority value is calculated using the formula. For example, a security related message is facility equals 4 with info level is six. This combination generates a priority value of 38. In here, I suggest you leave all facilities and severity as pass through. I highlighted most common scene storage grid log messages sys log priority in this table. In the filter section, you define the message pattern and how you want to parse it and break up the messages into individual fields. In log stash, we use grock to define pattern.However, one pattern does not fit all as storage grid log messages come in different format. I included URL link where you can find all supported Grock pattern. I will go through the detail how to define pattern during demo. In my testing, I defined six Grock patterns each to match a different storage grid message format. If Grock cannot match a message to the pattern, it adds a tag called Grock parse failure. You can use this tag to identify any mismatch pattern and correct it. This is audit message sample. I use pausent positive integer to define syslog_priand sysl base which define three fields. The most important information of this message is highlighted and it is key value pairs format. I use KV filter to extract the key value pair. So each key is a field and I don't have to define each field individually. You can see from the table here, it is much easier to read and interpret the audit messages. Similarly for bycast log S3 request related messages there are key value pairs but this event process create and S3 error response key value pair come in slightly different pattern. So I need two KV filter to capture the fields.I will go through each sample during demo. After you define the patterns to process incoming messages, you output them to elastic search. In this example, I output to elastic search running on the same host with index prefix of squid 2year month day. This will create one index file per day which you can see from this curl command output. Now we are ready to use Cabana to perform log analysis. For the first time we create an index pattern which match to one or more elastic search index files. From here you can see the message is not easy to read. Create some Cabana query to focus on messages that you are interested.Customize the view by selecting important fields to display. Here is sample of S3 events with select fields and I can use it to troubleshoot failed request. Use Cabana visualization tool to create charts and build custom dashboard. Here are URL link that you can find useful resources to get you started with Elk, Log Stash and Grock. Let's go through demo to have a better understanding how all these works together to help you get insight to ongoing activities inside storage grid. On storage grid side, I already configured the external SIS log server. The configuration here match what is defined in the log-config file input section.For the syslog priority selection, I am using default pass through for all three type of logs. Next, we look at the filter section. We need look at the storage grid log message pattern which I captured the messages to a flat file. In here you can see that each message starts with SIS log priority value and different formats based on the program or module which generate the message. First look at audit message. Each message start with SIS log priority value of 190 and there is the word audit.This helps us to identify the audit messages.Let's copy the audit message and learn how to use Grock pattern to extract the relevant fields from the Cabana GUI. You can select management dev tools and gro debugger. I already copy the audit message and grock pattern. Here you can first start with something basic to check whether the pattern is correctly defined. I used pausant to define the syslog priority and the sysled base predefined grock pattern. Simulate result shows that the pattern is correct and extract the correct information of syslog priority. The syslbased pattern defined three fields timestamp log source and program which match the first part of message. Here there is another timestamp with microscond which I use timestamp underscore ISO8601 pattern to define this field as event time. The second part of message starts a square follow with the word aud and I used grock pattern greedy data to define the field as audit message. Greedy data pattern means match everything. In here, I match everything after audit colon to this field. This field actually has the most important information I want to extract. It is in another different pattern which is key value pair. I need to use another filter in order to break up this into multiple fields to allow index of each field. In the log stash config, I specify if program equal audit use kv which is key value to filter audit message from the message. The key and value is separated by a colon and the field separator is a square bracket. Using the KV filter, I don't have to manually define each field. I also removed some fields that I'm not interested or they are duplicated fields.By default, all fields are string. I converted some of the fields to integers so that I could use it for calculation or sorting. Now let's look at S3 RQ messages. The these are S3 requests related message. You can see that this message has a trace ID with module name S3RQ. This is an event process create which is when client connects to LDR to make the S3 request. There is another S3 RQ message with same trace ID but is AS3 error response. Let's copy this message to Grock debugger.In here you can see that there are many fields such as node ID, process ID, module name etc that I have to define each individually. The most important part of this messages is after the S3RQ, I define it as message detail which has another key value pattern. I can use the KV filter to break down into individual fields. This time the key value separator is equal sign and field separator is a space in the log stash config file. I use a condition if trace ID exists and if module name is S3RQ. I use another grock pattern to subdivide the message details in two fields. The first field is event type followed by space-sp space and the rest I group it as s3 request. I need two gro pattern definition here because you can see that the message pattern of second one is different. The S3 error response event follows by a colon. If the event type is equal event process create, I use KV filter to break down S3 requests into multiple fields. I removed some duplicated fields and I renamed field name which refers to S3 object.complete path. I rename it to path to match the field used in audit message. For S3 error response, I have to use field split pattern with comma and a space because it's not a single character. I rename some fields here to match the name used in other messages. I also renamed the HTTP status code to replace the space with underscore. I rename HTTP method with method to match the field used in event process create message. There are other messages format but I have to use another gro patterns to define them. For example, thismessage is similar to the one that I defined with trace ID except this one doesn't have trace ID. So you can use the same pattern except you remove the trace ID field from that. This message from chunk is another example that has completely different format.It doesn't have the module name and it doesn't have the time stamp with microcond. So I have to define it separately. I need one Grock match line for each storage grid message format. You can use similar method. Copy the message to Grock debugger and to find the proper pattern. If there is no compliance to keep all the messages, you can drop the message that you are not interested. For example, if the SIS log priority equals 29, which means notice level and not belonging to one of these three modules, I can drop them. The second drop statement is for messages with log level of info from different program or modules. The third drop statement is for security event that are not failed. Once you define all patterns and filters, you can output it to elastic search for indexing. In this example, I output it to elastic search running on same host and defined a prefix of grid 2 followed by year, month, day to create one index per day. From the Cabana GUI, select management dev tools cons. So I can run this get command to check what index are being created. There is one index per day and the document count represents number of messages stored in each day. One message equal one document. For the first time you need to define the index pattern in here. I already have a pattern defined that with all prefix start with grid 2. I can create an index pattern for each of the month. Meaning that when I run Cabana query, I run against that month only. You can click this star to make this index pattern as default. Now ready to perform log analysis. This shows there are approximate 2,800 2800 messages in the last day. However, it's not very easy to read the message in this format. Since I already break up the message into multiple fields, I can expand the message and select the field that I want to display in a table format. Let's add the fields I am interested to look at. You can rearrange the fields or resize the columns. This makes the messages easier to read. Let's look at audit messages. I have saved query to filter by module name S 3RQ or ILMX or type is OLM or OVWR which means object overwritten. I already selected the fields that provide me the details of audit messages.The log source is the storage node which process the S3 or ILM request and generates the audit message. There are different S3 tenants and I can see the type of S3 requests from each tenant. In this table, I can clearly see the bucket name and object key in each request and the object size and the time it took to complete each request. I can also see which objects are being overwritten.Next, we look at the ILM rule being applied to the object. ORM mean object rule met. You can see the name of the rule and the location of the object when the object being put into the grid. For example, this object size is over 200 K byte and is using EC rule and the location is CLC2 which is the EC profile 2. Here is the example of EC object at site one using CLC1. You can easily verify whether the correct ILM rule being applied. There are some OLM message without the rule. These refer to objects being purged from the grid. You can see object path and object size. There is no rule because default is when you delete an object, it is being removed from the grid unless you have object lock or versioning turned on. As you all know that audit message only records the successful client activity. But most of the time I want to know what causing S3 request failure. Let's look at the bycast log message related to S3 requests. In this query, I search for message with module name either equal S3RQ, DDSM or OBDR. And I exclude all those internal S3 requests. DDSM is DDS module. OBDR is object data read and S3RQ are S3 request. There are messages with event process create. Each has a unique trace ID. This event process create message is when the client established the connection with the grid. Example, this message is when client with IP ending 174.243 making a put request of this object. When there is no other S3RQ message with same trace ID, it means the request is successful.LDR does not generate a message with status code 200. It generates an audit message which we looked at earlier already. Here is the example of failed request. The event process create an S3 error response message with same trace ID indicate the put fails with no such bucket error. We can expand the message details field to look at the details. else. Here is another failed event. These two messages have same trace ID. This delete request failed with status code 403. Access denied. Again, you can expand to see message details.Likely the client does not have permission to delete the object. In another example, this put failed with 403 client authorization. Invalid access key.Expand the message details to see access key ID not found. From time to time, the customer complained that they got the internal server error 500. In this example, I can see that there are four messages associated with this trace ID. It is a head command. A message from OBDR trying to look up the object but failed due to Cassandra unavailable.Expand the message detail to see the complete message. It said lookup object Cassandra exception failed to execute CQL at consistency all. This means some Cassandra is not available to fulfill consistency all. Let look whether there are DDSM message around the same time frame. Here I see a couple of DDSM message. From here you can see that Cassandra connection not possible and next message is Cassandra connection is now unavailable. disabling DD service and all messages are coming from site 1 S1 storage node. There is Cassandra problem on this node which causing head non-existence object with return code of 500.A minutes later there is Cassandra connection is now available message meaning the Cassandra issue is resolved. You can see that using the query and putting the fields in a proper table format you can easily look at different message to troubleshoot S3 request or Cassandra related problem. Most company wants to look at security related messages. You can search on messages that coming from SSH sue or pseudo program and just focus on failed message to determine if there are unauthorized attempts and someone is trying to hack into the system. We can use Cabana visualization tool to build a dashboard. Usually I use lens to create some charts.You can choose to explore certain fields from the log and select the duration you want to chart. In this example, I want to count the S3 requests of each tenant. This chart shows the total of S3 request of each tenant. I want to break down the type of request. You can change the display title or color scheme. After you complete the chart, you can save it to the library and add it to a dashboard. organize the chart into a dashboard. After you are done with the dashboard, save it. I have a predefined dashboard. In this dashboard, I mainly focus on messages related to S3 requests. The first chart I am using the fields from LDRS3 RQ messages. These include all S3 requests successful or failed. I'm counting the request per client IP and breakdown by method. In the second chart, I display the HTTP status code breakdown by method. In the third chart, I display type of S3 requests per tenant. This is based on audit messages. So only successful requests are counted here. By using visualization and dashboard, it gives me a quick overview of all S3 requests related activities. This concludes storage grid log analysis demo using Elk stack.
With the NetApp StorageGRID 11.6 syslog forward feature, you can configure an external syslog server to collect and analysis StorageGRID log messages. Learn how to configure ELK and use it to identify and troubleshoot failed S3 requests.