BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
Hello everyone. This video is a quick product demonstration of how to use the S3 select feature with NetApp Storage Grid. S3 Select is a feature originally from AWS that is useful when working with big data sets. What it does is it allows you to use SQL queries to retrieve a subset of your object data. With storage grid, we added support for this select object content API in 11.6. So here on the screen I have a Jupyter notebook set up to walk through this demonstration. And this demonstration is very simple and we will be using the AWS CLI.Specific to storage grid, there's a couple considerations and requirements for using S3 Select. First off, the object that you want to query must be in CSV format or a compressed file, either gzip or bzzip 2 containing a CSV file. Second, you need to have a storage grid tenant that has the allow S3 select enabled.And then finally, the request needs to be sent to a storage grid load balancer endpoint. So you have to make sure you have one configured on storage grid. So for my demonstration file, I have pulled the sample data out of the US2020 census. So the file is titled sub- est2020_all.csv. So if I simply just head this file, we can take a look at what is in this CSV file, at least for the first five lines. So you'll notice here the column index. Couple important ones would be the population estimate of 2010 as well as the population estimate of 2020. And here you'll see it's listing states and then the associated data per column. So I already have this file uploaded to storage grid. But if I were to list it out just to see the object size of it, you'll see that it's about 10 megabytes. This data set is just for the demo video, but imagine you are working with a CSV file with millions of lines or more. As your CSV file grows, the impact of being able to filter and query a subset with S3 select has more impact. So the next thing I'm going to do is I'm going to issue that select object content API and I'm going to perform a SQL query on the object. So in this demonstration, I'm querying for the 2010 population. the 2020 population and I want to calculate the percent increase of population for the US states. So my command here is as follows. I'm using the S3 API. I'm pointing the endpoint to storage grid to that load balancer endpoint. I've selected my profile that has my access keys and secret access keys. I'm utilizing the new API select object content targeting the bucket that has my CSV file uploaded. Specifying the key as well. And then here's my SQL expression. I want to select the state name census 2010 population and then the population estimate of 2020. And I also want to calculate the percentage increase from 2020 to 2010. So I'm taking the estimate of 2020, subtracting the 2010 population and dividing it by the 2010 population, multiplying by 100 to create a percentage. And then I don't want all the data. I only want where the name is same as the state name. So I'm looking for all states and I'm limiting it to 10. A couple more arguments we want to look at the input serialization the CSV file and then we want to use the file header info because it provides us our column names and then in our case we have to apply the record delimiter of the return carriage. This is specific to my data file. And for the output I also want it in CSV and I want to output it to this file called output file.csv. So if I run this command on Jupyter notebook, it's running right now because of the star. And now it's completed. And I can now cat that output file. And you'll notice here instead of pulling the entire 10B file, I have now used a SQL query to pull a subset of that data. It's telling me all the states, their 2010 population, their 2020 population, and then I've calculated the percentage increase, which is about 3% in the case of Alabama, and only 10 entries. So S3 select is a very powerful feature. It allows you to optimize your data queries. So this can save in performance, reduce network bandwidth and also offload some of the compute from your application onto your storage platform. In this case, storage grid. And that concludes the demo. Thank you very much.
Optimize performance for your analytical workloads with StorageGRID S3 Select support. StorageGRID customers can now use simple SQL queries to accelerate S3 object data querying performance for analytical workloads.