BlueXP is now NetApp Console
Monitor and run hybrid cloud data services
Okay, we're going to kick off and get started. Hi everyone. Thank you so much for joining us today. As you know by now, my name is Shiva Kelleher and I am your partner Technical lead based out of Cork, Ireland office. Um, and you are joining us for our NetApp on NetApp webinar Scaling Object Storage for the enterprise with StorageGRID. And joining us to talk on that topic today is Senior Storage Engineer Ken Lee. Just before we get started, just some quick housekeeping rules. You'll see we have a Q&A function. So please, if you have any questions or any comments, feel free to shoot them in there. Myself and Chris Long will be taking a look at these programs, so Ken will have an opportunity to answer your questions. This meeting is also being recorded and the recording will be available soon after today. So keep an eye out on your inbox for that. And yeah, that's it. Like I said, any questions? You can throw them into the chat or the Q&A and you'll get an email anyways with the recording after. So if you have any questions afterwards that come in, just let us know and we can get those answers. So thank you very much for joining us today. And Kenley I'll let you get started okay. Guys could see my chair. Yeah. We can see your screen. Thanks, Ken. Okay. Hi, everyone. Uh, my name is Ken Lee, and hopefully, uh, you guys can hear me clearly as I'm just getting over a cold and my voice. Hopefully that will make the hour mark. Uh, I'm a storage admin here at NetApp for over 14 years now. Um, my main responsibility right now is StorageGRID. So I'm going to go over a little bit about the history of how we got here. And then I'm going to talk about two use cases that was big to us here at NetApp. So about ten years ago, um, 2015, around that time frame, we started using StorageGRID and we did so slowly, you know, um, only using StorageGRID for static data, uh, doing ONTAP backups using Netbackup. Then we transitioned to Alter Vault and started doing Sox compliance backup. Uh, when Alter Vault was sunsetted, uh, we kept the StorageGRID of the vault environment because of the Sox retention period. After that, we slowly introduce S3 Object Store to the rest of NetApp. It, uh, at the beginning, mostly for static content like Splunk. Git repository and Veeam.backups. So in today's environment, we currently have three storage grids. The first one being what we call our corp StorageGRID. It is in an internal storage grid that is not accessible by the outside. Originally this was a three site StorageGRID, but now after closing our Sunnyvale site it became a two grid site. Their locations are in Hillsboro, Oregon and the second one in RTP, North Carolina, running version 11 .7.0. 11 with 67 nodes and a total capacity of six petabytes 61 data nodes using the SG 56, 60, 87, 12, 57, 62, 60,And we are in the middle, well, no longer in the middle. Uh, last week in. Fact, we decommissioned, uh, 5660. We're running SANtricity 11 .80.1 R3 using two VM admin nodes and four VM gateway nodes for load balancing. So the reason why we decommissioned the 5660 is because it's not supported at the higher version, I think. I believe 11.8 no longer supports that hardware. Our second StorageGRID is our DMC environment and it's. This one is customer facing. It also started as a three grid site until Sunnyvale closed, and is running 11 .7.0. 11 as a total of 23 nodes. Doing two copies with six petabytes of capacity. There are 19 data nodes. Models are 5712 and the 6060. Running SANtricity 11 .80.1 R3 with two VMs for admin know. And we're using Avi, which is the VMware load balancer for the gateway nodes. Originally we were using the F5 load balancer. Um, we started an upgrade yesterday. Um, and this environment is now at release 11.8, but, you know, um, when you go to 11.8, one of the ports got moved, um, 18 082 was moved to 18 086. And because this is a DMZ as it is facing our customers, our firewall team had to open that port before any traffic can be flowing through there. So just know that when you're going to 11.8, uh, there's a port change. Our third StorageGRID is our Active IQ, what we call a bare metal grid. What a compute portion We are using Cisco UCS to see 220 M5 running Red hat 9.4. It has 210 gig interface, two SAS ports for the E-Series running Podman instead of Docker because Red hat dropped support for Docker engine. The back end E-Series is the E-Series 5760. We also have the SSG 6060 and the 6160. Uh, you might be wondering, the, um, those SGS are appliance and not bare metal. Uh, it is because we just added them less than a month ago. Uh, when we built this grid, the grid was a single site. Grid. Now it is a two site grid. Uh, this is a process that we're going to do is migrate everything Out the first side into the second side. Running trains. Transit city of 11 .80.1 R3. We are also using ONTAP for the UCS storage. Uh, for the compute iSCSI boot Luns, you know. So, uh, the grid is running 11 .9.03. Total number of nodes is 24. Um, because we started this grid as a single site, we are using erasure coding of four plus two. For any objects less than 200 K, we do three copies locally. The grid has a capacity of six petabytes. Total number of storage nodes is 14. We have a VM two VM admin knows, six gateway knows and one external heavy load balancer gateway. I will go through the reason why we started with a bare metal StorageGRID and about the admin knows that we have why we have so many. Okay. So that's our environment. So um it's not very different from real customer. Uh, we built a StorageGRID and we went very slow with static data. Uh, and then, uh, we evolve it from there. Uh, you notice that I never talk about FabricPool. Uh, and the reason why is because, uh, we're NetApp and, um, we just. The story is pretty cheap for us, so we didn't want to add a layer in there. But you know, we always suggest to the customer the best way to start and StorageGRID is usually for backups, which is static data and FabricPool from your ONTAP environment. So next I will talk about the use case and the first one being uh, secure file upload. Uh, the name could have changed by now. I'm not sure. So if you are here more than ten years ago, uh, I'm sure some of you are familiar with the site upload NetApp. Com where we were using aspera to allow our customer to upload files to NetApp, it required that our customer download the browser plugin. At the time, uh, it only supported Netscape. I know I'm dating myself. Uh, plugins and that plugin allow the web browser to load and run external application software. So that's how we were doing the uploads. Uh, at the time, the file size was limited to four gigabytes. So if the plugins were giving, uh, our customer upload problems, we just told them, hey, try FTP. And FTP was insecure and not very robust. So during that time frame, NetApp was under a lot of pressure from our customer to get a better handle on this issue. When a customer had an issue with ONTAP, they had to upload their core files so that our engineers could do analysis on them in order to troubleshoot the issues, which with large customer they tend to have lower large core files and uploads would often fail. It got so bad that our customers CEO would call RCO and complain because their systems were down. So to address this issue, these were some of the requirements in my support. Very large core files. As ONTAP was evolving, the core files started to grow and at the same time we were introducing Kdot. It was Estimated that core files can grow up to 800 gigabit. The upload must be able to function in different regions closer to our customer to decrease the data transfer time, and when the upload fails, it will be able to restart and not start over from the beginning. We also needed to address the key issues with the legacy upload solution like slow upload rate, less setup for our customer, like opening ports or downloading plugins to be able at a minimum upload 150 gigs. The architect that we came up with is the customer must be able to be verified. The initial plan was to have the customer enter a NetApp ID, but that change to allow the customer to enter a case number. Because if you open a support call, you would have a case number. We don't want any client side plugin just using the browser and nothing else. The developers uses a well known stack S3, SDK, angular and J and Node.js. During the design phase, they update up the file transfer size to two petabyte. Use s3 API for security. Um must be able to use multi-part upload for restart ability. Uh be able to use pre signed URL for added security, meaning that we create a pre Resigned URL that will expire in six hours and allow for pause, resume and cancel. The secure upload is an outside facing application, and therefore our security team will not allow us access in our Corp network, but our core files that are uploaded have to go to our engineering domain. That was how the legacy system was, and we didn't want to make any changes to the back end part. So we use one of the StorageGRID platform service, Cloud to replicate incoming objects from the DMC StorageGRID into our corp StorageGRID and from there the data flow to our engineering domain without any changes. What you are seeing on the screen is the original environment having three sites. Once we decommissioned the Sunnyvale site, the grid became a two site grid. Nothing else needed to change data flowing to engineering. You know, the domain engineering still worked. So how did we get the data from StorageGRID to the engineering domain in a timely fashion once the core file is uploaded? We have two regions, one in the west and the other in the east. Our customers decide which region endpoints to upload the core files. Once the user enters the case number and the access is verified. A pre signed URL is created to use S3 multipart upload to upload the core files for speed and restart ability. Once the file is uploaded into StorageGRID, we want to get it to the engineering domain as soon as possible, and to do that we use another StorageGRID platform service. S and S simple notification service. What we did was we configure a bucket to send notification to AWS SNS service when an object has been uploaded. There is a program that subscribes to that AWS as an S and S service, and we'll know right away that an object has been uploaded. The program will then log into the court, search StorageGRID, and migrate the court file into the engineering domain. So there is very little lag time once the file is uploaded. So after the application went live, everything was great. We were happy. Our customers were happy. Except for those that were not within the United States. So we deploy a new standalone site in EMEA and use Cloud Mirror. To replicate the data to the court. StorageGRID in the United States. From the application standpoint, only a small minor change was needed. They just needed to add software code that said that hey, if you were in EMEA, Media. The uploader used a media endpoint and the data would be uploaded to the EMEA StorageGRID. At the same time, the data can also be processed within the EMEA if needed, instead of coming all the way to the US. So um, why did we do the design this way? Why did we not just create a three side grid? Why we didn't just expand our current two site environment to a three site environment? Well, we looked at it. And what we concluded that if it was done, uh, we would not, you know, need Cloud mirror, which is better, but we would run into problems with the data, right. The data itself would eventually get to the US. Uh, because StorageGRID is an eventual consistency environment. But given our wham pipe from the US to EMEA, which was very small, we felt that the storage internal communication where each node within the cluster. That means all sites, all nodes need information about all the other nodes within the cluster. This chatter would be just too much for the Wan link that we had. We were not afraid of the data being transferred, but more on the queues for the internal communication getting too high and over time causes issue with the grid itself. So also we still had a problem. What if you're not within the three regions, your uploads could still be very slow, you know? And after we wrote out the Mia in site after about six months, that location was scheduled to be shut down. So what was our solution? A solution was AWS S3. Since StorageGRID is S3 compliant with AWS S3, we can use AWS Object Storage instead of our own StorageGRID when it is outside of our region. All we have to do is open account with AWS, deploy a very small AWS S3 Object Store, and point our applications to an AWS endpoint within that region. The only modification to our upload software is to add different AWS endpoint as a variable or a table entry. The user selects the region closest to them, usually outside of the US. The file gets uploaded to AWS within that region, and then we use AWS S3 to replicate that object from outside the United States using AWS network backbone. Right. We then use Cloud Sync to move the objects out of AWS, S3 into our StorageGRID. We shut down EMEA scale the application to Sydney, Japan, Singapore, Hong Kong and many other locations with no application code change, only variable or table regional data, and no need to deploy our own StorageGRID throughout the world. One thing to note we do not leave our data in AWS. We only use AWS network to transfer data from outside the United States into NetApp. Once it reaches our StorageGRID, we remove any data in AWS. One is to comply with our data governance, and two is to keep AWS S3 storage UI usage as transient or almost nothing at all. This is the application front door. I'm sure many of you seen this before. So in summary, this applications used the following StorageGRID features. Multi-part upload to get fast upload speed. Platform service Cloud mirror to move data from the DMC to our internal court StorageGRID. Platform service has to trigger notification when an object has been uploaded, so that data can be acted upon as soon as possible and be moved to our engineering domain. Cloud Sync to transfer data from AWS S3 Object Store door into our StorageGRID. Using these NetApp features allows the application to be flexible, scalable, and cost effective. I will pause here tofeel any questions that you might have. Yeah. Hi, Ken. Um, I just actually have some questions. Thanks. So for this really interesting case study. Um, you mentioned StorageGRID platform services like Cloud Manager access. Can you explain how these help solve real problems like speeding up profile delivery engineering? Yeah. So, you know, Cloud Mirror allows you to, uh, copy an object from one StorageGRID to another StorageGRID without any coding. Right? So all you have to do is turn it on and then, um, you could use this to mirror to a secondary site. Let's say your customer wants the same data and you don't want to do anything. You just want to create some rules and StorageGRID and say, hey, if this object is uploaded, I want my partners to have the same data, and then they won't even touch your environment at all. And I'll Mirror will allow you to do that with minimal effort. You just need an endpoint from your partners. Or, you know, even if you want to do a disaster recovery, um, or have data as a place where, um, if you get an attack, uh, and everything's encrypted in StorageGRID or something, this data using Cloud Mirror, uh, would be safe, and you could, bring it back. So a lot more efficient and more safe at the same time. Yes. Thank you. Okay. So if there is no other question, I'm going to. Questions in the chat now. Okay. I'm going to start on the second case. Okay. So this particular case is regarding our active IQ data Lake. Okay. Uh, just to give you a feel for how large the system is, here is some metrics for active IQ. There are about 10 trillion data points per month of telemetric data. Uh, you can say that this is AutoSupport all the AutoSupport NetApp gathers. That data is translated to about three petabytes of data Lake information. Excuse me. And another ten petabyte of sheer storage that NetApp used internally. And for our customers, they can get to some of the data via Active IQ digital advisory, mobileapps and custom APIs. The original Data Lake architecture is ingesting Telemetric data into Hadoop environment using Kafka Streaming and Apache Fume. We have many clusters where For compute servers are directly attached to one storage array. We have about 33 mini clusters that came out to be about 4000 plus cores, and seven plus petabytes of disk storage is huge. So the new this is the new Data Lake architect. The major change is to use Dremio in a Kubernetes environment. Replacing the original Hadoop environment. And Hadoop was a bare metal environment. And it was more painful to manage versus a Kubernetes, you know, in things like patching Etching and that type of stuff. Apache spark was still used for batch processing. The amount of disk space needed shrink quite a bit to three petabytes. We're using 16 dremio executors. Um, those are nodes on a Kubernetes cluster. Uh, data is stored in parquet format. That's just what Dremio uses. And Dremio is able to use StorageGRID as the back end to house their data. So originally we had about 130 plus servers And have 4000 plus cause we wound down to about more than 60% reduction in computes. Even with that much reduction in compute, we got faster query time 10 to 20 times improvement. With Hadoop on average, queries to the data lake took 45 minutes. Today, with StorageGRID the largest query took about. Anybody want to guess from 45 minutes to. 2.5 minutes. Oh, I was going to say five minutes. Yes. So that was for our largest,queries. It took 2.5 minutes. So for my queries it takes seconds. So that's,really,big right. And that's,what StorageGRID is providing is that speed for searching. So we got about 30% total cost of saving using less compute, less data center space and less storage space. We save about two petabytes of storage. This was a major change for us at NetApp. So as I said earlier, why did we go bare metal instead of appliance for our HQ system? Is because the Duke environment. We wanted to use the same E-Series that we purchased a whole bunch of storage for them. So we were going to free that up during the migration. That's the first phase. We took seven E-Series and start to build a single StorageGRID site, because this will be a single site and to ensure data accessibility from a single node being down and to keep the amount of space as low as possible. We decided to go with a recycling of four plus two. Okay, so we can take all the storage right away. Right. We have to take pieces of it. So that's why we did it in phases. This allowed us to do, you know, with the four plus two erasure code, allow us to do, uh, one storage node down. And if we had another node down. It shouldn't cause any problem because of four plus two. So for our data, no. The compute is running on a UCSC 220 M5 running Red hat 9.4. Uh, since Red hat no longer support Docker engine, uh, we're using Red hat podman engine instead. Nothing needs to change. All you do is create an alias for Docker commands. When you do a setup for Red hat, the physical host is mapped one, two, one to an E-Series. The configuration is such that for the host boot line, it is attached to our ONTAP storage. That And it's just our NetApp, it's standards. You could attach it to the E-Series, but we didn't want to do that. Um, the E-Series has two lines that, uh, StorageGRID uses at the host level with the rest of the Luns. The data lines only visible INSIGHT to the StorageGRID container. For the appliance. Uh, we. And to log into the OS, you would need to ssh into the StorageGRID. No IP with a port number of 8022. Okay. And use the same user ID admin. That's for appliance or pyramidal. There is an OS interface less just like any other Red hat host. So to login to the host is a little different. You don't use the port 8022, you just go in like it was your own Red hat environment. You do commands and you will see Docker running in the background and that background you will see StorageGRID running. So because of that, the host name is different from the node name. This is just what an appliance looks like to give you a comparison, right? As you can see, I stated earlier that SSH into the appliance OS would require you to add 8022, and all the required storage comes from the back end. E-Series there is no ONTAP. It's just our standards. Uh, both the host OS and the container are running Debian. Other than these little minor changes, bare metal is pretty much the same appliance. You know, same deployment. This concludes my presentation. Do we have any questions? Yeah. Thank you so much for your presentation, Ken. Um, and just so everyone can see, you can connect with Ken on various different links here, email, LinkedIn NetApp. Um, YouTube channel. Um, I actually have two questions from my own end, because I always find case studies fascinating, and the huge improvements you guys made in Speed is mind blowing, to be honest. Um, so when it comes to the previous case study for the Active IQ transformation, um, I know we talked a lot about the big improvements we found from that, but how difficult did you find the migration itself? Was there any particular challenges that stood out? That's a that's okay. So whenwe went into this project, they allocated a lot of time for migrating, uh, you know, because the data was large. So they were afraid that will be the biggest roadblock in terms of getting the project done. But it wasn't it was very,smooth talking to the project team and everything. They it just blew their mind that moving the data from the Hadoop environment to StorageGRID was pretty quick. Itjust it was so fast that it just it was just I can't have words for it.was just so fast. So andthe speed after they moved the data when they were testing the query using dremio. Um, that also blew their mind because before to do analysis and stuff like that, they would have to wait 45 minutes to an hour just to get the data to start working on it and see what's going on. But now with 2.5 minutes for the largest query, um, they just they love it. Yeah, they just love it. Yeah. I'm sure we wish every project went that fast. Yeah,Andon that project, we're still evolving. Um, you know, there's things going on, um, site closures and, uh, making it fail safe in terms of. Because Active IQ is very important to NetApp. Yeah. So, yeah. So,the next part right now that we're doing is, um, migrating the data from HIO to RTP and. Okay. And because of, uh, where we're at since Active IQ went live, we have a lot more data now too. But, you know, with StorageGRID, all I have to do is change the Ilmn policy and it will start moving the data for,us in the background from one side to the other, without any effort from the application team. That's fantastic. Um, my second question also kind of stating the obvious, but I always like to ask why leverage StorageGRID for Active IQ? Why not use some other ONTAP system? Yeah, we looked at that and, you know, um, object storage is meant for large,amounts of data. Object store was built for, you know, Internet of Things where all they do is stream lots and lots of data because our Active IQ is mostly data points from our base up. And, uh, you know, all those data. It was just a perfect fit. Right. And, uh, originally when they did, uh, Hadoopwas the greatest thing at the time. Right. And then over time, it just couldn't keep up. But with StorageGRID, you know, like I said earlier, um, to get customer to be familiar with StorageGRID, I would always start off with static data. You know, every customer has some kind of backup. They should just start writing to StorageGRID to get a feel of static data. Right? And you can start small. You don't have to start, you know, very fast. And then when you have StorageGRID in the environment and you could say, hey, you have ONTAP, why don't you do FabricPool, right? You already have both environments. And then from there, you know, open it up to more, um, transactional, you know, applications like, uh, jfrog or other things like that. And then grow it. Right. So our StorageGRID is built on Cassandra. So the back end is Cassandra.will let you scale to hundreds and hundreds of nodes with minimal effort. And that will allow you to scale along with the amount of data growth that you have in the easiest, most efficient way. You know, that's one of the reasons why we want to use StorageGRID for everything that has lot of lots and lots of static and, you know, data. You know, I also know that as we move along with technology, uh, our storage is becoming faster, you know, with SSDs and so on. So we are also starting to use it for transactional data, but using it for transactional data. You know, people still think about snapshots for recovery. Well, StorageGRID doesn't have those type of stuff. But we might have in the future. So it's a it's we're going in the right direction. Yeah. Exciting times. Lay ahead. Yeah.So that's it for today. Um, I don't see any more questions in the chat. So we'll wrap it up here. Um, but thank you very much for everyone for joining. And thank you very much, Ken, for presenting. I know you're sick. Um, yeah. So it's all been recorded. So you guys should get an email in the next few days with that recording. And like I said earlier, if you have any follow up questions that come to mind, you can shoot me an email and I'll get those answered. So that's it and I will say goodbye. And I hope you have a good morning. Evening, wherever you are. Thank you.
Learn how NetApp IT scales unstructured data with StorageGRID to power archives, enabling object lock, and leveraging advanced features like Cross-Grid Replication.