In this session, we're going to talk about how NetUP IT is leveraging spot to make cloud possible. I'm sure everyone seen the confidentiality notice before. So, as we migrated to the cloud, one of the one things we need to do is keep our cost down to make sure we're spending the same or less as what we wereon from. That kind of led us down the path ofspot. for our agenda. What is spot or spot instance?Elastic group how it is using elastic group. What's spot ocean? How it is using spot ocean and then spot eco and how we're using it.So what are spot instances? So within uh AWS they have enough capacity to man to handle their busiest day of the year. the rest of the year the compute's sitting idle or notall of it but you know there's a large portion being idle and what Amazon did is theysold that as a product at a discounted rate with the caveat that they can reclaim it back within 2 minutes ifthey need it and so it's the same as on demand it's just uh it's less expensive and Amazon can pull it back with a two-minute notice you know under the hood they'rethe same whether it's on demand or spot, you're doing pretty good if you can save, you know, in the 50 60% range with spot. Amazon, according to their documentation, you can save up to 90%. It'd be I'd be interested to seewhere people are getting 90% or whatinstances you 50 60% isdoing good.So, elastic groups, what are they?are a construct of the spotio or spot.net app. It's under the hood. It's an engine that helps you manage yourspot instances. It's u when you're creating elastic group. It's kind of a combination of like a EC2 run instances autoscaling group and a couple other things all cobbled together inone place to you know manage thattype of workflow for you. uh you can have it be zero nodes in it up to I'm not sure what the upper limit is but I know it'spretty high so it's it can handle it can act like an autoscaling tie in the load balancers uh it does predictive replacement so we talked about Amazon has a two-minute warning or two minutes they'll reclaim your spot instances well with the lastgroup they have a predictive replacement with a one hour pretty accurate I think about 85% accuracy of when the instance would be replaced. So they replace the instance sooner allowing more time for any shutdown scripts and quassing tooccur. an elastic group that like I said can be you know zero you know oneinstance onup uh that can be a mixture of you know on demand and spot or you know either way it could be 100 100% spot or 100% on demand andthese are really ideal forworkloads that you know leverage a load balancer or autoscaling group with elast groups if you're reattaching you'rerecreating your instance that went down when it was reclaimed. There's two ways to do it. One is to reattach detach and reattach your EBS volumes andENIs. Uh the second one is AMI based. This is the firstapproach with the reattaching. So you have an instance that's being replaced gets shut down. the EBS volumes and ENI are kept active even after the EC2 is terminated and then they're used tocreate a new EC2 instance using those same volumes in ENI. So end of the day what it looks like is your EC2 was shut down and started back up. similar approach where it looks like it's ashutdown, but it allows you to scale a because EBS volumes ENIs are tied within a single A. You'regoing to stay in thatA with that approach. If you want to be able to span ayou can use a snapshot and what that'll do is under the hood, it'll create an AMI based on yourEC2 instance and what's attached to it. And then when the termination happens, it'll create a new EC2 instance off of that AMI. And that that'll again allows you to span a. So when you're in thespot console, yeah, I just want to give you a view of how you can go through and select the different instance types. You don't have to know the names. Yeah, you just have to know the sizes. And when you do a mouse over, it highlights what the actual CPU memory are. And so you can go through thedifferent families, select all the instance types that you want that match your workload. And yeah, it's very easy to go through, see, and update. You can also go through and have all of youras selected for the last group to use or you can use a subset. On the bottom left, you can see where you can select the thes and the bottom right. You can see based on those as and the instances that were selected uh in the previous dialogue u spot is doing a market scoring. So thedarker squares are less likely to get h either have an instance available a spot instance value available or they're going through a lot of interruption rates.Uh the topright you can see the on demand type. So if an elastic group has to fall back to a uh on demand instance, there's no spot instances available based on the your selections. You specify what that instance type should be for the on demand. Uh demand. Uh you can get back to a spot ifyou want. There's some options within the last group. One is to fail back as soon as a spot instance isavailable. One is to never go back. And the third is to have a maintenance window. So if you have like an off hour, you can use thatoff hour tomove back to a spot instance if you wanted to. So within your list of instance types you can use, you can have a subset. you can say here's what I want to use but this is reallywhat I want to use and specify the those preferred types and the last group will try to match those before it matches the other types and you can see the market scoring for those as well. uh the minimum instance lifetime in the wizard when you're first creating the last group you have this ability and it's not guaranteed but it tries to find instances that match your criteria that you know spot has seen or spio has seen the uptime be you know that or more before the instance type is interrupted okay so that's the last groups How are we using elastic groups with within it? So we have acollection of people automation and hypervisors hyperscalers that we call cloud one. And within cloud one we have a portal for provisioning information uh I is systems. Yeah. So within it we can go through and provision you resources or VMs as we want.application teams can also go through and provision their own resourceswithinthis portal. The default for any nonproduction instances that are in the in a hyperscaler is going to be a spot instance. We can change it toon demand if we want or vice versa. Andthat's really just based on the application. We want it to be a opt in you for the production versus being opt out of. Then we have um something we call smart parking which is aset of costsaving automation where wescale the last groups down to zero uh during off hours depending on you know schedule perapplication team. uh and then you know before thenext day or you know whatever the time frame is the app team has uh it'll scale back up toone to have an instance ready. So it was spar parking we went through and we figured out you know ballpark what our on-prem costs are and then what it would be within AWS. So on the left side top you can see the on-prem bottom it's the 189 or they said this time or time we created this slide it was 189 and with smart parking yeah turning the EC2 instance off 12 hours a day every day of the week for the year the half thatamount so it's quite a bit savings there on theright side the bottom right samething, but the instance is running weekdays and shut off during the weekends. So, you know what? $54 for you save by shutting off during the weekends. So, what are the benefits that we've seen from using elastic groups? More savings the on demand instance. And we'regenerally getting, you know, in that, you know, 30 to 50% range. It does vary a little bit based on spot prices, which a you're in, region, things of that nature. Andit makes smart parking really easy because we just have to scale down to zero. And we get about $1,600 a month in savings throughsmart parking. Yeah. Last groups canleverage our eyes and savings plans if you're running on demand. So if you fall back to on demand, you get you're not necessarily paying the full on demand price. And youcan create on demand instances and spotted instances the same way. You know, it's just really a percentage that you want it to be. So 100% spot or 0% and you can do anywhere in between there. Uh you can also, you know, convert back and forth print spot and on demand again using that percentage that you want to use. uh predictive re um replacementthat one hour it really helps to quass the systems. So the challenges we we've seen so I mean the first one I don't want to say it's self-inflicted but you know it's something just legacy and you know we're tied into an a when we create it because of DNS and CMDB and it's just something we need to work through and create some automation to solve that. It's um nothing with the product. It's just you know legacy with within our side.Now we have seen fluctuations with theamount of savings with spot and it really varies based on what the spot market's doing. Uh interruption rates uh they can be you know there can be a lot of them depending on you know time of year instance type and size you know what region you're in. So I mean depending on where you are you can see more interruption rates than others and even between a same region you might see a you know disproportion on one a versus another. Uh you can also adjust yourthe cost orientation with an elastic group to I think there's three different settings don't remember all of them. one is find the cheapest one and the other at the other end of the spectrum is find a spot that has the highest up time. So it does allow you totweak a little bit too which you know is good and bad. It's um trying to find thatright spot is that's that in itself is a challenge.So spot ocean if I was to summarize it it's a engine for managing yourkubernetes worker nodes uh in hyperscaler clusters so in we'reusing EKS so it's EKS ismanaging the control plane and ocean is purely just managing the worker nodes you know it has acontroller that's running the cluster that feeds back data back to theuh master spot io account and then that controls the scaling.Ocean will rightsize theworker nodes and bin packs the cluster to reduce yourcompute to get to what you're actually using and not have excess capacity in your cluster unless you you've set that up. Uh [snorts] it's using elastic groups under the hood so you get that one hour predictive replacement. It has some built-in reportings around utilization uh cost right sizing within an oceancluster you have virtual node groups and your virtualnode groups are where you're defining yourworker nodes what whether they're spot on demand or you know anywhere in between your labels your AMI instance types all that stuff is within a virtual node group You can have multiple virtual node groups within an ocean cluster. And in our case, we're all of our clusters have multiple virtual node groups. How we're using that is if by default you deploy into the cluster, uh you're going to be running on a spot worker node. Unless you use a label and taint then inyour deployment, then you'll end up ondemand node. And we did that because while databases running in Kubernetes will come back up after being moved around the cluster, they're overly sensitive to it and that causes some application issues because of availabilityof the database. So having the databases run onor those pods run ondemand instances, it gives you morestability there. stability there. And the cool thing aboutthe scaling withvirtual node groups is if you have a deployment, let's say a replica set of three replicas and you have anti-infendi configured in that deploymentand there's no workers available for any capacity within the cluster. Instead of spinning up a single worker, waiting for some other logic kick in, and then a second, then a third, Ocean will spin up all three worker nodes at the same time because itsees anti-infinity in the replica set. Really nice feature because it gets it going so much faster and helps you with your capacity a little bit more. So, worker nodes, you know, how are they replaced and what's that look like? you know, any anyone that's done anywork in the past withcluster management, uh, this can be fun towork on. work on. What ocean does is when a node's being replaced, it'll spin up a newinstance, wait or sorry, it'll accord the instance that's being replaced, so no new pods will go to it. it'll spin up a new instance, wait for it to be added in the cluster, drain theinstance that's being removed, and then terminate it. Yeah. So, it's, you know, very smooth. And, you know, and other than your pods, you know, moving around the cluster, it's from a ministrative administrative perspective, it'svery nice and it's wellmanaged. So this is aview from theconsole and you can see here we have a yeah a couple virtual node groups in the worker node list. You can call them whatever you want. We uh have a scheme maybenot the prettiest scheme but it's a scheme. uh you can see uh towards the middle of the screen that you can see the instance types the ass that the instance is running in whether it's spot or on demand how many pods uh the resource allocations you can see how much CPU and memory is consumed on each of theworker nodes and on the second line down that different colored in the CPU and memory is uh reserved headroom that user configurable within the cluster. So we have that just so if the cluster has no access capacity and there's a new deployment youend up waiting for a new node to come up or multiple nodes come up and that headroom just allows thatdeployment to be deployed as soon as that capacity is seen versus or schedule sees it versus you waitingon new nodes and then ocean would just come back fill out that capacity that was needed. So benefits, yeah, we've seen, you know, a fair amount of savings using Ocean. It's in the51 to 58%, you know, range. Yeah.So that that's awesome. I mean, just itjust helps out thescale up andscale down for the and the clusters. Anyone that's done any automation with that in the past, you don't have to do anything. you just configure a couple things and ocean will scale your clust cluster up and down and you don't have to touch it. Uh there's a uh something called a cluster ro withinocean uh where let's say you go through and you update your AMIs and you need to replace all your worker nodes. You give ocean a percentage of the cluster to work against in batches and it'll replace yourworker nodes. uh just like the a [clears throat] node being replaced u because of a spot reclaimed by AWS, it works in that similar fashion just across multipleworkers at the same time. So if you had a cluster of 100 worker nodes and you did 10% it would work across 10 differentbatches of 10 and then they've been backing the pods and you know to optimize the utilization helps out reduces excess capacity or compute capacity needed. So I have a couple of screenshots from the council. This is like a week view early in September. gives you a view across allyour clusters, what's being spent, um how efficient they are, some right sizing, and then within each individual cluster, you can see a co there's a cost analysis page where you can see by namespace how much um how much money is going towhich namespace and then you can expand that out a little bit more and see the you know deployments within the namespace. So if you look at the bottom of the screen, you'll see you know that person that name space how much that is costing from a compute standpoint. And then it's expanded out. You can see thedeployments in there too. There's areally nice right sizing uhtab within the console which I'm sure this is an issue for anyone using Kubernetes.uh people still think in VM terms andstuff millores. So thetop two graphs uh the top line is what's requested and the bottom line is what ocean and CNB being used. So thespace between the two lines isum unneeded compute that's in the cluster. And then if you look at the bottom, one example, someone requested forCPUs for their pod, but they're using, you know, a fraction of a millore, you know. So thatwould be something that based on the report, you'd say, "Oh, I need to adjust that deployment and get your uh your request down to something more reasonable and help shrink those lines together toreduce anywasted resources." This slide waspretty hard to come up with. We've been using Ocean for almost two years and just haven't had any problems with it. The only problems we well only problems we've had hasn't been Ocean problems. They've been AWS problems and we just yeah because of that we we're were affected. uh how probably more of alessons learned. How you uh want to maintain your cluster should drive how you create your cluster. Yeah. When we were first creating our clusters, we used EKCTL and we wanted to manage it with Terraform. We should just use Terraform to create the clusters from the get-go. Yeah. So we ended up managing the virtual node groups through terraform and out the clusters which you don't really needto make changes to your clusters very often you so it works out but yeah just something to think about your day two before youcomplete your dayzero uh operations.Um, no, we have aproduct that's nodebased license. So, we're working with theapplication team to figure out how to best have that in the cluster tomanage thelicenses so we don'tgo over the EKS. Amazoncreated EKS worker node MIS work great. uh our security team wanted some additional changes made which led us down a path of having to create our own uh custom worker AMI images. Uh it's not an ocean issue per se. It's a just a security hyperscalerbut you know thatcould eat up some time too working through that process. Spot eco. So spot eco essentially is just managing your rais for you. uh it's buying and selling them based on yourusage. Uh it can interact with themarketplace. So ifwe look onat the image on the right side, uh the 7% and down are our eyes and savings plans thatsomeone in like cloud operation or FINOPS team purchased. And then the upper 65% uh at least how it's depicted is managed througheco and that's combined is getting that 100% coverage. So you know it's not uncommon to buy an RARI or you know one year three and then realize you don't really need that instance uh type. Maybe you need a different size or a different family. you know, you can change thosepatterns. And what Eco would do is see that you're no longer using that RARI anymore and then sell it on the marketplace and then either buy a full RARI or partial RARI based onwhat you're actually using. It'svery much a set it up andforget though you should every 6 to 12 months go through and just check to make sure thesettings match what yourapproach is as far as how aggressive or how much coverage to have. Uh we're seeingyou on average about uh 21,000 savings dollar savings a month uh by using eco. So takeaways last groups make the management of or usage of spot instances easier because it's handling that reattachment onthe back end. Ocean just makes management of theclusters easy so you can focus onthefun stuff within the clusters or thetweaks within the clusters and then spot Eco just manages your eyes makes it super easy to save some money uh without having to really do anything. So, additional sessions that might be of interest and our contact info. And thank you

NetApp IT is leveraging NetApp Spot to make cloud 'cost possible' [1240-2]

2 years ago

Learn how NetApp IT is making cloud possible by leveraging Spot by NetApp products to reduce costs. See how Elastigroups, Ocean and ECO are used within NetApp IT to expand into the cloud while keeping costs down. Simplify Kubernetes worker [...]

Spot Ocean scalability and cost savings

Speakers

Scott Stanford

Sr SRE & FinOps Practitioner , NetApp

Scott Stanford is a Senior Site Reliability Engineer and FinOps Practitioner in NetApp IT focused on SRE practices and FinOps in a NetApp IT’s hybrid multi-cloud DevOps platform. Based in North Carolina and has been with NetApp for 11 years. Scott has a diverse background spanning containers, DevOps, development, IT, and software configuration management systems.

NetApp IT is leveraging NetApp Spot to make cloud 'cost possible' [1240-2]

Speakers

Related Resources