Organizations all over the world are currently in a race to adopt and accelerate AI to transform their businesses and processes. Because of this, AI ready compute infrastructure is somewhat hard to come by and this necessitates a hybrid cloud AI architecture. Organizations have to take advantage of AI ready compute where it's available when it's available. I'm Mike Ogulby. I'm a technical marketing engineer focused on AI and machine learning solutions here at NetApp. Uh and in this session, I'm going to give you an overview of a solution that we put together with our partners at Domino Data Lab and Amazon Web Services to address this hybrid cloud AI challenge headon. So, let's dive right in. First, I just need to mention the information in this presentation is confidential and proprietary to NetApp. And I'll give you a few seconds to read this confidentiality notice here. So I mentioned that hybrid cloud architectures are becoming table stakes in theAI world. Uh but these hybrid cloud architectures raise several different challenges. uh data often resides on premises indifferent data centers uh and AI ready compute uh is not always there in the same data center where the data is uh if it is there you know it might be oversubscribed there might be a long queue touse that compute um and so there it's necessary to be able to flex workloads to where compute is available so you need to be able to uh move the data to the compute uh and also push jobs out to various different compute resources. Uh and so we have paired Domino data lab and NetApp technology to provide a solution that does just this. It enables data scientists to seam seamlessly schedule their AI workloads uh where their computes available and it enables them to both bring thedata to the compute and send the jobs out to available compute resources when they are available. So first I'm going to give a quick overview of this solution. Uh then I'm going to highlight some challenges around AI scheduling anddata access that this solution specifically addresses. Uh then I'm going to show you a demo of this solution in action and lastly we're going to wrap up with some links to someresources that you may want to check out. So what is this solution? Uh it's pretty simple really. Essentially, we have paired Domino Datal Labs enterprise AI platform uh one of the leading enterprise MLOps platforms on the market and specifically their Domino Nexus hybrid cloud workload scheduling capabilities. uh and we've paired that with NetApp's data management and data services including NetApp Blue XP to provide that single control plane for the data estate. So that single pane of glass for orchestrating all of the data management capabilities and all of thedata services. So what sorts of challenges does this solution address? Well, themain challenge it addresses is this one. Basically, many AI ready compute environments are overs subscribed. There's teams all over the organization, all over the enterprise that are wanting to run AI jobs, that are ready to run AI jobs. Um, and so thesejobs are having to often wait in a queue. And so you've got teams of data scientists that have jobs that are ready to run, but they're having to wait days, weeks, sometimes even months in order to get access to the compute that they need torun their jobs. And so this is delaying missionritical AI projects. Uh, and so there's a desire to be able to burst these jobs on demand to a remote environment such as theAWS cloud so that compute resources can be taken advantage of more immediately. The projects can be accelerated to the finish line where they'll actually be delivering business values. Now uh in validating this solution, we used AWS as the remote environment, but NetApp offers storage services for all of the major clouds and obviously for thedata center. So the remote environment could be any cloud uh or it could even be a remote data center where that some AI ready compute isavailable. Uh and so thisis something that organizations everywhere are wanting to do, right? But it's really painful. It's really challenging to the point that I've had data scientists tell me that thedata their data science teams, the data platform teams that they work with have essentially given up on bursting their jobs and because it's too complicated, it's too complex. And so they're just stuck sitting around waiting in that queue. Uh, and why is it so complex? Well, you know, you often don't want to move all the data. You only want to move a subset, the most important subset for your job. Uh, and so selecting that subset up front can be somewhat challenging. You know, it'soften easier to select that subset once you get access to some resources and can start doing some experimentation. So, that can be a challenge. uh but then there's the even greater challenge of performing themanual copy of the data subset. So you have to find a host on which to execute that manual copy job. Uh and then it's a manual job that you have to manage. You know these jobs are often errorprone and so what do you do if there's a failure? How do you ensure data consistency? Uh everything that comes along with executing amanual job. Then, you know, once the job's finished, you've got this remote copy that you have to manage. So, uh, you know, you've got this copy sitting out there and there's a lot of governance headaches that come along with that. You know, how do you make sure that copy doesn't fall into the wrong hands? How do you make sure access is controlled? Uh, that, you know, when you're talking about large enterprises with significant governance requirements and regulatory requirements, this is a big concern. you know, how do you make sure that copy isn't lost and left sitting out there where someone could access it in the future? You want to make sure you manage the life cycle of that remote copy and that you're able to easily tear it down or delete it when you'redone with it while of course, you know, logging all important artifacts from it. Uh and so like I mentioned, you know, these challenges have basically caused a lot of thedata science teams we work with to say, you know what, it's too complicated. We're not going to do this. So we have applied NetApp Flex Cache technology along with Domino Nexus and thehybrid cloud capabilities that the Domino enterprise AI platform brings. Uh and basically we've greatly simplified this process with flex cache technology. You can set up aondemand sparsely populated cache in the remote environment. Uh so again that could be AWS that could be any cloud that could be um you know a remote data center and so there's no upfront data selection. You set up a cache and in the remote environment it looks like all the data is there. Uh but data is only going to actually move over as it's accessed. So you could go ahead and kick off your job and as it accesses specific data elements, uh then theactual data is going to move over. And so it makes that subset selection much simpler because you could get going with your experiment or job and kind of figure it out as you go. Uh andthere's also some flexibility that comes along with that for big beefy jobs thatuh are utilizing super expensive compute resources. Uh you can before you spin up those compute resources, you can you know go ahead and run a simple job to read uh some specific data elements or data assets and that's going to go ahead and prehydrate that cache and then the data is going to be there uh when you go ahead and spin up yourjob. So you can maximize um uh thosesuper expensive compute resources. Uh and this is all on NetApp's platform, right? So we've got NetApp platforms on both ends. We have a NetApp orchestrationuh platform and Blue XP. And so this you get all of thegreat enterprise caliber governance security anti-ransomware capabilities that NetApp brings to the table. And when you're done with the flex cache, you can just delete the cache. You've still got your gold copy of the data there in theuh source environment. You just delete the cache. There's no headache around managing a remote copy that's sitting out there somewhere. And sothis, you know, all NetApp's capabilities here address a lot of thegovernance and regulatory concerns that come along with this process. Uh so that'sthe challenge and that's also enough slideware. Uh without further ado, I'd like to go ahead and jump into a demo of this solution in action. actually this specific scenario that I just highlighted um you know being addressed by this solution. So let's roll the demo. Hey folks, Mike Oglesby here from NetApp and I'm going to show you how Domino Nexus NetApp on Tap and Blue XP enable you to seamlessly schedule AI workloads across multiple data centers and clouds. In this particular scenario, I'm running a NetApp Ontapbacked Domino Nexus instance in both my on premises data center and in AWS. So let's say I just finished writing the code for an AI training job and I want to go ahead and schedule an instance of that job. So from the Domino web console here, I can go ahead and schedule a run. And when I go to schedule a run, Domino Nexus is actually going to give me my choice of hardware tier. So, let's go ahead and take a look at the choices here. So, local in this case is AWS because that's where I'm running my Domino control plane. And you'll see that I have a bunch of choices in AWS. andDomino actually tells me how long I'm going to have to wait if I schedule a job to one of these specific tiers in AWS. And if I go ahead and scroll down, you'll also see that I have one tier here for my local data center. And it'sa pretty small tier. So I could also schedule a job to my local data center. Uh but let's say that this particular job that I'm uh currently wanting to schedule requires a GPU that I only have access to in AWS. And so I specifically need to use an AWS tier. Uh and let's say I specifically need to use this larger GPU tier in AWS for this job. So let's go ahead and select that tier.We'll click through to the next screen. Uh, and here I have the option to attach a Spark cluster. I'm not going to do that for this particular job. So, let's go ahead and click on through. So, on this final screen here, Domino is showing me the data assets that my job will have access to. Uh, and so as you can see, my job is going to have access to both the quick start data set and the image data external data volume, which is actually a data volume on a high-erformance Amazon FSX for NetApp ONTAP instance. But my job is not going to have access to this chatbot data external data volume. Uh, which is also a data volume on onap. That data volume is unfortunately only available in my on premises data center. Uh but let's say my job actually needs to access some data on this chatbot data external data volume. Uh well ordinarily this would be the end of the story. I would have to figure out how to run my job with limited resources in my on premises data center or I would be forced to just give up on running this job alto together. Well, with NetApp, the good news is I don't have to give up. I can quickly spin up a cache of this data volume in AWS using Amazon FSX for NetApp on my job access to the entire contents of this data volume. So let's go ahead and take a look at the process for creating that cache so I can show you just how easy it is. So in order to create my cache, I just go into NetApp Blue XP. I click on my on premises NetApp storage system and I drag it onto my Amazon FSX for NetApp ontap instance and then I click on volume caching and then I just wait a few seconds for the volume caching wizard to load and the first thing I need to do is choose my origin volume. So the volume containing the data that I need for my training job is this volume here named chatbotdata. So I'm going to select this volume and then I scroll down and I'm just going to accept all the defaults for these options here but you can change them as needed. And then I just click on create caches. And this is going to take a couple of seconds to create the cache. Uh, it's going to take me to my volume caching list here and I just need to wait a few seconds for this cache creation operation to complete. All right, we waited about 30 seconds and now we see our cache here in our volume caching list. So, the cache was created. Uh and now the next thing we need to do is create a persistent volume claim a PVC for this cache in our Amazon EKS cluster so that we can then attach or mount the cache to our Domino data lab MLOps platform running on Amazon EKS. So to create that PVC, we just jump over to our terminal. And I have already created a PVC definition file. So let's go ahead and take a look at that. There are only a few things that you need to specify in this file. You can find a template in the Domino Datal documentation.So all we need to do to create the PVC is run this cube cuddle create command here.And it's just going to take a couple of seconds. And you'll see it created a PVC named chatbot-data. named chatbot-data. named chatbot-data. So the only other thing we need to do is import our new PVC into Domino Data Lab. So now we're back in our Domino Datal web console. And to import our PVC, we need to navigate to the admin area. So we just click on admin down here. Uh and now we need to go to data and external data volumes. And so you'll see here we have two external data volumes. We have the image data external data volume which is available in both the on premises data center and in AWS and we also have the chatbot data external data volume. Uh which is currently only available in the on premises data center. So what we're going to do is import the PVC that we just created so that this chatbot data external data volume is now also available in AWS. And it's super easy to import the PVC. All we have to do is click on edit here. Uh and you'll see Domino automatically picks up the fact that we now have a PVC in AWS. And so it's telling us here that these new data planes have been detected and will be enabled for this volume. So all we have to do is click on update. So let'sdo that now. And that's it. Now our chatbot data external data volume is available in AWS. So, if we jump back over to our main Domino web console, uh, and we attempt to run another job uh, and we go and we once again choose that GPU hardware tier in AWS and click through the screens. Now you'll see our chatbot data external data volume is in fact available and our job will have access to the data on that external data volume. So we can go ahead and kick off our job and that's it.is that simple to schedule AI workloads across multiple data centers and clouds with Domino Nexus NetApp on tap and NetApp Blue XP. I hope you enjoyed the demo uh and just to quickly wrap up. So in this uh session I have showed you how our solution that we've put together with Domino Data Lab uh specifically pairing NetApp data management capabilities with Domino's Nexus hybrid cloud capabilities uh enables seamless workload scheduling. So there's a single control plane across data centers andclouds and you can send your workloads out to your compute where it's available when it's available. Uh and we've paired this with efficient data access. And so um this enables a lot of flexibility. You can either bring the job to the data or you can bring the data to the job depending on you know where your compute resources are located and when they're available. Uh and we've done all of this withindustry standard tooling. So Domino offers access to all of the industry standard AI and machine learning frameworks and tools. Uh and so you can use all of your industry standard tooling on top of this solution. Uh and then lastly, uh there's somerelated sessions that you might want to check out. So youmight want to uh go ahead and check out the recordings for these various other insight sessions. Uh and then for additional information, uh there's a few links down here. The first link is the link to our documentation, our technical documentation for this solution. So definitely check that one out. Uh and then we have a link to the landing page for all of NetApp's AI solutions. You'll definitely want to check that one out as well. Uh and then lastly, the link to Domino Data Labs page where you can check out their enterprise AI platform and their Nexus hybrid cloud capabilities.Uh please do stay connected. Uh follow us onTwitter and Facebook. Join the NetApp community on Discord. If you want to connect with me, you can find me on LinkedIn. The link to my profile isdown at the bottom here. Uh and with that, I'd like to thank you for your time. I hope you enjoyed the session.

Train your AI/ML model with Amazon FSx for NetApp ONTAP and Domino Data Lab

2 years ago

When undertaking AI projects, enterprises often face a scenario where data resides on-premises, but on-premises compute is over-subscribed. In these situations, it would be useful to be able to "burst" high-priority jobs to the cloud so that

Learn more

Speakers

Mike Oglesby

Sr. Technical Marketing Engineer, NetApp

Mike is a Sr. Technical Marketing Engineer at NetApp focused on AI solutions. He architects and validates AI data management solutions that span a hybrid cloud. Mike has a DevOps background and a strong knowledge of DevOps processes and tools. Prior to joining NetApp, Mike worked on a line-of-business application development team at a large global financial services company.

Train your AI/ML model with Amazon FSx for NetApp ONTAP and Domino Data Lab

Speakers

Related Resources